A template-based N-gram part-of-speech tagger for Tagalog

Vlamir S. Rabo, De La Salle University, Manila

Abstract/Summary

TPOST is a template-based n-gram Part-Of-Speech (POS) tagger for Tagalog. It is designed for languages with few and not comprehensive lexical resources. The key to the algorithm is to utilize carefully chosen basic words and fundamental features used for word constructions, in tagging itself and in disambiguating and solving unknown words surrounding it. TPOST was trained using 1983 words with 450 distinct features, from the first three chapters of the Book of Philippians. It was manually tagged by a linguist and highschool Filipino teachers, using 59 tags that are classified under 10 major POS tags. The accuracy of the tagger was tested in the same domain with 539 words with 221 distinct word features, and has achieved less than 8% and 11% errors for general and specific errors respectively. It was also tested on a different corpus on the domain of children’s story books consisting of 1093 words with 397 distinct word features. The test resulted to an error below 17% and 23% for general and specific errors respectively. A lot of variations were also tested which further reduced the errors making TPOST algorithm a good foundation for further research in the field of POS Tagging in NLP. Keywords: Part-of-Speech Tagger, Natural Language Processing, Language Ambiguity, Template-based tagging, N-gram taggers.