Date of Publication
12-11-2004
Document Type
Master's Thesis
Degree Name
Master of Science in Computer Science
Subject Categories
Computer Sciences
College
College of Computer Studies
Department/Unit
Computer Science
Thesis Adviser
Charibeth Cheng Ko
Defense Panel Chair
Ethel Ong
Defense Panel Member
Nathalie Rose Lim
Michelle Wendy Tan
Abstract/Summary
TPOST is a template-based n-gram Part-Of-Speech (POS) tagger for Tagalog. It is designed for languages with few and not comprehensive texical resources. The key to the algorithm is to utilize carefully chosen basic words and fundamental features used for word constructions, in tagging itself and in disambiguating and solving unknown words surrounding it. TPOST was trained using 1983 words with 450 distinct features, from the first three chapters of the Book of Philippians. It was manually tagged by a linguist and highschool Filipino teachers, using 59 tags that are classified under 10 major POS tags. The accuracy of the tagger was tested in the same domain with 539 words with 221 distinct word features, and has achieved less than 8% and 11% errors for general and specific errors respectively. It was also tested on a different corpus on the domain of children's story books consisting of 1093 words with 397 distinct word features. The test resulted to an error below 17% and 23% for general and specific errors respectively. A lot of variations were also tested which further reduced the errors making TPOST algorithm a good foundation for further research in the field of POS Tagging in NLP.
Abstract Format
html
Language
English
Format
Accession Number
TG05885
Shelf Location
Archives, The Learning Commons, 12F Henry Sy Sr. Hall
Physical Description
vi, 111 leaves
Keywords
Tagalog language--Parts of speech; Natural language processing (Computer science)
Upload Full Text
wf_yes
Recommended Citation
Rabo, V. S. (2004). TPOST: A template-based, n-gram part-of-speech tagger for tagalog. Retrieved from https://animorepository.dlsu.edu.ph/etd_masteral/4788