Date of Publication

12-11-2004

Document Type

Master's Thesis

Degree Name

Master of Science in Computer Science

Subject Categories

Computer Sciences

College

College of Computer Studies

Department/Unit

Computer Science

Thesis Adviser

Charibeth Cheng Ko

Defense Panel Chair

Ethel Ong

Defense Panel Member

Nathalie Rose Lim
Michelle Wendy Tan

Abstract/Summary

TPOST is a template-based n-gram Part-Of-Speech (POS) tagger for Tagalog. It is designed for languages with few and not comprehensive texical resources. The key to the algorithm is to utilize carefully chosen basic words and fundamental features used for word constructions, in tagging itself and in disambiguating and solving unknown words surrounding it. TPOST was trained using 1983 words with 450 distinct features, from the first three chapters of the Book of Philippians. It was manually tagged by a linguist and highschool Filipino teachers, using 59 tags that are classified under 10 major POS tags. The accuracy of the tagger was tested in the same domain with 539 words with 221 distinct word features, and has achieved less than 8% and 11% errors for general and specific errors respectively. It was also tested on a different corpus on the domain of children's story books consisting of 1093 words with 397 distinct word features. The test resulted to an error below 17% and 23% for general and specific errors respectively. A lot of variations were also tested which further reduced the errors making TPOST algorithm a good foundation for further research in the field of POS Tagging in NLP.

Abstract Format

html

Language

English

Format

Print

Accession Number

TG05885

Shelf Location

Archives, The Learning Commons, 12F Henry Sy Sr. Hall

Physical Description

vi, 111 leaves

Keywords

Tagalog language--Parts of speech; Natural language processing (Computer science)

Upload Full Text

wf_yes

Share

COinS