HATPOST: Hybrid approach to tagalog part of speech tagging

Date of Publication

2007

Document Type

Bachelor's Thesis

Degree Name

Bachelor of Science in Computer Science

Subject Categories

Computer Sciences

College

College of Computer Studies

Department/Unit

Computer Science

Thesis Adviser

Rachel Roxas

Abstract/Summary

Part of speech (POS) tagging is a process of identifying the part of speech of a word in a text. It is used in many Natural Language Processing (NLP) applications. It attempts to solve the problem of language ambiguity, the state wherein a word may have more than one meaning. There are linguistic paradigms employed to perform Part of Speech tagging, most common of which are the rule-based and statistical approaches. Rule-based approach involves tagging of words based on Simple Rule-Based Tagger (Brill, 1992) which make use of patches. Furthermore, statistical approach checks the context of the sentence by looking at the relation of one tag to another by using computed probability values of the possible tag sequences. The combination of two or more approaches, or the hybrid approach, allows the approaches to complement one another. The hybrid approach is to be implemented in Tagalog part of speech tagging to address the issue of language ambiguity. Since it is a combination of the rule-based and statistical approaches, HATPOST requires large training data to be able to generate patches and tag sequences which will aid in tagging a text. Five testing methods were conducted on HATPOST. The five methods include testing was done for every genre, incrementally, for every two corpora of different genres, for every test data that is not part of the training data but is under the same genre, and for every test data whose 95% is the training data and the corresponding results are 92.47%, 76.46%, 52.58%, 61.86%, and 92.75%, respectively, for the rule-based approach and 92.62%, 78.46%, 55.42%, 64.66%, and 93.16%, respectively, after applying the statistical approach, which is the hybrid approach. Subtracting the results of hybrid approach from those of the rule-based approach, the average improvements are 0.15%, 2.00%, 2.84%, 2.80%, and 0.41%, respectively. In the hybrid approach, the first and last testing methods have the highest accuracy while the third testing method has the lowest accuracy. High accuracy is attained with the training data in terms of content or when they belong to the same genre. Second is when the training data is larger than or about the same size of the tagging data even if there are many unknown words. with the training data in terms of content or when they belong to the same genre. Second is when the training data is larger than or about the same size of the tagging data even if there are many unknown words. with the training data in terms of content or when they belong to the same genre. Second is when the training data is larger than or about the same size of the tagging data even if there are many unknown words. On the contrary, low accuracy is the result when the training and the tagging data are different in terms of size and content. The result HATPOST’s drawback is that it cannot tag all types of named entities and cannot handle a few punctuation marks.

Abstract Format

html

Language

English

Format

Electronic

Accession Number

CDTU019186

Shelf Location

Archives, The Learning Commons, 12F, Henry Sy Sr. Hall

Physical Description

1 computer optical disc ; 4 3/4 in.

Keywords

Natural language processing (Computer science)

This document is currently not available here.

Share

COinS