Statistics-based rule generation for Filipino style and grammar checking
Date of Publication
2014
Document Type
Master's Thesis
Degree Name
Master of Science in Computer Science
College
College of Computer Studies
Department/Unit
Computer Science
Thesis Adviser
Joel Ilao
Defense Panel Member
Rachel Edita Roxas
Allan Borra
Abstract/Summary
Current research works in the area of corpus and computational linguistics are now data-driven. When dealing with data, there is a need to check sentences for variations and inconsistencies. Style and grammar checkers can be used for this purpose. However, recent technologies rely on manually developing rules, which is a time-consuming process and a herculean task. In this paper, a statistics based rule generation framework that can be used to learn spelling variations, affix usage, and common mistakes made is presented. As domain, this research is focused on the Filipino language, characterized as a language with high degree of inflection. Monolingual corpora, annotated documents, as well as a tagged data were collected. The monolingual corpus was modeled and machine learning was used to aid in detecting spelling variations the tagged data was processed and data association was applied to determine affix usage and a subset of the annotated documents was digitized and used as training data for a statistical machine translation engine to determine common mistakes made. A total of 396 variant pairs, 16 affix usage, and 22 phrase pairs were generated and transformed into rules. A subset of these linguistic phenomena was reported in the literature, an indication that the framework can be used to automate linguistic tasks. The proposed variant scoring matches the style proposed by Sentro ng Wikang Filipino (SWF) with 30% recall and matches the style proposed by the Komisyon sa Wikang (KWF) Filipino with 60% recall, an indication that the style proposed by KWF is more inclined with the variant scoring. As future work, a policy paper could be drafted in coordination with experts in language planning.
Abstract Format
html
Language
English
Format
Electronic
Electronic File Format
MS WORD
Accession Number
CDTG005544
Shelf Location
Archives, The Learning Commons, 12F Henry Sy Sr. Hall
Physical Description
leaves ; 4 3/4 in.
Upload Full Text
wf_no
Recommended Citation
Oco, N. A. (2014). Statistics-based rule generation for Filipino style and grammar checking. Retrieved from https://animorepository.dlsu.edu.ph/etd_masteral/4610