Statistics-based rule generation for Filipino style and grammar checking

Date of Publication

2014

Document Type

Master's Thesis

Degree Name

Master of Science in Computer Science

College

College of Computer Studies

Department/Unit

Computer Science

Thesis Adviser

Joel Ilao

Defense Panel Member

Rachel Edita Roxas
Allan Borra

Abstract/Summary

Current research works in the area of corpus and computational linguistics are now data-driven. When dealing with data, there is a need to check sentences for variations and inconsistencies. Style and grammar checkers can be used for this purpose. However, recent technologies rely on manually developing rules, which is a time-consuming process and a herculean task. In this paper, a statistics based rule generation framework that can be used to learn spelling variations, affix usage, and common mistakes made is presented. As domain, this research is focused on the Filipino language, characterized as a language with high degree of inflection. Monolingual corpora, annotated documents, as well as a tagged data were collected. The monolingual corpus was modeled and machine learning was used to aid in detecting spelling variations the tagged data was processed and data association was applied to determine affix usage and a subset of the annotated documents was digitized and used as training data for a statistical machine translation engine to determine common mistakes made. A total of 396 variant pairs, 16 affix usage, and 22 phrase pairs were generated and transformed into rules. A subset of these linguistic phenomena was reported in the literature, an indication that the framework can be used to automate linguistic tasks. The proposed variant scoring matches the style proposed by Sentro ng Wikang Filipino (SWF) with 30% recall and matches the style proposed by the Komisyon sa Wikang (KWF) Filipino with 60% recall, an indication that the style proposed by KWF is more inclined with the variant scoring. As future work, a policy paper could be drafted in coordination with experts in language planning.

Abstract Format

html

Language

English

Format

Electronic

Electronic File Format

MS WORD

Accession Number

CDTG005544

Shelf Location

Archives, The Learning Commons, 12F Henry Sy Sr. Hall

Physical Description

leaves ; 4 3/4 in.

Upload Full Text

wf_no

This document is currently not available here.

Share

COinS