A corpus based-Filipino grammar checker using hybrid N-gram rules from grammatically-correct terms

Date of Publication

2016

Document Type

Master's Thesis

Degree Name

Master of Science in Computer Science

College

College of Computer Studies

Department/Unit

Computer Science

Thesis Adviser

Allan B. Borra

Defense Panel Member

Maribeth Cheng
Allan B. Borra
Nathalie Rose Lim-Cheng
Merlin Teodosia C. Suarez

Abstract/Summary

This study examines the use of a corpus-based approach as a method for detecting grammatical errors and suggesting corrections for the Filipino language. Prior to this study, the said approach has not yet been applied for the target language, while it showed a high potential in error detection and correction in other languages. Currently, Filipino grammar checker systems are limited and are mostly rule-based systems. One huge concern with this existing type of systems in Filipino is that it can only detect errors that were denied by the system which results to a very limited set of error types. The proposed approach, being corpus-based, learns grammar rules from a grammatically-correct and tagged corpus which will be used in detecting errors and providing suggestions. The grammar rules, which are hybrid n-grams, will be composed of words, part-of-speech tags, and lemmas. Input sentences will be compared against these grammar rules and identify whether there is an error or not using a weighted Levenshtein edit distance algorithm. Using this approach, the correction types can be suggested: insertion, deletion, substitution, merging, and unmerging. The approach also covers a broad range of error types such as: incorrect a x, misspellings, wrong word usage, missing word, unnecessary words, incorrectly merged words, and incorrectly unmerged words. The developed system has scored 64.11% in producing correct suggestions for 248 test phrases containing spelling/grammar errors and scored 70.95% accuracy in aging error-free words in a 1,284 error-free word corpus using only a small training corpus of 7,384 complex sentences.

Abstract Format

html

Language

English

Format

Electronic

Accession Number

CDTG006938

Shelf Location

Archives, The Learning Commons, 12F Henry Sy Sr. Hall

Physical Description

1 computer disc ; 4 3/4 in.

Keywords

Filipino language--Grammar; Filipino language; Filipino language--Study and teaching

This document is currently not available here.

Share

COinS