Autolex: An automatic lexicon builder for minority languages using an open corpus
Added Title
PACLIC 24
Pacific Asia Conference on Language, Information and Computation (24th)
College
College of Computer Studies
Department/Unit
Software Technology
Document Type
Article
Volume
PACLIC 24 - Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation
First Page
603
Last Page
611
Publication Date
12-1-2010
Abstract
The aim of this study is to build natural language resources for languages with limited resources or minority languages. Manually building these resources is tedious and costly. These natural language resources such as a language corpora and lexicon will be used for natural language processing research and system development. Tagalog, a minority language was considered in this study as a test bed. This study exploited the use of the WWW to retrieve documents that are written in a minority language. We employed a frequency-based algorithm to build the lexicon. For our evaluation, we considered 260 Tagalog documents extracted from the web as our corpus. From the corpus, the system automatically selected 1,386 candidate unique words based on the threshold (with value of 10) as the lexical entries. Each lexical entry is validated by a language expert. Our evaluation shows an accuracy of 97.84% and only 2.16% error rate. The error was based on incorrectly spelled words or words that are not Tagalog.
html
Recommended Citation
Buhay, E. C., Evardone, M. P., Nocon, H. B., Dimalen, D. D., & Roxas, R. O. (2010). Autolex: An automatic lexicon builder for minority languages using an open corpus., PACLIC 24 - Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, 603-611. Retrieved from https://animorepository.dlsu.edu.ph/faculty_research/4041
Disciplines
Computer Sciences
Keywords
Lexicography—Data processing; Computational linguistics
Upload File
wf_no