Automatic lexicon extraction from comaparable, non-parallel corpora
Date of Publication
4-22-2004
Document Type
Master's Thesis
Degree Name
Master of Science in Computer Science
Subject Categories
Computer Sciences
College
College of Computer Studies
Department/Unit
Computer Science
Thesis Adviser
Rachel Edita O. Roxas
Defense Panel Chair
Lolita V. Reyes
Defense Panel Member
Allan B. Borra
Abstract/Summary
An automated approach of extracting bilingual lexicon (or dictionary) from comparable, nonparallel corpora was developed, implemented, and tested. The corpora used are of biblical domains containing 381,553 English and 92,610 Tagalog terms, with corresponding 4,817 and 3,421 distinct root words, respectively. The terms in the resulting lexicon are grouped into their respective senses. For the 100 test words (50 high frequency words, HFW, and 50 low frequency words, LFW), 50.29% (HFW) and 31.37% (LFW) of the expected translations in all clusters were generated (called recall test). 56.12% (HFW) and 21.98% (LFW) of the expected translations within clusters were generated (called precision test). The overall results represented by the F-measure (a combination of recall and precision), show that 10.65% of the expected translations for the 100 test words were generated. Inclusion of several natural language resources (e.g. lexicon expansion to include alternate senses, word per word lexicon translation, larger comparable corpora), improvement of preprocessing techniques (e.g. stemming and part of speech tagging for Tagalog), and other enhancements (e.g. smoothing of sparse data and disambiguation techniques) would improve the overall performance of the system.
Abstract Format
html
Language
English
Format
Electronic
Accession Number
CDTG003689
Shelf Location
Archives, The Learning Commons, 12F, Henry Sy Sr. Hall
Keywords
Lexicography--Data processing; Computational linguistics; Bilingualism--Lexicology
Upload Full Text
wf_no
Recommended Citation
Tiu, E. K. (2004). Automatic lexicon extraction from comaparable, non-parallel corpora. Retrieved from https://animorepository.dlsu.edu.ph/etd_masteral/6990
Embargo Period
2-22-2022