Automatic bilingual lexicon extraction for a minority target language
Added Title
Pacific Asia Conference on Language, Information and Computation (22nd)
PACLIC 22
College
College of Computer Studies
Department/Unit
Software Technology
Document Type
Conference Proceeding
Source Title
Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation, PACLIC 22
First Page
368
Last Page
376
Publication Date
12-1-2008
Abstract
An automated approach of extracting bilingual lexicon from comparable, nonparallel corpora was developed for a target language with limited linguistic resources. We combined approaches from previous researches which only concentrated on context extraction, clustering techniques, or usage of part of speech tags for defining the different senses of a word. The domain-specific corpora for the source language contain 381,553 English words, while the target language with minimal language resources contain 92,610 Tagalog word, with 4,817 and 3,421 distinct root words, respectively. Despite the use of limited amount of corpora (400k vs Sadat's (2003) 39M word corpora) and seed lexicon (9,026 entries vs Rapp's (1999) 16,380 entries), the evaluation yielded promising results. The 50 high and 50 low frequency words yielded 50.29% and 31.37% recall values, and 56.12% and 21.98% precision values, respectively, which are within the range of values from previous studies, 39 - 84.45% (Koehn et al., 2002 and Zhou et al., 2001). Ranking showed an improvement to overall F-measure from 7.32% to 10.65%. © 2007 by Eileen Pamela Tiu, and Rachel Edita O.Roxas.
html
Recommended Citation
Tiua, E., & Roxas, R. O. (2008). Automatic bilingual lexicon extraction for a minority target language. Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation, PACLIC 22, 368-376. Retrieved from https://animorepository.dlsu.edu.ph/faculty_research/4040
Disciplines
Computer Sciences
Keywords
Lexicography—Data processing; Computational linguistics
Upload File
wf_no