Automatic bilingual lexicon extraction for a minority target language

Added Title

Pacific Asia Conference on Language, Information and Computation (22nd)
PACLIC 22

College

College of Computer Studies

Department/Unit

Software Technology

Document Type

Conference Proceeding

Source Title

Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation, PACLIC 22

First Page

368

Last Page

376

Publication Date

12-1-2008

Abstract

An automated approach of extracting bilingual lexicon from comparable, nonparallel corpora was developed for a target language with limited linguistic resources. We combined approaches from previous researches which only concentrated on context extraction, clustering techniques, or usage of part of speech tags for defining the different senses of a word. The domain-specific corpora for the source language contain 381,553 English words, while the target language with minimal language resources contain 92,610 Tagalog word, with 4,817 and 3,421 distinct root words, respectively. Despite the use of limited amount of corpora (400k vs Sadat's (2003) 39M word corpora) and seed lexicon (9,026 entries vs Rapp's (1999) 16,380 entries), the evaluation yielded promising results. The 50 high and 50 low frequency words yielded 50.29% and 31.37% recall values, and 56.12% and 21.98% precision values, respectively, which are within the range of values from previous studies, 39 - 84.45% (Koehn et al., 2002 and Zhou et al., 2001). Ranking showed an improvement to overall F-measure from 7.32% to 10.65%. © 2007 by Eileen Pamela Tiu, and Rachel Edita O.Roxas.

html

Disciplines

Computer Sciences

Keywords

Lexicography—Data processing; Computational linguistics

Upload File

wf_no

This document is currently not available here.

Share

COinS