Autolex: An automatic lexicon builder for minority languages using an open corpus

Added Title

PACLIC 24
Pacific Asia Conference on Language, Information and Computation (24th)

College

College of Computer Studies

Department/Unit

Software Technology

Document Type

Article

Volume

PACLIC 24 - Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

First Page

603

Last Page

611

Publication Date

12-1-2010

Abstract

The aim of this study is to build natural language resources for languages with limited resources or minority languages. Manually building these resources is tedious and costly. These natural language resources such as a language corpora and lexicon will be used for natural language processing research and system development. Tagalog, a minority language was considered in this study as a test bed. This study exploited the use of the WWW to retrieve documents that are written in a minority language. We employed a frequency-based algorithm to build the lexicon. For our evaluation, we considered 260 Tagalog documents extracted from the web as our corpus. From the corpus, the system automatically selected 1,386 candidate unique words based on the threshold (with value of 10) as the lexical entries. Each lexical entry is validated by a language expert. Our evaluation shows an accuracy of 97.84% and only 2.16% error rate. The error was based on incorrectly spelled words or words that are not Tagalog.

html

Disciplines

Computer Sciences

Keywords

Lexicography—Data processing; Computational linguistics

Upload File

wf_no

This document is currently not available here.

Share

COinS