Date of Publication

8-5-2004

Document Type

Master's Thesis

Degree Name

Master of Science in Computer Science

Subject Categories

Computer Sciences

College

College of Computer Studies

Department/Unit

Computer Science

Thesis Adviser

Rachel Edita O. Roxas

Defense Panel Chair

Allan B. Borra

Defense Panel Member

Lolita Reyes
Ethel C. Ong

Abstract/Summary

AutoCor is a method for the automatic acquisition and classification of corpora of documents in closely-related languages. It is an extension and enhancement of CorpusBuilder, a system that automatically builds specific minority language corpora from a closed corpus (Ghani, et al, 2001a). AutoCor used the query generation method odds ratio which was reported to produce best results in CorpusBuilder. It considered closely-related languages rather than a single minority language, and introduced the concept of common word pruning to the language models of closely-related languages, which was found to improve the precision of the system. The method was implemented in PHP and PERL & tested on 3 most closely-related languages in the Philippines, namely: Bicolano, Cebuano and Tagalog (Fortunato, 1993). Each of the target languages was tested for query lengths 1 to 5, with 100 generated queries per query length, both with and without pruning. Precision and recall were computed per query, and average precision was computed per query length. The results show that common word pruning improved the precision of the system (Bicolano: with 52.96% highest improvement at query length 4, Cebuano: with 18.00% highest improvement at query length 1, Tagalog: with 19.78% highest improvement at query length 2).

Abstract Format

html

Language

English

Format

Print

Accession Number

TG03719; CDTG003719

Shelf Location

Archives, The Learning Commons, 12F Henry Sy Sr. Hall

Physical Description

vii, 158 leaves ; 28 cm. + 1 computer optical disc.

Keywords

Query languages (Computer science); Corpora (Linguistics); Machine translating; Computational linguistics; QUERY (Information retrieval system); Language and languages

Upload Full Text

wf_yes

Share

COinS