A bi-directional example-based English-Tagalog machine translation system

Date of Publication

2006

Document Type

Master's Thesis

Degree Name

Master of Science in Computer Science

College

College of Computer Studies

Department/Unit

Computer Science

Thesis Adviser

Rachel Edita O. Roxas

Defense Panel Chair

Allan B. Borra

Defense Panel Member

Rachel Edita O. Roxas
Raquel E. Sison Buban

Abstract/Summary

A bi-directional English-Tagalog machine translation system named Halo is created based on the example-based machine translation (EBMT) approach, wherein the translation is based primarily on knowledge obtained from analysis of parallel corpora. The system focused on the creation of a knowledge base for translation, requiring no linguistic knowledge prior to and during translation.

Halo is composed of two major phases, the knowledge extraction phase and the translation phase. From parallel corpora, databases of sentence pair examples are extracted. All the words that occurred in the stored sentence pairs are indexed with information on its frequency and position. A database structure for this purpose using the relational database concept was also developed. The Dice Coefficient formula is used to establish a relationship between words from two languages. The calculation is utilized to approximate the most probable translation of the words in the two languages. Algorithms on the following processes were developed: build-up of the correlation table (dictionary), input text segmentation, translation of the segments and the recombination of the translated segments to form the final translation for the whole input text.

The system was tested on subsets of parallel corpora from the 1987 Philippine Constitution and the novel Alchemist. A scoring algorithm is used to generate the two candidate translations with high scores (1.0 as the highest value). The candidate translation with the highest score is taken as the correct translation. For the Philippine Constitution test data, the average translation scores for both chunk and sentence levels from English to Tagalog is 0.85 and from Tagalog to English is 0.72. Using the Alchemist corpus, the average scores for English to Tagalog is 0.56 in the chunk level and 0.64 in the sentence level for the Tagalog to English the scores in the chunk and sentence levels are 0.63 and 0.62, respectively. The percentage of the segments or chunks translated correctly as determined manually based on the expected translation for selected input sentences is highest (66%) for the Tagalog to English translation using the Alchemist corpus while the English to Tagalog translation of the said corpus has the lowest percent correct translation (40%). For the 1987 Philippine Constitution, percent correct translation was evaluated.to be 59% and 41% for English to Tagalog and Tagalog to English, respectively.

The quality of translation depends heavily on the quality and nature of the corpus used. The Philippine Constitution test data had better translation scores since strict and proper translations are necessary for such a legal document. In contrast, the Alchemist test data produced low quality translations where most of the segments were not translated correctly because the sentences in the corpus were translated non-literally (or subjectively) since it is a literary document. In general, results show acceptable translations at the chunk level while translations of whole input text which are composed of several chunks tend to degenerate in thought because it is derived from different sentence examples.

Abstract Format

html

Language

English

Format

Print

Accession Number

TG04081; CDTG004081

Shelf Location

Archives, The Learning Commons, 12F Henry Sy Sr. Hall

Physical Description

v, 78, 10 leaves ; 28 cm. + 1 computer optical disc + 1 cd supplement.

Keywords

Machine translating; English language--Machine translating; English language--Translating into Tagalog

This document is currently not available here.

Share

COinS