Building an English-Tagalog tourism corpus and lexicon for a statistical machine translation system

Date of Publication


Document Type

Master's Thesis

Degree Name

Master in Computer Science


College of Computer Studies


Software Technology

Thesis Adviser

Charibeth K. Cheng

Defense Panel Chair

Ethel C. Ong

Defense Panel Member

Nathalie Rose Lim-Cheng
Charibeth K. Cheng


Statistical machine translation systems make use of an approach which relies on the extraction of a bilingual dictionary and translation rules from a large volume of bilingual data or training data and the selection of the most probable translation by statistically disambiguating structural ambiguity. The machine learns how to translate words by observing a large amount of examples and assuming constraints from them. The more translation examples there are, the more accurate the translation becomes.

This study aimed to build a bilingual corpus of Philippine tourism data and a bilingual lexicon of named entities from the Philippine tourism domain. The output of this project is for the further enhancement of the Philippine component of the ASEAN-MT project, which is a statistical machine translation system.

The corpus was built manually and manual translation was done on the retrieved data. The data were composed of documents from Philippine Tourism websites like itsmorefuninthephilippines.com, www.experiencephilippines.org, www.wowphilippines.ca and http://www.visitmyphilippines.com.

Named-entities like peoples names, group names, company names, currency units, temporal entities, language names, locations, products, and artistic creations were manually annotated as specified by the guidelines set by the National Electronics and Computer Technology Centre (NECTEC) (Appendix A). NECTEC is the group which headed the ASEAN-MT project.

Data were analysed and evaluated using a statistical machine translation system called MOSES.

The corpus was tested according to categories Festivals and Events, Provincial Profile, Tourist Attractions and General Information where the category of Tourist Attraction got a BLEU score of 76.74. The corpus was also evaluated according to who did the manual translation and BLEU scores of 31.59, 31.87, 24.6 and 64.02 were computed based on the translations of translator1, translator2, translator3 and translator4 respectively.

The corpus was further tested according to translator per category and a BLEU score of 76.57 and 69.69 for categories Provincial Profile and General Information under translator2 and 65.73 for translator translator4 under category Tourist Attractions.

However, because of factors such as the number of function words, named-entities and numbers, as a whole, the BLEU score of the corpus was 34.42.

The overall quality of the corpus based on the BLEU score was poor. However, since it got a significantly high BLEU score under the category of Tourist Attractions, the bilingual corpus of Tourist Attractions can contribute to the quality of translation of the ASEAN-MT project.

Abstract Format






Accession Number


Shelf Location

Archives, The Learning Commons, 12F Henry Sy Sr. Hall

Physical Description

1 computer optical disc ; 4 3/4 in.

This document is currently not available here.