Date of Publication

7-14-2021

Document Type

Master's Thesis

Degree Name

Master of Science in Computer Science

Subject Categories

Computer Sciences

College

College of Computer Studies

Department/Unit

Computer Science

Thesis Advisor

Charibeth K. Cheng

Defense Panel Chair

Joel Ilao

Defense Panel Member

Nathalie Rose-Lim Cheng

Abstract/Summary

The Philippines is home to more than 150 languages that are considered low- resourced, Resulting in a lack of pursuit in developing a translation system for most of its languages. To aid in improving the results and processes of translation systems for low-resource languages, multilingual NMT became an active area of research. However, existing works in multilingual NMT disregard the analysis of a multilingual model on a closely related and low-resource language group in the context of zero-resource translation.

In this study, we have benchmarked translation systems for several Philip- pine Languages and provide an analysis of a transformer-based multilingual NMT system for morphologically rich and low-resource languages in terms of its ca- pabilities in translating unseen language pairs using zero-shot translation and pivot-based translation. Our studies show that due to the architectural design of the Transformer model, common words and sentence-length differences affect the performance of a multilingual NMT in translating both seen and unseen lan- guage pairs with Bicolano, Cebuano, and Hiligaynon consistently perform better than the other languages in various translation task by having a good balance of commonality and sentence length difference.

This work also investigated the effect of increasing the model size and capacity that allowed the model to build a language invariant shared representation space and stronger decoding capabilities to do zero-shot translation where the previous model with smaller capacity failed to develop a language invariant shared represen- tation space and could only produce translations up to English when attempting a zero-shot translation.

Since we are dealing with low-resource multilingual data, some of the risks involved are domain shift and out-of-vocabulary words. We have also shown how the multilingual NMT leverages joint byte-pair encoding and the shared represen- tation space to produce translation for unseen or rare words.

Lastly, we have shown that the transformer-based multilingual NMT can com- pete with, or outperform other translation approaches as we have shown in a comparative analysis against the baseline statistical MT models where several statistical-based translation models were produced to compare its performance against a single multilingual NMT model. We have shown in the results that the translation performance of the multilingual NMT is superior to the Statisti- cal MT models both in bidirectional English and Philippine languages translation

task and a pivot-based Philippine languages translation task where we have shown the capability of the multilingual NMT model to retain information and context across multilingual translation, something that the statistical MT models failed to do. The multilingual NMT model is also capable of producing competitive results against a directly trained NMT in a bidirectional Cebuano and Tagalog translation task where the pivot-based approach of the multilingual NMT pro- duced 6.72 and 7.20 BLEU scores against the 9.54 and 10.55 BLEU scores of a directly trained NMT for Tagalog to Cebuano and Cebuano to Tagalog transla- tion tasks even though the multilingual NMT does not have any parallel Cebuano and Tagalog datasets, proving the effectiveness of a multilingual NMT model in building translations systems for low-resource languages.

Abstract Format

html

Language

English

Format

Electronic

Physical Description

93 leaves

Keywords

Philippine languages—Translations; Translators (Computer programs)

Upload Full Text

wf_yes

Embargo Period

8-20-2022

Available for download on Saturday, August 20, 2022

Share

COinS