Date of Publication

2024

Document Type

Dissertation/Thesis

Degree Name

Bachelor of Science (Honors) in Computer Science and Master of Science in Computer Science

College

College of Computer Studies

Department/Unit

Software Technology

Thesis Advisor

Ethel C. Ong

Defense Panel Chair

Charibeth K. Cheng

Defense Panel Member

Edward P. Tighe

Abstract (English)

Machine reading comprehension (MRC) is a popular task in natural language processing that has found applications in sectors such as customer service and healthcare. MRC has been integrated into software systems such as search engines and chatbots and used as a benchmark for evaluating the performance of language models. Despite the large amount of MRC research conducted in high-resource languages like English and Chinese, no work has been done on Filipino MRC. This study proposes to kick-start the field of Filipino MRC by constructing the Filipino Question Answering Dataset (FilQuAD), the first dataset in this area, to facilitate the training and evaluation of Filipino MRC models. The questions in FilQuAD were gathered via manual data collection and synthetic data generation. Dataset analysis and model evaluation were conducted to understand the properties of the created questions and benchmark existing Filipino language models on the MRC task. A total of 4063 question-answer pairs were gathered from both manual data collection and synthetic data generation. Model evaluation experiments show that cross-lingual language models significantly outperform Filipino models, and that synthetic data augmentation yields improved model performance. The models struggled most on questions requiring multiple sentence reasoning and world knowledge, as well as questions with numeric answers.

Abstract Format

html

Language

English

Upload Full Text

wf_yes

Embargo Period

8-15-2024

Share

COinS