Date of Publication

2024

Document Type

Master's Thesis

Degree Name

Bachelor of Science (Honors) in Computer Science and Master of Science in Computer Science

Subject Categories

Computer Sciences | Software Engineering

College

College of Computer Studies

Department/Unit

Software Technology

Thesis Advisor

Ethel C. Ong

Defense Panel Chair

Charibeth K. Cheng

Defense Panel Member

Edward P. Tighe

Abstract (English)

Machine reading comprehension (MRC) is a popular task in natural language processing that has found applications in sectors such as customer service and healthcare. MRC has been integrated into software systems such as search engines and chatbots and used as a benchmark for evaluating the performance of language models. Despite the large amount of MRC research conducted in high-resource languages like English and Chinese, no work has been done on Filipino MRC. This study proposes to kick-start the field of Filipino MRC by constructing the Filipino Question Answering Dataset (FilQuAD), the first dataset in this area, to facilitate the training and evaluation of Filipino MRC models. The questions in FilQuAD were gathered via manual data collection and synthetic data generation. Dataset analysis and model evaluation were conducted to understand the properties of the created questions and benchmark existing Filipino language models on the MRC task. A total of 4063 question-answer pairs were gathered from both manual data collection and synthetic data generation. Model evaluation experiments show that cross-lingual language models significantly outperform Filipino models, and that synthetic data augmentation yields improved model performance. The models struggled most on questions requiring multiple sentence reasoning and world knowledge, as well as questions with numeric answers.

Abstract Format

html

Abstract (Filipino)

Abstract Format

html

Language

English

Keywords

Natural language processing (Computer science); Computational linguistics; Machine learning -- Educational applications; Filipino language -- Data processing; Question-answering systems

Recommended Citation

Pua, G. T. (2024). FilQuAD: A Filipino question answering dataset for machine reading comprehension. Retrieved from https://animorepository.dlsu.edu.ph/etdm_softtech/14

Upload Full Text

wf_yes

Embargo Period

8-15-2024

Download

COinS

Software Technology Master's Theses

FilQuAD: A Filipino question answering dataset for machine reading comprehension

Date of Publication

Document Type

Degree Name

Subject Categories

College

Department/Unit

Thesis Advisor

Defense Panel Chair

Defense Panel Member

Abstract (English)

Abstract Format

Abstract (Filipino)

Abstract Format

Language

Keywords

Recommended Citation

Upload Full Text

Embargo Period

Search

Browse

Submit

Connect

Software Technology Master's Theses

FilQuAD: A Filipino question answering dataset for machine reading comprehension

Author

Date of Publication

Document Type

Degree Name

Subject Categories

College

Department/Unit

Thesis Advisor

Defense Panel Chair

Defense Panel Member

Abstract (English)

Abstract Format

Abstract (Filipino)

Abstract Format

Language

Keywords

Recommended Citation

Upload Full Text

Embargo Period

Share

Search

Browse

Submit

Connect