Date of Publication

12-14-2024

Document Type

Dissertation/Thesis

Degree Name

Bachelor of Science (Honors) in Computer Science and Master of Science in Computer Science

College

College of Computer Studies

Department/Unit

Software Technology

Thesis Advisor

Dr. Ethel Chua Joy Ong

Defense Panel Chair

Dr. Charibeth K. Cheng

Defense Panel Member

Dr. Ann Franchesca Laguna

Abstract (English)

Pretrained large language models currently need to train on large amounts of data while fine-tuning them requires high-quality specialized datasets; this elicits another look towards the importance of high-quality data and its creation. The annotation or dataset creation process is usually done manually, being time-consuming and expensive. Synthetic data generation addresses this by producing large amounts of data at a low cost, with the drawback of reliability and consistency as machine-generated data can be inaccurate. In this study, we aim to investigate the potential and capabilities of fine-tuned Large Language Models (LLM) as a curator model for dataset curation in text classification tasks. A pretrained GPT-4o-mini model would be fine-tuned and evaluated by curating synthetic data on various text classification tasks such as question classification. The curator model aims to provide a tool for identifying the quality of data through a score and explanation, this is especially useful for synthetic data due to their inconsistent nature and as a way to establish a baseline quality for a dataset. Results have shown that the best performance is achieved through augmenting the training data with high-quality synthetic data and that training purely on a synthetic dataset, even if curated, suffers from performance issues. Ethical considerations regarding the use of LLMs were also discussed along with possible ways to mitigate ethical risks and concerns.

Abstract Format

html

Language

English

Upload Full Text

wf_yes

Embargo Period

12-13-2024

Share

COinS