Date of Publication
12-14-2024
Document Type
Dissertation/Thesis
Degree Name
Bachelor of Science (Honors) in Computer Science and Master of Science in Computer Science
College
College of Computer Studies
Department/Unit
Software Technology
Thesis Advisor
Dr. Ethel Chua Joy Ong
Defense Panel Chair
Dr. Charibeth K. Cheng
Defense Panel Member
Dr. Ann Franchesca Laguna
Abstract (English)
Pretrained large language models currently need to train on large amounts of data while fine-tuning them requires high-quality specialized datasets; this elicits another look towards the importance of high-quality data and its creation. The annotation or dataset creation process is usually done manually, being time-consuming and expensive. Synthetic data generation addresses this by producing large amounts of data at a low cost, with the drawback of reliability and consistency as machine-generated data can be inaccurate. In this study, we aim to investigate the potential and capabilities of fine-tuned Large Language Models (LLM) as a curator model for dataset curation in text classification tasks. A pretrained GPT-4o-mini model would be fine-tuned and evaluated by curating synthetic data on various text classification tasks such as question classification. The curator model aims to provide a tool for identifying the quality of data through a score and explanation, this is especially useful for synthetic data due to their inconsistent nature and as a way to establish a baseline quality for a dataset. Results have shown that the best performance is achieved through augmenting the training data with high-quality synthetic data and that training purely on a synthetic dataset, even if curated, suffers from performance issues. Ethical considerations regarding the use of LLMs were also discussed along with possible ways to mitigate ethical risks and concerns.
Abstract Format
html
Language
English
Recommended Citation
Ibrahim, H. A. (2024). Dataset Curator Model for Improving Classification task Capabilities of Large Language Models. Retrieved from https://animorepository.dlsu.edu.ph/etdm_softtech/13
Upload Full Text
wf_yes
2024_Ibrahim_PageswithSignature.pdf (734 kB)
2024_Ibrahim_Chapter1.pdf (70 kB)
2024_Ibrahim_Chapter2.pdf (103 kB)
2024_Ibrahim_Chapter3.pdf (417 kB)
2024_Ibrahim_Chapter4.pdf (786 kB)
2024_Ibrahim_Chapter5.pdf (59 kB)
2024_Ibrahim_AppendixA.pdf (584 kB)
2024_Ibrahim_References.pdf (103 kB)
Embargo Period
12-13-2024