Date of Publication

12-14-2024

Document Type

Dissertation/Thesis

Degree Name

Bachelor of Science (Honors) in Computer Science and Master of Science in Computer Science

College

College of Computer Studies

Department/Unit

Software Technology

Thesis Advisor

Dr. Ethel Chua Joy Ong

Defense Panel Chair

Dr. Charibeth K. Cheng

Defense Panel Member

Dr. Ann Franchesca Laguna

Abstract (English)

Pretrained large language models currently need to train on large amounts of data while fine-tuning them requires high-quality specialized datasets; this elicits another look towards the importance of high-quality data and its creation. The annotation or dataset creation process is usually done manually, being time-consuming and expensive. Synthetic data generation addresses this by producing large amounts of data at a low cost, with the drawback of reliability and consistency as machine-generated data can be inaccurate. In this study, we aim to investigate the potential and capabilities of fine-tuned Large Language Models (LLM) as a curator model for dataset curation in text classification tasks. A pretrained GPT-4o-mini model would be fine-tuned and evaluated by curating synthetic data on various text classification tasks such as question classification. The curator model aims to provide a tool for identifying the quality of data through a score and explanation, this is especially useful for synthetic data due to their inconsistent nature and as a way to establish a baseline quality for a dataset. Results have shown that the best performance is achieved through augmenting the training data with high-quality synthetic data and that training purely on a synthetic dataset, even if curated, suffers from performance issues. Ethical considerations regarding the use of LLMs were also discussed along with possible ways to mitigate ethical risks and concerns.

Abstract Format

html

Language

English

Recommended Citation

Ibrahim, H. A. (2024). Dataset Curator Model for Improving Classification task Capabilities of Large Language Models. Retrieved from https://animorepository.dlsu.edu.ph/etdm_softtech/13

Upload Full Text

wf_yes

2024_Ibrahim_PreliminaryPages.pdf (67 kB)
2024_Ibrahim_PageswithSignature.pdf (734 kB)
2024_Ibrahim_Chapter1.pdf (70 kB)
2024_Ibrahim_Chapter2.pdf (103 kB)
2024_Ibrahim_Chapter3.pdf (417 kB)
2024_Ibrahim_Chapter4.pdf (786 kB)
2024_Ibrahim_Chapter5.pdf (59 kB)
2024_Ibrahim_AppendixA.pdf (584 kB)
2024_Ibrahim_References.pdf (103 kB)

Embargo Period

12-13-2024

Download

COinS

Software Technology Master's Theses

Dataset Curator Model for Improving Classification task Capabilities of Large Language Models

Date of Publication

Document Type

Degree Name

College

Department/Unit

Thesis Advisor

Defense Panel Chair

Defense Panel Member

Abstract (English)

Abstract Format

Language

Recommended Citation

Upload Full Text

Embargo Period

Search

Browse

Submit

Connect

Software Technology Master's Theses

Dataset Curator Model for Improving Classification task Capabilities of Large Language Models

Author

Date of Publication

Document Type

Degree Name

College

Department/Unit

Thesis Advisor

Defense Panel Chair

Defense Panel Member

Abstract (English)

Abstract Format

Language

Recommended Citation

Upload Full Text

Embargo Period

Share

Search

Browse

Submit

Connect