Software Technology Master's Theses

Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification

Gian Marco I. Te, De La Salle University, ManilaFollow

Date of Publication

12-12-2022

Document Type

Master's Thesis

Degree Name

Master of Science in Computer Science

Subject Categories

Computer Sciences

College

College of Computer Studies

Department/Unit

Software Technology

Thesis Advisor

Charibeth Cheng

Defense Panel Chair

Joel Ilao

Defense Panel Member

Edward Tighe
Charibeth Cheng

Abstract/Summary

Data annotation is the process of labeling text, images, or other types of content for machine learning tasks. With the rise in popularity of machine learning for classification tasks, large amounts of labeled data is typically desired to train effective models using different algorithms and architectures. Data annotation is a critical step in developing these models and, while there is an abundance of unlabeled data that is being generated everyday, annotation is often a laborious and costly process. Furthermore, low-resource languages such as Filipino do not have as many readily available datasets as mainstream languages that can be leveraged to fine-tune existing models that were pre-trained with large amounts of data. In this study, we explored the use of BERT and semi-supervised learning for textual data in order to see how it might ease the burden of human annotation when building text classification training sets and at the same time reduce the amount of manually-labeled data needed to fine-tune a pre-trained model for a specific downstream text classification task. We then analyzed relevant factors that may affect pseudo-labeling performance, and also compared the accuracy scores of different non-BERT classifiers when trained with the same samples having solely human-labeled data versus its counterpart composed of a mixture of human-labeled data and pseudo-labeled data after semi-supervised learning.

Abstract Format

html

Language

English

Format

Electronic

Physical Description

217 leaves

Keywords

Supervised learning (Machine learning); Natural language processing (Computer science)

Recommended Citation

Te, G. I. (2022). Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification. Retrieved from https://animorepository.dlsu.edu.ph/etdm_softtech/6

Upload Full Text

wf_yes

Embargo Period

11-26-2022

Download

COinS

Software Technology Master's Theses

Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification

Date of Publication

Document Type

Degree Name

Subject Categories

College

Department/Unit

Thesis Advisor

Defense Panel Chair

Defense Panel Member

Abstract/Summary

Abstract Format

Language

Format

Physical Description

Keywords

Recommended Citation

Upload Full Text

Embargo Period

Search

Browse

Submissions

Links

Software Technology Master's Theses

Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification

Author

Date of Publication

Document Type

Degree Name

Subject Categories

College

Department/Unit

Thesis Advisor

Defense Panel Chair

Defense Panel Member

Abstract/Summary

Abstract Format

Language

Format

Physical Description

Keywords

Recommended Citation

Upload Full Text

Embargo Period

Share

Search

Browse

Submissions

Links