Building neural word embeddings using new media resources for text classification
Date of Publication
9-9-2016
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Computer Science
Subject Categories
Computer Sciences
College
College of Computer Studies
Department/Unit
Software Technology
Thesis Adviser
Merlin Teodosia C. Suarez
Defense Panel Chair
Arnulfo P. Azcarraga
Defense Panel Member
Joel P. Ilao
Susan Pancho Festin
Rachel Edita O. Roxas
Merlin Teodosia C. Suarez
Abstract/Summary
For computers to understand the meaning of human language, it must be represented in the language of computers. Distributional semantic models (DSM) assume that the meaning of a word can be inferred from its distribution properties in the text. Through the statistical analysis of the context where a word occurs, these models dynamically build semantic representations, in the form of high-dimensional vector spaces. This study shows how different DSMs improve the classification performance. Existing works on modeling the Filipino language and the inputs of Filipino NLP tasks have focused on count-based DSM. These models suffer from the curse of dimensionality, where the size of the vocabulary has a significant negative impact on the dimensions of the word vector representation. They also have difficulty representing rare, but relevant words. Moreover, the elements of these vectors represent the words as a token with no other associated information. Recent developments in language modeling, and NLP in general, suggest that neural network-based word embedding
models outperform traditional count-based distributional models. In these neural- network based approaches, words are “embedded” into a low-dimensional vector
space, with each word represented as an ! −dimensional vector of real numbers. Relationships between word vectors are determined through their distance between each other. Vectors that are “closer” to each other should be more semantically related. How do these neural word embedding models handle morphologically rich
languages, that tend to have large vocabulary sizes? Can they generalize a less- resourced language using a smaller dataset? For low-resourced languages like
Filipino, the lack of linguistic tools and resourced as well as expert-annotated datasets makes it difficult to apply state-of-the-art techniques directly. This research compared the improvements and limitations of count-based vector space models and distributional vector space models, through five text classification problems. Consequently, a Tagalog Wikipedia corpus was built and was used to train word embeddings using GloVe and FastText.
Abstract Format
html
Language
English
Format
Electronic
Accession Number
CDTG008242
Keywords
Natural language processing (Computer science); Neural networks (Computer science)
Recommended Citation
Cheng, C. K. (2016). Building neural word embeddings using new media resources for text classification. Retrieved from https://animorepository.dlsu.edu.ph/etd_doctoral/1521
Upload Full Text
wf_no
Embargo Period
3-12-2025