"Building neural word embeddings using new media resources for text cl" by Charibeth Ko Cheng

Building neural word embeddings using new media resources for text classification

Date of Publication

9-9-2016

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Subject Categories

Computer Sciences

College

College of Computer Studies

Department/Unit

Software Technology

Thesis Adviser

Merlin Teodosia C. Suarez

Defense Panel Chair

Arnulfo P. Azcarraga

Defense Panel Member

Joel P. Ilao
Susan Pancho Festin
Rachel Edita O. Roxas
Merlin Teodosia C. Suarez

Abstract/Summary

For computers to understand the meaning of human language, it must be represented in the language of computers. Distributional semantic models (DSM) assume that the meaning of a word can be inferred from its distribution properties in the text. Through the statistical analysis of the context where a word occurs, these models dynamically build semantic representations, in the form of high-dimensional vector spaces. This study shows how different DSMs improve the classification performance. Existing works on modeling the Filipino language and the inputs of Filipino NLP tasks have focused on count-based DSM. These models suffer from the curse of dimensionality, where the size of the vocabulary has a significant negative impact on the dimensions of the word vector representation. They also have difficulty representing rare, but relevant words. Moreover, the elements of these vectors represent the words as a token with no other associated information. Recent developments in language modeling, and NLP in general, suggest that neural network-based word embedding

models outperform traditional count-based distributional models. In these neural- network based approaches, words are “embedded” into a low-dimensional vector

space, with each word represented as an ! −dimensional vector of real numbers. Relationships between word vectors are determined through their distance between each other. Vectors that are “closer” to each other should be more semantically related. How do these neural word embedding models handle morphologically rich

languages, that tend to have large vocabulary sizes? Can they generalize a less- resourced language using a smaller dataset? For low-resourced languages like

Filipino, the lack of linguistic tools and resourced as well as expert-annotated datasets makes it difficult to apply state-of-the-art techniques directly. This research compared the improvements and limitations of count-based vector space models and distributional vector space models, through five text classification problems. Consequently, a Tagalog Wikipedia corpus was built and was used to train word embeddings using GloVe and FastText.

Abstract Format

html

Language

English

Format

Electronic

Accession Number

CDTG008242

Keywords

Natural language processing (Computer science); Neural networks (Computer science)

Upload Full Text

wf_no

Embargo Period

3-12-2025

This document is currently not available here.

Share

COinS