A statistical feature extraction tool for mining short text data
Date of Publication
2015
Document Type
Master's Thesis
Degree Name
Master of Science in Computer Science
Subject Categories
Computer Sciences
College
College of Computer Studies
Department/Unit
Computer Science
Thesis Adviser
Charibeth K. Cheng
Defense Panel Chair
Ethel Ong
Defense Panel Member
Nathalie Rose Lim-Cheng
Charibeth K. Cheng
Abstract/Summary
The surging popularity and influence of social media globally opens a lot of doors for instantaneous information sharing, marketing opportunities, self-promotion and crowdsourcing. With the good, comes the bad -- social media is also hit with different criticism and issues such as cyberbullying, invasion of privacy, online harassment and false reporting. For NLP practitioners, dealing with social media data is interesting because of the properties it possesses. Social media data differs from traditional text data because of its short text nature. This research aims to create an easy to use tool for users beyond NLP practitioners such as students, site administrators, blog owners and others.
In order to make a universal feature extraction tool, the following issues were addressed: language-independence, domain-Independence, data sparsity, informal grammar patterns and feature selection/reduction. The tool was designed, built and comprised of different modules to handle each of these issues. The tool emphasized on extracting and building a data model that goes beyond the commonly used Bag-of-Words by utilizing the co-occurrence properties of the extracted features. Third-party performance evaluation surveys were conducted to evaluate the tool on the criteria of ease of use, efficiency, accuracy, usability and completeness. The survey resulted with a favorable evaluation from the respondents. For benchmarking and general testing, the tool was subjected to four experiments with varying domains, languages, number of classes and data sources. All the datasets used for each experiments were gathered from past researches. The results for each of the experiments were benchmarked against the results of the previous works respectively. Overall, the tool managed to produce higher accuracies for each of the experiments conducted compared to their respective reported results.
Abstract Format
html
Language
English
Format
Electronic
Accession Number
CDTG006546
Shelf Location
Archives, The Learning Commons, 12F Henry Sy Sr. Hall
Physical Description
1 computer optical disc ; 4 3/4 in.
Keywords
Social media; Electronic data processing--Data entry
Upload Full Text
wf_no
Recommended Citation
Chan, O. L. (2015). A statistical feature extraction tool for mining short text data. Retrieved from https://animorepository.dlsu.edu.ph/etd_masteral/5049