A statistical feature extraction tool for mining short text data
Date of Publication
Master of Science in Computer Science
College of Computer Studies
Charibeth K. Cheng
Defense Panel Chair
Defense Panel Member
Nathalie Rose Lim-Cheng
Charibeth K. Cheng
The surging popularity and influence of social media globally opens a lot of doors for instantaneous information sharing, marketing opportunities, self-promotion and crowdsourcing. With the good, comes the bad -- social media is also hit with different criticism and issues such as cyberbullying, invasion of privacy, online harassment and false reporting. For NLP practitioners, dealing with social media data is interesting because of the properties it possesses. Social media data differs from traditional text data because of its short text nature. This research aims to create an easy to use tool for users beyond NLP practitioners such as students, site administrators, blog owners and others.
In order to make a universal feature extraction tool, the following issues were addressed: language-independence, domain-Independence, data sparsity, informal grammar patterns and feature selection/reduction. The tool was designed, built and comprised of different modules to handle each of these issues. The tool emphasized on extracting and building a data model that goes beyond the commonly used Bag-of-Words by utilizing the co-occurrence properties of the extracted features. Third-party performance evaluation surveys were conducted to evaluate the tool on the criteria of ease of use, efficiency, accuracy, usability and completeness. The survey resulted with a favorable evaluation from the respondents. For benchmarking and general testing, the tool was subjected to four experiments with varying domains, languages, number of classes and data sources. All the datasets used for each experiments were gathered from past researches. The results for each of the experiments were benchmarked against the results of the previous works respectively. Overall, the tool managed to produce higher accuracies for each of the experiments conducted compared to their respective reported results.
Archives, The Learning Commons, 12F Henry Sy Sr. Hall
1 computer optical disc ; 4 3/4 in.
Social media; Electronic data processing--Data entry
Chan, O. (2015). A statistical feature extraction tool for mining short text data. Retrieved from https://animorepository.dlsu.edu.ph/etd_masteral/5049