A statistical feature extraction tool for mining short text data

Date of Publication


Document Type

Master's Thesis

Degree Name

Master of Science in Computer Science

Subject Categories

Computer Sciences


College of Computer Studies


Computer Science

Thesis Adviser

Charibeth K. Cheng

Defense Panel Chair

Ethel Ong

Defense Panel Member

Nathalie Rose Lim-Cheng
Charibeth K. Cheng


The surging popularity and influence of social media globally opens a lot of doors for instantaneous information sharing, marketing opportunities, self-promotion and crowdsourcing. With the good, comes the bad -- social media is also hit with different criticism and issues such as cyberbullying, invasion of privacy, online harassment and false reporting. For NLP practitioners, dealing with social media data is interesting because of the properties it possesses. Social media data differs from traditional text data because of its short text nature. This research aims to create an easy to use tool for users beyond NLP practitioners such as students, site administrators, blog owners and others.

In order to make a universal feature extraction tool, the following issues were addressed: language-independence, domain-Independence, data sparsity, informal grammar patterns and feature selection/reduction. The tool was designed, built and comprised of different modules to handle each of these issues. The tool emphasized on extracting and building a data model that goes beyond the commonly used Bag-of-Words by utilizing the co-occurrence properties of the extracted features. Third-party performance evaluation surveys were conducted to evaluate the tool on the criteria of ease of use, efficiency, accuracy, usability and completeness. The survey resulted with a favorable evaluation from the respondents. For benchmarking and general testing, the tool was subjected to four experiments with varying domains, languages, number of classes and data sources. All the datasets used for each experiments were gathered from past researches. The results for each of the experiments were benchmarked against the results of the previous works respectively. Overall, the tool managed to produce higher accuracies for each of the experiments conducted compared to their respective reported results.

Abstract Format






Accession Number


Shelf Location

Archives, The Learning Commons, 12F Henry Sy Sr. Hall

Physical Description

1 computer optical disc ; 4 3/4 in.


Social media; Electronic data processing--Data entry

This document is currently not available here.