Using unsupervised techniques and manual analysis: A framework for discovering themes from social media posts

Date of Publication


Document Type

Master's Thesis

Degree Name

Master of Science in Computer Science


College of Computer Studies


Computer Science

Thesis Adviser

Charibeth K. Cheng

Defense Panel Chair

Merlin Teodosia C. Suarez

Defense Panel Member

Allan Borra


Given the role of social media in the modern society, it is imperative that the data from these sources be organized in order for them to be properly utilized. Hence, current technologies rely on supervised learning approaches that require the development of training data. However, for these training data to be useful, accurate or expert knowledge is often required. As alternative to manual approaches, which are impractical and uneconomical, social scientists utilize Natural Language Processing (NLP) as guide in order to derive themes from the dataset. However, these automatic approaches are either biased to frequently occurring terms or do not provide enough information in order to aid experts. Given these constraints, a framework that combines unsupervised methods and a manual means for topic extraction is presented.

For this research, the data gathered from related researches (Meier, 2012a Meier, 2012b Pablo, Oco, Cheng, Roldan, & Roxas, 2014) are first preprocessed and represented using the bag-of-words representation and TF-IDF weighting scheme. Then the entire data undergoes feature reduction in order to reduce the length of the vector space. Next, k-means clustering (k = 3, 5 and 8) is used in order to organize the data in categories. It has been observed that silhouette coefficient of the clusters indicate that the clustering is suffering from high dimensionality of the features. Furthermore, due to the unlabeled nature of the unsupervised methods, content analysis using open coding is performed. Evaluation of the assigned labels yielded accuracy rate of 41.5% agreement rate while analysis of the results show different types of cluster behaviors (1) multi-clustered theme (2) consistent clusters (3) multi-topic clusters (4) language clusters (5) dispersing cluster. As future work, an improved preprocessing technique could be used for the clustering as well as exploring other value for k.

Abstract Format






Accession Number


Shelf Location

Archives, The Learning Commons, 12F Henry Sy Sr. Hall

Physical Description

leaves ; 4 3/4 in.

This document is currently not available here.