Analyzing Filipino editorials through information extraction and sentiment analysis

Date of Publication

2015

Document Type

Master's Thesis

Degree Name

Master of Science in Computer Science

College

College of Computer Studies

Department/Unit

Computer Science

Abstract/Summary

The purpose of this research is to allow easy data analysis upon performing information extraction and sentiment analysis on Filipino editorials. Information extraction was guided by rules based from researchers observation and was automated through bootstrapping. The attributes that were extracted are the Tagalog equivalent of the 5W user requirement proposed by Das et al. (2012) that encompasses sino (who), ano (what), kailan (when), saan (where), and bakit (why). Consequently, comparative experiments on sentiment analysis were done using machine learning and lexicon-based approaches. Both information extraction and sentiment analysis were done on paragraph level. Collective result was presented visually. In the process of developing the visualization, several factors were considered including how the end user will be able to comprehend the important points in the editorials and the overall sentiment present in each. The three main components of the research process namely information extraction, sentiment analysis, and result visualization were evaluated objectively and subjectively.

To evaluate the performance of rule-based information extraction, a gold standard was built to which the machine output was compared. The result of the approach was below average in extracting ano, sino, and saan features with a correctness percentage of 0%, 6.06%, and 19.51% respectively. It did perform on average in extracting bakit feature with 50% correct extraction. The highest result recorded was 84.39% in kailan feature extraction. The performance of lexicon-based and machine learning-based sentiment analysis were also compared in this research. Machine learning-based sentiment analysis was known to perform well on bigger data sets upon attaining a classi cation accuracy of 80.98% as compared to the 61% accuracy of lexicon-based approach. Lexicon-based approach also showcased its potential upon obtaining an accuracy of 87.71% over the 70.5% accuracy of machine learning-based approach in balanced data set with few instances only. The visualization elements that represented the output of the two major processes of this research were evaluated to be appropriate representations. The visualization system was also subjectively rated to be easy to use and understand.

Abstract Format

html

Language

English

Format

Electronic

Accession Number

CDTG005926

Shelf Location

Archives, The Learning Commons, 12F Henry Sy Sr. Hall

Physical Description

1 computer optical disc ; 4 3/4 in.

This document is currently not available here.

Share

COinS