A hybrid approach to extracting the 5Ws in Filipino news articles
Date of Publication
2016
Document Type
Bachelor's Thesis
Degree Name
Bachelor of Science in Computer Science
Subject Categories
Computer Sciences
College
College of Computer Studies
Department/Unit
Computer Science
Thesis Adviser
Charibeth Chua
Defense Panel Member
Charibeth Cheng
Briane Paul Samson
Abstract/Summary
The goal of this research is to develop an information extraction system for Filipino news articles that extracts the 5Ws, namely, sino (who), ano (what), kailan (when), saan (where), and bakit (why) and produces an output which can reduce the e ort required for further data analysis. Utilizing the output of the information extraction system, an interface is provided to allow its users to view, search, and edit the extracted data in a structured format.
The information extraction system applies both rule-based and machine learning techniques as well as various tools in order to perform text processing, candidate selection, and feature extraction. The functions that fall under text processing include tokenization, sentence segmentation, named-entity recogni- tion, part-of-speech tagging, and word scoring. Afterwards, rule-based candidate selection is performed by utilizing both the output of the text processing module as well as text markers. Subsequently, feature extraction is done through both machine-learned candidate classi cation models for the who, when, and where features and rule-based algorithms for the what and why features.
Furthermore, the information extraction system was evaluated alongside the system in the research of Cagampan (2014) in order to compare the results against a similar system that extracts the same features. However, the system in Cagampans research is optimized for Filipino editorials as opposed to news articles.
The proponents' system was able to achieve 63.3257% accuracy for 'who', 71.3768% accuracy for 'when', 58.2492% accuracy for 'where', 89.2% accuracy for 'what', and 50% accuracy for 'why'. In comparison to Cagampan's system, the 'who', 'where', and 'what' feature extraction modules of the proponents' system performed better.
Abstract Format
html
Language
English
Format
Electronic
Accession Number
CDTU022249
Shelf Location
Archives, The Learning Commons, 12F, Henry Sy Sr. Hall
Physical Description
1 computer optical disc ; 4 3/4 in.
Keywords
Text processing (Computer science); Natural language processing (Computer science); Information retrieval
Recommended Citation
Chua, J. L., Livelo, E. S., Ver, A. O., & Yao, J. S. (2016). A hybrid approach to extracting the 5Ws in Filipino news articles. Retrieved from https://animorepository.dlsu.edu.ph/etd_bachelors/11501