A hybrid approach to extracting the 5Ws in Filipino news articles

Date of Publication

2016

Document Type

Bachelor's Thesis

Degree Name

Bachelor of Science in Computer Science

Subject Categories

Computer Sciences

College

College of Computer Studies

Department/Unit

Computer Science

Thesis Adviser

Charibeth Chua

Defense Panel Member

Charibeth Cheng
Briane Paul Samson

Abstract/Summary

The goal of this research is to develop an information extraction system for Filipino news articles that extracts the 5Ws, namely, sino (who), ano (what), kailan (when), saan (where), and bakit (why) and produces an output which can reduce the e ort required for further data analysis. Utilizing the output of the information extraction system, an interface is provided to allow its users to view, search, and edit the extracted data in a structured format.

The information extraction system applies both rule-based and machine learning techniques as well as various tools in order to perform text processing, candidate selection, and feature extraction. The functions that fall under text processing include tokenization, sentence segmentation, named-entity recogni- tion, part-of-speech tagging, and word scoring. Afterwards, rule-based candidate selection is performed by utilizing both the output of the text processing module as well as text markers. Subsequently, feature extraction is done through both machine-learned candidate classi cation models for the who, when, and where features and rule-based algorithms for the what and why features.

Furthermore, the information extraction system was evaluated alongside the system in the research of Cagampan (2014) in order to compare the results against a similar system that extracts the same features. However, the system in Cagampans research is optimized for Filipino editorials as opposed to news articles.

The proponents' system was able to achieve 63.3257% accuracy for 'who', 71.3768% accuracy for 'when', 58.2492% accuracy for 'where', 89.2% accuracy for 'what', and 50% accuracy for 'why'. In comparison to Cagampan's system, the 'who', 'where', and 'what' feature extraction modules of the proponents' system performed better.

Abstract Format

html

Language

English

Format

Electronic

Accession Number

CDTU022249

Shelf Location

Archives, The Learning Commons, 12F, Henry Sy Sr. Hall

Physical Description

1 computer optical disc ; 4 3/4 in.

Keywords

Text processing (Computer science); Natural language processing (Computer science); Information retrieval

This document is currently not available here.

Share

COinS