Information extraction for elegislation

Date of Publication

2010

Document Type

Bachelor's Thesis

Degree Name

Bachelor of Science in Computer Science

Subject Categories

Computer Sciences

College

College of Computer Studies

Department/Unit

Computer Science

Thesis Adviser

Allan Borra

Defense Panel Member

Charibeth Cheng

Rachel Roxas

Abstract/Summary

Information extraction (IE) is the process of transforming unstructured information of documents into a structured database of structured information. This technology allowed more narrowed-down search results of documents stored in Document Management System (DMS). An IE system was developed to augment a Blue Ribbon Committee (BRC) DMS for the eParticipation Project. IE architectures were studied and related tools were identified to develop the IE system specifically for the BRC. The IE System is composed of 7 minor modules namely Sentence Splitter, Tokenizer, Cross Reference, Part of Speech Tagger, Unknown Word, Named Entity Recognition and Preparser, 3 major modules which are Semantic Tagger, CoReference Resolution and Preparser, 3 major modules which are Semantic Tagger, CoReference Resolution and Template Filler, and 2 external modules which are Search and Evaluation modules. With the help and constant communication with the Blue Ribbon Committee, the research was able to gather documents that helped in creating the system. Also, the output is already created and extracted based on the preference of the client and that the output system is already meeting the standards requested by the Blue Ribbon Committee. Overall, the system showed favorable results in the actual testing phase which had an output of 95.42%, but when the initial format of the documents were followed, the result of the system would be 100% accurate. Upon presenting the system to the main stakeholders, they remarked that what they had seen was already beyond their expectations and they were very pleased about the outcome. There are still parts of the system which could be improved on, such as train the values of the POS Tagger and the Named Entity Recognition from the documents being fed, update the library used to open word document files, add documents and templates to the system's process, add image recognition to the system, update web crawler for more sources and improve the search ranking algorithm.

Abstract Format

html

Language

English

Format

Print

Accession Number

TU19863

Shelf Location

Archives, The Learning Commons, 12F, Henry Sy Sr. Hall

Physical Description

1 v. (various foliations) : illustrations (some colored) ; 28 cm.

Keywords

Text processing (Computer science); Natural language processing (Computer science); Database management

This document is currently not available here.

Share

COinS