Visualization and analysis of document clusters produced by self-organizing maps

Date of Publication

2013

Document Type

Master's Thesis

Degree Name

Master of Science in Computer Science

College

College of Computer Studies

Department/Unit

Computer Science

Thesis Adviser

Arnulfo Azcarraga

Abstract/Summary

The problem of information overload with the huge number of text documents available makes them increasingly difficult to organize and analyze. To alleviate this problem, text document clustering is used to automatically group related documents together. However, documents usually produce very high-dimensional data, making it resource-intensive to perform data processing on them. Random Projection Method (RPM) is shown to reduce the dimensionality of a large document dataset. The dimensionality reduction scheme is then coupled with Self-Organizing Maps (SOM) to organize the documents in the dataset. K-Means clustering is then performed on the SOM units to produce clusters of documents that were organized within the SOM. Various properties based on the SOM were introduced, as well as a method to measure and visualize them. These allowed for detailed analysis of the clusters and aided in nding outliers of the dataset, overlap between clusters, concentration of documents within clusters, possible subclusters and quality of di erent parts of clusters, among others. Cross-referencing between di erent property visualizations provided internal validation of the observations. For future work, the di erent SOM-based properties and their visualizations can be used for interactive document selection, recommendation systems, and quality measure.

Abstract Format

html

Language

English

Format

Print

Accession Number

TG05337

Shelf Location

Archives, The Learning Commons, 12F Henry Sy Sr. Hall

Physical Description

vii, 66 leaves ; 28 cm.

Keywords

Document clustering; Cluster analysis

This document is currently not available here.

Share

COinS