Measuring language similarity using trigrams: Limitations of language identification

College

College of Computer Studies

Department/Unit

Computer Technology

Document Type

Conference Proceeding

Source Title

2013 International Conference on Recent Trends in Information Technology, ICRTIT 2013

First Page

478

Last Page

481

Publication Date

1-1-2013

Abstract

Computational approaches in language identification often result in high number of false positives and low recall rates, especially if the languages involved come from the same subfamily. In this paper, we aim to determine the cause of this problem by measuring language similarity through trigrams. Religious and literary texts were used as training data. Our experiments involving language identification show that the number of common trigrams for a given language pair is inversely proportional to precision and recall rates, whereas the average word length is directly proportional to the number of true positives. Future directions include improving language modeling and providing an approach to increase precision and recall. © 2013 IEEE.

html

Digitial Object Identifier (DOI)

10.1109/ICRTIT.2013.6844250

Disciplines

Computer Sciences

Keywords

Computational linguistics; Philippine languages—Data processing; Similarity (Language learning)

Upload File

wf_no

This document is currently not available here.

Share

COinS