Dice's coefficient on trigram profiles as metric for language similarity
College
College of Computer Studies
Department/Unit
Computer Technology
Document Type
Conference Proceeding
Source Title
2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation, O-COCOSDA/CASLRE 2013
Publication Date
12-1-2013
Abstract
In this study, we present Dice's coefficient on trigram profiles as metric for language similarity. As testbed, we focused on eight Philippine languages. No known language similarity value for these languages exists. Documents containing transcribed audio recordings, news articles, religious and literary texts were taken from an online corpus and used as training data. Character trigram profiles were then generated using an n-gram generator and language similarity was computed. The results were matched against those reported in the literature and against the language family tree. To evaluate the metric, it was applied to five languages with known similarity values. The results were then compared with an existing lexical similarity metric. The average difference is 27%. Analyses of the results reveal that phonetic spelling play an important role in language similarity. As future work, the metric can be used on phonetic transcriptions. © 2013 IEEE.
html
Digitial Object Identifier (DOI)
10.1109/ICSDA.2013.6709892
Recommended Citation
Oco, N., Syliongka, L., Roxas, R., & Ilao, J. P. (2013). Dice's coefficient on trigram profiles as metric for language similarity. 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation, O-COCOSDA/CASLRE 2013 https://doi.org/10.1109/ICSDA.2013.6709892
Disciplines
Computer Sciences
Keywords
Computational linguistics; Similarity (Language learning); Philippine languages—Data processing
Upload File
wf_no