Use of word and character N-grams for low-resourced local languages
Added Title
International Conference on Asian Language Processing (2018)
IALP 2018
College
College of Computer Studies
Department/Unit
Computer Science
Document Type
Conference Proceeding
Source Title
Proceedings of the 2018 International Conference on Asian Language Processing, IALP 2018
First Page
250
Last Page
254
Publication Date
1-28-2019
Abstract
Language identification is a text classification task for identifying the language of a given text. Several works use this as a preprocessing technique prior to sentiment analysis, mood analysis, and named entity recognition among others. Thus, building an accurate language identification engine is important given that the Philippines is home to more than 170 languages, and is scarce of language documents and resources. We compare machine learning algorithms such as Naive Bayes, Linear Support Vector Machines (SVM), and Random Forest for classification of Philippine languages. Results show that the Linear SVM model had the best performance with 0.97 Fl-score. © 2018 IEEE.
html
Digitial Object Identifier (DOI)
10.1109/IALP.2018.8629235
Recommended Citation
Regalado, R., Agarap, A., Baliber, R., Yambao, A., & Cheng, C. (2019). Use of word and character N-grams for low-resourced local languages. Proceedings of the 2018 International Conference on Asian Language Processing, IALP 2018, 250-254. https://doi.org/10.1109/IALP.2018.8629235
Disciplines
Computer Sciences
Keywords
Natural language processing (Computer science); Machine learning
Upload File
wf_no