AMHARIC-ENGLISH CODE-MIXED LANGUAGE IDENTIFICATION ON SOCIAL MEDIA USING MACHINE LEARNING AND DEEP LEARNING APPROACH
Date
2024-01
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
WOLKITE UNIVERSITY
Abstract
In recent years, the study of identifying languages in social media text has been a fascinating area of research. Earlier English language was dominantly used in social media communication, but code-mixed text in social media is prevalent in non-English-speaking states. In code-mixed data, a single sentence combines two different languages, making it challenging for people to determine the language used in the text. Therefore, the use of language identification for processing such code-mixed sentences is essential in language processing tasks. In this proposed work we employed different machine learning and deep learning techniques to identify Amharic-English code-mixed text from social media platforms like Facebook, YouTube, TikTok, and Telegram. For the proposed models we prepared 5021 sentences from these social media, and we applied scraping, cleaning, filtration, and then tokenization and POS tagging using NLTK. After tokenization was performed, we labeled each language tag and Amharic POS tag manually, the total number of words was thirty-one thousand three hundred five (31,305) prepared. The process of categorizing Amharic-English code-mixed data into Amharic, English, Named Entity, and Universal was executed. In this study, we employed machine learning and deep learning methods, we trained our models with count vector, TF-IDF vector, bi and tri gram, and word to vector. After model training was finished, we evaluated the effectiveness of each model using accuracy, precision, recall, and F1-Score, and for user-friendly interaction, we saved the model and deployed using the Flask web Framework. The result shows SVM, logistic regression, and naïve bayes achieves an accuracy of 96%, 86%, and 86%, with TF-IDF and count vector respectively, and decision tree, XGBoost, random forest, and MaxEnt performs an accuracy of 94%, 93%, 93% and 87% with word to vector respectively. And LSTM, CNN, Bi-LSTM, MLP gained accuracy of 84%,87%,87%, and 87%. Generally, SVM is best fit for our code-mixed language identification. From deep learning algorithm MLP is best in terms of f1-score of 0.90,0.95,0.64, and 0.55 for Amharic, English, named entity and universal language label.
Description
Keywords
SVM,, Code-Mixed, Language identification, MLP, Decision tree, Machine learning