AMHARIC-ENGLISH CODE-MIXED LANGUAGE IDENTIFICATION  ON SOCIAL MEDIA USING MACHINE LEARNING AND DEEP  LEARNING APPROACH

YETMWORK TESFAYE ASFAW

AMHARIC-ENGLISH CODE-MIXED LANGUAGE IDENTIFICATION ON SOCIAL MEDIA USING MACHINE LEARNING AND DEEP LEARNING APPROACH

Files

Yetmwork_Tesfaye_LID_Research.pdf (3.66 MB)

Date

2024-01

Authors

YETMWORK TESFAYE ASFAW

Publisher

WOLKITE UNIVERSITY

Abstract

In recent years, the study of identifying languages in social media text has been a fascinating area of research. Earlier English language was dominantly used in social media communication, but code-mixed text in social media is prevalent in non-English-speaking states. In code-mixed data, a single sentence combines two different languages, making it challenging for people to determine the language used in the text. Therefore, the use of language identification for processing such code-mixed sentences is essential in language processing tasks. In this proposed work we employed different machine learning and deep learning techniques to identify Amharic-English code-mixed text from social media platforms like Facebook, YouTube, TikTok, and Telegram. For the proposed models we prepared 5021 sentences from these social media, and we applied scraping, cleaning, filtration, and then tokenization and POS tagging using NLTK. After tokenization was performed, we labeled each language tag and Amharic POS tag manually, the total number of words was thirty-one thousand three hundred five (31,305) prepared. The process of categorizing Amharic-English code-mixed data into Amharic, English, Named Entity, and Universal was executed. In this study, we employed machine learning and deep learning methods, we trained our models with count vector, TF-IDF vector, bi and tri gram, and word to vector. After model training was finished, we evaluated the effectiveness of each model using accuracy, precision, recall, and F1-Score, and for user-friendly interaction, we saved the model and deployed using the Flask web Framework. The result shows SVM, logistic regression, and naïve bayes achieves an accuracy of 96%, 86%, and 86%, with TF-IDF and count vector respectively, and decision tree, XGBoost, random forest, and MaxEnt performs an accuracy of 94%, 93%, 93% and 87% with word to vector respectively. And LSTM, CNN, Bi-LSTM, MLP gained accuracy of 84%,87%,87%, and 87%. Generally, SVM is best fit for our code-mixed language identification. From deep learning algorithm MLP is best in terms of f1-score of 0.90,0.95,0.64, and 0.55 for Amharic, English, named entity and universal language label.

Keywords

SVM,, Code-Mixed, Language identification, MLP, Decision tree, Machine learning

Collections

Department of Governance and Development studies

Full item page

AMHARIC-ENGLISH CODE-MIXED LANGUAGE IDENTIFICATION ON SOCIAL MEDIA USING MACHINE LEARNING AND DEEP LEARNING APPROACH

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By