AMHARIC-ENGLISH CODE-MIXED LANGUAGE IDENTIFICATION  ON SOCIAL MEDIA USING MACHINE LEARNING AND DEEP  LEARNING APPROACH

YETMWORK TESFAYE ASFAW

AMHARIC-ENGLISH CODE-MIXED LANGUAGE IDENTIFICATION ON SOCIAL MEDIA USING MACHINE LEARNING AND DEEP LEARNING APPROACH

dc.contributor.author	YETMWORK TESFAYE ASFAW
dc.date.accessioned	2024-04-03T08:04:23Z
dc.date.available	2024-04-03T08:04:23Z
dc.date.issued	2024-01
dc.description.abstract	In recent years, the study of identifying languages in social media text has been a fascinating area of research. Earlier English language was dominantly used in social media communication, but code-mixed text in social media is prevalent in non-English-speaking states. In code-mixed data, a single sentence combines two different languages, making it challenging for people to determine the language used in the text. Therefore, the use of language identification for processing such code-mixed sentences is essential in language processing tasks. In this proposed work we employed different machine learning and deep learning techniques to identify Amharic-English code-mixed text from social media platforms like Facebook, YouTube, TikTok, and Telegram. For the proposed models we prepared 5021 sentences from these social media, and we applied scraping, cleaning, filtration, and then tokenization and POS tagging using NLTK. After tokenization was performed, we labeled each language tag and Amharic POS tag manually, the total number of words was thirty-one thousand three hundred five (31,305) prepared. The process of categorizing Amharic-English code-mixed data into Amharic, English, Named Entity, and Universal was executed. In this study, we employed machine learning and deep learning methods, we trained our models with count vector, TF-IDF vector, bi and tri gram, and word to vector. After model training was finished, we evaluated the effectiveness of each model using accuracy, precision, recall, and F1-Score, and for user-friendly interaction, we saved the model and deployed using the Flask web Framework. The result shows SVM, logistic regression, and naïve bayes achieves an accuracy of 96%, 86%, and 86%, with TF-IDF and count vector respectively, and decision tree, XGBoost, random forest, and MaxEnt performs an accuracy of 94%, 93%, 93% and 87% with word to vector respectively. And LSTM, CNN, Bi-LSTM, MLP gained accuracy of 84%,87%,87%, and 87%. Generally, SVM is best fit for our code-mixed language identification. From deep learning algorithm MLP is best in terms of f1-score of 0.90,0.95,0.64, and 0.55 for Amharic, English, named entity and universal language label.	en_US
dc.description.sponsorship	wolkite universty	en_US
dc.language.iso	en	en_US
dc.publisher	WOLKITE UNIVERSITY	en_US
dc.subject	SVM,	en_US
dc.subject	Code-Mixed	en_US
dc.subject	Language identification	en_US
dc.subject	MLP, Decision tree	en_US
dc.subject	Machine learning	en_US
dc.title	AMHARIC-ENGLISH CODE-MIXED LANGUAGE IDENTIFICATION ON SOCIAL MEDIA USING MACHINE LEARNING AND DEEP LEARNING APPROACH	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: YETMWORK TESFAYE ASFAW.pdf
Size:: 3.66 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Department of Governance and Development studies