AMHARIC-ENGLISH CODE-MIXED LANGUAGE IDENTIFICATION ON SOCIAL MEDIA USING MACHINE LEARNING AND DEEP LEARNING APPROACH

dc.contributor.authorYETMWORK TESFAYE ASFAW
dc.date.accessioned2024-04-03T08:04:23Z
dc.date.available2024-04-03T08:04:23Z
dc.date.issued2024-01
dc.description.abstractIn recent years, the study of identifying languages in social media text has been a fascinating area of research. Earlier English language was dominantly used in social media communication, but code-mixed text in social media is prevalent in non-English-speaking states. In code-mixed data, a single sentence combines two different languages, making it challenging for people to determine the language used in the text. Therefore, the use of language identification for processing such code-mixed sentences is essential in language processing tasks. In this proposed work we employed different machine learning and deep learning techniques to identify Amharic-English code-mixed text from social media platforms like Facebook, YouTube, TikTok, and Telegram. For the proposed models we prepared 5021 sentences from these social media, and we applied scraping, cleaning, filtration, and then tokenization and POS tagging using NLTK. After tokenization was performed, we labeled each language tag and Amharic POS tag manually, the total number of words was thirty-one thousand three hundred five (31,305) prepared. The process of categorizing Amharic-English code-mixed data into Amharic, English, Named Entity, and Universal was executed. In this study, we employed machine learning and deep learning methods, we trained our models with count vector, TF-IDF vector, bi and tri gram, and word to vector. After model training was finished, we evaluated the effectiveness of each model using accuracy, precision, recall, and F1-Score, and for user-friendly interaction, we saved the model and deployed using the Flask web Framework. The result shows SVM, logistic regression, and naïve bayes achieves an accuracy of 96%, 86%, and 86%, with TF-IDF and count vector respectively, and decision tree, XGBoost, random forest, and MaxEnt performs an accuracy of 94%, 93%, 93% and 87% with word to vector respectively. And LSTM, CNN, Bi-LSTM, MLP gained accuracy of 84%,87%,87%, and 87%. Generally, SVM is best fit for our code-mixed language identification. From deep learning algorithm MLP is best in terms of f1-score of 0.90,0.95,0.64, and 0.55 for Amharic, English, named entity and universal language label.en_US
dc.description.sponsorshipwolkite universtyen_US
dc.language.isoenen_US
dc.publisherWOLKITE UNIVERSITYen_US
dc.subjectSVM,en_US
dc.subjectCode-Mixeden_US
dc.subjectLanguage identificationen_US
dc.subjectMLP, Decision treeen_US
dc.subjectMachine learningen_US
dc.titleAMHARIC-ENGLISH CODE-MIXED LANGUAGE IDENTIFICATION ON SOCIAL MEDIA USING MACHINE LEARNING AND DEEP LEARNING APPROACHen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
Yetmwork_Tesfaye_LID_Research.pdf
Size:
3.66 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: