SCHOOL OF GRADUATE STUDIES DEVELOPING SEMANTIC TEXTUAL SIMILARITY FOR GURAGIGNA LANGUAGE USING DEEP LEARNING APPROACH MSc. THESIS GETNET DEGEMU May 28, 2024 WOLKITE, ETHIOPIA I Wolkite University School of Graduate studies Developing Semantic Textual Similarity for Guragigna Language using Deep Learning Approach A Msc Thesis Submitted to School of Graduate Studies in Partial Fulfillment Requirement for the Degree of Master of computer Science in Computer and Engineering Getnet Degemu Besir Major Advisor: Sintayehu Hirpassa (Ph.D.) Co-Advisor: Abdo Ababor (Msc) May 28, 2024 Wolkite, Ethiopia II APPROVAL SHEET Wolkite University School of Graduate studies As Thesis advisor, I hereby certify that all given comments by reviewers have considered, read, and evaluated the Thesis entitled Developing Semantic textual similarity for Guragigna language using Deep learning Approach. Submitted by Getnet Degemu ___________ 5/28/2024 Name Signature Date Approved by Sintayehu Hirpassa (Ph.D.) 25/05/2024 Name of Major Advisor Signature Date Abdo Ababor (Msc) _______________ _____________ Name of CO-Advisor Signature Date Kindie Biredagn (Ph.D.) 25/05/2024 Name of extremal examiner Signature Date Worku Muluye _______________ _____________ Name of Internal examiner Signature Date _______________ _______________ _______________ Name of DGC chairman Signature Date _______________ _______________ _______________ Name of CPG coordinator Signature Date _______________ _______________ _______________ SGS Approval Signature Date III APPROVAL SHEET WOLKITE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby certify that we have read and evaluated this Thesis titled “Developing Semantic textual similarity for Guragigna language using Deep learning approach” prepared under our guidance by Getnet Degemu Besir. We recommend that the Thesis shall be submitted as fulfilling the requirements for the award of a MSc. degree in computer Science and Engineering. Sintayehu Hirpassa (Ph.D.) 25/05/2024 Major Advisor Signature Date Abdo Ababor (Msc) ______________ ______________ Co-Advisor Signature Date As members of the Board of Examiners of the Master of Science Thesis open defense examination, we have read and evaluated this Thesis prepared by Getnet Degemu Besir and examined the candidate. We hereby certify that, the thesis is accepted for fulfilling the requirements for the award of the degree of Master of Science (M.Sc.) in computer Science and Engineering. Kindie Biredagn (Ph.D.) 25/05/2024 Name of extremal examiner Signature Date Worku Muluye ______________ ______________ Name of Internal examiner Signature Date ________________________ ______________ ______________ Name of chainman Signature Date Final approval and acceptance of the Thesis is contingent upon the submission of its final copy to the Council of Postgraduate Program (CPCS), through the candidate's department or school graduate committee (DGC or SGC). IV DEDICATION I dedicate the present work to my daughter, Ayana Getnet (ኤያና) V DECLARATION By my signature below, I declare and affirm that this Thesis/Dissertation is my own work. I have followed all ethical principles of scholarship in the preparation, data collection, data analysis and completion of this thesis. All scholarly matter that is included in the thesis has been given recognition through citation. I affirm that I have cited and referenced all sources used in this document. Every serious effort has been made to avoid any plagiarism in the preparation of this thesis. This thesis is submitted in partial fulfillment of the requirement for a degree from the School of Graduate Studies at Wolkite University. The thesis is deposited in the Wolkite University Library and is made available to borrowers under the rules of the library. I solemnly declare that this thesis has not been submitted to any other institution anywhere for the award of any academic degree, diploma or certificate. Brief quotations from this Thesis/Dissertations may be used without special permission provided that accurate and complete acknowledgement of the source is made. Requests for permission for extended quotations from, or reproduction of, this thesis in whole or in part may be granted by the Head of the School or Department or the Deas of the School of Graduate Studies when in his or her judgment the proposed use of the material is in the interest of scholarship. In all other instances, however, permission must be obtained from the author of the thesis. Name: Getnet Degemu Signature: _________ Date: 05/28/24 Department: computer Science and Engineering VI ACKNOWLEDGMENT First of all, I give thanks to God and Saint Mary (የረሸ ማርያም) for all of their support in my life. In next, I appreciative of Dr. Sintayehu H., my adviser, for his suggestions and assistance, and invaluable insights throughout this research. His expertise and encouragement have been instrumental in shaping the direction of this work and pushing me to achieve my best and I would like to convey my heartfelt appreciation to my co-advisor, Abdo Ababor (Msc), that have greatly influenced and enhanced my work and I am truly grateful for your deep guidance, constructive comments, and sharing of your experience and skills. Working with you has been a privilege, and I am fortunate to have had your support and mentorship throughout this work. Also, I would like to convey my appreciation to the CCI college members of the Department of Software Engineering at Wolkite University for their continuous support and encouragement. Their dedication to academic excellence and their commitment to fostering a conducive learning environment have greatly enriched my educational experience. Special thanks go to my family, colleagues (Alex) and friends (Temesgen, Tsegaye F, D Yalewu F, D Dawit F and D Cheru T) who have been a source of inspiration and motivation. Their insightful discussions, constructive feedback, and support in collection of data for this research. Lastly; this acknowledgment to my beloved wife Marta W (ማዬ). Whose unwavering support, love, and understanding. Your presence in my life has been a constant source of inspiration and strength. Your unwavering belief in my abilities even during the most challenging times has propelled me forward and given me the confidence to follow my dreams. Name: Getnet Degemu Signature: _________ Date: 05/28/24 Department: computer Science and Engineering VII ABBREVIATIONS AND ACRONYMS BERT Bidirectional Encoder Representation BRNN Bi-Directional Recurrent Neural Network CNN Convolutional Neural Network DL Deep Learning EXP Expert GLOVE Global Vectors For Word Representation GLSA Generalized Latent Semantic Analysis GPU Graphics Processing Unit GRU Gated Recurrent Unit IDE Integrated Development Environment IDF Inverse Document Frequency LSTM Long Short-Term Memory ML Machine Learning NLP Natural Language Processing NN Neural Network PDF Portable Document Format PNG Portable Network Graphics RELU Rectified Linear Unit RNN Recurrent Neural Network S1 Sentence One S2 Sentence Two SRNN Stacked Recurrent Neural Network STS Semantic Textual Similarity SVD Singular Value Decomposition SVG Scalable Vector Graphics TXT Text File Extension USE Universal Sentence Encoder UTF-8 Unicode Transformation Format-8-Bit Word2vec Word To Vector VIII TABLE OF CONTENTS APPROVAL SHEET ............................................................................................................ III DEDICATION ....................................................................................................................... IV DECLARATION ..................................................................................................................... V ACKNOWLEDGMENT ...................................................................................................... VI ABBREVIATIONS AND ACRONYMS ............................................................................ VII TABLE OF CONTENTS ................................................................................................... VIII LIST OF TABLE ................................................................................................................. XII LIST OF FIGURE ............................................................................................................. XIII LIST OF ALGORITHMS ................................................................................................. XIV LIST OF TABLES IN THE APPENDIX ........................................................................... XV LIST OF FIGURES IN THE APPENDIX ....................................................................... XVI ABSTRACT ....................................................................................................................... XVII CHAPTER ONE ...................................................................................................................... 1 1. INTRODUCTION ............................................................................................................ 1 1.1. Background of the Study .......................................................................................... 1 1.2. Motivation .................................................................................................................. 2 1.3. Statement of the Problem ......................................................................................... 3 1.4. Research Questions ................................................................................................... 4 1.5. Objective .................................................................................................................... 5 1.5.1. General objective ............................................................................................................ 5 1.5.2. Specific objective ............................................................................................................ 5 1.6. Scope of the study ...................................................................................................... 5 1.7. Limitations of the study ............................................................................................ 5 1.8. Significance of the study ........................................................................................... 6 1.9. Organization of the thesis ......................................................................................... 6 CHAPTER TWO ..................................................................................................................... 7 2. LITERATURE REVIEW ................................................................................................ 7 2.1. Introduction ............................................................................................................... 7 IX 2.2. The Semantic Textual similarity Approach ............................................................ 8 2.2.1. String based similarity .................................................................................................... 8 2.2.2. Corpus-Based Approaches ............................................................................................ 12 2.2.3. Knowledge Base STS Approaches................................................................................ 17 2.3. Deep learning techniques ........................................................................................ 21 2.4. Overview of Guragigna Language ......................................................................... 23 2.5. Related works .......................................................................................................... 24 CHAPTER THREE ............................................................................................................... 30 3. MATERIAL AND METHODS ..................................................................................... 30 3.1. Introduction ............................................................................................................. 30 3.2. Proposed Approach ................................................................................................. 30 3.3. Material and Tools .................................................................................................. 30 3.3.1. Hardware Tools ............................................................................................................. 30 3.3.2. Software Tools .............................................................................................................. 31 3.4. Corpus ...................................................................................................................... 32 3.5. Preprocessing ........................................................................................................... 35 3.5.1. Removing extra spaces .................................................................................................. 35 3.5.2. Removal of stop-words ................................................................................................. 36 3.5.3. Removing punctuation .................................................................................................. 37 3.5.4. Tokenization ................................................................................................................. 38 3.6. Universal Sentence Encoder (USE) ....................................................................... 39 3.7. Word embedding ..................................................................................................... 40 3.7.1. Global Vectors for Word Representation (GloVe) ....................................................... 41 3.7.2. Word2Vec (Word to Vector) ........................................................................................ 43 3.8. Optimization Algorithms in STS using deep learning ......................................... 43 3.9. Performance Measurement Methods .................................................................... 44 3.10. Accuracy ............................................................................................................... 44 3.11. Evaluation Metrics .............................................................................................. 44 3.11.1. Mean Squared of Error (MSE) ...................................................................................... 44 3.11.2. Process of adapting a pre-trained model ....................................................................... 45 CHAPTER FOUR .................................................................................................................. 46 4. RESEARCH DESIGN .................................................................................................... 46 X 4.1. Corpus Preparation................................................................................................. 46 4.2. Architecture of Developing STS for Guragigna Language ................................. 47 4.3. Pre-processing.......................................................................................................... 48 4.4. Data Splitting ........................................................................................................... 48 4.5. Vectorization ............................................................................................................ 48 4.6. Model Selection ........................................................................................................ 49 4.6.1. Long Short-term Memory Model .................................................................................. 49 4.6.2. Bi-directional RNN ....................................................................................................... 51 4.6.3. Gated Recurrent Unit (GRU) ........................................................................................ 53 4.6.4. Stacked RNN ................................................................................................................ 54 CHAPTER FIVE .................................................................................................................... 57 5. EXPERIMENTATION .................................................................................................. 57 5.1. Introduction ............................................................................................................. 57 5.2. Data Collection and Preparation ........................................................................... 57 5.3. Environment of implementation ............................................................................ 58 5.3.1. Removing extra spaces .................................................................................................. 58 5.3.2. Removal of stop-words ................................................................................................. 59 5.3.3. Removing punctuation .................................................................................................. 61 5.3.4. Tokenization ................................................................................................................. 61 5.4. Embedding Process ................................................................................................. 63 5.5. Parameter Selection ................................................................................................ 64 CHAPTER SIX ...................................................................................................................... 67 6. RESULT AND DISCUSSION ....................................................................................... 67 6.1. Introduction ............................................................................................................. 67 6.2. Experimental Result ................................................................................................ 68 6.3. Discussion on the Result ......................................................................................... 81 6.4. Confusion matrix ..................................................................................................... 83 6.5. Summary .................................................................................................................. 86 6.1. Evaluation by Linguists experts ............................................................................. 86 6.2. Answering Research Questions .............................................................................. 88 Chapter Seven ........................................................................................................................ 90 XI 7. CONCLUSION AND RECOMMENDATION ............................................................ 90 7.1. Overview .................................................................................................................. 90 7.2. Conclusion ................................................................................................................ 90 7.3. Contribution and challenges .................................................................................. 91 7.4. Future work ............................................................................................................. 92 7.5. Recommendation ..................................................................................................... 93 REFERENCES ....................................................................................................................... 94 APPENDICES ...................................................................................................................... 100 Appendix A ....................................................................................................................... 100 Appendix B ....................................................................................................................... 101 Appendix C ....................................................................................................................... 102 Appendix D ....................................................................................................................... 103 Appendix E ....................................................................................................................... 104 Appendix F ........................................................................................................................ 105 Appendix G ....................................................................................................................... 106 Appendix H ....................................................................................................................... 107 Appendix I ........................................................................................................................ 108 XII LIST OF TABLE Table 1-1: The weakness of lexical matching in capturing semantic similarity ........................ 2 Table 2-1: List of related works ............................................................................................... 27 Table 3-1 Tools and materials .................................................................................................. 32 Table 5-1 Source of Data Collection ........................................................................................ 57 Table 5-2 list of Hype-Parameters ........................................................................................... 65 Table 6-1 Experimental result of the proposed model ............................................................. 82 Table 6-2 Comparing predicted similarity scores with the actual similarity scores ................ 87 XIII LIST OF FIGURE Figure 4-1 Architecture of Developing STS of Guragigna Language ..................................... 47 Figure 4-2 LSTM Encoder Decoder Architecture ................................................................... 50 Figure 4-3 Architecture of Bi-directional RNN Model ........................................................... 52 Figure 4-4 Gated Recurrent Unit (GRU) Model Architecture ................................................. 54 Figure 4-5 Architecture of Stacked RNN ................................................................................ 55 Figure 5-1 Sample code of removing spaces and newlines ..................................................... 59 Figure 5-2 Sample Code of Removing Stopwords .................................................................. 60 Figure 5-3 Sample Code of Removes Punctuation .................................................................. 61 Figure 5-4 Sample Code of Tokenization ................................................................................ 62 Figure 5-5 Sample Code of Embedding ................................................................................... 64 Figure 6-1 LSTM Training Using USE Embedding ................................................................ 70 Figure 6-2 LSTM Training Using Glove Embedding .............................................................. 71 Figure 6-3 LSTM Training Using Word2vec Embedding ....................................................... 71 Figure 6-4 LSTM Model Comparison of Actual and Predicted Similarity Scores .................. 72 Figure 6-5 GRU Training History Using USE Embedding ..................................................... 73 Figure 6-6 GRU Training History Using Glove Embedding ................................................... 73 Figure 6-7 GRU Training History Using Word2vec Embedding ............................................ 74 Figure 6-8 GRU Model Comparison of Actual and Predicted Similarity Scores .................... 75 Figure 6-9 Bidirectional RNN Training History Using USE Embedding ............................... 76 Figure 6-10 Bidirectional RNN Training History Using Word2vec Embedding .................... 76 Figure 6-11 Bidirectional RNN Training History Using Glove Embedding ........................... 77 Figure 6-12 Bid-RNN Model of Comparison Actual and Predicted Similarity Scores ........... 78 Figure 6-13 Stacked RNN Training History Using USE Embedding ..................................... 79 Figure 6-14 Stacked RNN Training History Using Glove Embedding ................................... 79 Figure 6-15 Stacked RNN Training History Using Word2vec Embedding ............................ 80 Figure 6-16 Stacked RNN Model Comparison of Actual and Predicted Similarity Scores .... 81 Figure 6-17 LSTM confusion matrix ....................................................................................... 83 Figure 6-18 GRU confusion matrix ......................................................................................... 84 Figure 6-19 Bidirectional RNN confusion matrix ................................................................... 84 Figure 6-20 Stacked RNN confusion matrix ............................................................................ 85 XIV LIST OF ALGORITHMS Algorithm 3-1 Algorithms for remove extra spaces from dataset ........................................... 36 Algorithm 3-2 Algorithms for remove stop words from dataset ............................................. 37 Algorithm 3-3 Algorithms for remove punctuation from dataset ............................................ 37 Algorithm 3-4 Algorithms for tokenization a dataset .............................................................. 39 Algorithm 3-5 Algorithms for Universal Sentence Encoder (USE) ........................................ 40 Algorithm 3-6 Algorithms for Global Vectors for Word Representation (GloVe) ................. 42 Algorithm 3-7 Algorithms for Word2Vec (Word to Vector) .................................................. 43 XV LIST OF TABLES IN THE APPENDIX 1. Sample of Corpus ............................................................................................................... 100 2. Punctuation marks commonly used in Guragigna Language ............................................. 101 3. Sample List of Stop Word of Guragigna ........................................................................... 102 XVI LIST OF FIGURES IN THE APPENDIX 1. Alphabete of Guragigna .................................................................................................... 103 2. Required Python Libraries ................................................................................................. 104 3. Sample Train Data ............................................................................................................. 105 4. Sample output of model predict ......................................................................................... 106 5. Sample output of model Loss and Accuracy ..................................................................... 107 6. Sample pair of sentence that score similarity by Expertise ............................................... 108 XVII ABSTRACT Natural language processing (NLP) is one part of how far the world has come in terms of technology. It is the process of teaching human language to machines and includes everything from Morphology Analysis to Pragmatic Analysis. Semantic Similarity is one of the highest levels of NLP. The Previous Semantic textual similarity (STS) studies have been conducted using from string-based similarity methods to deep learning methods. These studies have their limitations, and no research has been done for STS in the local language using deep learning. STS has significant advantages in NLP applications like information retrieval, information extraction, text summarization, data mining, machine translation, and other tasks. This thesis aims to present a deep learning approach for capturing semantic textual similarity (STS) in the Guragigna language. The methodology involves collecting a Guragigna language corpus and preprocessing the text data and text representation is done using the Universal Sentence Encoder (USE), along with word embedding techniques including Word2Vec and GloVe and mean Square Error (MSE) is used to measure the performance. In the experimentation phase, models like LSTM, Bidirectional RNN, GRU, and Stacked RNN are trained and evaluated using different embedding techniques. The results demonstrate the efficacy of the developed models in capturing semantic textual similarity in the Guragigna language. Across different embedding techniques, including Word2Vec, GloVe, and USE, the Bidirectional RNN model with USE embedding achieves the lowest MSE of 0.0950 and the highest accuracy of 0.9244. GloVe and Word2Vec embedding also show competitive performance with slightly higher MSE and lower accuracy. The Universal Sentence Encoder consistently emerges as the top-performing embedding across all RNN architectures. The research results demonstrate the effectiveness of LSTM, GRU, Bi RNN, and Stacked RNN models in measuring semantic textual similarity in the Guragigna language. Keywords: Semantic textual similarity, Guragigna language, deep learning, corpus-based approaches, LSTM, GRU, Bidirectional RNN, Stacked RNN and Word embedding. 1 CHAPTER ONE 1. INTRODUCTION 1.1. Background of the Study NLP means doing computations in natural language. Semantic analysis is one of the processes involved in natural language processing. When building the syntactic structure of the sentence the input sentence analysis does a semantic analysis of the sentence and Sentences are given meaning by semantic interpretation. Logical forms are mapped to knowledge representations by contextual interpretation. The semantic similarity of features in a vector model is the fundamental building block of semantic analysis.[1]. The comparison of text meaning known as semantic text similarity (STS) plays a vital role in various tasks within natural language processing (NLP) like information retrieval, categorization, content extraction, answering questions, and identifying plagiarism. Text similarity between simple sentence is an important and necessary task in many information retrieval applications. Performance of many natural language processing (NLP) applications like text summarization, machine translation, plagiarism detection, and sentiment analysis. It also relies on similarity of text and meaning. Several other applications have used similarity such as text classification, feedback on relevancy, word disambiguation, subtopic mining, and web search[2]. Similarity measures for many languages such as English, Spanish and Arabic are available, and some have been organized by the organizers of SemEval ST for calculating similarity between multilingual and monolingual simple sentence research duties [2]. One typical approach for computing similarity is lexical matching between simple sentence. A similarity score is determined using the quantity of terms that belong to both text segments. These metrics however, are only able to calculate similarities at a very basic level. Furthermore, this matching can only estimate text similarity but not semantics. Consider two simple sentence “ሁት ሜና ነረን ባረም ተሳረምታ ቸነም” (does he has a work? He asked) and “ሁት ሜና ኤነን ባረም ተሳረምታ ቸነም” (doesn’t he has a work? He asked). As indicated by the lexical assignment in both sentences he has two headwords ("ሁት" and "ሜና"). But these he has no semantic connection between the two simple sentence. Consider another pair of sentences: “አት አርች ቸዋች ተሐረ አወገዳታ ብ𞟠 ንስራነ ቧረንም” (A boy went to cry with his good friend) and “አማት አርች ሶሬሳ ተሐረ አወገዳታ ብ𞟠 ንወነ 2 ቧረንም” (A boy went to cry with his good friend). There is no clear terminology present in these two sentences. but there are clear semantic similarities [2]. Table 1-1: The weakness of lexical matching in capturing semantic similarity Sentence 1 Sentence 2 Similarity “ሁት ሜና ነረን ባረም ተሳረምታ ቸነም” “ሁት ሜና ኤነን ባረም ተሳረምታ ቸነም” Lexically similar but not semantically “አት አርች ቸዋች ተሐረ አወገዳታ ብ𞟠 ንስራነ ቧረንም” “አማት አርች ሶሬሳ ተሐረ አወገዳታ ብ𞟠 ንወነ ቧረንም” Semantically similar but not lexically Similarities between Guragigna simple sentence is more difficult than simple sentence in other languages. One of the main reasons is that the Guragigna resources are not comparable to those of any other language. Other text preprocessing includes the well-known tokenizers, stemmers, and lemmatizes used in almost every NLP task, and their performance is arguably even better. But on the contrary. This kind of tool is less common in Guragigna simple sentence. Additionally, there are well-organized resources such as WordNet, NLP POS-Tagger, and more. This improves the performance of similarity estimation methods and Guragigna text methods therefore, lack such tools and resources [2]. Trying to overcome the challenge of capturing semantic similarities between Guragigna text pairs. We introduced a method to measure the semantic similarity of Guragigna simple sentence Based on Deep learning techniques an efficient Guragigna algorithm for measuring semantic text similarity is used. Prepare a dataset that can be used to test the performance of the Guragigna text semantic similarity measure [2]. 1.2. Motivation Research in the field of natural language processing has been primarily motivated by they lead to a better understanding of the structure and function of human language Building natural language interfaces. It is used to facilitate communication between both human and computers. Recently, research on the similarity of semantic sentence similarity in international has been made. As an illustration, foreign languages have developed semantic text similarity such as English, Arabic, Spanish, and Bengali [2]. However, the research in local languages and Guragigna is very limited in order to direct Guragigna approach to technology. In particular, semantic textual similarities have not developed in the Guragigna language also in local language. 3 1.3. Statement of the Problem An essential component of the processing of natural languages is Semantic Textual Similarity (STS) with significant implications for various tasks and applications. When retrieving information (IR), STS plays a vital role by measuring the similarity between user queries and documents, enabling precise retrieval of relevant information. STS is also valuable in information extraction, where it aids in mining text that is unorganized for useful information by measuring semantic similarity between different pieces of text. Another important application is text summarization, where STS helps identify similar or redundant content within a document, making it easier to create educational summaries. STS is also valuable in data mining, where it aids in clustering similar instances or identifying similar patterns by measuring semantic similarity. In machine translation, STS improves the accuracy of translations by capturing the semantic similarity between source and target language sentences. In question answering systems, STS helps determine the similarity between user queries and candidate answers, leading to more accurate responses. STS is also relevant in sentiment analysis, where it measures similarity between sentiment-bearing texts, aiding in tasks such as sentiment classification. Additionally, STS aids in paraphrase detection, which is crucial for tasks like plagiarism detection and text generation. Overall, STS is a fundamental concept in natural language processing that enhances the efficiency and accuracy of language understanding across various domains. Guragigna is one of the most widely spoken languages in Ethiopia and is an Afro-Asiatic language of the Semitic Southern Ethiopian branch spoken by the Gurage people. According to the 2007 Census there are currently over 6.8 million native speakers of the language. The language is used in the middle grades of elementary school and in various institutions in the community[3]. It is also used in various fields such as Wolkite radio stations, Magazines, textbooks, and fiction are published in the language. The limited study in Guragigna language can be justified by several factors. NLP tasks require significant linguistic resources, such as annotated corpora and language models. Which are often developed for languages with greater demand and research backing. As a result Guragigna may lack the necessary resources to support advanced NLP research. Additionally, the availability of data plays a crucial role in NLP, and languages with limited study may suffer from a shortage of publicly available language resources. The problem of development and evaluation of NLP systems for Guragigna. 4 Moreover, the absence of practical applications or tools. For instance Machine translation systems or part-of-speech taggers for Guragigna, further indicates the limited research and development in these areas. Due to these reasons there have been few research studies conducted on the Guragigna language in various tasks involving natural language processing (NLP) instance part-of-speech tagging and machine translation [4], automatic Guragigna language character recognition [5], and others. International academic research websites like IEEE Xplore, ACM Digital Library, and Google Scholar, as well as local academic research websites associated with Ethiopian universities, linguistic research institutions, or language departments. Were explored using relevant keywords such as "Guragigna language" "semantic textual similarity" and "NLP" While no specific studies on STS for Guragigna were found? it is possible to come across related research in the broader field of NLP or studies on STS in other languages, such as Bengali [6], English [7], Arabic [8], and others. These studies showed promising results but also had certain limitations including not utilizing any RNN or CNN models, a large gap in accuracy compared to English STS models, a lack of a lexical standard for Arabic, insufficient experiment detail (failure to explain actual scores and model predictions) and comparison with unrelated works, a lack of information about pre-training embedding’s, a shortage of annotated corpora, better performance on smaller datasets, and limited availability of training data. Additionally, in a study conducted on the Amharic language [9] an attempt was made to develop an Amharic-English CLSTSM system by utilizing a statistically topic modeling-based semantic text similarity measurement approach. This model, which uses statistical topic modeling approaches like LDA has a disadvantage in that it primarily relies on word co-occurrence statistics and fails to incorporate the semantic meaning of the words. As a result the topics generated by the model may not always align perfectly with human interpretation or understanding of the underlying themes. Therefore, Based on these problem conduct research on semantic textual similarity is mandatory and important for information retrieval, Information extraction, Text summarization, Data-mining, machine translation and other issues, so we intend to do this Study. 1.4. Research Questions At the end of this study, the following research questions are answered and investigated. RQ1. Which word embedding techniques can be used for Model development that can determine the effectiveness and robustness of Semantic Text Similarity (STS)? RQ2. Which deep learning model is the most effective in performing Semantic Text Similarity (STS) analyses for the Guragigna language? 5 1.5. Objective 1.5.1. General objective  The general objective is to develop a semantic textual similarity analyzer using a deep learning approach for Guragigna language. 1.5.2. Specific objective  To prepare a semantic text similarity corpus for the Guragigna language.  To develop word embedding techniques for Guragigna Semantic Text Similarity (STS) analyzer.  To develop a deep learning model for Guragigna semantic text similarity (STS) analyzer.  To measure the performance of word embedding techniques in conjunction with deep learning model.  To measure the effects of each deep learning algorithm on the Guragigna Semantic Text Similarity (STS) model. 1.6. Scope of the study This study's objective is to examine how similar basic sentences are semantically in Guragigna language in dialect of cheha with specifically focusing on sentence-level semantic similarity. The study employs approach of deep learning to investigate semantic similarity within the context of Guragigna language. The goal is to develop a model that can accurately measure semantic similarity in Guragigna language sentences. To facilitate the development and evaluation of the model, a dataset is prepared, consisting of annotated sentences in Guragigna language along with their corresponding similarity scores. The dataset covers diverse sentence pairs, representing various semantic relationships and degrees of similarity. The study aims to leverage deep learning techniques to advance the understanding and capabilities of semantic similarity analysis in the Guragigna language. 1.7. Limitations of the study The analysis is focused on sentences meaning that the findings may not directly apply to more complex sentence structures or longer texts and the study is limited to the specific dialect of Cheha. Which could restrict its generalizability to other dialects or languages. The effectiveness of the model developed in this study heavily depends on the quality and representativeness of the dataset used. Any limitations or biases in the dataset may have an impact on the accuracy and reliability of the model's results. Lastly, this study primarily focuses on general semantic relationships in Guragigna language sentences and does not 6 address specific domain semantic similarity analysis. By considering these limitations, we can better interpret and contextualize the findings of the study. 1.8. Significance of the study Semantic text similarity is an important and fundamental task in natural language processing (NLP). Being able to compare semantic similarities between sentences has many applications in various fields. Examples of areas where text similarity is used include plagiarism detection, search engines, and customer service. The development of STS has different benefits for both the Gurage language community and the research community.  For the Gurage community: Semantic Textual Similarity (STS) holds great significance for the Gurage community by contributing to language preservation, technology development, education, information retrieval, and cultural representation. STS enables the accurate measurement of semantic similarity in Gurage language sentences help in the preservation and revitalization of the language and facilitating the development of language technologies specific to the community's needs. It supports educational applications and empowering learners to improve their language proficiency. Lastly, it promotes cultural representation and identity by conveying the unique aspects of the Gurage community's language and culture in various domains. Overall, STS empowers the Gurage community in communication, information access, and language preservation for future generations.  For the research community: It contributes to researchers doing more advanced NLP applications of the Guragigna language as a preprocessing component. 1.9. Organization of the thesis The rest of this thesis is organized as follows. In Chapter 2, we explain the different approaches used to develop Semantic Text Similarity (STS) and review related works on developing Semantic Text Similarity (STS) for Guragigna language. Chapter 3 focuses on the methodology employed in this study. Chapter 4 presents the design of STS and implementation of the proposed Semantic Text Similarity (STS) system for Guragigna language. In Chapter 5, we present the STS experimental results of the proposed system. Chapter 6 describes the results and discussion. Finally, in Chapter 7, we conclude the thesis by highlighting the research contribution and discussing future works. 7 CHAPTER TWO 2. LITERATURE REVIEW 2.1. Introduction Semantic similarity is significant in Natural Language Processing (NLP), and it plays a crucial role in various NLP applications. One fundamental task in this field is Semantic Textual Similarity (STS) which involves assessing the similarity between different documents. To determine this similarity a metric is used to evaluate the direct and indirect relationships among the documents. By identifying semantic relations we can measure and recognize these relationships accurately [8][10]. The primary objective of the STS task is to establish a unified framework that include different independent semantic components. This assess the influence of these elements on different NLP tasks. Developing such a framework is a crucial research challenge with significant applications in NLP include retrieval of information (IR) and summarization of text in area [4], [11], as well as question answering [12],, relevance feedback [13], text classification [14], WSD, and summarization for extractive [15]. Semantic similarity is not only relevant for NLP applications but also plays a significant role in various semantic web applications include extraction, generation of ontology, and disambiguation. Semantic similarity is particularly valuable in search [50], where the performance to accurately with all entities measure the semantic relatedness is valuable in IR. One of the key problem is how semantically related documents or images retrieving to a user's query in a web search engine, including retrieving images based on their captions [11]. Text similarity has applications beyond NLP and the semantic web, extending into the field of databases as well. In database systems, text similarity can be leveraged for schema matching, addressing the challenge of semantic heterogeneity in data sharing systems, data integration systems, message passing systems, and peer-to-peer data management systems [16]. Additionally, text similarity is beneficial for relational join operations in databases, particularly when the join attributes exhibit textual similarity. The utility of text similarity spans various application domains, including the integration and querying of data from diverse resources, data cleansing, and data mining [17]. In NLP, STS) is connected to both Textual Entailment (TE) and paraphrasing, but have a differences between them. In TE, three directional relationships can be established between two text fragments. The task involves a two text considering as fragments of "text" (t) and the 8 "hypothesis" (h). In another case, paraphrasing identification aims to recognize text fragments that have approximately the same meaning within a specific context. Therefore, TE and paraphrasing focus on providing a yes/no decision, while STS goes a step further by evaluating the degree of equivalence between texts and assigning ratings with their semantic connection. 2.2. The Semantic Textual similarity Approach 2.2.1. String based similarity The string-based Similarity methods evaluate the text from a lexical standpoint and only work with string sequences and characters. String based similarity is a way of evaluate the strings similarity. More used in NLP tasks to compare phrases, sentences and other text fragments. These measures may be taken to determine the level of semantic or syntactic relatedness between two strings. 2.2.1.1. Character-Wise Approach LCS and N-grams are two of the most common approaches in a character level evaluation. Longest Common Substring (LCS) algorithm uses dynamic programming to consider the length of common substrings in both terms. N-gram algorithm considers a sub-sequence of n items of the term. Distance in N-Gram is computed by dividing the number of similar n-grams by the maximal number of n-grams available. Longest Common Substring (LCS) algorithm is employed to identify the longest shared substring between two strings. It compares the two strings and determines their similarity by examining the longest sequence of characters they have in common [16]. The measurement can be computed as follows: 𝐿𝐶𝑆𝑢𝑏𝑠𝑡𝑟 (𝑆1, 𝑆2) = max 𝐿𝐶𝑆𝑢𝑓𝑓(𝑆1 … 𝑖, 𝑆2 … 𝑗), 1 ≤ 𝑖 ≤ 𝑚, 1 ≤ 𝑗 ≤ 𝑛 The measurement of the LCS algorithm can be computed using the following formula: LCS(S1, S2) = LCSuff(S1, S2, m, n) Here, m represents the length of the first string (S1), n represents the length of the second string (S2), and LCSuff is a function that finds the longest common suffixes of the possible prefixes of S1 and S2. Damerau-Levenshtein distance, also known as the Damerau-Levenshtein is a metric of two strings to evaluate the difference between them. It the number of operations needed to transform one string into another to quantifies.[16]. 9 Jaro: no similarity between the strings is explain in 0, and represents an exact match is explain in 1 called distance score is normalized is calculated as follows: 𝑑𝑗 ={ ( m |s1| + m |s2| + m−t m ) 1 3 0 if m=0 otherwise Here, |s1| denotes the length of string s1, |s2| denotes the length of string s2, m is the “number of matching characters”, and t is “transpositions of half the number”. [ 𝑚𝑎𝑥(|𝑆1|, |𝑆2|) 2 ] −1 Jaro–Winkler distance is accept two strings and evaluate similarity of semantic that is a classification of distance edit and the Jaro distance metric. This way is more implement on simple sentence. dw = dj + (lp(1 − dj)), Here, dj represents the Jaro distance between the strings, and lp is scale of prefix can matching characters at the beginning of the strings that ratings assigns up to a length l of prefix. This approach included the Jaro distance with a prefix bonus to provide a more refined similarity measure [20]. Needleman-Wunsch algorithm is optimal matching algorithm and a global alignment technique for dynamic programming algorithm commonly take in sequences of bio-informatics for aligning. [18]. Algorithm of Smith-Waterman is algorithm that make sequence alignment in local take that to evaluate the strings similarity like nucleotide sequences. Unlike Needleman-Wunsch, Smith- Waterman focuses on optimizing segments of strings similarity. This algorithm not implement for long-scale problems [19]. Model of N–gram is a model of probabilistic language that implement to execute the sequence of next term (n - 1) terms or characters. The big advantages of the N-gram model are its simply implement and more scalability [20]. 10 2.2.1.2. Term-Wise Approach At the Term level, there are commonly used measures to evaluate similarity: cosine similarity and Jaccard similarity. Cosine similarity compares two vectors in a space and calculates. There are also other methods available, such as Damerau-Levenshtein, Jaro-Winkler, Needleman- Wunsch, and Smith-Waterman, which are not explained in detail here due to space constraints. Block Distance, also known as the City Block of Distance, Snake Distance, Manhattan Distance, Manhattan Length, L1 Distance[21], is a metric used to measure the two points distance d1 with calculated vectors of p and vectors of q as follows: 𝑑1(𝑝, 𝑞) = ‖𝑝 − 𝑞‖1 = ∑|𝑝𝑖 − 𝑞𝑖| 𝑛 𝑖=1 Cosine Similarity is a metric of similarity in an inner product space between two non-zero vectors to determines the cosine of the vectors angle. Similarity of Cosine is commonly used in data mining to assess the cohesion between vectors [22]. The cosine of two non-zero vectors can be computed using the Euclidean dot product. 𝑎 − 𝑏 = ‖𝑎‖‖𝑏‖ cos 𝜃 Vectors A and vectors B is calculated Similarity of Cosine cos(θ) as divided by vectors to the product of their magnitudes in dot product. Mathematically, it can be expressed as cos(𝜃) = 𝐴 − 𝐵 ‖𝐴‖‖𝐵‖ = ∑ 𝐴𝑖𝐵𝑖 𝑛 𝑖=1 √∑ 𝐴𝑖 2 𝑛 𝑖=1 √∑ 𝐵𝑖 2𝑛 𝑖=1 Here, A · B represents the dot product of vectors A and B, and ||A|| and ||B|| represent the magnitudes (or norms) of vectors A and B, respectively. Similarity of Soft Cosine is takes into count the Vector Space Model similarity [23]. It is calculated using the following formula: 𝑠𝑜𝑓𝑡_ cos 𝜃 = ∑ 𝑠𝑖𝑗𝐴𝑖𝐵𝑗 𝑛 𝑖,𝑗=1 √∑ 𝑠𝑖𝑗𝐴𝑖𝐴𝑗 𝑛 𝑖,𝑗=1 √∑ 𝑠𝑖𝑗𝐵𝑖𝐵𝑗 𝑛 𝑖,𝑗=1 In this formula, sij represents the value from the similarity matrix between features i and j. It's important to note that if the similarity matrix is diagonal, meaning the features are only similar to themselves, then the soft cosine similarity becomes same to the old similarity of cosine. [24] 11 Sorensen–Dice index (Dice's Coefficient) is taken to measure quantify the similarity of samples [25]. It is commonly employed to determine the presence of data set or absence of data sets. The formula for calculating the Sorensen-Dice Index is as follows: 𝑄𝑆 = 2|𝑋 ∩ 𝑌| |𝑋| + |𝑌| Here, |X| (number of elements one set) and |Y| (number of elements one set) compared. The quotient of similarity represent by QS, and ranges of 0 to 1 is value. Coefficient on bigrams of Strings S1 and S2 similarity calculated as follows: 𝑠im = 2nt ns1 + ns2 In this formula, nt represents the count of shared bigrams between the strings, while n_s1 and n_s2 represent the total number of bigrams in S1 and S2. Euclidean Distance is masseur the two points of straight-line distance. Distance of euclidean in two points, denoted as "d(s,t)" or "d(t,s)", can be calculated using the following formula: d(s,t) = 𝑑(𝑡, 𝑠) = √∑(𝑡𝑖 − 𝑝𝑖)2 𝑛 𝑖=1 In this formula, n represents the number of dimensions or features in the space, t_i and p_i represent the corresponding coordinates or values of the points in each dimension. The formula calculates the total root of square with squared differences in the values of the two points. Jaccard Index ( Jaccard similarity coefficient) is a statistical measure used to similarity with diversity determine in two sets of finite [26]. It is defined by the following formula: 𝐽(𝐴, 𝐵) = |𝐴||𝐵| |𝐴 ∪ 𝐵| = |𝐴||𝐵| |𝐴| + |𝐵| − |𝐴 ∩ 𝐵| In this formula, A and B represent the two sets being compared. |A| and |B| denote the cardinality (number of elements) of sets A and B, respectively. SMC is a statistical measure used to assess the similarity and diversity between two objects. It considers the objects as a collection of n binary attributes. The SMC between objects A and B can be calculated using the following formula: 12 𝑆𝑀𝐶 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑀𝑎𝑡𝑐ℎ𝑖𝑛𝑔 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 𝑁𝑜. 𝑜𝑓 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 = 𝑎00 + 𝑎11 𝑎00 + 𝑎01 + 𝑎10 + 𝑎11 SMC = (Number of Matching Attributes) / (Total Number of Attributes) Where the total number of attributes is represented by a_00 + a_01 + a_10 + a_11, and the matching attributes that have the same value in both objects. In the formula, a_00 represents the total number of attributes that are 0 in both A and B, a_10 represents the total number of attributes that are 1 in A and 0 in B, a_01 represents the total number of attributes that are 0 in A and 1 in B, and a_11 represents the total number of attributes that are 1 in both A and B. The SMC provides a evaluation of similarity of two objects By dividing the quantity of matching attributes by the total number of attributes, ranging from 0 to 1. Overlap Coefficient the Overlap Coefficient, also known as the Szymkiewicz-Simpson Coefficient, is a similarity evaluation that is closely related to the Jaccard Index. It quantifies the overlap between two sets and is defined as follows: 𝑜𝑣𝑒𝑟𝑙𝑎𝑝(𝐴, 𝐵) = |𝐴| ∩ |𝐵| min (|𝐴|, |𝐵|) In this formula, |A| and |B| represent the cardinalities (number of elements) of sets A and B, respectively. The numerator calculates the size of the intersection of sets A and B, while the denominator represents the size of the smaller set between A and B. The overlap coefficient takes values between 0 and 1, with 1 indicating that set A is a subset of set B. In other words, when the overlap coefficient is equal to 1, all elements of set A are also present in set B. 2.2.2. Corpus-Based Approaches Corpus-based methods in language studies involve using real language samples from large collections of written or spoken texts to examine language structure, usage, and meaning. These approaches are commonly used to identify patterns in communication, create networks of word meanings, develop computer models for language and learning, measure differences between 13 dialects, and understand how language changes over time. They can also help us understand how we learn languages and provide information for tasks like NLP. Important aspect of this approach is finding similarities between words by analyzing the data in the collection. This method requires a sizable collection of texts. Analyzing large collections provides valuable information, allowing us to identify common word occurrences and accurately estimate word similarities. Many of the methods proposed for measuring word similarity rely on analyzing large collections of texts. This type of similarity measurement is based on information gathered from a substantial collection of texts. A text collection used for language research is called a corpus, and it contains written or spoken sentences. To determine word similarities, we often look at how words appear together in the corpus. To obtain reliable statistics on word co-occurrence, we need a very large and balanced corpus [27]. 2.2.2.1. Method of LSA (Latent Semantic Analysis) One example of this type of analysis is Latent Semantic Analysis (LSA). In LSA, each word is represented as a vector based on statistical calculations. To create these vectors, a large text is analyzed, and a matrix of words is constructed. The words are represented as rows in the matrix, and the paragraphs or segments of text are represented as columns. Singular value decomposition (SVD) is then applied to reduce the dimensionality of the matrix. After dimensionality reduction, word similarity is computed using cosine similarity. In this method, contextual information for words is extracted from a large text corpus [28]. The first step involves representing the text as a matrix, where rows represent unique words and columns represent segments of text. Each entry in the matrix represents the frequency count of a word appearing in a particular segment of text [29]. The cell frequencies are weighted based on two factors. The importance of words in the text and the degree to which parts of speech share information in the discourse context. This approach can be implemented in two ways as a similarity matrix with words and text segments implementation use, and as a computational model that represents the underlying knowledge acquisition and usage. To reduce the number of rows in matrix, singular value decomposition (SVD) is develop while preserving the similarities with columns. To measure similarity, the cosine angle between vectors of word made by any two rows. LSA relies on the distributional hypothesis, which suggests that words appearing in similar contexts to have similar meanings [30]. Therefore, evidence of word similarity can be computed through statistical analysis of large collections of sentences. LSA is a mathematical 14 and statistical technique that extracts and assumes relationships based on the expected contextual usage of words in a discourse passage. It is not a traditional natural language or artificial intelligence processing program. Instead of relying on human-made dictionaries, knowledge bases, semantic webs, grammars, syntactic parsers, morphologies, etc., LSA takes raw text as input, treating it as sentences or paragraphs [29]. 2.2.2.2. Method of Hyperspace Analogue to Language Hyperspace Analogue to Language (HAL) is a method that constructs a word co-occurrence matrix where both rows and columns represent words in the vocabulary. The matrix elements are filled with association strength values. These association strength values are computed by applying a sliding "window" over the corpus, and the size of the window can be adjusted. The strength of association between words within the window decreases as their distance from each other increases. For example, in the sentence "This is a survey of various semantic similarity measures," the words "survey" and "variety" would have a higher association value compared to "survey" and "measures." Word vectors are formed by considering both the row and column of a given word in the co- occurrence matrix. To reduce dimensionality, columns with low entropy values are eliminated. Finally, semantic similarity is calculated by measuring the Euclidean or Manhattan distance between the word vectors [31]. 2.2.2.3. Method of Explicit Semantic Analysis (ESA) ESA (Explicit Semantic Analysis) is a semantic similarity measurement method that relies on Wikipedia concepts. By utilizing Wikipedia, this approach can be applied to different domains and languages. The dynamic nature of Wikipedia ensures that the method remains adaptable to changes over time [32].. In ESA, each concept present in Wikipedia is represented as an attribute vector comprising the words associated with it. An inverted index is then constructed, linking each word to the concepts it is associated with. To determine the strength of technique called TF-IDF applied, which assigns weights to the associations. Concepts that have weak associations with words are subsequently filtered out. As a result, the input text is represented by weighted vectors of concepts, known as "interpretation vectors" [32].. To measure semantic similarity, the cosine similarity between these interpretation vectors is calculated. The cosine similarity provides a measure of how similar the vectors are in terms of their direction in the vector space [32]. 15 2.2.2.4. Method of Word-alignment Models Word-alignment models are used to determine the semantic similarity between sentences based on their alignment in a large corpus [32]. These models have shown success in SemEval tasks 2015, securing the second, third, and fifth positions. One unsupervised method that ranked fifth utilized the word-alignment technique based on the Paraphrase Database (PPDB) [32]. This system measures the semantic similarity between two sentences by considering the proportion of aligned context words shared between the sentences compared to the total number of words in both sentences. The supervised methods that ranked second and third employed word2vec to establish word alignments. In the first supervised method, a sentence vector is created by computing the "component-wise average" of the word vectors. The cosine model similarity between these sentence vectors is then implement as a measure of STS. The second supervised method focuses only on words that exhibit contextual semantic similarity [32]. 2.2.2.5. Method of Latent Dirichlet Allocation (LDA) LDA (Latent Dirichlet Allocation) is a technique commonly used for topic modeling tasks. It represents a document's topic or general idea as a vector, rather than including every single word from the document. This approach offers the advantage of reduced dimensionality since the number of topics is typically much smaller than the number of words of the document [33]. To evaluate similarity of document, a new method involves using vector representations of documents. Each document is represented as a vector in a high-dimensional space, where each dimension corresponds to a specific topic. The cosine similarity between these document vectors is then calculated to measure the semantic documents similarity [34].The cosine similarity provides a measure of how similar the directions of the vectors space in document, indicating their semantic similarity. 2.2.2.6. Method of Normalized Google Distance (NGD) Normalized Google Distance (NGD) is a measure of similarity between two terms based on the results obtained from querying them using the Google search engine. The underlying assumption is that if two words are more related, they will appear together more frequently in web pages [35]. To calculate the NGD between two terms, denoted as t1 and t2, the following formula is used: NGD(x,y) = max {loд f (t1),loд f (t2)} − loд f (t1,t2), (9) loд G − min {loд f (t1),loд f (t2)} 16 In this formula, f(x) and f(y) represent the number of hits in the Google search results for the respective terms, while f(x,y) represents the number of hits when the terms are searched together. The variable G represents the total number of pages in the overall Google search. NGD is commonly used to measure semantic relatedness rather than semantic similarity. This is because related terms tend to appear together more frequently in web pages, even if they have opposite meanings. 2.2.2.7. Method of Dependency-based Models Dependency-based approaches aim to determine the meaning of a given word or phrase by examining its neighboring words within a specified window. These approaches typically begin by parsing the corpus using Inductive Dependency Parsing [36]., which involves analyzing the distribution of words within the corpus. For each word, a "syntactic context template" is constructed, taking into account the preceding and succeeding nodes in the parse tree. As an example, the phrase "thinks delicious" could have a context template such as "pizza, burger, and food." A vector representation of a word is then created by aggregating the context templates in which the word appears as the root word. The frequency of these word windows occurring in the entire corpus is also considered. Once the vector representation is formed, semantic similarity can be calculated using cosine similarity between these vectors. Levy et al. [36].introduced DEPS embedding as a word- embedding model based on a bag-of-words approach in dependency based. The model was evaluated using the WS353 dataset, which involved ranking similar words above related words. When comparing the recall-precision curves, the DEPS curve demonstrated a stronger affinity towards similarity rankings compared to the bag-of-words (BoW) methods. 2.2.2.8. Method of Word-attention Models In many corpus-based methods, all components of the text are typically treated as equally significant. However, in human interpretation of similarity, the importance of specific keywords in a given context is often emphasized. Word attention models aim to capture the importance or relevance of words from the underlying corpus before calculating semantic similarity [37]. Word attention models employ various techniques to determine the attention weights of the words in the text being analyzed. These techniques may include factors such as word frequency, alignment, and word association. By assigning higher attention weights to main terms in the context, word attention models can capture the relative importance of specific words in 17 determining semantic similarity. This allows the models to focus on the most relevant information when calculating similarity measures. 2.2.2.9. Method of GLSA (Generalized Latent Semantic Analysis) A technique for calculating semantically motivated phrase and document vectors is called Generalized Latent Semantic Analysis (GLSA). By emphasizing term vectors rather than the dual document-term representation, it expands on the LSA methodology. A dimensionality reduction technique and a measure of semantic association between concepts are needed for GLSA. Any appropriate technique for dimensionality reduction can be used with any type of similarity measure on the space of words using the GLSA approach. The final phase provides the weights in the linear combination of term vectors using the conventional term document matrix [10]. 2.2.3. Knowledge Base STS Approaches Newly developed state-of-the-art methods for determining similarity scores with in text sample pairs include knowledge-based linguistic variables. These techniques use lexical relations and word-level semantic networks to assess relevance at the text (sentence) level. Electronic resources, such as lexical resources and knowledge bases, serve as the primary sources of information for these methods. The semantic similarity between two simple sentences is quantified by evaluating a global measure based on pairwise comparisons of word similarity within these sentences. The construction of sentence-to-sentence semantic similarity relies on the aggregation of individual word semantic similarities [16]. One specific measure used for sentence-to-sentence similarity is the term set-to-term set measure, which represents an extreme case of this approach [38]. To compute this measure, two texts that are being compared are queried separately in the corpora to determine the number of documents containing each text. Additionally, the number of documents in which both words appear together is queried. These queries are performed using a Lucene index built on the corpora. The cardinality of a word refers to the number of corpora in which the word appears, while the cardinality of a word's conjunction represents the number of documents in which both words appear [8]. Based on the principles take to assess the semantic similarity between words, knowledge-based semantic similarity methods can be further categorized into edge-counting methods, feature- based methods, and information content-based methods. These different categories employ distinct techniques and measures to capture the semantic relatedness between words and extend them to sentence-level similarity assessments. 18 2.2.3.1. Edge-counting Methods A simple approach to measure similarity between terms is to view the underlying ontology as a graph, where words are connected in a taxonomic manner. By counting the edges between two terms, we can gauge their similarity. The shorter the path between the terms, the more similar they are. This measure, known as "path," was proposed by Rada et al [39]. It determines similarity by considering the inverse of the shortest path length between two terms. However, the edge-counting method does not take that words lowermost in the structure may have more specific meanings. These words could be more similar to each other, even if they have the same distance as two term indicate a more general concept. To address this, Wu and Palmer [39] proposed the "wup" measure, which considers the depth of words in the ontology as an important factor. The wup measure counts the number of edges between each term and their Least Common Subsumer (LCS). The LCS represents the common ancestor shared by both terms in the given ontology. Let's denote two terms as t1 and t2, their LCS as tlcs, and the shortest path length between them as min_len(t1,t2). The path is then measured as follows: 𝑠𝑖𝑚𝑝𝑎𝑡ℎ(𝑡1, 𝑡2) = 1 1 + min_𝑙𝑒𝑛(𝑡1, 𝑡2) and wup is measured as, 𝑠𝑖𝑚𝑤𝑢𝑝(𝑡1, 𝑡2) = 2𝑑𝑒𝑝𝑡ℎ(𝑡𝑙𝑐𝑠) 𝑑𝑒𝑝𝑡ℎ(𝑡1) + 𝑑𝑒𝑝𝑡ℎ(𝑡2) Li et al. [40] proposed a measure that takes into account both the minimum path distance and depth. li is measured as, simli = e−amin_len(t1,t2), eβdepth(tlcs) − e−βdepth(tlcs) eβdepth(tlcs) + e−βdepth(tlcs) However, the edge-counting methods ignore the fact that the edges in the ontologies need not be of equal length. To overcome this shortcoming of simple edge-counting methods, feature- based semantic similarity methods were proposed. 2.2.3.2. Feature-based Methods The feature-based methods calculate similarity as a properties of the words function, such as gloss, neighboring concepts, and so on [12]. Gloss is defined as the meaning of a word in a dictionary; a collection of glosses is called a glossary. There are various semantic similarity methods proposed with gloss of words. Gloss-based semantic similarity measures exploit the 19 knowledge that words with the same meanings have more common words in their gloss. The semantic similarity is measured as the extent of overlap between the words glosses in consideration. The measure [41]] assigns a value of relatedness between two words based on the overlap of words in their gloss and the glosses of the concepts they are related to in an ontology like WordNet [42] [14]. Proposed a feature-based method where semantic similarity is measured using the glosses of concepts present in Wikipedia. Most feature-based methods take into account common and non-common features between two words/terms. The common features contribute to the increase of the similarity value and the non-common features decrease the similarity value. The major limitation of feature-based methods is its dependency on ontologies with semantic features, and most ontologies rarely incorporate any semantic features other than taxonomic relationships [12]. 2.2.3.3. Information Content-based Methods Information content (IC) of a concept is defined as the information derived from the concept when it appears in context [43]. A high IC value indicates that the word is more specific and clearly describes a concept with less ambiguity, while lower IC values indicate that the words are more abstract in meaning [44]. The specificity of the word is determined using Inverse Document Frequency (IDF), which relies on the principle that the more specific a word is, the less it occurs in a document. IC-based methods measure the similarity between terms using the IC value associated with them. Resnik and Philip [45] proposed a semantic similarity measure called res that measures the similarity with on the idea that if two concepts share a common subsumer, then they share more information, since the IC value of LCS is higher. Considering IC represents the IC of the given term, res is measured as, simres (t1,t2) = ICtlcs . D. Lin [46] proposed an extension of the res measure consideration to taking the IC value of both the terms that attribute to the individual information or description of the terms and the IC value of their LCS that provides the shared commonality between the terms. lin is measured as, 𝑠𝑖𝑚𝑙𝑖𝑛(𝑡1, 𝑡2) = 2𝐼𝐶𝑡𝑙𝑐𝑠 𝐼𝐶𝑡2 + 𝐼𝐶𝑡2 Jiang and Conrath [47] calculate a distance measure with on the difference between the sum of the individual IC values of the terms and the value of IC their LCS using the below equation: 20 disjcn(t1,t2) = ICt1 + ICt2 − 2ICtlcs. The distance measure replaces the shortest path length in Equation (1), and the similarity is inversely proportional to the above distance. Hence jcn is measured as, 𝑠𝑖𝑚𝑗𝑐𝑛(𝑡1, 𝑡2) = 1 1 + dis𝑗𝑐𝑛(𝑡1, 𝑡2) an underlying corpora measured by IC or from the intrinsic ontology structure itself [33] with on the assumption that the ontologies are structured in a meaningful way. Some of the terms may not be included in one ontology, which provides a scope to use multiple ontologies to calculate their relationship [13]. Based on whether the given terms are both present in a single ontology or not, IC-based methods can be classified as mono-ontological methods or multi- ontological methods. When multiple ontologies are involved, the IC of the Least Common Subsumer from both the ontologies are accessed to estimate the semantic similarity values. Jiang et al.[48] Proposed IC-based semantic similarity measures based on Wikipedia pages, concepts, and neighbors. Wikipedia was both used as a structured taxonomy likewise, a corpus to provide IC values. Semantic Textual Similarity (STS) is a task in natural language processing that involves measuring how closely related pairs of text units are in terms of their meaning. To preprocess STS data, several steps are typically followed to enhance the accuracy of the similarity measurement. These steps include: 1. Tokenization: The text is divided into individual words or tokens to establish the basic units of analysis. 2. Stop word removal: Common words that do not carry much semantic meaning, such as "a," "the," or "of," are removed to reduce noise and focus on more meaningful content. 3. Stemming and lemmatization: Words are reduced to their base or root forms to handle variations of the same word. It advantageous in capturing the core meaning and avoids redundancy. 4. Part-of-speech (POS) tagging: Each word is assigned a syntactic category, such as noun, verb, adjective, etc., to understand the grammatical structure and potential relationships between words. 5. Dependency parsing: The relationships and dependencies between words are analyzed to determine which words depend on others in terms of syntax and meaning. 21 6. Named entity recognition: Entities such as names of people, organizations, locations, etc., are identified to handle their specific semantic significance. 7. Parsing trees: The syntax structure of the sentence is represented using parsing trees, which capture the hierarchical relationships between words. By applying these preprocessing techniques, noise is reduced, and the semantic meaning of the sentence pair is captured more accurately. This, in turn, improves the performance and accuracy of STS systems in measuring the similarity between texts. 2.3. Deep learning techniques A Recurrent Neural Network (RNN) A Recurrent Neural Network (RNN) is a type of Neural Network that addresses the requirement for sequential information processing. Unlike traditional neural networks, where inputs and outputs are treated independently, RNNs consider the previous output as input for the current step. This is particularly useful in tasks like predicting the next word in a sentence, where the context of previous words is necessary. To enable this sequential processing, RNNs introduce a Hidden Layer that plays a crucial role. This Hidden Layer, also known as the Hidden State or Memory State, retains information about the sequence's previous inputs. It make as a memory that helps the network remember and utilize past inputs. One of the key advantages of RNNs is their ability to share parameters across different inputs or hidden layers. This means that the same set of parameters is used for each input, performing the same operation on all of them. As a result, the complexity of parameters is reduced compared to other neural network architectures. RNNs are specialized neural networks that address the requirement for sequential information processing. They utilize a Hidden Layer or Hidden State to remember past inputs, enabling them to capture dependencies and context in sequential data. The parameter sharing property of RNNs contributes to their efficiency by reducing the complexity of parameters [49]. When modeling sequential data, where the context and order of the input items are important, RNNs perform especially well. They are capable of processing input sequences with varying lengths and identifying the connection between the sequence's components. Because of this, RNNs can be used for a variety of applications, including language modeling, machine translation, speech recognition, time series prediction, and sentiment analysis. Long-term dependencies and contextual information within a sequence are excellently captured by RNNs. An RNN's hidden state stores the details of the inputs it has already seen, enabling the network to retain context memory. Tasks requiring the word analysis or phrase meaning in the context 22 of the full sequence benefit from this contextual comprehension. Because RNNs can represent sequential data, they are frequently utilized in NLP tasks. Language generation, text categorization, named entity identification, sentiment analysis, question answering, and machine translation are just a fewer of the jobs they have successfully completed. RNN variations that have shown to be very successful in capturing long-range dependencies and reducing the vanishing gradient issue are LSTM and GRU. For the analysis and prediction of time series data values arranged according to a specific time interval RNNs are a good fit. They are helpful for applications like signal processing, anomaly detection, weather forecasting, and stock market prediction because they can extract temporal patterns and dependencies from the data. Transfer learning is made possible by pre-training RNN models on extensive language modeling tasks, such as training on a sizable corpus of text data. Then, using smaller labeled datasets for certain downstream tasks, the pre-trained RNN models can be improved. By utilizing the acquired language knowledge from the pre-training phase, this method enhances performance on the intended job. RNNs enable for the examination of the hidden states and their temporal evolution, which contributes to their interpretability to some degree. This could provide information into the attributes the model deems crucial for the task and aid in understanding how it makes decisions. RNNs are a popular option in learning of machine in case of their versatility in NLP and time series analytic tasks, as well as their capacity to handle sequential input, capture context and dependencies, and perform well in these areas. RNNs are a useful tool for different applications due to their adaptability and efficiency in modeling sequential information [49]. LSTM (Long Short-Term Memory) is a special kind of recurrent neural network (RNN) design that solves the issue of the vanishing gradient problem found in traditional RNNs. LSTM introduces a memory cell and three gates: input gate, forget gate, and output gate. These gates control the flow of information into and out of the memory cell, allowing LSTMs to selectively retain or discard information over long sequences. The memory cell enables LSTMs to capture long-range dependencies and remember relevant information from earlier parts of the sequence. GRU (Gated Recurrent Unit) is another variant of the RNN architecture that addresses the vanishing gradient problem and has a simpler structure compared to LSTM. GRU also includes gating mechanisms, but it uses only two gates which is update gate and a reset gate. The update gate regulates how much of the previous hidden state should be maintained, while the reset gate determines the extent to which past information should be disregarded. GRU performs 23 similarly to LSTM in many sequence modeling tasks while being computationally more efficient due to its reduced number of gates [49]. In some cases, information from both past and future inputs is important for understanding the current input in a sequence. Bidirectional RNNs (Bi-RNNs) address this by processing the input sequence in two directions: one forward and one backward. This means that the hidden state of the network at each step is influenced by both the past and future input contexts. Bi- RNNs are particularly useful in tasks where context from both directions is crucial, such as part-of-speech tagging or named entity recognition. By capturing information from both directions, Bi-RNNs can provide improved context awareness and capture dependencies in the entire sequence [49]. Stacked RNN involve stacking multiple recurrent layers on top of each other. Each layer in the stack processes the input sequence sequentially, and its hidden state is passed as input to the next layer. Stacked RNNs allow for more complex representations and can capture hierarchical dependencies in the data input. The lower layers capture local dependencies, while the higher layers capture more abstract and global dependencies. The use of stacked RNNs can enhance the models the ability to understand intricate relationships and patterns in sequential data, making them beneficial for tasks that require a deeper understanding of the input sequence. In general, RNNs are commonly utilized for Semantic Textual Similarity (STS) tasks due to their semantic similarity between sentences, which requires considering the order and context of words and phrases. RNNs, with their recurrent connections and hidden states, can effectively model the sequential nature of sentences and encode contextual information. They are appropriate for capturing the complex semantic linkages between phrases because they may record long-term interdependence. Furthermore, RNNs' flexibility in handling variable-length input and their capacity for transfer learning make them a valuable choice for STS, allowing them to leverage pre-trained language models and improve performance on the task. 2.4. Overview of Guragigna Language Guragigna, also known as Gurage or Guragegna, is a Semitic language spoken by the Gurage people in Ethiopia. It belongs to the Afro-Asiatic language family and specifically falls under the South Ethiopian Semitic branch. Guragigna is primarily spoken in the Gurage Zone, which is located in the southern part of the country. Guragigna has several dialects, with variations in vocabulary, pronunciation, and grammar across different Gurage communities. These dialects include Ezha, Cheha, Soddo, Inor, Gumer, Gura, Meskane, Muher, and Gyeto. The language is 24 characterized by a rich oral tradition and has its own unique writing system which is the Guragigna script. However, the script is not widely used, and the majority of Guragigna speakers primarily use the Ethiopian script known as Fidel [50]. Guragigna exhibits a phonological system characterized by a diverse range of consonants and a set of five vowel phonemes. The language allows for complex syllable structures and permits consonant clusters in both initial and final positions. Stress typically falls on the penultimate syllable, while intonation plays a significant role in conveying meaning. Grammatically, Guragigna features noun and verb conjugation, adjective agreement with nouns, and a predominantly subject-verb-object word order. The language historically used the Ge'ez script, but in modern times, it is commonly written using the Ethiopian script, an abugida and now the Gurage zone Culture and Tourism Office prepare new Guragigna Script. Guragigna holds socio-cultural significance for the Gurage people, being intertwined with their traditions, folklore, and identity. Efforts are underway to preserve and promote Guragigna through educational initiatives and cultural events, contributing to the linguistic and cultural landscape of the Gurage Zone and Ethiopia as a whole [51][52][53]. Guragigna has influenced and been influenced by other Ethiopian languages, particularly Amharic, due to historical and geographical interactions. As outcome, there are similarities in vocabulary and grammar between Guragigna and Amharic. The Gurage people, who are the native speakers of Guragigna, have a diverse cultural heritage and are known for their agricultural practices, craftsmanship, and music. Guragigna plays a significant role in preserving and transmitting their cultural traditions and expressions [51][52][53]. While Guragigna is primarily spoken within the Gurage community, there have been efforts to promote the language and its cultural significance through educational initiatives and documentation projects. These endeavors aim to preserve and enhance the understanding and use of Guragigna among its speakers and to promote appreciation for its linguistic and cultural richness [51][52][53]. 2.5. Related works As mention in [6] the paper investigates several word embedding techniques (Word2Vec, GloVe, and FastText) to estimate the semantic similarity of “Bengali” sentences. Due to the unavailability of the standard dataset, this work developed a Bengali dataset containing 187031 text documents with 400824 unique words. Moreover, this work considers three semantic distance measures to compute the similarity between the word vectors using Cosine similarity 25 with no weight, term frequency weighting, and Part-of-Speech weighting. The performance of the proposed approach is evaluated on the developed dataset containing 50 pairs of Bengali sentences. The evaluation result shows that Fast Text with continuous bag-of-words with 100 vector sizes achieved the highest Pearson's correlation (ρ) score of 77.28% [6]. As mentioned in [8], offers three distinct methods for producing STS Arabic models that work well. The first one is based on a fine-tuning evaluation of automatic machine translation from English STS data to Arabic. The second strategy is based on integrating English data resources with Arabic models. Using a proposed translated dataset, the third strategy focuses on optimizing the knowledge distillation-based models to improve their performance in Arabic. Using a very small collection of resources a few hundred Arabic STS sentence pairs. Were able to obtain an 81% correlation score when using the regular STS 2017 Arabic assessment set. Additionally, it was possible to expand the Arabic models to process the two regional dialects, Saudi Arabian (SA) and Egyptian (EG) [13]. Determining how similar two sentences are in meaning is a crucial part of comprehending natural languages automatically. The problem of semantic similarity involves evaluating the closeness of sentence meanings. To address this problem, recurrent and recursive neural networks have been used and have shown significant improvements over basic models. These neural networks are designed to handle the structure of language. Recurrent neural networks (RNNs) are suitable for processing sentences and understanding the relationships between words. Recursive neural networks (RecNNs) take this further by considering the hierarchical structure of sentences. By utilizing recurrent and recursive neural networks, there have been notable enhancements in measuring semantic similarity, with reported improvements ranging from 16% to 70% compared to basic models. This highlights the effectiveness of these neural network approaches in evaluating the similarity of sentence meanings. These advancements contribute to better automated language understanding and have various applications in tasks like question answering, information retrieval, and language translation [54]. Semantic Textual Similarity (STS) forms the foundation for numerous applications in Natural Language Processing (NLP). To measure the semantic similarity of sentences, a system combines convolutional and recurrent neural networks. It utilizes a convolutional network to consider the nearby context of words and a Long Short-Term Memory (LSTM) network to account for the overall context of sentences. By combining these networks, the system retains important sentence information and enhances the calculation of sentence similarity. The model 26 has demonstrated favorable outcomes and is competitive with the leading state-of-the-art systems, as indicated by reference [7]. this study[9] was attempted to develop an Amharic-English CLSTSM system by utilizing a Statistically topic modeling-based semantic text similarity measurement approach It helps native speakers of Amharic gauge the amount of web content available in Amharic by utilizing a query in their own language. The publicly accessible Amharic and English text materials that make up the similar and non-comparable collected documents were utilized to test the system prototype. By projecting the two texts into an LDA topic space and utilizing three distinct techniques to measure the similarity of the two text documents, the LDA topic model methodology is used to turn text documents into vectors. In varying data sizes, the Jaccard algorithm outperforms other matching algorithms with accuracy rates of 70%, 79%, 92%, and 96%. Additionally, on non-comparable corpora, the Jaccard algorithm surpasses other algorithms with accuracy rates of 65%, 78%, 92%, and 95.6. To measure the Semantic Textual Similarity (STS) is an important study area in NLP which plays a significant role in many applications such as question answering, document summarization, retrieval of information and information extraction. This paper evaluates Siamese recurrent architectures, a special type of neural networks, which are used here to measure STS. Several variants of the architecture are compared with existing methods [55]. 27 Table 2-1: List of related works Year Topic Method Accuracy Datasets Algorithm Evaluation matrix Gaps/Feature Research work 2021 [6] “Word Embedding-based Textual Semantic Similarity Measure in Bengali” word embedding techniques and cosine similarity 77.28% Pearson Correlatio n 187031 text documen ts No weight, term frequency weighting, and Part-of-Speech weighting Pearson's correlation (ρ) Ambiguity words cannot consider and not use any RNN or CNN models 2022 [8] “Semantic textual similarity for modern standard and dialectal Arabic using transfer learning” transfer learning with BERT embedding 81% Pearson Correlatio n 100 pair of sentence transfer learning Pearson Correlation Large gap in accuracy compared to English STS Model and lacks lexical standard for Arabic 2022 [54] “Deep learning based semantic similarity detection using text data” Combine LSTM and CNN with word embedding 70% Accuracy 404290 question pairs LSTM and CNN Precision, Recall and F1 Insufficient experiment detail (not explain about actual score and model predictions ) and compare the purpose model with not related works 2018 [7] “Predicting the Semantic Textual Similarity with Siamese CNN and LSTM” CNN and LSTM 0.79 Pearson 9,927 sentence pairs LSTM and CNN Pearson (r) and Spearman Not provide information about pre-training embedding and lack of annotated corpora 28 Correlatio n (ρ) correlation coefficients , and Mean Squared Error 2019 [55] “Semantic Textual Similarity with Siamese Neural Networks” Siamese neural networks with word embedding 0.81 Pearson Correlatio n 9927 sentence pairs Siamese Neural Networks Pearson Correlation better performance on smaller datasets and only training data available 2021 [8] “Cross-Language Semantic Text Similarity Measurement using Statistical Topic Model: The Case of Amharic-English Languages” The LDA topic model and Jaccard algorithm 96% 1200 compara ble and non- compara ble text Cosine, Jaccard and Hellinger Precision, Recall and F1 It primarily relies on word co- occurrence statistics and fails to incorporate the semantic meaning of the words. model may not always align perfectly with human interpretation 29 Summary of Related Work Research on STS has been conducted in a different ways with varying approaches to foreign languages. Nevertheless, no deep learning study on STS for Guragigna and local languages has been done. As we examine multiple research conducted on languages other than those heavily recommended by the STS, like deep learning algorithms, as difference to other conventional STS approaches. The purpose of this work is to use deep learning techniques to the development of semantic textual similarity for the Guragigna language. Depending on the chosen LSTM, BI-RNN, GRU, and Stacked RNN model, we conducted the experiments using those models since, generally speaking, as we have seen in the works linked above, the majority of recent research has been conducted for various languages. To assess the effectiveness of the model, Mean Squared Error (MSE) is used as the evaluation metric. Which compares system output to reference sentences that have been manually scored in order to assess score correctness. We employed preprocessing approaches and optimization strategies to improve performance using either an MSE or training speed using a deep learning STS model, hence reducing the complexity of the work. Based on the gaps described above Table 2-1, this thesis done as in the list ways to bridge the gap in STS models' accuracy, the thesis could explore alternative approaches and using different embedding’s technique. Could investigate specific-domain pre-training techniques or leverage annotated datasets in Guragigna to improvement of performance of semantic similarity models. In addition, the thesis includes complete set-ups, providing test details and comparisons thesis aim to include thorough setups, detail descriptions of model architectures, hyper-parameters, and evaluation metrics. It compare their proposed models with relevant approaches, highlighting the strengths and weaknesses of each models based on the results. Moreover, by collaborating with linguists or using different data sourcing platforms to create an annotated corpus, these models allow for more accurate and reliable evaluation. Th