SCHOOL OF GRADUATE STUDIES 

 
DEVELOPING SEMANTIC TEXTUAL SIMILARITY FOR 

GURAGIGNA LANGUAGE USING DEEP LEARNING APPROACH 

 
MSc. THESIS  

 
GETNET DEGEMU 

 
 May 28, 2024 

WOLKITE, ETHIOPIA 

 
I 

 
Wolkite University 

School of Graduate studies 

Developing Semantic Textual Similarity for Guragigna Language using 

Deep Learning Approach 

 
A Msc Thesis Submitted to School of Graduate Studies in Partial 

Fulfillment Requirement for the Degree of Master of computer Science in 

Computer and Engineering 

 
Getnet Degemu Besir 

 
Major Advisor: Sintayehu Hirpassa (Ph.D.)  

Co-Advisor: Abdo Ababor (Msc)  

 
May 28, 2024 

Wolkite, Ethiopia 

 
II 

 
  APPROVAL SHEET 

Wolkite University 

School of Graduate studies 

As Thesis advisor, I hereby certify that all given comments by reviewers have considered, read, 

and evaluated the Thesis entitled Developing Semantic textual similarity for Guragigna 

language using Deep learning Approach. 

Submitted by 

 
Getnet 

Degemu ___________ 
 

5/28/2024 

Name  Signature Date 

Approved by 

 
Sintayehu Hirpassa (Ph.D.)  

 
25/05/2024 

Name of Major Advisor Signature Date 

Abdo Ababor (Msc) _______________ _____________ 

Name of CO-Advisor Signature Date 

Kindie Biredagn (Ph.D.) 
  

25/05/2024 

Name of extremal examiner Signature Date 

Worku Muluye _______________ _____________ 

Name of Internal examiner Signature Date 

_______________ _______________ _______________ 

Name of DGC chairman Signature Date 

_______________ _______________ _______________ 

Name of CPG coordinator Signature Date 

_______________ _______________ _______________ 

SGS Approval Signature Date 

 
III 

 
APPROVAL SHEET 

WOLKITE UNIVERSITY  

SCHOOL OF GRADUATE STUDIES 

We hereby certify that we have read and evaluated this Thesis titled “Developing Semantic 

textual similarity for Guragigna language using Deep learning approach” prepared under 

our guidance by Getnet Degemu Besir. We recommend that the Thesis shall be submitted as 

fulfilling the requirements for the award of a MSc. degree in computer Science and 

Engineering.  

Sintayehu Hirpassa (Ph.D.) 

 
25/05/2024 

Major Advisor Signature Date 

Abdo Ababor (Msc)  ______________ ______________ 

Co-Advisor Signature Date 

As members of the Board of Examiners of the Master of Science Thesis open defense 

examination, we have read and evaluated this Thesis prepared by Getnet Degemu Besir and 

examined the candidate. We hereby certify that, the thesis is accepted for fulfilling the 

requirements for the award of the degree of Master of Science (M.Sc.) in computer Science 

and Engineering.  

Kindie Biredagn (Ph.D.) 

 
25/05/2024 

Name of extremal examiner Signature Date 

Worku Muluye ______________ ______________ 

Name of Internal examiner Signature Date 

________________________ ______________ ______________ 

Name of chainman Signature Date 

Final approval and acceptance of the Thesis is contingent upon the submission of its final copy 

to the Council of Postgraduate Program (CPCS), through the candidate's department or school 

graduate committee (DGC or SGC). 

 
IV 

 
DEDICATION 

I dedicate the present work to my daughter, Ayana Getnet (ኤያና) 

 
V 

 
DECLARATION 

By my signature below, I declare and affirm that this Thesis/Dissertation is my own work. I 

have followed all ethical principles of scholarship in the preparation, data collection, data 

analysis and completion of this thesis. All scholarly matter that is included in the thesis has 

been given recognition through citation. I affirm that I have cited and referenced all sources 

used in this document. Every serious effort has been made to avoid any plagiarism in the 

preparation of this thesis. 

This thesis is submitted in partial fulfillment of the requirement for a degree from the School 

of Graduate Studies at Wolkite University. The thesis is deposited in the Wolkite University 

Library and is made available to borrowers under the rules of the library. I solemnly declare 

that this thesis has not been submitted to any other institution anywhere for the award of any 

academic degree, diploma or certificate. 

Brief quotations from this Thesis/Dissertations may be used without special permission 

provided that accurate and complete acknowledgement of the source is made. Requests for 

permission for extended quotations from, or reproduction of, this thesis in whole or in part may 

be granted by the Head of the School or Department or the Deas of the School of Graduate 

Studies when in his or her judgment the proposed use of the material is in the interest of 

scholarship. In all other instances, however, permission must be obtained from the author of 

the thesis. 

Name: Getnet Degemu     

Signature: _________ 

Date: 05/28/24 

Department: computer Science and Engineering  

 
VI 

 
ACKNOWLEDGMENT 

First of all, I give thanks to God and Saint Mary (የረሸ ማርያም) for all of their support in my 

life. 

In next, I appreciative of Dr. Sintayehu H., my adviser, for his suggestions and assistance, and 

invaluable insights throughout this research. His expertise and encouragement have been 

instrumental in shaping the direction of this work and pushing me to achieve my best and I 

would like to convey my heartfelt appreciation to my co-advisor, Abdo Ababor (Msc), that 

have greatly influenced and enhanced my work and I am truly grateful for your deep guidance, 

constructive comments, and sharing of your experience and skills. Working with you has been 

a privilege, and I am fortunate to have had your support and mentorship throughout this work. 

Also, I would like to convey my appreciation to the CCI college members of the Department 

of Software Engineering at Wolkite University for their continuous support and 

encouragement. Their dedication to academic excellence and their commitment to fostering a 

conducive learning environment have greatly enriched my educational experience. 

Special thanks go to my family, colleagues (Alex) and friends (Temesgen, Tsegaye F, D Yalewu 

F, D Dawit F and D Cheru T) who have been a source of inspiration and motivation. Their 

insightful discussions, constructive feedback, and support in collection of data for this research. 

Lastly; this acknowledgment to my beloved wife Marta W (ማዬ). Whose unwavering support, 

love, and understanding. Your presence in my life has been a constant source of inspiration and 

strength. Your unwavering belief in my abilities even during the most challenging times has 

propelled me forward and given me the confidence to follow my dreams. 

Name: Getnet Degemu 

Signature: _________ 

Date: 05/28/24 

Department: computer Science and Engineering 

 
VII 

 
ABBREVIATIONS AND ACRONYMS  

BERT Bidirectional Encoder Representation 

BRNN Bi-Directional Recurrent Neural Network 

CNN Convolutional Neural Network 

DL Deep Learning 

EXP Expert 

GLOVE Global Vectors For Word Representation 

GLSA Generalized Latent Semantic Analysis   

GPU Graphics Processing Unit 

GRU Gated Recurrent Unit 

IDE Integrated Development Environment 

IDF Inverse Document Frequency    

LSTM Long Short-Term Memory 

ML Machine Learning 

NLP Natural Language Processing 

NN Neural Network 

PDF Portable Document Format 

PNG Portable Network Graphics 

RELU Rectified Linear Unit 

RNN Recurrent Neural Network 

S1 Sentence One  

S2 Sentence Two 

SRNN Stacked Recurrent Neural Network 

STS Semantic Textual Similarity 

SVD Singular Value Decomposition 

SVG Scalable Vector Graphics 

TXT  Text File Extension  

USE Universal Sentence Encoder 

UTF-8  Unicode Transformation Format-8-Bit 

Word2vec Word To Vector 

 
VIII 

 
TABLE OF CONTENTS 

APPROVAL SHEET ............................................................................................................ III 

DEDICATION ....................................................................................................................... IV 

DECLARATION ..................................................................................................................... V 

ACKNOWLEDGMENT ...................................................................................................... VI 

ABBREVIATIONS AND ACRONYMS ............................................................................ VII 

TABLE OF CONTENTS ................................................................................................... VIII 

LIST OF TABLE ................................................................................................................. XII 

LIST OF FIGURE ............................................................................................................. XIII 

LIST OF ALGORITHMS ................................................................................................. XIV 

LIST OF TABLES IN THE APPENDIX ........................................................................... XV 

LIST OF FIGURES IN THE APPENDIX ....................................................................... XVI 

ABSTRACT ....................................................................................................................... XVII 

CHAPTER ONE ...................................................................................................................... 1 

1. INTRODUCTION ............................................................................................................ 1 

1.1. Background of the Study .......................................................................................... 1 

1.2. Motivation .................................................................................................................. 2 

1.3. Statement of the Problem ......................................................................................... 3 

1.4. Research Questions ................................................................................................... 4 

1.5. Objective .................................................................................................................... 5 

1.5.1. General objective ............................................................................................................ 5 

1.5.2. Specific objective ............................................................................................................ 5 

1.6. Scope of the study ...................................................................................................... 5 

1.7. Limitations of the study ............................................................................................ 5 

1.8. Significance of the study ........................................................................................... 6 

1.9. Organization of the thesis ......................................................................................... 6 

CHAPTER TWO ..................................................................................................................... 7 

2. LITERATURE REVIEW ................................................................................................ 7 

2.1. Introduction ............................................................................................................... 7 


IX 

 
2.2. The Semantic Textual similarity Approach ............................................................ 8 

2.2.1. String based similarity .................................................................................................... 8 

2.2.2. Corpus-Based Approaches ............................................................................................ 12 

2.2.3. Knowledge Base STS Approaches................................................................................ 17 

2.3. Deep learning techniques ........................................................................................ 21 

2.4. Overview of Guragigna Language ......................................................................... 23 

2.5. Related works .......................................................................................................... 24 

CHAPTER THREE ............................................................................................................... 30 

3. MATERIAL AND METHODS ..................................................................................... 30 

3.1. Introduction ............................................................................................................. 30 

3.2. Proposed Approach ................................................................................................. 30 

3.3. Material and Tools .................................................................................................. 30 

3.3.1. Hardware Tools ............................................................................................................. 30 

3.3.2. Software Tools .............................................................................................................. 31 

3.4. Corpus ...................................................................................................................... 32 

3.5. Preprocessing ........................................................................................................... 35 

3.5.1. Removing extra spaces .................................................................................................. 35 

3.5.2. Removal of stop-words ................................................................................................. 36 

3.5.3. Removing punctuation .................................................................................................. 37 

3.5.4. Tokenization ................................................................................................................. 38 

3.6. Universal Sentence Encoder (USE) ....................................................................... 39 

3.7. Word embedding ..................................................................................................... 40 

3.7.1. Global Vectors for Word Representation (GloVe) ....................................................... 41 

3.7.2. Word2Vec (Word to Vector) ........................................................................................ 43 

3.8. Optimization Algorithms in STS using deep learning ......................................... 43 

3.9. Performance Measurement Methods .................................................................... 44 

3.10. Accuracy ............................................................................................................... 44 

3.11. Evaluation Metrics .............................................................................................. 44 

3.11.1. Mean Squared of Error (MSE) ...................................................................................... 44 

3.11.2. Process of adapting a pre-trained model ....................................................................... 45 

CHAPTER FOUR .................................................................................................................. 46 

4. RESEARCH DESIGN .................................................................................................... 46 


X 

 
4.1. Corpus Preparation................................................................................................. 46 

4.2. Architecture of Developing STS for Guragigna Language ................................. 47 

4.3. Pre-processing.......................................................................................................... 48 

4.4. Data Splitting ........................................................................................................... 48 

4.5. Vectorization ............................................................................................................ 48 

4.6. Model Selection ........................................................................................................ 49 

4.6.1. Long Short-term Memory Model .................................................................................. 49 

4.6.2. Bi-directional RNN ....................................................................................................... 51 

4.6.3. Gated Recurrent Unit (GRU) ........................................................................................ 53 

4.6.4. Stacked RNN ................................................................................................................ 54 

CHAPTER FIVE .................................................................................................................... 57 

5. EXPERIMENTATION .................................................................................................. 57 

5.1. Introduction ............................................................................................................. 57 

5.2. Data Collection and Preparation ........................................................................... 57 

5.3. Environment of implementation ............................................................................ 58 

5.3.1. Removing extra spaces .................................................................................................. 58 

5.3.2. Removal of stop-words ................................................................................................. 59 

5.3.3. Removing punctuation .................................................................................................. 61 

5.3.4. Tokenization ................................................................................................................. 61 

5.4. Embedding Process ................................................................................................. 63 

5.5. Parameter Selection ................................................................................................ 64 

CHAPTER SIX ...................................................................................................................... 67 

6. RESULT AND DISCUSSION ....................................................................................... 67 

6.1. Introduction ............................................................................................................. 67 

6.2. Experimental Result ................................................................................................ 68 

6.3. Discussion on the Result ......................................................................................... 81 

6.4. Confusion matrix ..................................................................................................... 83 

6.5. Summary .................................................................................................................. 86 

6.1. Evaluation by Linguists experts ............................................................................. 86 

6.2. Answering Research Questions .............................................................................. 88 

Chapter Seven ........................................................................................................................ 90 


XI 

 
7. CONCLUSION AND RECOMMENDATION ............................................................ 90 

7.1. Overview .................................................................................................................. 90 

7.2. Conclusion ................................................................................................................ 90 

7.3. Contribution and challenges .................................................................................. 91 

7.4. Future work ............................................................................................................. 92 

7.5. Recommendation ..................................................................................................... 93 

REFERENCES ....................................................................................................................... 94 

APPENDICES ...................................................................................................................... 100 

Appendix A ....................................................................................................................... 100 

Appendix B ....................................................................................................................... 101 

Appendix C ....................................................................................................................... 102 

Appendix D ....................................................................................................................... 103 

Appendix E ....................................................................................................................... 104 

Appendix F ........................................................................................................................ 105 

Appendix G ....................................................................................................................... 106 

Appendix H ....................................................................................................................... 107 

Appendix I ........................................................................................................................ 108 

 
XII 

 
LIST OF TABLE  

Table 1-1: The weakness of lexical matching in capturing semantic similarity ........................ 2 

Table 2-1: List of related works ............................................................................................... 27 

Table 3-1 Tools and materials .................................................................................................. 32 

Table 5-1 Source of Data Collection ........................................................................................ 57 

Table 5-2 list of Hype-Parameters ........................................................................................... 65 

Table 6-1 Experimental result of the proposed model ............................................................. 82 

Table 6-2 Comparing predicted similarity scores with the actual similarity scores ................ 87 

 
XIII 

 
LIST OF FIGURE  

Figure 4-1 Architecture of Developing STS of Guragigna Language ..................................... 47 

Figure 4-2 LSTM Encoder Decoder Architecture ................................................................... 50 

Figure 4-3 Architecture of Bi-directional RNN Model ........................................................... 52 

Figure 4-4 Gated Recurrent Unit (GRU) Model Architecture ................................................. 54 

Figure 4-5 Architecture of Stacked RNN ................................................................................ 55 

Figure 5-1 Sample code of removing spaces and newlines ..................................................... 59 

Figure 5-2 Sample Code of Removing Stopwords .................................................................. 60 

Figure 5-3 Sample Code of Removes Punctuation .................................................................. 61 

Figure 5-4 Sample Code of Tokenization ................................................................................ 62 

Figure 5-5 Sample Code of Embedding ................................................................................... 64 

Figure 6-1 LSTM Training Using USE Embedding ................................................................ 70 

Figure 6-2 LSTM Training Using Glove Embedding .............................................................. 71 

Figure 6-3 LSTM Training Using Word2vec Embedding ....................................................... 71 

Figure 6-4 LSTM Model Comparison of Actual and Predicted Similarity Scores .................. 72 

Figure 6-5 GRU Training History Using USE Embedding ..................................................... 73 

Figure 6-6 GRU Training History Using Glove Embedding ................................................... 73 

Figure 6-7 GRU Training History Using Word2vec Embedding ............................................ 74 

Figure 6-8 GRU Model Comparison of Actual and Predicted Similarity Scores .................... 75 

Figure 6-9 Bidirectional RNN Training History Using USE Embedding ............................... 76 

Figure 6-10 Bidirectional RNN Training History Using Word2vec Embedding .................... 76 

Figure 6-11 Bidirectional RNN Training History Using Glove Embedding ........................... 77 

Figure 6-12 Bid-RNN Model of Comparison Actual and Predicted Similarity Scores ........... 78 

Figure 6-13 Stacked RNN Training History Using USE Embedding ..................................... 79 

Figure 6-14 Stacked RNN Training History Using Glove Embedding ................................... 79 

Figure 6-15 Stacked RNN Training History Using Word2vec Embedding ............................ 80 

Figure 6-16 Stacked RNN Model Comparison of Actual and Predicted Similarity Scores .... 81 

Figure 6-17 LSTM confusion matrix ....................................................................................... 83 

Figure 6-18 GRU confusion matrix ......................................................................................... 84 

Figure 6-19 Bidirectional RNN confusion matrix ................................................................... 84 

Figure 6-20 Stacked RNN confusion matrix ............................................................................ 85 

 
XIV 

 
LIST OF ALGORITHMS 

Algorithm 3-1 Algorithms for remove extra spaces from dataset ........................................... 36 

Algorithm 3-2 Algorithms for remove stop words from dataset ............................................. 37 

Algorithm 3-3 Algorithms for remove punctuation from dataset ............................................ 37 

Algorithm 3-4 Algorithms for tokenization a dataset .............................................................. 39 

Algorithm 3-5 Algorithms for Universal Sentence Encoder (USE) ........................................ 40 

Algorithm 3-6 Algorithms for Global Vectors for Word Representation (GloVe) ................. 42 

Algorithm 3-7 Algorithms for Word2Vec (Word to Vector) .................................................. 43 

 
XV 

 
LIST OF TABLES IN THE APPENDIX 

1. Sample of Corpus ............................................................................................................... 100 

2. Punctuation marks commonly used in Guragigna Language ............................................. 101 

3. Sample List of Stop Word of Guragigna ........................................................................... 102 

 
XVI 

 
LIST OF FIGURES IN THE APPENDIX 

1.  Alphabete of Guragigna .................................................................................................... 103 

2. Required Python Libraries ................................................................................................. 104 

3. Sample Train Data ............................................................................................................. 105 

4. Sample output of model predict ......................................................................................... 106 

5. Sample output of model Loss and Accuracy ..................................................................... 107 

6. Sample pair of sentence that score similarity by Expertise ............................................... 108 

 
XVII 

 
ABSTRACT 

Natural language processing (NLP) is one part of how far the world has come in terms of 

technology. It is the process of teaching human language to machines and includes everything 

from Morphology Analysis to Pragmatic Analysis. Semantic Similarity is one of the highest 

levels of NLP. The Previous Semantic textual similarity (STS) studies have been conducted 

using from string-based similarity methods to deep learning methods. These studies have their 

limitations, and no research has been done for STS in the local language using deep learning. 

STS has significant advantages in NLP applications like information retrieval, information 

extraction, text summarization, data mining, machine translation, and other tasks. This thesis 

aims to present a deep learning approach for capturing semantic textual similarity (STS) in the 

Guragigna language. The methodology involves collecting a Guragigna language corpus and 

preprocessing the text data and text representation is done using the Universal Sentence 

Encoder (USE), along with word embedding techniques including Word2Vec and GloVe and 

mean Square Error (MSE) is used to measure the performance. In the experimentation phase, 

models like LSTM, Bidirectional RNN, GRU, and Stacked RNN are trained and evaluated using 

different embedding techniques. The results demonstrate the efficacy of the developed models 

in capturing semantic textual similarity in the Guragigna language. Across different embedding 

techniques, including Word2Vec, GloVe, and USE, the Bidirectional RNN model with USE 

embedding achieves the lowest MSE of 0.0950 and the highest accuracy of 0.9244. GloVe and 

Word2Vec embedding also show competitive performance with slightly higher MSE and lower 

accuracy. The Universal Sentence Encoder consistently emerges as the top-performing 

embedding across all RNN architectures. The research results demonstrate the effectiveness of 

LSTM, GRU, Bi RNN, and Stacked RNN models in measuring semantic textual similarity in the 

Guragigna language. 

Keywords: Semantic textual similarity, Guragigna language, deep learning, corpus-based 

approaches, LSTM, GRU, Bidirectional RNN, Stacked RNN and Word embedding. 

 
1 

 
CHAPTER ONE 

1. INTRODUCTION 

1.1.   Background of the Study 

NLP means doing computations in natural language. Semantic analysis is one of the processes 

involved in natural language processing. When building the syntactic structure of the sentence 

the input sentence analysis does a semantic analysis of the sentence and Sentences are given 

meaning by semantic interpretation. Logical forms are mapped to knowledge representations 

by contextual interpretation. The semantic similarity of features in a vector model is the 

fundamental building block of semantic analysis.[1].  

The comparison of text meaning known as semantic text similarity (STS) plays a vital role in 

various tasks within natural language processing (NLP) like information retrieval, 

categorization, content extraction, answering questions, and identifying plagiarism. 

Text similarity between simple sentence is an important and necessary task in many 

information retrieval applications. Performance of many natural language processing (NLP) 

applications like text summarization, machine translation, plagiarism detection, and sentiment 

analysis. It also relies on similarity of text and meaning. Several other applications have used 

similarity such as text classification, feedback on relevancy, word disambiguation, subtopic 

mining, and web search[2]. 

Similarity measures for many languages such as English, Spanish and Arabic are available, and 

some have been organized by the organizers of SemEval ST for calculating similarity between 

multilingual and monolingual simple sentence research duties [2]. One typical approach for 

computing similarity is lexical matching between simple sentence. A similarity score is 

determined using the quantity of terms that belong to both text segments. These metrics 

however, are only able to calculate similarities at a very basic level. Furthermore, this matching 

can only estimate text similarity but not semantics. 

Consider two simple sentence “ሁት ሜና ነረን ባረም ተሳረምታ ቸነም” (does he has a work? 

He asked) and “ሁት ሜና ኤነን ባረም ተሳረምታ ቸነም” (doesn’t he has a work? He asked). 

As indicated by the lexical assignment in both sentences he has two headwords ("ሁት" and 

"ሜና"). But these he has no semantic connection between the two simple sentence. Consider 

another pair of sentences: “አት አርች ቸዋች ተሐረ አወገዳታ ብ𞟠 ንስራነ ቧረንም” (A boy 

went to cry with his good friend) and “አማት አርች ሶሬሳ ተሐረ አወገዳታ ብ𞟠 ንወነ 


2 

 
ቧረንም” (A boy went to cry with his good friend). There is no clear terminology present in 

these two sentences. but there are clear semantic similarities [2].   

Table 1-1: The weakness of lexical matching in capturing semantic similarity 

Sentence 1  Sentence 2  Similarity 

“ሁት ሜና ነረን ባረም 

ተሳረምታ ቸነም”  

“ሁት ሜና ኤነን ባረም 

ተሳረምታ ቸነም” 

Lexically similar but 

not semantically 

“አት አርች ቸዋች ተሐረ 

አወገዳታ ብ𞟠 ንስራነ 

ቧረንም” 

“አማት አርች ሶሬሳ 

ተሐረ አወገዳታ ብ𞟠 

ንወነ ቧረንም” 

Semantically similar 

but not lexically 

Similarities between Guragigna simple sentence is more difficult than simple sentence in other 

languages. One of the main reasons is that the Guragigna resources are not comparable to those 

of any other language. Other text preprocessing includes the well-known tokenizers, stemmers, 

and lemmatizes used in almost every NLP task, and their performance is arguably even better. 

But on the contrary. This kind of tool is less common in Guragigna simple sentence. 

Additionally, there are well-organized resources such as WordNet, NLP POS-Tagger, and more. 

This improves the performance of similarity estimation methods and Guragigna text methods 

therefore, lack such tools and resources [2]. Trying to overcome the challenge of capturing 

semantic similarities between Guragigna text pairs. We introduced a method to measure the 

semantic similarity of Guragigna simple sentence Based on Deep learning techniques an 

efficient Guragigna algorithm for measuring semantic text similarity is used. Prepare a dataset 

that can be used to test the performance of the Guragigna text semantic similarity measure [2].  

1.2. Motivation   

Research in the field of natural language processing has been primarily motivated by they lead 

to a better understanding of the structure and function of human language Building natural 

language interfaces. It is used to facilitate communication between both human and computers. 

Recently, research on the similarity of semantic sentence similarity in international has been 

made. As an illustration, foreign languages have developed semantic text similarity such as 

English, Arabic, Spanish, and Bengali [2]. However, the research in local languages and 

Guragigna is very limited in order to direct Guragigna approach to technology. In particular, 

semantic textual similarities have not developed in the Guragigna language also in local 

language.   


3 

 
1.3. Statement of the Problem 

An essential component of the processing of natural languages is Semantic Textual Similarity 

(STS) with significant implications for various tasks and applications. When retrieving 

information (IR), STS plays a vital role by measuring the similarity between user queries and 

documents, enabling precise retrieval of relevant information. STS is also valuable in 

information extraction, where it aids in mining text that is unorganized for useful information 

by measuring semantic similarity between different pieces of text. Another important 

application is text summarization, where STS helps identify similar or redundant content within 

a document, making it easier to create educational summaries. STS is also valuable in data 

mining, where it aids in clustering similar instances or identifying similar patterns by 

measuring semantic similarity. In machine translation, STS improves the accuracy of 

translations by capturing the semantic similarity between source and target language sentences. 

In question answering systems, STS helps determine the similarity between user queries and 

candidate answers, leading to more accurate responses. STS is also relevant in sentiment 

analysis, where it measures similarity between sentiment-bearing texts, aiding in tasks such as 

sentiment classification. Additionally, STS aids in paraphrase detection, which is crucial for 

tasks like plagiarism detection and text generation. Overall, STS is a fundamental concept in 

natural language processing that enhances the efficiency and accuracy of language 

understanding across various domains. 

Guragigna is one of the most widely spoken languages in Ethiopia and is an Afro-Asiatic 

language of the Semitic Southern Ethiopian branch spoken by the Gurage people. According 

to the 2007 Census there are currently over 6.8 million native speakers of the language. The 

language is used in the middle grades of elementary school and in various institutions in the 

community[3]. It is also used in various fields such as Wolkite radio stations, Magazines, 

textbooks, and fiction are published in the language. The limited study in Guragigna language 

can be justified by several factors. NLP tasks require significant linguistic resources, such as 

annotated corpora and language models. Which are often developed for languages with greater 

demand and research backing. As a result Guragigna may lack the necessary resources to 

support advanced NLP research. Additionally, the availability of data plays a crucial role in 

NLP, and languages with limited study may suffer from a shortage of publicly available 

language resources. The problem of development and evaluation of NLP systems for 

Guragigna.  


4 

 
Moreover, the absence of practical applications or tools. For instance Machine translation 

systems or part-of-speech taggers for Guragigna, further indicates the limited research and 

development in these areas. Due to these reasons there have been few research studies 

conducted on the Guragigna language in various tasks involving natural language processing 

(NLP) instance part-of-speech tagging and machine translation [4], automatic Guragigna 

language character recognition [5], and others. International academic research websites like 

IEEE Xplore, ACM Digital Library, and Google Scholar, as well as local academic research 

websites associated with Ethiopian universities, linguistic research institutions, or language 

departments. Were explored using relevant keywords such as "Guragigna language" "semantic 

textual similarity" and "NLP" While no specific studies on STS for Guragigna were found? it 

is possible to come across related research in the broader field of NLP or studies on STS in 

other languages, such as Bengali [6], English [7], Arabic [8], and others. These studies showed 

promising results but also had certain limitations including not utilizing any RNN or CNN 

models, a large gap in accuracy compared to English STS models, a lack of a lexical standard 

for Arabic, insufficient experiment detail (failure to explain actual scores and model 

predictions) and comparison with unrelated works, a lack of information about pre-training 

embedding’s, a shortage of annotated corpora, better performance on smaller datasets, and 

limited availability of training data. Additionally, in a study conducted on the Amharic 

language [9] an attempt was made to develop an Amharic-English CLSTSM system by 

utilizing a statistically topic modeling-based semantic text similarity measurement approach. 

This model, which uses statistical topic modeling approaches like LDA has a disadvantage in 

that it primarily relies on word co-occurrence statistics and fails to incorporate the semantic 

meaning of the words. As a result the topics generated by the model may not always align 

perfectly with human interpretation or understanding of the underlying themes. Therefore, 

Based on these problem conduct research on semantic textual similarity is mandatory and 

important for information retrieval, Information extraction, Text summarization, Data-mining, 

machine translation and other issues, so we intend to do this Study. 

1.4. Research Questions 

At the end of this study, the following research questions are answered and investigated.  

RQ1. Which word embedding techniques can be used for Model development that can 

determine the effectiveness and robustness of Semantic Text Similarity (STS)?  

RQ2. Which deep learning model is the most effective in performing Semantic Text 

Similarity (STS) analyses for the Guragigna language? 


5 

 
1.5. Objective  

1.5.1. General objective 

 The general objective is to develop a semantic textual similarity analyzer using a deep 

learning approach for Guragigna language. 

1.5.2. Specific objective 

 To prepare a semantic text similarity corpus for the Guragigna language. 

 To develop word embedding techniques for Guragigna Semantic Text Similarity (STS) 

analyzer. 

 To develop a deep learning model for Guragigna semantic text similarity (STS) analyzer. 

 To measure the performance of word embedding techniques in conjunction with deep 

learning model. 

 To measure the effects of each deep learning algorithm on the Guragigna Semantic Text 

Similarity (STS) model. 

1.6. Scope of the study 

This study's objective is to examine how similar basic sentences are semantically in Guragigna 

language in dialect of cheha with specifically focusing on sentence-level semantic similarity. 

The study employs approach of deep learning to investigate semantic similarity within the 

context of Guragigna language. The goal is to develop a model that can accurately measure 

semantic similarity in Guragigna language sentences. To facilitate the development and 

evaluation of the model, a dataset is prepared, consisting of annotated sentences in Guragigna 

language along with their corresponding similarity scores. The dataset covers diverse sentence 

pairs, representing various semantic relationships and degrees of similarity. The study aims to 

leverage deep learning techniques to advance the understanding and capabilities of semantic 

similarity analysis in the Guragigna language. 

1.7. Limitations of the study 

The analysis is focused on sentences meaning that the findings may not directly apply to more 

complex sentence structures or longer texts and the study is limited to the specific dialect of 

Cheha. Which could restrict its generalizability to other dialects or languages. The 

effectiveness of the model developed in this study heavily depends on the quality and 

representativeness of the dataset used. Any limitations or biases in the dataset may have an 

impact on the accuracy and reliability of the model's results. Lastly, this study primarily focuses 

on general semantic relationships in Guragigna language sentences and does not 


6 

 
address specific domain semantic similarity analysis. By considering these limitations, we can 

better interpret and contextualize the findings of the study. 

1.8. Significance of the study 

Semantic text similarity is an important and fundamental task in natural language processing 

(NLP). Being able to compare semantic similarities between sentences has many applications 

in various fields. Examples of areas where text similarity is used include plagiarism detection, 

search engines, and customer service. The development of STS has different benefits for both 

the Gurage language community and the research community. 

 For the Gurage community: Semantic Textual Similarity (STS) holds great significance for 

the Gurage community by contributing to language preservation, technology development, 

education, information retrieval, and cultural representation. STS enables the accurate 

measurement of semantic similarity in Gurage language sentences help in the preservation 

and revitalization of the language and facilitating the development of language technologies 

specific to the community's needs. It supports educational applications and empowering 

learners to improve their language proficiency. Lastly, it promotes cultural representation 

and identity by conveying the unique aspects of the Gurage community's language and 

culture in various domains. Overall, STS empowers the Gurage community in 

communication, information access, and language preservation for future generations. 

 For the research community: It contributes to researchers doing more advanced NLP 

applications of the Guragigna language as a preprocessing component. 

1.9. Organization of the thesis    

The rest of this thesis is organized as follows. In Chapter 2, we explain the different approaches 

used to develop Semantic Text Similarity (STS) and review related works on developing 

Semantic Text Similarity (STS) for Guragigna language. Chapter 3 focuses on the methodology 

employed in this study. Chapter 4 presents the design of STS and implementation of the 

proposed Semantic Text Similarity (STS) system for Guragigna language. In Chapter 5, we 

present the STS experimental results of the proposed system. Chapter 6 describes the results 

and discussion. Finally, in Chapter 7, we conclude the thesis by highlighting the research 

contribution and discussing future works.   


7 

 
CHAPTER TWO  

2. LITERATURE REVIEW 

2.1. Introduction  

Semantic similarity is significant in Natural Language Processing (NLP), and it plays a crucial 

role in various NLP applications. One fundamental task in this field is Semantic Textual 

Similarity (STS) which involves assessing the similarity between different documents. To 

determine this similarity a metric is used to evaluate the direct and indirect relationships among 

the documents. By identifying semantic relations we can measure and recognize these 

relationships accurately [8][10]. 

The primary objective of the STS task is to establish a unified framework that include different 

independent semantic components. This assess the influence of these elements on different 

NLP tasks. Developing such a framework is a crucial research challenge with significant 

applications in NLP include retrieval of information (IR) and summarization of text in area [4], 

[11], as well as question answering [12],, relevance feedback [13], text classification [14], 

WSD, and summarization for extractive [15]. 

Semantic similarity is not only relevant for NLP applications but also plays a significant role 

in various semantic web applications include extraction, generation of ontology, and 

disambiguation. Semantic similarity is particularly valuable in search [50], where the 

performance to accurately with all entities measure the semantic relatedness is valuable in IR. 

One of the key problem is how semantically related documents or images retrieving to a user's 

query in a web search engine, including retrieving images based on their captions [11]. 

Text similarity has applications beyond NLP and the semantic web, extending into the field of 

databases as well. In database systems, text similarity can be leveraged for schema matching, 

addressing the challenge of semantic heterogeneity in data sharing systems, data integration 

systems, message passing systems, and peer-to-peer data management systems [16]. 

Additionally, text similarity is beneficial for relational join operations in databases, particularly 

when the join attributes exhibit textual similarity. The utility of text similarity spans various 

application domains, including the integration and querying of data from diverse resources, 

data cleansing, and data mining [17]. 

In NLP, STS) is connected to both Textual Entailment (TE) and paraphrasing, but have a 

differences between them. In TE, three directional relationships can be established between 

two text fragments. The task involves a two text considering as fragments of "text" (t) and the 


8 

 
"hypothesis" (h). In another case, paraphrasing identification aims to recognize text fragments 

that have approximately the same meaning within a specific context. Therefore, TE and 

paraphrasing focus on providing a yes/no decision, while STS goes a step further by evaluating 

the degree of equivalence between texts and assigning ratings with their semantic connection. 

2.2. The Semantic Textual similarity Approach  

2.2.1. String based similarity   

The string-based Similarity methods evaluate the text from a lexical standpoint and only work 

with string sequences and characters. String based similarity is a way of evaluate the strings 

similarity. More used in NLP tasks to compare phrases, sentences and other text fragments. 

These measures may be taken to determine the level of semantic or syntactic relatedness 

between two strings.  

2.2.1.1. Character-Wise Approach 

LCS and N-grams are two of the most common approaches in a character level evaluation. 

Longest Common Substring (LCS) algorithm uses dynamic programming to consider the 

length of common substrings in both terms. N-gram algorithm considers a sub-sequence of n 

items of the term. Distance in N-Gram is computed by dividing the number of similar n-grams 

by the maximal number of n-grams available. 

Longest Common Substring (LCS) algorithm is employed to identify the longest shared 

substring between two strings. It compares the two strings and determines their similarity by 

examining the longest sequence of characters they have in common [16]. The measurement 

can be computed as follows: 

𝐿𝐶𝑆𝑢𝑏𝑠𝑡𝑟 (𝑆1, 𝑆2) = max 𝐿𝐶𝑆𝑢𝑓𝑓(𝑆1 … 𝑖, 𝑆2 … 𝑗), 1 ≤ 𝑖 ≤ 𝑚, 1 ≤ 𝑗 ≤ 𝑛 

The measurement of the LCS algorithm can be computed using the following formula: 

LCS(S1, S2) = LCSuff(S1, S2, m, n) 

Here, m represents the length of the first string (S1), n represents the length of the second string 

(S2), and LCSuff is a function that finds the longest common suffixes of the possible prefixes 

of S1 and S2. 

Damerau-Levenshtein distance, also known as the Damerau-Levenshtein is a metric of two 

strings to evaluate the difference between them. It the number of operations needed to 

transform one string into another to quantifies.[16]. 


9 

 
Jaro: no similarity between the strings is explain in 0, and represents an exact match is explain 

in 1 called distance score is normalized is calculated as follows: 

𝑑𝑗 ={ (
m

|s1|
 +  

m 

|s2|
+  

m−t

m
)

1 

3

0

 if m=0 otherwise 

Here, |s1| denotes the length of string s1, |s2| denotes the length of string s2, m is the “number 

of matching characters”, and t is “transpositions of half the number”. 

[
𝑚𝑎𝑥(|𝑆1|, |𝑆2|)

2
] −1 

Jaro–Winkler distance is accept two strings and evaluate similarity of semantic that is a 

classification of distance edit and the Jaro distance metric. This way is more implement on 

simple sentence. 

dw = dj + (lp(1 − dj)), 

Here, dj represents the Jaro distance between the strings, and lp is scale of prefix can matching 

characters at the beginning of the strings that ratings assigns up to a length l of prefix. This 

approach included the Jaro distance with a prefix bonus to provide a more refined similarity 

measure [20]. 

Needleman-Wunsch algorithm is optimal matching algorithm and a global alignment 

technique for dynamic programming algorithm commonly take in sequences of bio-informatics 

for aligning. [18]. 

Algorithm of Smith-Waterman is algorithm that make sequence alignment in local take that 

to evaluate the strings similarity like nucleotide sequences. Unlike Needleman-Wunsch, Smith-

Waterman focuses on optimizing segments of strings similarity. This algorithm not implement 

for long-scale problems [19]. 

Model of N–gram is a model of probabilistic language that implement to execute the sequence 

of next term (n - 1) terms or characters. The big advantages of the N-gram model are its simply 

implement and more scalability [20]. 


10 

 
2.2.1.2. Term-Wise Approach 

At the Term level, there are commonly used measures to evaluate similarity: cosine similarity 

and Jaccard similarity. Cosine similarity compares two vectors in a space and calculates. There 

are also other methods available, such as Damerau-Levenshtein, Jaro-Winkler, Needleman-

Wunsch, and Smith-Waterman, which are not explained in detail here due to space constraints. 

Block Distance, also known as the City Block of Distance, Snake Distance, Manhattan 

Distance, Manhattan Length, L1 Distance[21], is a metric used to measure the two points 

distance  d1 with calculated vectors of p and vectors of q as follows: 

𝑑1(𝑝, 𝑞) = ‖𝑝 − 𝑞‖1 = ∑|𝑝𝑖 −  𝑞𝑖|

𝑛

𝑖=1

 
Cosine Similarity is a metric of similarity in an inner product space between two non-zero 

vectors to determines the cosine of the vectors angle. Similarity of Cosine is commonly used 

in data mining to assess the cohesion between vectors [22]. The cosine of two non-zero vectors 

can be computed using the Euclidean dot product.  

𝑎 − 𝑏 = ‖𝑎‖‖𝑏‖ cos 𝜃 

Vectors A and vectors B is calculated Similarity of Cosine cos(θ) as divided by vectors to the 

product of their magnitudes in dot product. Mathematically, it can be expressed as  

cos(𝜃) =
𝐴 − 𝐵

‖𝐴‖‖𝐵‖
=

∑ 𝐴𝑖𝐵𝑖
𝑛
𝑖=1

√∑ 𝐴𝑖
2 𝑛

𝑖=1 √∑ 𝐵𝑖
2𝑛

𝑖=1

 
Here, A · B represents the dot product of vectors A and B, and ||A|| and ||B|| represent the 

magnitudes (or norms) of vectors A and B, respectively. 

Similarity of Soft Cosine is takes into count the Vector Space Model similarity [23]. It is 

calculated using the following formula:  

𝑠𝑜𝑓𝑡_ cos 𝜃 =
∑ 𝑠𝑖𝑗𝐴𝑖𝐵𝑗

𝑛
𝑖,𝑗=1

√∑ 𝑠𝑖𝑗𝐴𝑖𝐴𝑗
𝑛
𝑖,𝑗=1 √∑ 𝑠𝑖𝑗𝐵𝑖𝐵𝑗

𝑛
𝑖,𝑗=1

 
In this formula, sij represents the value from the similarity matrix between features i and j. It's 

important to note that if the similarity matrix is diagonal, meaning the features are only similar 

to themselves, then the soft cosine similarity becomes same to the old similarity of cosine. [24]  


11 

 
Sorensen–Dice index (Dice's Coefficient) is taken to measure quantify the similarity of 

samples [25]. It is commonly employed to determine the presence of data set or absence of data 

sets. 

The formula for calculating the Sorensen-Dice Index is as follows: 

𝑄𝑆 =
2|𝑋 ∩ 𝑌|

|𝑋| + |𝑌|
 

Here, |X| (number of elements one set) and |Y| (number of elements one set) compared. The 

quotient of similarity represent by QS, and ranges of 0 to 1 is value.  

Coefficient on bigrams of Strings S1 and S2 similarity calculated as follows: 

𝑠im =
2nt

ns1 + ns2
 

In this formula, nt represents the count of shared bigrams between the strings, while n_s1 and 

n_s2 represent the total number of bigrams in S1 and S2. 

Euclidean Distance is masseur the two points of straight-line distance. Distance of euclidean 

in two points, denoted as "d(s,t)" or "d(t,s)", can be calculated using the following formula: 

d(s,t) = 𝑑(𝑡, 𝑠) = √∑(𝑡𝑖 − 𝑝𝑖)2

𝑛

𝑖=1

 
In this formula, n represents the number of dimensions or features in the space, t_i and p_i 

represent the corresponding coordinates or values of the points in each dimension. The formula 

calculates the total root of square with squared differences in the values of the two points. 

Jaccard Index ( Jaccard similarity coefficient) is a statistical measure used to similarity with 

diversity determine in two sets of finite [26]. It is defined by the following formula: 

𝐽(𝐴, 𝐵) =
|𝐴||𝐵|

|𝐴 ∪ 𝐵|
=

|𝐴||𝐵|

|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|
 

In this formula, A and B represent the two sets being compared. |A| and |B| denote the 

cardinality (number of elements) of sets A and B, respectively.  

SMC is a statistical measure used to assess the similarity and diversity between two objects. It 

considers the objects as a collection of n binary attributes. The SMC between objects A and B 

can be calculated using the following formula: 


12 

 
𝑆𝑀𝐶 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑀𝑎𝑡𝑐ℎ𝑖𝑛𝑔 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 

𝑇𝑜𝑡𝑎𝑙 𝑁𝑜. 𝑜𝑓 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 

=  
𝑎00 + 𝑎11

𝑎00 + 𝑎01 + 𝑎10 + 𝑎11

 
SMC = (Number of Matching Attributes) / (Total Number of Attributes) 

Where the total number of attributes is represented by a_00 + a_01 + a_10 + a_11, and the 

matching attributes that have the same value in both objects. 

In the formula, a_00 represents the total number of attributes that are 0 in both A and B, a_10 

represents the total number of attributes that are 1 in A and 0 in B, a_01 represents the total 

number of attributes that are 0 in A and 1 in B, and a_11 represents the total number of attributes 

that are 1 in both A and B. The SMC provides a evaluation of similarity of two objects By 

dividing the quantity of matching attributes by the total number of attributes, ranging from 0 

to 1. 

Overlap Coefficient the Overlap Coefficient, also known as the Szymkiewicz-Simpson 

Coefficient, is a similarity evaluation that is closely related to the Jaccard Index. It quantifies 

the overlap between two sets and is defined as follows: 

𝑜𝑣𝑒𝑟𝑙𝑎𝑝(𝐴, 𝐵) =
|𝐴| ∩ |𝐵|

min (|𝐴|, |𝐵|)
 

In this formula, |A| and |B| represent the cardinalities (number of elements) of sets A and B, 

respectively. The numerator calculates the size of the intersection of sets A and B, while the 

denominator represents the size of the smaller set between A and B. 

The overlap coefficient takes values between 0 and 1, with 1 indicating that set A is a subset of 

set B. In other words, when the overlap coefficient is equal to 1, all elements of set A are also 

present in set B. 

2.2.2. Corpus-Based Approaches 

Corpus-based methods in language studies involve using real language samples from large 

collections of written or spoken texts to examine language structure, usage, and meaning. These 

approaches are commonly used to identify patterns in communication, create networks of word 

meanings, develop computer models for language and learning, measure differences between 


13 

 
dialects, and understand how language changes over time. They can also help us understand 

how we learn languages and provide information for tasks like NLP. 

Important aspect of this approach is finding similarities between words by analyzing the data 

in the collection. This method requires a sizable collection of texts. Analyzing large collections 

provides valuable information, allowing us to identify common word occurrences and 

accurately estimate word similarities. Many of the methods proposed for measuring word 

similarity rely on analyzing large collections of texts. This type of similarity measurement is 

based on information gathered from a substantial collection of texts. A text collection used for 

language research is called a corpus, and it contains written or spoken sentences. To determine 

word similarities, we often look at how words appear together in the corpus. To obtain reliable 

statistics on word co-occurrence, we need a very large and balanced corpus [27]. 

2.2.2.1. Method of LSA (Latent Semantic Analysis) 

One example of this type of analysis is Latent Semantic Analysis (LSA). In LSA, each word is 

represented as a vector based on statistical calculations. To create these vectors, a large text is 

analyzed, and a matrix of words is constructed. The words are represented as rows in the matrix, 

and the paragraphs or segments of text are represented as columns. Singular value 

decomposition (SVD) is then applied to reduce the dimensionality of the matrix. After 

dimensionality reduction, word similarity is computed using cosine similarity. 

In this method, contextual information for words is extracted from a large text corpus [28]. The 

first step involves representing the text as a matrix, where rows represent unique words and 

columns represent segments of text. Each entry in the matrix represents the frequency count of 

a word appearing in a particular segment of text [29]. The cell frequencies are weighted based 

on two factors. The importance of words in the text and the degree to which parts of speech 

share information in the discourse context. This approach can be implemented in two ways as 

a similarity matrix with words and text segments implementation use, and as a computational 

model that represents the underlying knowledge acquisition and usage. To reduce the number 

of rows in matrix, singular value decomposition (SVD) is develop while preserving the 

similarities with columns. To measure similarity, the cosine angle between vectors of word 

made by any two rows. 

LSA relies on the distributional hypothesis, which suggests that words appearing in similar 

contexts to have similar meanings [30]. Therefore, evidence of word similarity can be 

computed through statistical analysis of large collections of sentences. LSA is a mathematical 


14 

 
and statistical technique that extracts and assumes relationships based on the expected 

contextual usage of words in a discourse passage. It is not a traditional natural language or 

artificial intelligence processing program. Instead of relying on human-made dictionaries, 

knowledge bases, semantic webs, grammars, syntactic parsers, morphologies, etc., LSA takes 

raw text as input, treating it as sentences or paragraphs [29]. 

2.2.2.2. Method of Hyperspace Analogue to Language 

Hyperspace Analogue to Language (HAL) is a method that constructs a word co-occurrence 

matrix where both rows and columns represent words in the vocabulary. The matrix elements 

are filled with association strength values. These association strength values are computed by 

applying a sliding "window" over the corpus, and the size of the window can be adjusted. The 

strength of association between words within the window decreases as their distance from each 

other increases. For example, in the sentence "This is a survey of various semantic similarity 

measures," the words "survey" and "variety" would have a higher association value compared 

to "survey" and "measures."  

Word vectors are formed by considering both the row and column of a given word in the co-

occurrence matrix. To reduce dimensionality, columns with low entropy values are eliminated. 

Finally, semantic similarity is calculated by measuring the Euclidean or Manhattan distance 

between the word vectors [31]. 

2.2.2.3. Method of Explicit Semantic Analysis (ESA) 

ESA (Explicit Semantic Analysis) is a semantic similarity measurement method that relies on 

Wikipedia concepts. By utilizing Wikipedia, this approach can be applied to different domains 

and languages. The dynamic nature of Wikipedia ensures that the method remains adaptable to 

changes over time [32]..  

In ESA, each concept present in Wikipedia is represented as an attribute vector comprising the 

words associated with it. An inverted index is then constructed, linking each word to the 

concepts it is associated with. To determine the strength of technique called TF-IDF applied, 

which assigns weights to the associations. Concepts that have weak associations with words 

are subsequently filtered out. As a result, the input text is represented by weighted vectors of 

concepts, known as "interpretation vectors" [32]..  

To measure semantic similarity, the cosine similarity between these interpretation vectors is 

calculated. The cosine similarity provides a measure of how similar the vectors are in terms of 

their direction in the vector space [32]. 


15 

 
2.2.2.4. Method of Word-alignment Models  

Word-alignment models are used to determine the semantic similarity between sentences based 

on their alignment in a large corpus [32]. These models have shown success in SemEval tasks 

2015, securing the second, third, and fifth positions. One unsupervised method that ranked fifth 

utilized the word-alignment technique based on the Paraphrase Database (PPDB) [32]. This 

system measures the semantic similarity between two sentences by considering the proportion 

of aligned context words shared between the sentences compared to the total number of words 

in both sentences. 

The supervised methods that ranked second and third employed word2vec to establish word 

alignments. In the first supervised method, a sentence vector is created by computing the 

"component-wise average" of the word vectors. The cosine model similarity between these 

sentence vectors is then implement as a measure of STS. The second supervised method 

focuses only on words that exhibit contextual semantic similarity [32]. 

2.2.2.5. Method of Latent Dirichlet Allocation (LDA) 

LDA (Latent Dirichlet Allocation) is a technique commonly used for topic modeling tasks. It 

represents a document's topic or general idea as a vector, rather than including every single 

word from the document. This approach offers the advantage of reduced dimensionality since 

the number of topics is typically much smaller than the number of words of the document [33]. 

To evaluate similarity of document, a new method involves using vector representations of 

documents. Each document is represented as a vector in a high-dimensional space, where each 

dimension corresponds to a specific topic. The cosine similarity between these document 

vectors is then calculated to measure the semantic documents similarity [34].The cosine 

similarity provides a measure of how similar the directions of the vectors space in document, 

indicating their semantic similarity.  

2.2.2.6. Method of Normalized Google Distance (NGD) 

Normalized Google Distance (NGD) is a measure of similarity between two terms based on 

the results obtained from querying them using the Google search engine. The underlying 

assumption is that if two words are more related, they will appear together more frequently in 

web pages [35]. 

To calculate the NGD between two terms, denoted as t1 and t2, the following formula is used: 

NGD(x,y) = max {loд f (t1),loд f (t2)} − loд f (t1,t2), (9) loд G − min {loд f (t1),loд f (t2)}  


16 

 
In this formula, f(x) and f(y) represent the number of hits in the Google search results for the 

respective terms, while f(x,y) represents the number of hits when the terms are searched 

together. The variable G represents the total number of pages in the overall Google search. 

NGD is commonly used to measure semantic relatedness rather than semantic similarity. This 

is because related terms tend to appear together more frequently in web pages, even if they 

have opposite meanings. 

2.2.2.7. Method of Dependency-based Models 

Dependency-based approaches aim to determine the meaning of a given word or phrase by 

examining its neighboring words within a specified window. These approaches typically begin 

by parsing the corpus using Inductive Dependency Parsing [36]., which involves analyzing the 

distribution of words within the corpus. 

For each word, a "syntactic context template" is constructed, taking into account the preceding 

and succeeding nodes in the parse tree. As an example, the phrase "thinks <term> delicious" 

could have a context template such as "pizza, burger, and food." A vector representation of a 

word is then created by aggregating the context templates in which the word appears as the 

root word. The frequency of these word windows occurring in the entire corpus is also 

considered. 

Once the vector representation is formed, semantic similarity can be calculated using cosine 

similarity between these vectors. Levy et al. [36].introduced DEPS embedding as a word-

embedding model based on a bag-of-words approach in dependency based. The model was 

evaluated using the WS353 dataset, which involved ranking similar words above related words. 

When comparing the recall-precision curves, the DEPS curve demonstrated a stronger affinity 

towards similarity rankings compared to the bag-of-words (BoW) methods. 

2.2.2.8. Method of Word-attention Models 

In many corpus-based methods, all components of the text are typically treated as equally 

significant. However, in human interpretation of similarity, the importance of specific 

keywords in a given context is often emphasized. Word attention models aim to capture the 

importance or relevance of words from the underlying corpus before calculating semantic 

similarity [37]. 

Word attention models employ various techniques to determine the attention weights of the 

words in the text being analyzed. These techniques may include factors such as word frequency, 

alignment, and word association. By assigning higher attention weights to main terms in the 

context, word attention models can capture the relative importance of specific words in 


17 

 
determining semantic similarity. This allows the models to focus on the most relevant 

information when calculating similarity measures. 

2.2.2.9. Method of GLSA (Generalized Latent Semantic Analysis) 

A technique for calculating semantically motivated phrase and document vectors is called 

Generalized Latent Semantic Analysis (GLSA). By emphasizing term vectors rather than the 

dual document-term representation, it expands on the LSA methodology. A dimensionality 

reduction technique and a measure of semantic association between concepts are needed for 

GLSA. Any appropriate technique for dimensionality reduction can be used with any type of 

similarity measure on the space of words using the GLSA approach. The final phase provides 

the weights in the linear combination of term vectors using the conventional term document 

matrix [10].  

2.2.3. Knowledge Base STS Approaches  

Newly developed state-of-the-art methods for determining similarity scores with in  text sample 

pairs include knowledge-based linguistic variables. These techniques use lexical relations and 

word-level semantic networks to assess relevance at the text (sentence) level. Electronic 

resources, such as lexical resources and knowledge bases, serve as the primary sources of 

information for these methods. The semantic similarity between two simple sentences is 

quantified by evaluating a global measure based on pairwise comparisons of word similarity 

within these sentences. The construction of sentence-to-sentence semantic similarity relies on 

the aggregation of individual word semantic similarities [16]. 

One specific measure used for sentence-to-sentence similarity is the term set-to-term set 

measure, which represents an extreme case of this approach [38]. To compute this measure, 

two texts that are being compared are queried separately in the corpora to determine the number 

of documents containing each text. Additionally, the number of documents in which both words 

appear together is queried. These queries are performed using a Lucene index built on the 

corpora. The cardinality of a word refers to the number of corpora in which the word appears, 

while the cardinality of a word's conjunction represents the number of documents in which 

both words appear [8]. 

Based on the principles take to assess the semantic similarity between words, knowledge-based 

semantic similarity methods can be further categorized into edge-counting methods, feature-

based methods, and information content-based methods. These different categories employ 

distinct techniques and measures to capture the semantic relatedness between words and extend 

them to sentence-level similarity assessments. 


18 

 
2.2.3.1. Edge-counting Methods 

A simple approach to measure similarity between terms is to view the underlying ontology as 

a graph, where words are connected in a taxonomic manner. By counting the edges between 

two terms, we can gauge their similarity. The shorter the path between the terms, the more 

similar they are. This measure, known as "path," was proposed by Rada et al [39].  It determines 

similarity by considering the inverse of the shortest path length between two terms. 

However, the edge-counting method does not take that words lowermost in the structure may 

have more specific meanings. These words could be more similar to each other, even if they 

have the same distance as two term indicate a more general concept. To address this, Wu and 

Palmer [39] proposed the "wup" measure, which considers the depth of words in the ontology 

as an important factor. The wup measure counts the number of edges between each term and 

their Least Common Subsumer (LCS). The LCS represents the common ancestor shared by 

both terms in the given ontology. 

Let's denote two terms as t1 and t2, their LCS as tlcs, and the shortest path length between them 

as min_len(t1,t2). The path is then measured as follows:  

𝑠𝑖𝑚𝑝𝑎𝑡ℎ(𝑡1, 𝑡2) =
1

1 + min_𝑙𝑒𝑛(𝑡1, 𝑡2)
 

and wup is measured as, 

𝑠𝑖𝑚𝑤𝑢𝑝(𝑡1, 𝑡2) =
2𝑑𝑒𝑝𝑡ℎ(𝑡𝑙𝑐𝑠)

𝑑𝑒𝑝𝑡ℎ(𝑡1) + 𝑑𝑒𝑝𝑡ℎ(𝑡2)
 

Li et al. [40] proposed a measure that takes into account both the minimum path distance and 

depth. li is measured as, 

simli = e−amin_len(t1,t2),
eβdepth(tlcs) − e−βdepth(tlcs)

eβdepth(tlcs) + e−βdepth(tlcs)
 

However, the edge-counting methods ignore the fact that the edges in the ontologies need not 

be of equal length. To overcome this shortcoming of simple edge-counting methods, feature-

based semantic similarity methods were proposed. 

2.2.3.2. Feature-based Methods 

The feature-based methods calculate similarity as a properties of the words function, such as 

gloss, neighboring concepts, and so on [12]. Gloss is defined as the meaning of a word in a 

dictionary; a collection of glosses is called a glossary. There are various semantic similarity 

methods proposed with gloss of words. Gloss-based semantic similarity measures exploit the 


19 

 
knowledge that words with the same meanings have more common words in their gloss. The 

semantic similarity is measured as the extent of overlap between the words glosses in 

consideration. The measure [41]] assigns a value of relatedness between two words based on 

the overlap of words in their gloss and the glosses of the concepts they are related to in an 

ontology like WordNet [42] [14]. Proposed a feature-based method where semantic similarity 

is measured using the glosses of concepts present in Wikipedia. Most feature-based methods 

take into account common and non-common features between two words/terms. The common 

features contribute to the increase of the similarity value and the non-common features 

decrease the similarity value. The major limitation of feature-based methods is its dependency 

on ontologies with semantic features, and most ontologies rarely incorporate any semantic 

features other than taxonomic relationships [12]. 

2.2.3.3. Information Content-based Methods 

Information content (IC) of a concept is defined as the information derived from the concept 

when it appears in context [43]. A high IC value indicates that the word is more specific and 

clearly describes a concept with less ambiguity, while lower IC values indicate that the words 

are more abstract in meaning [44]. The specificity of the word is determined using Inverse 

Document Frequency (IDF), which relies on the principle that the more specific a word is, the 

less it occurs in a document. IC-based methods measure the similarity between terms using the 

IC value associated with them.  

Resnik and Philip [45] proposed a semantic similarity measure called res that measures the 

similarity with on the idea that if two concepts share a common subsumer, then they share more 

information, since the IC value of LCS is higher. Considering IC represents the IC of the given 

term, res is measured as, 

simres (t1,t2) = ICtlcs . 

D. Lin [46] proposed an extension of the res measure consideration to taking the IC value of 

both the terms that attribute to the individual information or description of the terms and the 

IC value of their LCS that provides the shared commonality between the terms. lin is measured 

as, 

𝑠𝑖𝑚𝑙𝑖𝑛(𝑡1, 𝑡2) =
2𝐼𝐶𝑡𝑙𝑐𝑠

𝐼𝐶𝑡2 + 𝐼𝐶𝑡2
 

Jiang and Conrath [47] calculate a distance measure with on the difference between the sum of 

the individual IC values of the terms and the value of IC their LCS using the below equation: 


20 

 
disjcn(t1,t2) = ICt1 + ICt2 − 2ICtlcs.  

The distance measure replaces the shortest path length in Equation (1), and the similarity is 

inversely proportional to the above distance. Hence jcn is measured as, 

𝑠𝑖𝑚𝑗𝑐𝑛(𝑡1, 𝑡2) =
1

1 + dis𝑗𝑐𝑛(𝑡1, 𝑡2)
 

an underlying corpora measured by IC or from the intrinsic ontology structure itself [33] with 

on the assumption that the ontologies are structured in a meaningful way. Some of the terms 

may not be included in one ontology, which provides a scope to use multiple ontologies to 

calculate their relationship [13]. Based on whether the given terms are both present in a single 

ontology or not, IC-based methods can be classified as mono-ontological methods or multi-

ontological methods. When multiple ontologies are involved, the IC of the Least Common 

Subsumer from both the ontologies are accessed to estimate the semantic similarity values. 

Jiang et al.[48] Proposed IC-based semantic similarity measures based on Wikipedia pages, 

concepts, and neighbors. Wikipedia was both used as a structured taxonomy likewise, a corpus 

to provide IC values. 

Semantic Textual Similarity (STS) is a task in natural language processing that involves 

measuring how closely related pairs of text units are in terms of their meaning. To preprocess 

STS data, several steps are typically followed to enhance the accuracy of the similarity 

measurement. These steps include: 

1. Tokenization: The text is divided into individual words or tokens to establish the basic units 

of analysis. 

2. Stop word removal: Common words that do not carry much semantic meaning, such as "a," 

"the," or "of," are removed to reduce noise and focus on more meaningful content. 

3. Stemming and lemmatization: Words are reduced to their base or root forms to handle 

variations of the same word. It advantageous in capturing the core meaning and avoids 

redundancy. 

4. Part-of-speech (POS) tagging: Each word is assigned a syntactic category, such as noun, 

verb, adjective, etc., to understand the grammatical structure and potential relationships 

between words. 

5. Dependency parsing: The relationships and dependencies between words are analyzed to 

determine which words depend on others in terms of syntax and meaning. 


21 

 
6. Named entity recognition: Entities such as names of people, organizations, locations, etc., 

are identified to handle their specific semantic significance. 

7. Parsing trees: The syntax structure of the sentence is represented using parsing trees, which 

capture the hierarchical relationships between words. 

By applying these preprocessing techniques, noise is reduced, and the semantic meaning of the 

sentence pair is captured more accurately. This, in turn, improves the performance and 

accuracy of STS systems in measuring the similarity between texts. 

2.3. Deep learning techniques 

A Recurrent Neural Network (RNN) 

A Recurrent Neural Network (RNN) is a type of Neural Network that addresses the requirement 

for sequential information processing. Unlike traditional neural networks, where inputs and 

outputs are treated independently, RNNs consider the previous output as input for the current 

step. This is particularly useful in tasks like predicting the next word in a sentence, where the 

context of previous words is necessary. To enable this sequential processing, RNNs introduce 

a Hidden Layer that plays a crucial role. This Hidden Layer, also known as the Hidden State 

or Memory State, retains information about the sequence's previous inputs. It make as a 

memory that helps the network remember and utilize past inputs. One of the key advantages of 

RNNs is their ability to share parameters across different inputs or hidden layers. This means 

that the same set of parameters is used for each input, performing the same operation on all of 

them. As a result, the complexity of parameters is reduced compared to other neural network 

architectures. RNNs are specialized neural networks that address the requirement for sequential 

information processing. They utilize a Hidden Layer or Hidden State to remember past inputs, 

enabling them to capture dependencies and context in sequential data. The parameter sharing 

property of RNNs contributes to their efficiency by reducing the complexity of parameters 

[49]. 

When modeling sequential data, where the context and order of the input items are important, 

RNNs perform especially well. They are capable of processing input sequences with varying 

lengths and identifying the connection between the sequence's components. Because of this, 

RNNs can be used for a variety of applications, including language modeling, machine 

translation, speech recognition, time series prediction,  and sentiment analysis.  Long-term 

dependencies and contextual information within a sequence are excellently captured by RNNs. 

An RNN's hidden state stores the details of the inputs it has already seen, enabling the network 

to retain context memory. Tasks requiring the word analysis or phrase meaning in the context 


22 

 
of the full sequence benefit from this contextual comprehension. Because RNNs can represent 

sequential data, they are frequently utilized in NLP tasks. Language generation, text 

categorization, named entity identification, sentiment analysis, question answering, and 

machine translation are just a fewer of the jobs they have successfully completed. RNN 

variations that have shown to be very successful in capturing long-range dependencies and 

reducing the vanishing gradient issue are LSTM and GRU. For the analysis and prediction of 

time series data values arranged according to a specific time interval RNNs are a good fit. They 

are helpful for applications like signal processing, anomaly detection, weather forecasting, and 

stock market prediction because they can extract temporal patterns and dependencies from the 

data. Transfer learning is made possible by pre-training RNN models on extensive language 

modeling tasks, such as training on a sizable corpus of text data. Then, using smaller labeled 

datasets for certain downstream tasks, the pre-trained RNN models can be improved. By 

utilizing the acquired language knowledge from the pre-training phase, this method enhances 

performance on the intended job. RNNs enable for the examination of the hidden states and 

their temporal evolution, which contributes to their interpretability to some degree. This could 

provide information into the attributes the model deems crucial for the task and aid in 

understanding how it makes decisions. RNNs are a popular option in learning of machine in 

case of their versatility in NLP and time series analytic tasks, as well as their capacity to handle 

sequential input, capture context and dependencies, and perform well in these areas. RNNs are 

a useful tool for different applications due to their adaptability and efficiency in modeling 

sequential information [49].  

LSTM (Long Short-Term Memory) is a special kind of recurrent neural network (RNN) design 

that solves the issue of the vanishing gradient problem found in traditional RNNs. LSTM 

introduces a memory cell and three gates: input gate, forget gate, and output gate. These gates 

control the flow of information into and out of the memory cell, allowing LSTMs to selectively 

retain or discard information over long sequences. The memory cell enables LSTMs to capture 

long-range dependencies and remember relevant information from earlier parts of the 

sequence. 

GRU (Gated Recurrent Unit) is another variant of the RNN architecture that addresses the 

vanishing gradient problem and has a simpler structure compared to LSTM. GRU also includes 

gating mechanisms, but it uses only two gates which is update gate and a reset gate. The update 

gate regulates how much of the previous hidden state should be maintained, while the reset 

gate determines the extent to which past information should be disregarded. GRU performs 


23 

 
similarly to LSTM in many sequence modeling tasks while being computationally more 

efficient due to its reduced number of gates [49]. 

In some cases, information from both past and future inputs is important for understanding the 

current input in a sequence. Bidirectional RNNs (Bi-RNNs) address this by processing the 

input sequence in two directions: one forward and one backward. This means that the hidden 

state of the network at each step is influenced by both the past and future input contexts. Bi-

RNNs are particularly useful in tasks where context from both directions is crucial, such as 

part-of-speech tagging or named entity recognition. By capturing information from both 

directions, Bi-RNNs can provide improved context awareness and capture dependencies in the 

entire sequence [49].  

Stacked RNN involve stacking multiple recurrent layers on top of each other. Each layer in the 

stack processes the input sequence sequentially, and its hidden state is passed as input to the 

next layer. Stacked RNNs allow for more complex representations and can capture hierarchical 

dependencies in the data input. The lower layers capture local dependencies, while the higher 

layers capture more abstract and global dependencies. The use of stacked RNNs can enhance 

the models the ability to understand intricate relationships and patterns in sequential data, 

making them beneficial for tasks that require a deeper understanding of the input sequence.  

In general, RNNs are commonly utilized for Semantic Textual Similarity (STS) tasks due to 

their semantic similarity between sentences, which requires considering the order and context 

of words and phrases. RNNs, with their recurrent connections and hidden states, can effectively 

model the sequential nature of sentences and encode contextual information. They are 

appropriate for capturing the complex semantic linkages between phrases because they may 

record long-term interdependence.  Furthermore, RNNs' flexibility in handling variable-length 

input and their capacity for transfer learning make them a valuable choice for STS, allowing 

them to leverage pre-trained language models and improve performance on the task. 

2.4. Overview of Guragigna Language 

Guragigna, also known as Gurage or Guragegna, is a Semitic language spoken by the Gurage 

people in Ethiopia. It belongs to the Afro-Asiatic language family and specifically falls under 

the South Ethiopian Semitic branch. Guragigna is primarily spoken in the Gurage Zone, which 

is located in the southern part of the country. Guragigna has several dialects, with variations in 

vocabulary, pronunciation, and grammar across different Gurage communities. These dialects 

include Ezha, Cheha, Soddo, Inor, Gumer, Gura, Meskane, Muher, and Gyeto. The language is 


24 

 
characterized by a rich oral tradition and has its own unique writing system which is the 

Guragigna script. However, the script is not widely used, and the majority of Guragigna 

speakers primarily use the Ethiopian script known as Fidel [50]. 

Guragigna exhibits a phonological system characterized by a diverse range of consonants and 

a set of five vowel phonemes. The language allows for complex syllable structures and permits 

consonant clusters in both initial and final positions. Stress typically falls on the penultimate 

syllable, while intonation plays a significant role in conveying meaning. Grammatically, 

Guragigna features noun and verb conjugation, adjective agreement with nouns, and a 

predominantly subject-verb-object word order. The language historically used the Ge'ez script, 

but in modern times, it is commonly written using the Ethiopian script, an abugida and now 

the Gurage zone Culture and Tourism Office prepare new Guragigna Script. Guragigna holds 

socio-cultural significance for the Gurage people, being intertwined with their traditions, 

folklore, and identity. Efforts are underway to preserve and promote Guragigna through 

educational initiatives and cultural events, contributing to the linguistic and cultural landscape 

of the Gurage Zone and Ethiopia as a whole [51][52][53]. 

Guragigna has influenced and been influenced by other Ethiopian languages, particularly 

Amharic, due to historical and geographical interactions. As outcome, there are similarities in 

vocabulary and grammar between Guragigna and Amharic. The Gurage people, who are the 

native speakers of Guragigna, have a diverse cultural heritage and are known for their 

agricultural practices, craftsmanship, and music. Guragigna plays a significant role in 

preserving and transmitting their cultural traditions and expressions [51][52][53]. 

While Guragigna is primarily spoken within the Gurage community, there have been efforts to 

promote the language and its cultural significance through educational initiatives and 

documentation projects. These endeavors aim to preserve and enhance the understanding and 

use of Guragigna among its speakers and to promote appreciation for its linguistic and cultural 

richness [51][52][53].  

2.5. Related works 

As mention in [6] the paper investigates several word embedding techniques (Word2Vec, 

GloVe, and FastText) to estimate the semantic similarity of “Bengali” sentences. Due to the 

unavailability of the standard dataset, this work developed a Bengali dataset containing 187031 

text documents with 400824 unique words. Moreover, this work considers three semantic 

distance measures to compute the similarity between the word vectors using Cosine similarity 


25 

 
with no weight, term frequency weighting, and Part-of-Speech weighting. The performance of 

the proposed approach is evaluated on the developed dataset containing 50 pairs of Bengali 

sentences. The evaluation result shows that Fast Text with continuous bag-of-words with 100 

vector sizes achieved the highest Pearson's correlation (ρ) score of 77.28% [6]. 

As mentioned in  [8], offers three distinct methods for producing STS Arabic models that work 

well. The first one is based on a fine-tuning evaluation of automatic machine translation from 

English STS data to Arabic. The second strategy is based on integrating English data resources 

with Arabic models. Using a proposed translated dataset, the third strategy focuses on 

optimizing the knowledge distillation-based models to improve their performance in Arabic. 

Using a very small collection of resources a few hundred Arabic STS sentence pairs. Were able 

to obtain an 81% correlation score when using the regular STS 2017 Arabic assessment set. 

Additionally, it was possible to expand the Arabic models to process the two regional dialects, 

Saudi Arabian (SA) and Egyptian (EG) [13]. 

Determining how similar two sentences are in meaning is a crucial part of comprehending 

natural languages automatically. The problem of semantic similarity involves evaluating the 

closeness of sentence meanings. To address this problem, recurrent and recursive neural 

networks have been used and have shown significant improvements over basic models. These 

neural networks are designed to handle the structure of language. Recurrent neural networks 

(RNNs) are suitable for processing sentences and understanding the relationships between 

words. Recursive neural networks (RecNNs) take this further by considering the hierarchical 

structure of sentences. By utilizing recurrent and recursive neural networks, there have been 

notable enhancements in measuring semantic similarity, with reported improvements ranging 

from 16% to 70% compared to basic models. This highlights the effectiveness of these neural 

network approaches in evaluating the similarity of sentence meanings. These advancements 

contribute to better automated language understanding and have various applications in tasks 

like question answering, information retrieval, and language translation [54]. 

Semantic Textual Similarity (STS) forms the foundation for numerous applications in Natural 

Language Processing (NLP). To measure the semantic similarity of sentences, a system 

combines convolutional and recurrent neural networks. It utilizes a convolutional network to 

consider the nearby context of words and a Long Short-Term Memory (LSTM) network to 

account for the overall context of sentences. By combining these networks, the system retains 

important sentence information and enhances the calculation of sentence similarity. The model 


26 

 
has demonstrated favorable outcomes and is competitive with the leading state-of-the-art 

systems, as indicated by reference [7]. 

this study[9] was attempted to develop an Amharic-English CLSTSM system by utilizing a 

Statistically topic modeling-based semantic text similarity measurement approach It helps 

native speakers of Amharic gauge the amount of web content available in Amharic by utilizing 

a query in their own language. The publicly accessible Amharic and English text materials that 

make up the similar and non-comparable collected documents were utilized to test the system 

prototype. By projecting the two texts into an LDA topic space and utilizing three distinct 

techniques to measure the similarity of the two text documents, the LDA topic model 

methodology is used to turn text documents into vectors. In varying data sizes, the Jaccard 

algorithm outperforms other matching algorithms with accuracy rates of 70%, 79%, 92%, and 

96%. Additionally, on non-comparable corpora, the Jaccard algorithm surpasses other 

algorithms with accuracy rates of 65%, 78%, 92%, and 95.6. 

To measure the Semantic Textual Similarity (STS) is an important study area in NLP which 

plays a significant role in many applications such as question answering, document 

summarization, retrieval of information and information extraction. This paper evaluates 

Siamese recurrent architectures, a special type of neural networks, which are used here to 

measure STS. Several variants of the architecture are compared with existing methods [55]. 


27 

 
Table 2-1: List of related works  

Year Topic Method Accuracy Datasets Algorithm 
Evaluation 

matrix 

Gaps/Feature Research 

work 

2021 

[6] 

 
“Word Embedding-based 

Textual Semantic Similarity 

Measure  in Bengali” 

word 

embedding 

techniques and 

cosine 

similarity 

77.28% 

Pearson 

Correlatio

n 

187031 

text 

documen

ts 

No weight, 

term frequency 

weighting, and 

Part-of-Speech 

weighting 

Pearson's 

correlation 

(ρ) 

Ambiguity words cannot 

consider and not use any 

RNN or CNN models  

2022 

[8] 

“Semantic textual similarity for 

modern standard and dialectal 

Arabic using transfer learning” 

transfer 

learning with 

BERT 

embedding 

81% 

Pearson 

Correlatio

n 

100 pair 

of 

sentence 

transfer 

learning 

Pearson 

Correlation 

Large gap in accuracy 

compared to English STS 

Model and lacks lexical 

standard for Arabic  

2022 

[54] 

“Deep learning based semantic 

similarity detection using text 

data” 

Combine 

LSTM and 

CNN with word 

embedding 

70% 

Accuracy 

404290 

question 

pairs 

LSTM and 

CNN 

Precision, 

Recall and 

F1 

 
Insufficient experiment detail 

(not explain about actual score 

and model predictions ) and 

compare the purpose model 

with not related works     

2018 

[7] 

“Predicting the Semantic 

Textual Similarity with Siamese 

CNN and LSTM” 

CNN and 

LSTM 

0.79 

Pearson 

9,927 

sentence 

pairs 

LSTM and 

CNN 

Pearson (r) 

and 

Spearman 

Not provide information 

about pre-training embedding 

and lack of annotated corpora  


28 

 
Correlatio

n 

(ρ) 

correlation 

coefficients

, and Mean 

Squared 

Error 

2019 

[55] 

“Semantic Textual Similarity 

with Siamese Neural Networks” 

Siamese neural 

networks with 

word 

embedding 

0.81 

Pearson 

Correlatio

n 

9927 

sentence 

pairs 

Siamese 

Neural 

Networks 

Pearson 

Correlation 

better performance on 

smaller datasets and only 

training data available  

2021 

[8] 

“Cross-Language Semantic 

Text Similarity Measurement 

using Statistical Topic Model: 

The Case of Amharic-English 

Languages” 

The LDA topic 

model and 

Jaccard 

algorithm 

96% 

1200 

compara

ble and 

non-

compara

ble text 

Cosine, 

Jaccard and 

Hellinger 

Precision, 

Recall and 

F1 

 
It primarily relies on word co-

occurrence statistics and fails 

to incorporate the semantic 

meaning of the words. model 

may not always align 

perfectly with human 

interpretation 

 
29 

 
Summary of Related Work 

Research on STS has been conducted in a different ways with varying approaches to foreign 

languages. Nevertheless, no deep learning study on STS for Guragigna and local languages has 

been done. As we examine multiple research conducted on languages other than those heavily 

recommended by the STS, like deep learning algorithms, as difference to other conventional 

STS approaches.  The purpose of this work is to use deep learning techniques to the 

development of semantic textual similarity for the Guragigna language. Depending on the 

chosen LSTM, BI-RNN, GRU, and Stacked RNN model, we conducted the experiments using 

those models since, generally speaking, as we have seen in the works linked above, the majority 

of recent research has been conducted for various languages. To assess the effectiveness of the 

model, Mean Squared Error (MSE) is used as the evaluation metric. Which compares system 

output to reference sentences that have been manually scored in order to assess score 

correctness. We employed preprocessing approaches and optimization strategies to improve 

performance using either an MSE or training speed using a deep learning STS model, hence 

reducing the complexity of the work. 

Based on the gaps described above Table 2-1, this thesis done as in the list ways to bridge the 

gap in STS models' accuracy, the thesis could explore alternative approaches and using 

different embedding’s technique. Could investigate specific-domain pre-training techniques or 

leverage annotated datasets in Guragigna to improvement of performance of semantic 

similarity models. In addition, the thesis includes complete set-ups, providing test details and 

comparisons thesis aim to include thorough setups, detail descriptions of model architectures, 

hyper-parameters, and evaluation metrics. It compare their proposed models with relevant 

approaches, highlighting the strengths and weaknesses of each models based on the results.  

Moreover, by collaborating with linguists or using different data sourcing platforms to create 

an annotated corpus, these models allow for more accurate and reliable evaluation. Th