A Technique for Discovering Similarities between Texts Based on Extracting Features from the Text

Jihad, Alaa Abdalqahar; Hamad, Mortadha M.

doi:10.37652/juaps.2022.171876

	A Technique for Discovering Similarities between Texts Based on Extracting Features from the Text
Journal of University of Anbar for Pure Science
Article 8, Volume 13, Issue 1, April 2019, Pages 50-54 PDF (344.87 K)
Document Type: Research Paper
DOI: 10.37652/juaps.2022.171876
Authors
Alaa Abdalqahar Jihad^* ¹; Mortadha M. Hamad²
¹Computer Center, University of Anbar
²ollege of Computer Sciences and IT , University of Anbar
Abstract
The discovery of the similarity between two texts is very important and useful in many applications. The similarity between texts is the core research area of dataset, data warehouse, and data mining. This paper provides a framework that gives a similarity between two input texts based on pattern recognition and the use of approximate string matching; there is a weight that affects the proportion of similarity. The search compares the similarity of two texts without adherence to the grammar or the use of synonyms or meanings of words. Preliminary results showed the benefit of extracting some of the features in the discovery of the similarity between the texts.
Keywords
Similarity; Text Processing; Pattern Recognition; extraction; Semantic Textual Similarity; STS; Natural Language Processing; NLP

References
[1] F. Sebastiani, “Machine learning in automated text categorization,”ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002. [2] Mohammad A. Al-Ramahi , Suleiman H. Mustafa, .2012. N-Gram-Based Techniques for Arabic Text Document Matching; Case Study: Courses Accreditation. Abhath AL-Yarmouk: "Basic Sci. & Eng, 21( 1), pp: 85-105. [3] M.K.Vijaymeena1 and K.Kavitha, A Survey on Similarity Measures in Text Mining, Machine Learning and Applications, Machine Learning and Applications: An International Journal (MLAIJ) 3(1), (2016) pp. 19-28. [4] Harrag, F., Al-Qawasmah, E., 2010. Improving Arabic text categorization using neural network with SVD. JDIM 8 (4), 233–239. [5] AMINUL I. and DIANA I., Semantic Text Similarity Using Corpus-Based Word Similarity and String Similarity, University of Ottawa, ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 2, July 2008. [6] Cer, Daniel & Diab, Mona & Agirre, Eneko & Nigo Lopez-Gazpio, ˜ & Specia, Lucia. (2017). SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation. 10.18653/v1/S17-2001. [7] Suhad M., Aseel Q., Finding the Similarity between Two Arabic Texts, Iraqi Journal of Science, 2017, Vol. 58, No.1A, pp: 152-162. [8] V Sharapova, E & V Sharapov, R. (2018). The problem of fuzzy duplicate detection of large texts. 270-277. 10.18287/1613-0073-2018-2212-270-277. [9] Pawar, Atish & Mago, Vijay. (2018). Calculating the similarity between words and sentences using a lexical database and corpus statistics. [10] Wang, Yue & Di, Xiaoqiang & Li, Jinqing & Yang, Huamin & Bi, Lin. (2018). Sentence Similarity Learning Method based on Attention Hybrid Model. Journal of Physics: Conference Series. 1069. 012119. 10.1088/1742-6596/1069/1/012119. [11] Ramaprabha, J & Das, Sayan & Mukerjee, Pronay. (2018). Survey on Sentence Similarity Evaluation using Deep Learning. Journal of Physics: Conference Series. 1000. 012070. 10.1088/1742-6596/1000/1/012070. [12] E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191–212, 1992. [13] Dice, L. R., Measures of the amount of ecologic association between species, Ecology, 26:297-302, 1945.
Statistics Article View: 149 PDF Download: 115