Machine Learning Model for Paraphrases Detection Based on Text Content Pair Binary Classification

Kholodna, Nataliia; Vysotska, Victoria; Markiv, Oksana; Chyrun, Sofia

Machine Learning Model for Paraphrases Detection Based on Text Content Pair Binary Classification

DC Element	Wert	Sprache
dc.contributor.author	Kholodna, Nataliia
dc.contributor.author	Vysotska, Victoria
dc.contributor.author	Markiv, Oksana
dc.contributor.author	Chyrun, Sofia
dc.contributor.editor	Emmerich, M.
dc.contributor.editor	Vysotska, V.
dc.date.accessioned	2023-07-12T06:59:35Z	-
dc.date.available	2023-07-12T06:59:35Z	-
dc.date.issued	2022
dc.identifier.issn	1613-0073
dc.identifier.uri	http://osnascholar.ub.uni-osnabrueck.de/handle/unios/72144	-
dc.description	Cited by: 2; Conference name: 4th International Workshop of Modern Machine Learning Technologies and Data Science, MoMLeT and DS 2022; Conference date: 25 November 2022 through 26 November 2022; Conference code: 185816
dc.description.abstract	This article dwells process of ML-model development for detecting paraphrasing by binary classification of texts pair. For this study, the following semantic similarity metrics or indicators have been selected as features: Jacquard coefficient for shared N-grams, cosine distance between vector representations of sentences, Word Mover Distance, distances according to WordNet dictionaries, prediction of two ML-models: Siamese neural network based on recurrent and Transformer type - RoBERTa. Developed software uses principle of model stacking and feature engineering. Additional features indicate semantic affiliation of sentences or normalized number of common N-grams. Created model shows excellent classification results on PAWS test data: weighted accuracy (precision) – 93%, weighted completeness (recall) – 92%, F-measure (F1-score) – 92%, accuracy (accuracy) – 92%. Results of study have shown that Transformer-type NNs can be successfully applied to detect paraphrasing in a pair of texts with fairly high accuracy without need for additional feature generation. Fine-tuned NN RoBERTa (with additional fully connected layers) is less sensitive to pairs of sentences that are not paraphrases of each other. This model specificity may contribute to incorrect accusations of plagiarism or incorrect association of user-generated content. Additional features increase both overall classification accuracy and model sensitivity to pairs of sentences that are not paraphrases of each other. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
dc.language.iso	en
dc.publisher	CEUR-WS
dc.relation.ispartof	CEUR Workshop Proceedings
dc.subject	Classification (of information)
dc.subject	Computational linguistics
dc.subject	content
dc.subject	deep learning
dc.subject	Embeddings
dc.subject	Language processing
dc.subject	Machine learning
dc.subject	Machine-learning
dc.subject	model
dc.subject	natural language processing
dc.subject	Natural language processing systems
dc.subject	Natural languages
dc.subject	Ontology
dc.subject	paraphrasing detection
dc.subject	Recurrent neural networks
dc.subject	rewrite identification
dc.subject	semantic similarity
dc.subject	Semantics
dc.subject	text
dc.subject	text analysis
dc.subject	text classification
dc.subject	Text processing
dc.subject	Vector embedding of word
dc.subject	vector embedding of words
dc.subject	WordNet
dc.title	Machine Learning Model for Paraphrases Detection Based on Text Content Pair Binary Classification
dc.type	conference paper
dc.identifier.scopus	2-s2.0-85146121840
dc.identifier.url	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85146121840&partnerID=40&md5=6e935265568960bed9aec68295f33535
dc.description.volume	3312
dc.description.startpage	283 – 306
dcterms.isPartOf.abbreviation	CEUR Workshop Proc.
local.import.remains	affiliations : Lviv Polytechnic National University, S. Bandera Street, 12, Lviv, 79013, Ukraine; Osnabrück University, Friedrich-Janssen-Str. 1, Osnabrück, 49076, Germany
local.import.remains	publication_stage : Final

Zur Kurzanzeige

Seitenaufrufe

11

Letzte Woche
0

Letzter Monat
1

geprüft am 13.05.2024

Google Scholar^TM

Prüfen

Machine Learning Model for Paraphrases Detection Based on Text Content Pair Binary Classification

Seitenaufrufe

Google ScholarTM

Google Scholar^TM