Machine Learning Model for Paraphrases Detection Based on Text Content Pair Binary Classification

Autor(en): Kholodna, Nataliia
Vysotska, Victoria
Markiv, Oksana
Chyrun, Sofia
Herausgeber: Emmerich, M.
Vysotska, V.
Stichwörter: Classification (of information); Computational linguistics; content; deep learning; Embeddings; Language processing; Machine learning; Machine-learning; model; natural language processing; Natural language processing systems; Natural languages; Ontology; paraphrasing detection; Recurrent neural networks; rewrite identification; semantic similarity; Semantics; text; text analysis; text classification; Text processing; Vector embedding of word; vector embedding of words; WordNet
Erscheinungsdatum: 2022
Herausgeber: CEUR-WS
Journal: CEUR Workshop Proceedings
Volumen: 3312
Startseite: 283 – 306
Zusammenfassung: 
This article dwells process of ML-model development for detecting paraphrasing by binary classification of texts pair. For this study, the following semantic similarity metrics or indicators have been selected as features: Jacquard coefficient for shared N-grams, cosine distance between vector representations of sentences, Word Mover Distance, distances according to WordNet dictionaries, prediction of two ML-models: Siamese neural network based on recurrent and Transformer type - RoBERTa. Developed software uses principle of model stacking and feature engineering. Additional features indicate semantic affiliation of sentences or normalized number of common N-grams. Created model shows excellent classification results on PAWS test data: weighted accuracy (precision) – 93%, weighted completeness (recall) – 92%, F-measure (F1-score) – 92%, accuracy (accuracy) – 92%. Results of study have shown that Transformer-type NNs can be successfully applied to detect paraphrasing in a pair of texts with fairly high accuracy without need for additional feature generation. Fine-tuned NN RoBERTa (with additional fully connected layers) is less sensitive to pairs of sentences that are not paraphrases of each other. This model specificity may contribute to incorrect accusations of plagiarism or incorrect association of user-generated content. Additional features increase both overall classification accuracy and model sensitivity to pairs of sentences that are not paraphrases of each other. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Beschreibung: 
Cited by: 2; Conference name: 4th International Workshop of Modern Machine Learning Technologies and Data Science, MoMLeT and DS 2022; Conference date: 25 November 2022 through 26 November 2022; Conference code: 185816
ISSN: 1613-0073
Externe URL: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85146121840&partnerID=40&md5=6e935265568960bed9aec68295f33535

Zur Langanzeige

Seitenaufrufe

11
Letzte Woche
0
Letzter Monat
1
geprüft am 28.04.2024

Google ScholarTM

Prüfen