Machine Learning Model for Paraphrases Detection Based on Text Content Pair Binary Classification

DC ElementWertSprache
dc.contributor.authorKholodna, Nataliia
dc.contributor.authorVysotska, Victoria
dc.contributor.authorMarkiv, Oksana
dc.contributor.authorChyrun, Sofia
dc.contributor.editorEmmerich, M.
dc.contributor.editorVysotska, V.
dc.date.accessioned2023-07-12T06:59:35Z-
dc.date.available2023-07-12T06:59:35Z-
dc.date.issued2022
dc.identifier.issn1613-0073
dc.identifier.urihttp://osnascholar.ub.uni-osnabrueck.de/handle/unios/72144-
dc.descriptionCited by: 2; Conference name: 4th International Workshop of Modern Machine Learning Technologies and Data Science, MoMLeT and DS 2022; Conference date: 25 November 2022 through 26 November 2022; Conference code: 185816
dc.description.abstractThis article dwells process of ML-model development for detecting paraphrasing by binary classification of texts pair. For this study, the following semantic similarity metrics or indicators have been selected as features: Jacquard coefficient for shared N-grams, cosine distance between vector representations of sentences, Word Mover Distance, distances according to WordNet dictionaries, prediction of two ML-models: Siamese neural network based on recurrent and Transformer type - RoBERTa. Developed software uses principle of model stacking and feature engineering. Additional features indicate semantic affiliation of sentences or normalized number of common N-grams. Created model shows excellent classification results on PAWS test data: weighted accuracy (precision) – 93%, weighted completeness (recall) – 92%, F-measure (F1-score) – 92%, accuracy (accuracy) – 92%. Results of study have shown that Transformer-type NNs can be successfully applied to detect paraphrasing in a pair of texts with fairly high accuracy without need for additional feature generation. Fine-tuned NN RoBERTa (with additional fully connected layers) is less sensitive to pairs of sentences that are not paraphrases of each other. This model specificity may contribute to incorrect accusations of plagiarism or incorrect association of user-generated content. Additional features increase both overall classification accuracy and model sensitivity to pairs of sentences that are not paraphrases of each other. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
dc.language.isoen
dc.publisherCEUR-WS
dc.relation.ispartofCEUR Workshop Proceedings
dc.subjectClassification (of information)
dc.subjectComputational linguistics
dc.subjectcontent
dc.subjectdeep learning
dc.subjectEmbeddings
dc.subjectLanguage processing
dc.subjectMachine learning
dc.subjectMachine-learning
dc.subjectmodel
dc.subjectnatural language processing
dc.subjectNatural language processing systems
dc.subjectNatural languages
dc.subjectOntology
dc.subjectparaphrasing detection
dc.subjectRecurrent neural networks
dc.subjectrewrite identification
dc.subjectsemantic similarity
dc.subjectSemantics
dc.subjecttext
dc.subjecttext analysis
dc.subjecttext classification
dc.subjectText processing
dc.subjectVector embedding of word
dc.subjectvector embedding of words
dc.subjectWordNet
dc.titleMachine Learning Model for Paraphrases Detection Based on Text Content Pair Binary Classification
dc.typeconference paper
dc.identifier.scopus2-s2.0-85146121840
dc.identifier.urlhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-85146121840&partnerID=40&md5=6e935265568960bed9aec68295f33535
dc.description.volume3312
dc.description.startpage283 – 306
dcterms.isPartOf.abbreviationCEUR Workshop Proc.
local.import.remainsaffiliations : Lviv Polytechnic National University, S. Bandera Street, 12, Lviv, 79013, Ukraine; Osnabrück University, Friedrich-Janssen-Str. 1, Osnabrück, 49076, Germany
local.import.remainspublication_stage : Final
Zur Kurzanzeige

Seitenaufrufe

11
Letzte Woche
0
Letzter Monat
1
geprüft am 13.05.2024

Google ScholarTM

Prüfen