Words and echoes: Assessing and mitigating the non-randomness problem in word frequency distribution modeling

Autor(en): Baroni, M.
Evert, S.
Stichwörter: Correction techniques; Cross validation; Frequency distributions; Non-randomness; Over fitting problem; Pre-processing method; Prediction accuracy; Random sampling; Rigorous evaluation; Test data; Word frequencies, Computational linguistics
Erscheinungsdatum: 2007
Journal: ACL 2007 - Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics
Startseite: 904
Seitenende: 911
Zusammenfassung: 
Frequency distribution models tuned to words and other linguistic events can predict the number of distinct types and their frequency distribution in samples of arbitrary sizes. We conduct, for the first time, a rigorous evaluation of these models based on cross-validation and separation of training and test data. Our experiments reveal that the prediction accuracy of the models is marred by serious overfitting problems, due to violations of the random sampling assumption in corpus data. We then propose a simple pre-processing method to alleviate such non-randomness problems. Further evaluation confirms the effectiveness of the method, which compares favourably to more complex correction techniques. © 2007 Association for Computational Linguistics.
Beschreibung: 
Conference of 45th Annual Meeting of the Association for Computational Linguistics, ACL 2007 ; Conference Date: 23 June 2007 Through 30 June 2007; Conference Code:89523
ISBN: 9781932432862
Externe URL: https://www.scopus.com/inward/record.uri?eid=2-s2.0-84859999046&partnerID=40&md5=dd7530d99d74a3113082d95f3b0d1ed0

Zur Langanzeige

Seitenaufrufe

2
Letzte Woche
0
Letzter Monat
1
geprüft am 19.05.2024

Google ScholarTM

Prüfen

Altmetric