Words and echoes: Assessing and mitigating the non-randomness problem in word frequency distribution modeling
Autor(en): | Baroni, M. Evert, S. |
Stichwörter: | Correction techniques; Cross validation; Frequency distributions; Non-randomness; Over fitting problem; Pre-processing method; Prediction accuracy; Random sampling; Rigorous evaluation; Test data; Word frequencies, Computational linguistics | Erscheinungsdatum: | 2007 | Journal: | ACL 2007 - Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics | Startseite: | 904 | Seitenende: | 911 | Zusammenfassung: | Frequency distribution models tuned to words and other linguistic events can predict the number of distinct types and their frequency distribution in samples of arbitrary sizes. We conduct, for the first time, a rigorous evaluation of these models based on cross-validation and separation of training and test data. Our experiments reveal that the prediction accuracy of the models is marred by serious overfitting problems, due to violations of the random sampling assumption in corpus data. We then propose a simple pre-processing method to alleviate such non-randomness problems. Further evaluation confirms the effectiveness of the method, which compares favourably to more complex correction techniques. © 2007 Association for Computational Linguistics. |
Beschreibung: | Conference of 45th Annual Meeting of the Association for Computational Linguistics, ACL 2007 ; Conference Date: 23 June 2007 Through 30 June 2007; Conference Code:89523 |
ISBN: | 9781932432862 | Externe URL: | https://www.scopus.com/inward/record.uri?eid=2-s2.0-84859999046&partnerID=40&md5=dd7530d99d74a3113082d95f3b0d1ed0 |
Zur Langanzeige
Seitenaufrufe
2
Letzte Woche
0
0
Letzter Monat
1
1
geprüft am 19.05.2024