Classification of documents based on the structure of their DOM trees

Autor(en): Geibel, P.
Pustylnikov, O.
Mehler, A.
Gust, H.
Kühnberger, K.-U. 
Stichwörter: Biological materials; Information retrieval systems; Markup languages; XML, DOM trees; Efficient; Newspaper articles; Ordered trees; Parse trees; Partial trees; Structural features; Textual contents; Tree kernels; Xml documents; XML tags, Content based retrieval
Erscheinungsdatum: 2008
Enthalten in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Band: 4985 LNCS
Ausgabe: PART 2
Startseite: 779
Seitenende: 788
In this paper, we discuss kernels that can be applied for the classification of XML documents based on their DOM trees. DOM trees are ordered trees in which every node might be labeled by a vector of attributes including its XML tag and the textual content. We describe five new kernels suitable for such structures: a kernel based on predefined structural features, a tree kernel derived from the well-known parse tree kernel, the set tree kernel that allows permutations of children, the string tree kernel being an extension of the so-called partial tree kernel, and the soft tree kernel as a more efficient alternative. We evaluate the kernels experimentally on a corpus containing the DOM trees of newspaper articles and on the well-known SUSANNE corpus. © 2008 Springer-Verlag Berlin Heidelberg.
Conference of 14th International Conference on Neural Information Processing, ICONIP 2007 ; Conference Date: 13 November 2007 Through 16 November 2007; Conference Code:73944
ISBN: 9783540691594
ISSN: 03029743
DOI: 10.1007/978-3-540-69162-4_81
Externe URL:

Show full item record

Page view(s)

Last Week
Last month
checked on Jun 16, 2024

Google ScholarTM