A hybrid information extraction approach using transfer learning on richly-structured documents

Autor(en): Chowdhury, A.G.
Schut, N.
Atzmueller, M.
Herausgeber: Seidl, T.
Fromm, M.
Obermeier, S.
Stichwörter: Deep learning; Extract informations; Information Extraction; Information retrieval, Complexes structure; Optical character recognition; Optical Character Recognition (OCR); PDF document; Sources of informations; Structured document; Table Detection; Transfer Learning; Unified framework, Optical character recognition
Erscheinungsdatum: 2020
Herausgeber: CEUR-WS
Journal: CEUR Workshop Proceedings
Volumen: 2993
Startseite: 13
Seitenende: 25
Zusammenfassung: 
Richly-structured documents such as PDFs provide a rich source of information, where - however - its extraction is often challenging due to the complex structures. Computer vision, optical character recognition (OCR) and deep learning offer significant opportunities in the field of information extraction from PDF articles. However, it is extremely challenging to create a unified framework to extract information from different types of PDF documents due to their diverse visual appearance. In this paper, we propose a hybrid information extraction approach for documents with complex structures. In particular, it features a pipeline which uses OCR for plain textual information extraction and transfer learning for table detection from documents with such rich and complex structure. Our application context is given by technical (product) datasheets, in particular plastic product technical data sheets for service provisioning. We discuss first experimental results and outline several challenges in this context. © 2021 Copyright for this paper by its authors.
Beschreibung: 
Conference of 2021 Learning, Knowledge, Data, Analytics Workshops, LWDA 2021 ; Conference Date: 1 September 2021 Through 3 September 2021; Conference Code:173242
ISSN: 16130073
Externe URL: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85118898876&partnerID=40&md5=214b8b172333f1ceac155a071df1b151

Zur Langanzeige

Seitenaufrufe

73
Letzte Woche
2
Letzter Monat
5
geprüft am 26.04.2024

Google ScholarTM

Prüfen