A hybrid information extraction approach using transfer learning on richly-structured documents
Autor(en): | Chowdhury, A.G. Schut, N. Atzmueller, M. |
Herausgeber: | Seidl, T. Fromm, M. Obermeier, S. |
Stichwörter: | Deep learning; Extract informations; Information Extraction; Information retrieval, Complexes structure; Optical character recognition; Optical Character Recognition (OCR); PDF document; Sources of informations; Structured document; Table Detection; Transfer Learning; Unified framework, Optical character recognition | Erscheinungsdatum: | 2020 | Herausgeber: | CEUR-WS | Journal: | CEUR Workshop Proceedings | Volumen: | 2993 | Startseite: | 13 | Seitenende: | 25 | Zusammenfassung: | Richly-structured documents such as PDFs provide a rich source of information, where - however - its extraction is often challenging due to the complex structures. Computer vision, optical character recognition (OCR) and deep learning offer significant opportunities in the field of information extraction from PDF articles. However, it is extremely challenging to create a unified framework to extract information from different types of PDF documents due to their diverse visual appearance. In this paper, we propose a hybrid information extraction approach for documents with complex structures. In particular, it features a pipeline which uses OCR for plain textual information extraction and transfer learning for table detection from documents with such rich and complex structure. Our application context is given by technical (product) datasheets, in particular plastic product technical data sheets for service provisioning. We discuss first experimental results and outline several challenges in this context. © 2021 Copyright for this paper by its authors. |
Beschreibung: | Conference of 2021 Learning, Knowledge, Data, Analytics Workshops, LWDA 2021 ; Conference Date: 1 September 2021 Through 3 September 2021; Conference Code:173242 |
ISSN: | 16130073 | Externe URL: | https://www.scopus.com/inward/record.uri?eid=2-s2.0-85118898876&partnerID=40&md5=214b8b172333f1ceac155a071df1b151 |
Zur Langanzeige