A hybrid information extraction approach using transfer learning on richly-structured documents

DC ElementWertSprache
dc.contributor.authorChowdhury, A.G.
dc.contributor.authorSchut, N.
dc.contributor.authorAtzmueller, M.
dc.contributor.editorSeidl, T.
dc.contributor.editorFromm, M.
dc.contributor.editorObermeier, S.
dc.date.accessioned2021-12-23T16:35:07Z-
dc.date.available2021-12-23T16:35:07Z-
dc.date.issued2020
dc.identifier.issn16130073
dc.identifier.urihttps://osnascholar.ub.uni-osnabrueck.de/handle/unios/18337-
dc.descriptionConference of 2021 Learning, Knowledge, Data, Analytics Workshops, LWDA 2021 ; Conference Date: 1 September 2021 Through 3 September 2021; Conference Code:173242
dc.description.abstractRichly-structured documents such as PDFs provide a rich source of information, where - however - its extraction is often challenging due to the complex structures. Computer vision, optical character recognition (OCR) and deep learning offer significant opportunities in the field of information extraction from PDF articles. However, it is extremely challenging to create a unified framework to extract information from different types of PDF documents due to their diverse visual appearance. In this paper, we propose a hybrid information extraction approach for documents with complex structures. In particular, it features a pipeline which uses OCR for plain textual information extraction and transfer learning for table detection from documents with such rich and complex structure. Our application context is given by technical (product) datasheets, in particular plastic product technical data sheets for service provisioning. We discuss first experimental results and outline several challenges in this context. © 2021 Copyright for this paper by its authors.
dc.description.sponsorshipThis work has been funded by the Interreg North-West Europe program ?Interreg NWE), project Di-Plast - Digital Circular Economy for the Plastics Industry ?NWE729).
dc.language.isoen
dc.publisherCEUR-WS
dc.relation.ispartofCEUR Workshop Proceedings
dc.subjectDeep learning
dc.subjectExtract informations
dc.subjectInformation Extraction
dc.subjectInformation retrieval, Complexes structure
dc.subjectOptical character recognition
dc.subjectOptical Character Recognition (OCR)
dc.subjectPDF document
dc.subjectSources of informations
dc.subjectStructured document
dc.subjectTable Detection
dc.subjectTransfer Learning
dc.subjectUnified framework, Optical character recognition
dc.titleA hybrid information extraction approach using transfer learning on richly-structured documents
dc.typeconference paper
dc.identifier.scopus2-s2.0-85118898876
dc.identifier.urlhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-85118898876&partnerID=40&md5=214b8b172333f1ceac155a071df1b151
dc.description.volume2993
dc.description.startpage13
dc.description.endpage25
dcterms.isPartOf.abbreviationCEUR Workshop Proc.
Zur Kurzanzeige

Seitenaufrufe

73
Letzte Woche
0
Letzter Monat
2
geprüft am 05.05.2024

Google ScholarTM

Prüfen