A hybrid information extraction approach using transfer learning on richly-structured documents

Chowdhury, A.G.; Schut, N.; Atzmueller, M.

A hybrid information extraction approach using transfer learning on richly-structured documents

DC Element	Wert	Sprache
dc.contributor.author	Chowdhury, A.G.
dc.contributor.author	Schut, N.
dc.contributor.author	Atzmueller, M.
dc.contributor.editor	Seidl, T.
dc.contributor.editor	Fromm, M.
dc.contributor.editor	Obermeier, S.
dc.date.accessioned	2021-12-23T16:35:07Z	-
dc.date.available	2021-12-23T16:35:07Z	-
dc.date.issued	2020
dc.identifier.issn	16130073
dc.identifier.uri	https://osnascholar.ub.uni-osnabrueck.de/handle/unios/18337	-
dc.description	Conference of 2021 Learning, Knowledge, Data, Analytics Workshops, LWDA 2021 ; Conference Date: 1 September 2021 Through 3 September 2021; Conference Code:173242
dc.description.abstract	Richly-structured documents such as PDFs provide a rich source of information, where - however - its extraction is often challenging due to the complex structures. Computer vision, optical character recognition (OCR) and deep learning offer significant opportunities in the field of information extraction from PDF articles. However, it is extremely challenging to create a unified framework to extract information from different types of PDF documents due to their diverse visual appearance. In this paper, we propose a hybrid information extraction approach for documents with complex structures. In particular, it features a pipeline which uses OCR for plain textual information extraction and transfer learning for table detection from documents with such rich and complex structure. Our application context is given by technical (product) datasheets, in particular plastic product technical data sheets for service provisioning. We discuss first experimental results and outline several challenges in this context. © 2021 Copyright for this paper by its authors.
dc.description.sponsorship	This work has been funded by the Interreg North-West Europe program ?Interreg NWE), project Di-Plast - Digital Circular Economy for the Plastics Industry ?NWE729).
dc.language.iso	en
dc.publisher	CEUR-WS
dc.relation.ispartof	CEUR Workshop Proceedings
dc.subject	Deep learning
dc.subject	Extract informations
dc.subject	Information Extraction
dc.subject	Information retrieval, Complexes structure
dc.subject	Optical character recognition
dc.subject	Optical Character Recognition (OCR)
dc.subject	PDF document
dc.subject	Sources of informations
dc.subject	Structured document
dc.subject	Table Detection
dc.subject	Transfer Learning
dc.subject	Unified framework, Optical character recognition
dc.title	A hybrid information extraction approach using transfer learning on richly-structured documents
dc.type	conference paper
dc.identifier.scopus	2-s2.0-85118898876
dc.identifier.url	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85118898876&partnerID=40&md5=214b8b172333f1ceac155a071df1b151
dc.description.volume	2993
dc.description.startpage	13
dc.description.endpage	25
dcterms.isPartOf.abbreviation	CEUR Workshop Proc.

Zur Kurzanzeige

Seitenaufrufe

73

Letzte Woche
0

Letzter Monat
2

geprüft am 05.05.2024

Google Scholar^TM

Prüfen

A hybrid information extraction approach using transfer learning on richly-structured documents

Seitenaufrufe

Google ScholarTM

Google Scholar^TM