Towards Tabular Data Extraction From Richly-Structured Documents Using Supervised and Weakly-Supervised Learning

Autor(en): Ghosh Chowdhury, A.
Ben Ahmed, M.
Atzmueller, M.
Stichwörter: Bar-low Twins; Classification (of information); Computer vision; Data extraction; Domain specific; Image classification; Information retrieval; Information retrieval systems; Learning systems; Object detection; Object recognition; Pipelines, Bar-low twin; Self-supervised Learning; Structure recognition; Structured document; Table Detection; Table structure; Table Structure Recognition; Tabular data, Supervised learning
Erscheinungsdatum: 2022
Herausgeber: Institute of Electrical and Electronics Engineers Inc.
Journal: IEEE International Conference on Emerging Technologies and Factory Automation, ETFA
Volumen: 2022-September
Zusammenfassung: 
Tabular information extraction from richly structured documents is a challenging task, due to rich table and document structures. Supervised document table detection approaches include image classification and object localization methods, typically relying on manually annotated data which is often costly to acquire specially on domain specific dataset. Self-supervised learning is quickly closing the gap with supervised methods in computer vision research [1]. This paper investigates the impact of a self-supervised image classifier as the primary backbone in supervised object detection for document table detection. Furthermore, we study an approach for table structure recognition based on the pix2pix Generative Adversarial Networks (GAN) approach [2]. We propose these approaches as the basis of a machine learning pipeline for table detection and structure recognition. Our evaluation results on different publicly available datasets, as well as a domain specific dataset demonstrate the efficacy of the presented approaches towards tabular information extraction pipelines from richly structured documents. © 2022 IEEE.
Beschreibung: 
Conference of 27th IEEE International Conference on Emerging Technologies and Factory Automation, ETFA 2022 ; Conference Date: 6 September 2022 Through 9 September 2022; Conference Code:183811
ISBN: 9781665499965
ISSN: 1946-0740
DOI: 10.1109/ETFA52439.2022.9921455
Externe URL: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85141435828&doi=10.1109%2fETFA52439.2022.9921455&partnerID=40&md5=899c638c63ada760548cc4fc7526a6e1

Show full item record

Page view(s)

9
Last Week
2
Last month
4
checked on May 18, 2024

Google ScholarTM

Check

Altmetric