Towards Tabular Data Extraction From Richly-Structured Documents Using Supervised and Weakly-Supervised Learning
Autor(en): | Ghosh Chowdhury, A. Ben Ahmed, M. Atzmueller, M. |
Stichwörter: | Bar-low Twins; Classification (of information); Computer vision; Data extraction; Domain specific; Image classification; Information retrieval; Information retrieval systems; Learning systems; Object detection; Object recognition; Pipelines, Bar-low twin; Self-supervised Learning; Structure recognition; Structured document; Table Detection; Table structure; Table Structure Recognition; Tabular data, Supervised learning | Erscheinungsdatum: | 2022 | Herausgeber: | Institute of Electrical and Electronics Engineers Inc. | Journal: | IEEE International Conference on Emerging Technologies and Factory Automation, ETFA | Volumen: | 2022-September | Zusammenfassung: | Tabular information extraction from richly structured documents is a challenging task, due to rich table and document structures. Supervised document table detection approaches include image classification and object localization methods, typically relying on manually annotated data which is often costly to acquire specially on domain specific dataset. Self-supervised learning is quickly closing the gap with supervised methods in computer vision research [1]. This paper investigates the impact of a self-supervised image classifier as the primary backbone in supervised object detection for document table detection. Furthermore, we study an approach for table structure recognition based on the pix2pix Generative Adversarial Networks (GAN) approach [2]. We propose these approaches as the basis of a machine learning pipeline for table detection and structure recognition. Our evaluation results on different publicly available datasets, as well as a domain specific dataset demonstrate the efficacy of the presented approaches towards tabular information extraction pipelines from richly structured documents. © 2022 IEEE. |
Beschreibung: | Conference of 27th IEEE International Conference on Emerging Technologies and Factory Automation, ETFA 2022 ; Conference Date: 6 September 2022 Through 9 September 2022; Conference Code:183811 |
ISBN: | 9781665499965 | ISSN: | 1946-0740 | DOI: | 10.1109/ETFA52439.2022.9921455 | Externe URL: | https://www.scopus.com/inward/record.uri?eid=2-s2.0-85141435828&doi=10.1109%2fETFA52439.2022.9921455&partnerID=40&md5=899c638c63ada760548cc4fc7526a6e1 |
Show full item record