Good day everyone, I have an issue that I need some help with regarding extraction.
I am given a native PDF pack that contains multiple sizes of financial tables ranging from 1 column to 18 columns. It is a relatively large document but my issue is to get reliable extraction on each table type (I am classifying table types by their sizes). I have tried IXP but it is not giving me the results I want, Document Understanding didn’t seem to help either. My other solution would be is to use Content Generation to extract but the problem comes in with these double spreads where 2 tables may be side by side so extraction won’t be reliable as it may cause a cross over between the tables and also some tables span over multiple pages.
Please could someone assist with a reliable solution to PERFECTLY extract all the tables from the document. Attached is a sample of what is in the Financials.
There is no way to perfectly auto-extract these financial tables with AI tools. The reliable approach is rule-based: classify by layout, split pages into regions for side-by-side tables, and extract data using fixed coordinates, then stitch multi-page tables by header detection.
