Hi guys, I trust you are all well. I am working on a project where I am given a native pdf pack (100+ pages), and this pdf pack contains numerous different tables. My goal is to extract all these tables as they appear in the document and insert them into excel. The problem comes where a page has 2 or more tables on it side by side. Using the digitized document text doesn’t seem to be the solution because the side by side tables end up picking up as one table if I use Content Generation to generate the tables. Also certain tables span across multiple pages and it needs to be extracted as one table. The different tables can be determined by their header above it. I will provide examples of what the document looks like and what the output in excel should look like. Please provide any ideas or solutions, would be greatly appreciated!!
Since you mentioned that each table has a header above it, you can try to
- Use Regex/Keyword Anchors to detect headers.
- Extract the table content that follows until the next header.
- This way side-by-side tables won’t merge, because each header acts as a boundary.
here are some steps that can be tried -
- Digitize PDF (using OCR or native text extraction).
- Detect Headers (Regex or ML Classifier).
- Extract Table Content:
- Use Form Extractor or Regex Extractor.
- Apply bounding box logic for side-by-side tables.
- Merge Multi-Page Tables
- Append rows until a new header is found.
- Write to Excel:
- Each table → separate sheet or append below with header.
another idea →
try using the Document Understanding Framework with a Machine Learning (ML) Extractor.
The core idea is to treat each unique table type (identified by its header) as a separate document type for the ML model to learn.
Step1: Digitization and Classification
- Use the Digitize Document activity.
- Use a Machine Learning Classifier (trained in AI Center) to analyze the whole 100+ page PDF pack. The classifier will use the headers and context to determine where each table begins and ends within the pack
Step 2: Extraction
- Use the Data Extraction Scope activity.
- Plug in the ML Extractor skill that you trained. This ML Extractor will be for locating and accurately extracting the rows and columns for all identified tables.
Step3: Post-Processing and Output (Excel)
- Loop through the extraction results.
- Use the Write Range Workbook activity to append the extracted data
You’re dealing with a common challenge when extracting multiple side-by-side tables and multi-page tables from native PDFs. Here are a few approaches that usually work well depending on the document layout
If the tables have consistent headers, you can use the Table Extraction activity with Anchor-based detection:
• Set the header text (e.g., “Lease liabilities”, “Intangible assets”, etc.) as the anchor.
• Extract only the table immediately following that header.
• This helps DU separate side-by-side tables instead of merging them.
When tables appear side-by-side, they are often better processed using coordinate-based extraction:
• Use Digitize Document → get the page dimensions.
• Split the page into left and right regions using cropping.
• Run Table Extraction on each region separately.
This prevents DU from merging both tables into one.
Thank you, I will give it a try although I was wondering if using Content Generation would be a good approach?
So the excel template that I need to write to is a consolidated excel file with multiple different sheets each representing the different tables found in the document.
I was think of creating a config with the table names and sheet names and possibly creating a prompt dictionary but then again I’m not too sure how well this would work with side by side tables.
I thought about this method of splitting the document into left and right but ran into some challenges. Not all pages have side by side tables, some pages have 1 continuous table. How do I determine which page needs to be split into left and right, based on what properties?
Also tables may overlap into the next page…
