How do I extract tables from PDFs where column positions keep shifting based on content length? Data Scraping fails when product descriptions are long and push other columns to the right, causing it to merge cells or skip rows entirely.
I’m processing invoices from multiple vendors and the table structure varies - some have borders, some don’t, and columns don’t stay in fixed positions. Anchor Base is too slow for 100+ line items per invoice.
Is there a way to make Data Scraping detect columns dynamically, or do I need to use Regex on raw text? Document Understanding seems too heavy for just extracting tabular data. Currently on Studio 2023.10 Enterprise.
This is blocking our AP automation and I need a scalable solution that works across different vendor formats without building separate workflows for each one.
Hi @Maurya-ji,
For dynamic PDF tables where columns shift, use Read PDF Text and apply Regex patterns instead of Data Scraping. Extract the full text, then match patterns for amounts, quantities, and descriptions using named capture groups regardless of position.
Another method is Get Text on the table area, split by line breaks, then use String.Split with multiple space or tab delimiters. This handles variable spacing better than fixed columns.
For multiple vendor formats, I created a config file with vendor-specific Regex patterns. The workflow identifies the vendor and applies the matching pattern, avoiding separate workflows for each format.
If your license includes Document Understanding, use it. The ML extractors handle shifting columns better than rules and you can train on your invoice formats. Initial setup takes time but scales well for 50+ vendors.
Try this hope it’s works fine
Thanks & Happy Automation
This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.