I have been working on a solution to extract data form multiple PDFs, these are invoice PDFs with tabular data of items, please see attached screenshot for reference.
There are various ways to extract specific data like name, invoice number etc from PDF however this seems to be challenging to get the data from the table of items and save to excel.
Has anyone come across to solve this kind of problem?
I will appreciate quick response.
Try to use the
Document Understanding ML model
to extract the table from PDF
@hasib08 I really dont want to use DU at this point of time. Any other way with PDF activities ?
Have u tried data scraping
Yes I did try data scrapping however data is not consistent across PDFs.
Here i tried extracting tabular data of pdf using string manipulations and regular expressions. Take a look that might be helpful.
Hi @desineediaditya, thank you for sharing the help.
Yeah string manipulation is always an option, before performing string manipulation I wanted to know if something out of the box technique available.
Tried with screen scrapping
If not try that
Let me know if it works
Can you share the pdf ?
I used OmniPage OCR here and below is the result of PDF files.
AnandMain.xaml (5.8 KB) dataPDF.pdf (132.7 KB)
There can be either single or multiple pages in PDF, please let me know how it goes.