Easiest Data extraction methods from scanned pdf

Hi Team,

I have requirement to extract some data (invoice number, Shipment number, Total invoice amount, Credit party, Tabular data -Charges containing line items) etc. from a scanned pdf document.
This extracted information should then be compared with data in an Excel file to check for discrepancies.
Could you please recommend the best methods for performing this data extraction using built-in functionalities in UiPath (cost effective)
(Average number of pages to be extracted in a month is 800 pages.)
Is computer vision a good approach in this case?

Hi @maria.josephina

Try with

1.Read PDF with OCR activity
2.By using Regex we extract required data and write to excel

Or

1.Document Understanding

Regards,

1 Like

@maria.josephina,

If your document going to be in same format, it’s good to use Extract PDF text using OCR.

Try all OCR engines available and select best out of those.

Thanks,
Ashok :slightly_smiling_face:

1 Like

Hie @maria.josephina if you have multiple data with same structure go with Document Understanding method its fast … and more reliable…

cheers Happy Automation

1 Like

@singh_sumit My requirement is only on one document type and looking for some cost effective techniques.
Document understanding involves cost right?

1 Like

@maria.josephina Yes you could say that . so try with read pdf ocr and use some regex manipulation technique or string manipulation.
cheers happy automation…

If you have fixed document type, then go with reading pdf data with OCR and applying regex on it to extract necessary data.

Thanks,
Bharat