Scrape Text from Scanned PDF

I have 1000 of Scanned PDf. I have to Scrape data from different fields and position of fields is changing in every pdf. so, How can I archive it?

@Abhishek_Bharti Hi,
You can achieve this by using Regex

Hi @Abhishek_Bharti

Try this

Regarding custom activities or functions for Excel, Pdf, Notepad, etc. - To have market place

Thanks
Ashwin.S

All the PDF have more than 20 pages.

@Abhishek_Bharti Can you share 1 pdf and list the data you want to scrape ?

I used Regex and i can find the position of required feilds but the data that define that field is a paragraph.
If i use OCR than i am not getting exact text which is there in Paragraph.

I have also change OCR engine, Scraping Methods and also increase/decrease Scale.

Hi @Abhishek_Bharti,

Since you have the scanned pdfs, you have to read it using pdf OCR activity, i would say first check the accuracy and precision of the data extracted using OCR and if it works fine, you can use Regex to extract the data.
If it not works then you have to go for the ML/ICR solution, either you build it or you can use Abby Flexi Capture or any other tools in the market.

@shibani Sorry. it’s confidential. I can’t share.
I can talk with management and then i can if they permit.

Thank You

if it’s an handwritten document and all the OCR’s are failing
I would recommend you use the ABBYY as an OCR tool for the task.

No it is not handwritten doc.

it was first a word file than it was converted into PDF.

again if it’s random extraction and no regex can be obtained the best way would be to use the OCR ABBYY. You can also try

1 Like

@Abhishek_Bharti - you might also want to have a look at this: How to use the IntelligentOCR Package - it might help in getting you started writing your own extractor!

1 Like