Scrape Text from Scanned PDF

Abhishek_Bharti · November 8, 2019, 4:31am

I have 1000 of Scanned PDf. I have to Scrape data from different fields and position of fields is changing in every pdf. so, How can I archive it?

shibani · November 8, 2019, 4:33am

@Abhishek_Bharti Hi,
You can achieve this by using Regex

AshwinS2 · November 8, 2019, 4:33am

Hi @Abhishek_Bharti

Try this

Regarding custom activities or functions for Excel, Pdf, Notepad, etc. - To have market place

Thanks
Ashwin.S

Abhishek_Bharti · November 8, 2019, 4:46am

All the PDF have more than 20 pages.

shibani · November 8, 2019, 4:52am

@Abhishek_Bharti Can you share 1 pdf and list the data you want to scrape ?

Abhishek_Bharti · November 8, 2019, 4:52am

I used Regex and i can find the position of required feilds but the data that define that field is a paragraph.
If i use OCR than i am not getting exact text which is there in Paragraph.

I have also change OCR engine, Scraping Methods and also increase/decrease Scale.

SaurabhDisawal · November 8, 2019, 4:56am

Hi @Abhishek_Bharti,

Since you have the scanned pdfs, you have to read it using pdf OCR activity, i would say first check the accuracy and precision of the data extracted using OCR and if it works fine, you can use Regex to extract the data.
If it not works then you have to go for the ML/ICR solution, either you build it or you can use Abby Flexi Capture or any other tools in the market.

Abhishek_Bharti · November 8, 2019, 4:57am

@shibani Sorry. it’s confidential. I can’t share.
I can talk with management and then i can if they permit.

Thank You

Shubham_Varshney · November 8, 2019, 5:07am

if it’s an handwritten document and all the OCR’s are failing
I would recommend you use the ABBYY as an OCR tool for the task.

Abhishek_Bharti · November 8, 2019, 5:16am

No it is not handwritten doc.

it was first a word file than it was converted into PDF.

Shubham_Varshney · November 8, 2019, 5:54am

again if it’s random extraction and no regex can be obtained the best way would be to use the OCR ABBYY. You can also try

https://docs.uipath.com/activities/docs/about-the-intelligent-ocr-activities-pack

Ioana_Gligan · November 18, 2019, 7:57am

@Abhishek_Bharti - you might also want to have a look at this: How to use the IntelligentOCR Package - it might help in getting you started writing your own extractor!

Topic		Replies	Views
I want to extract specific data in Scanned pdf file Activities ocr , activities , question	6	248	April 27, 2024
How to extract data from pdf files on a dynamic way with OCR Activities pdf , ocr , activities , question , tesseract-ocr , ocr-engine	5	1790	October 15, 2022
Get ocr text activity is identifying characters improperly Academy Feedback	5	1229	June 20, 2020
Need to scrap the more number of data in pdf Help studio	14	1004	September 25, 2019
Extract PDF without OCR Help pdf , activities , question	9	2203	January 14, 2020

Scrape Text from Scanned PDF

Related topics