Not able to extract some data from scanned PDF using OCR

Rakesh_Tiwari · July 27, 2022, 5:47am

Hi All,

i am trying to extract some data from scanned PDF but got below error.

error: Read PDF With OCR: Could not find file ‘C:\Users.…\Documents\UiPath\FirstAutomation_OCR\C’.

what i have done is:

made a config file and calling the file path (PDF file Path)from there.
looping through the each pdf files using for each activity.
used read pdf text with OCR activity and used Tesseract OCR and microsoft OCR to read the text but not getting the desired output.

note- since some PDF files is more than 6 pages. so i am trying to read only 1 page.

kindly suggest some solutions.

Happy automation

THIRU_NANI · July 27, 2022, 5:54am

Hey!

Before reading the pdf check whether the pdf file exists in the folder or not

Like this:

Assign strFolderPath = "C:\Users\Name\Documents\UiPath\FirstAutomation_OCR\"

Assign ArrFiles = Directory.GetFiles(strFolderPath,"*.pdf")

Take one for each pass the ArrFiles

You’ll get the all pdf files one by one…

Inside the for each take one Read pdf activity and pass the path as item - Output as - strPdfOutput

Now use string manipulation to get the desired output

Regards,
NaNi

Topic		Replies	Views
Extract data from PDF using get OCR text Help	2	1037	April 14, 2020
Extract data from scanned PDFs Help	7	782	August 31, 2020
How to extract data from pdf files on a dynamic way with OCR Activities pdf , ocr , activities , question , tesseract-ocr , ocr-engine	5	1219	October 15, 2022
Read PDF with OCR not extracting the first few lines Studio pdf , ocr	1	902	June 28, 2023
How to extract text from a scanned pdf? Help pdf , activities , question	3	1462	December 22, 2020