I get PDF files per day and i need to extract data from pdf one by one… now when i am iterating through all pdf, i need to check if the pdf is regular or scanned one so that if its regular, we have some fixed positions rules and then accordingly using string manipulations we will extract data and if its scanned one or images then we will use OCR and extract data… Guys anyone who have worked on OCR, please guide me on how to proceed … how to check limitations or feasibility
@shraddha071987 I am not sure if it is the best method to find out, whether to Decide if it is a scanned or Digital PDF, but If while using Read PDF Text Activity, the output generated is empty, then you can move to the OCR part to extract the Data as it can mostly be a scanned PDF.
You can give a try on this method and confirm on the Empty Text for a Scanned PDF.
In order extract data from PDF whether it is scanned or regular , you can try with document understanding feature in Uipath , in ur case the PDF extraction data are in same position almost so u can try with form extractor to extract the data
Yes… I was also thinking to try this… I thought people might be aware of this kind of scenarios and the optimize way to do this… But Thanks for the response … I will surely try this…
Try This Read PDF Files. This will help you to identify the scanned pdf and normal PDF’s.
this will Give the basic Idea.
Drag an Input Dialog activity and connect it to the Start Node.
In the Properties panel, add the expression "Choose one option below:" in the Label field.
Add the expression {"Read PDF Text", "Read PDF With OCR"} in the Options field.
Add the value “Options” in the Title field.
Add the variable chooseOption in the Result field.
Place a Flow Decision activity below the Input Dialog activity and connect it to it.
In the Properties panel, add the expression chooseOption = "Read PDF Text" in the Condition field.