PDF extraction from multiple pdf and how to check which pdf is scanned and which pdf is regular

Hi,

I get PDF files per day and i need to extract data from pdf one by one… now when i am iterating through all pdf, i need to check if the pdf is regular or scanned one so that if its regular, we have some fixed positions rules and then accordingly using string manipulations we will extract data and if its scanned one or images then we will use OCR and extract data… Guys anyone who have worked on OCR, please guide me on how to proceed … how to check limitations or feasibility

Hi @shraddha071987

Do data in all pdf are in same position irrespective of whether it is scanned or regular ?

yes… almost fixed position… but then how can we check whether the file is regular or scanned one in uipath ?

@shraddha071987 I am not sure if it is the best method to find out, whether to Decide if it is a scanned or Digital PDF, but If while using Read PDF Text Activity, the output generated is empty, then you can move to the OCR part to extract the Data as it can mostly be a scanned PDF.

You can give a try on this method and confirm on the Empty Text for a Scanned PDF.

1 Like

Hi @shraddha071987

Like @supermanPunch said u can try like that

In order extract data from PDF whether it is scanned or regular , you can try with document understanding feature in Uipath , in ur case the PDF extraction data are in same position almost so u can try with form extractor to extract the data

Hope it helps you

Regards

Nived N :robot:

Happy Automation :relaxed::relaxed::relaxed:

Yes… I was also thinking to try this… I thought people might be aware of this kind of scenarios and the optimize way to do this… But Thanks for the response … I will surely try this…

Hi Nived,

Client is not using document understanding version right now… Please let me know if you have any other solution.

Hi @shraddha071987 did u tried the way as @supermanPunch told

Hi @shraddha071987,

Try This Read PDF Files. This will help you to identify the scanned pdf and normal PDF’s.

  • this will Give the basic Idea.
  1. Drag an Input Dialog activity and connect it to the Start Node.
    In the Properties panel, add the expression "Choose one option below:" in the Label field.
    Add the expression {"Read PDF Text", "Read PDF With OCR"} in the Options field.
    Add the value “Options” in the Title field.
    Add the variable chooseOption in the Result field.
  2. Place a Flow Decision activity below the Input Dialog activity and connect it to it.
  • In the Properties panel, add the expression chooseOption = "Read PDF Text" in the Condition field.

Best Regards

I am not sure but try image exists activity. It throw Boolean value. Place Boolean value in if condition

If true then use read pdf text with ocr and else read pdf text.

Hi ,

First use Read PDF Text Activity to read all your files what ever u want to read.

PDF Text Activity returns String Data Type,

if we use Read PDF Text Activity for Scanned PDF Files, it returns null Value

So check this String value is null or not. If that String value is null then go for
Read PDF with OCR text, so that u can read only Scanned pdf files

Hope that helps u.