How to loop at each page in a pdf file looking for text or digitize?

Based on the screenshot below , I have a PDF file that is a scanned. PDF taxes may go up to 100 pages so I don’t want to digitize all at once since it will take took too long.

Is there a way we can loop at each page of a PDF file to check if a text . For example If bot found that the page that contains the text “U.S. Individual Income Tax Return” then the bot will stop and get the page number. Any idea would be a great help thank you.

  1. Get PDF files from directory
  2. For each loop through each PDFs and use ORC to read each page since it is scanned.
  3. Identify if a string of text or formatting of text occurs in the PDF (using ocr or digitize)
  4. Identify which pages of the PDF match the text criteria and get the page number.

Any idea with this ? Thanks,

That might be hard.

Your requirement is self defeating.

Remember the document is scanned and for UiPath to read it, it has to be read by the OCR. How can you get that specific string if the document has not been read by the program?

@Jelrey - Please take a look at this …dowload this workflow and this will give you an idea about how to loop through pages…here I have looped through pages and deleteed a page where text found …one thing you have to change is instead of Read PDF Text, you have to use Read PDF using OCR…

@prasath17 , do you have some example that instead of deleting the page where the text is found , I want to retain or remain the page that contains the text and then delete other pages that does not contain the text , the opposite of what you did ? is that possible ?

Thank you.

@Jelrey - Try moving the pdf splitter from Else condition to Then …it will now split the pages where the text is found and keep it separated which you can combine at the end…

I guess it should work…

1 Like

what activity to combine the pdf again @prasath17 ? Thanks.

@Jelrey - It is already there in the workflow…

image

The str_inputfile in your example is the page right ? and the splitter path is where we save the deleted page ? am I correct ?

@Jelrey - Yes, That is correct…Once you downloaded the workflow…go to splitted folder and delete all the files…and try running the workflow as is

Thanks for the idea @prasath17 . Appreciated, I will update you with my progress.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.