How to loop at each page in a pdf file looking for text or digitize?

Jelrey · May 21, 2021, 7:12am

Based on the screenshot below , I have a PDF file that is a scanned. PDF taxes may go up to 100 pages so I don’t want to digitize all at once since it will take took too long.

Is there a way we can loop at each page of a PDF file to check if a text . For example If bot found that the page that contains the text “U.S. Individual Income Tax Return” then the bot will stop and get the page number. Any idea would be a great help thank you.

Jelrey · May 21, 2021, 7:23am

Get PDF files from directory
For each loop through each PDFs and use ORC to read each page since it is scanned.
Identify if a string of text or formatting of text occurs in the PDF (using ocr or digitize)
Identify which pages of the PDF match the text criteria and get the page number.

Any idea with this ? Thanks,

josephatomwanzia · May 21, 2021, 9:52am

That might be hard.

Your requirement is self defeating.

Remember the document is scanned and for UiPath to read it, it has to be read by the OCR. How can you get that specific string if the document has not been read by the program?

prasath17 · May 21, 2021, 11:35am

@Jelrey - Please take a look at this …dowload this workflow and this will give you an idea about how to loop through pages…here I have looped through pages and deleteed a page where text found …one thing you have to change is instead of Read PDF Text, you have to use Read PDF using OCR…

Jelrey · May 21, 2021, 3:28pm

@prasath17 , do you have some example that instead of deleting the page where the text is found , I want to retain or remain the page that contains the text and then delete other pages that does not contain the text , the opposite of what you did ? is that possible ?

Thank you.

prasath17 · May 21, 2021, 3:30pm

@Jelrey - Try moving the pdf splitter from Else condition to Then …it will now split the pages where the text is found and keep it separated which you can combine at the end…

I guess it should work…

Jelrey · May 21, 2021, 3:32pm

what activity to combine the pdf again @prasath17 ? Thanks.

prasath17 · May 21, 2021, 3:33pm

@Jelrey - It is already there in the workflow…

Jelrey · May 21, 2021, 3:33pm

The str_inputfile in your example is the page right ? and the splitter path is where we save the deleted page ? am I correct ?

prasath17 · May 21, 2021, 3:35pm

@Jelrey - Yes, That is correct…Once you downloaded the workflow…go to splitted folder and delete all the files…and try running the workflow as is

Jelrey · May 21, 2021, 3:36pm

Thanks for the idea @prasath17 . Appreciated, I will update you with my progress.

system · May 24, 2021, 3:37pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to seperate PDF page or split if text is detected? Studio excel , selector , uiautomation , robot , activities , studio , question	7	1279	May 28, 2021
Find and search data in PDF file Studio	12	487	October 31, 2023
Read PDF and find page number where text occurs Help pdf , activities , question	8	4585	December 1, 2019
Pdf automation solution Forum	5	414	May 22, 2023
Loop in PDF Pages Document Understanding activities , question	3	1359	February 22, 2021

Most Active Users - Yesterday
Anil_G
ashokkarale
jinal.shah
Gautham_Pattabiraman
postwick
chandreshsinh.jadeja
vrdabberu
Ajay_Mishra
sven.wullum1
Vyshnavi_Nalumachu
More details...

How to loop at each page in a pdf file looking for text or digitize?

Related Topics