Split PDF file into many files based on a specific text. PDF file consists of images

I want to read a PDF file which consists of images and look for a specific text in the pdf file and then split the file to files.
to do that, I need to get the page number of pages with that specific text. How can I get the page number by OCR from a scanned pdf?
I have done this project with a regular pdf and not a scanned one.
Can anybody assist me with this?
Thanks

1 Like

Instead of the read PDF use read PDF ocr

I did.
my question is, suppose there is a text like “split from here” in different pages of the file.I want to get the page number where this text “split from here” is shown. I parsed the pdf file before and could get the page number. but I do not know how I should do this with a scanned pdf.

1 Like

Hi @mahsa.mohk

I suppose you could work page by page. Read first page, process it, take action, read the second page, take action, and so on. This way you would always know which page you are on. You can store this information and take a cumulative action afterwards (splitting the PDF) based on that.

1 Like

Hello @mahsa.mohk - have you tried the new PDF activities? You have a new Extract PDF Page Range activity, which can get arguments such as “1-4” (first four pages), or “5-END” (all pages from page 5 to the end of file)… This might help!

You can decide where to split by using the “Digitize Document” activity, and try to find the word you are searching for either in the DocumentObjectModel object, or in the Text version (and then search in the DocumentObjectModel where that index appears to identify the page). This is indeed a little bit of custom code, but it shouldn’t be too complicated…

2 Likes

Hi
Thanks for the replies.I found out my issue was related to OCR and I have had dirty data. I played with wildcards and change the profile to NONE in read pdf with OCR and could find the text I am looking for.
for splitting, I created a list and put the all page numbers that contains the text. then, I fetched the page numbers and used it in “Extract PDF Range page” activity to split.
Thanks all for your responses and helps

1 Like

Hi, can you please share the workflow? Thank youu :slight_smile:

2 Likes