Identify Pages in PDF file contain keyword string

Greetings,

I have a working process to extract pages from a PDF file with a specific keyword or string. e.g. “Sports Section”. Current process is to split the file first and store the single-page file into a directory. Then, use “Read PDF Text” for a single page, and use “Find Matching Pattern” to identify if the page contains the keyword. If condition is true, then move the single-page PDF to another folder.

While the above approach works, but very slow for a large file over 500 pages. After all, only about 10% of the pages contain the keyword. If I put the “Read PDF Text” and get a long string before splitting to single-page PDF, I lost which the page where the keyword is supposed to have come from.

Appreciate any higher power’s help with a process to identify the pages with keyword and store them in a string. Then, I can use “Extract PDF Page Range” to get the desired pages with PDF.

Thanks!

1 Like

@Bei_Jing

If the pdf contains any static header or footer…then read all the data at on e then split on the header or footer and then loop and chexk each string from array…by which we can get the page numbers where your string is present and then can use extract at once to extract all

Cheers

There is a static footer. So, I tried to add some sort of marking after the static footer, but it’s still a long string of text with my “true” page marking. What activities should I use to split? Do I split the one long string text file into smaller strings with an appropriate page number at the end of the each smaller string? Could you please provide more details on the footer approach? Thanks.

Perhaps, “Text to Left/Right” of the footer. Save the right? and iterate through?

@beijing03051

If the text footer is say xxxx then use str.Split({“xxxx”},Stringsplitoptions.None) will split the pages exactly eith header

Then iterate using for loop to get as each page

Cheers

Attached please find the
Extract_PDF_Page_by_Keyword.1.0.1.nupkg (26.3 KB)
Invoice Examples.pdf (36.0 KB)
package. Can’t seem to store the pages in a string array. If I look for “Olive Oil”, it should extract page 2.

what activity should I use for the str.split statement?

Main.xaml (11.1 KB)
Should have uploaded this one instead.

@Bei_Jing

Please check this…this is how the xaml looks

PDFSplitExtract.xaml (9.2 KB)

cheers

1 Like

HI,

Another approach: How about using DocumentUnderstanding - Digitize Document activity with ApplyOcrOnPdfArgument.No option?

We can get text in Pdf file (without OCR) and page information. We can easily get text of each page with single processing for a pdf file.

Sample20230621-3L.zip (634.1 KB)

Hope this helps you.

Regards,

1 Like

Thanks for the Xml. It works to only extract only 1 page successfully. If there is more than one page that needed to be extracted, then the system crashes with is message.
error msg
your xml below.
PDFSplitExtract.xaml (9.0 KB)
new example input file with more than 1 page to extract.
Invoice Examples2.pdf (67.1 KB)

The input argument needs to be a string separated by a comma. Array is my weakest area. :frowning: I need to play around with your working xaml to make it work.

Thanks!!

Thank you very much for your xaml. It worked great splitting all the pages into separate *.txt file. Now I can try to further the rest of the steps. The videos on youtube seem to over complicate using Document Understanding. Yours is simple to understand and use. Thank you!

1 Like

I see that you split up each page, probably can add a “Find” activity to store only the page with matching keyword(phrase) into a string array instead of writing. Thanks again.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.