I have a working process to extract pages from a PDF file with a specific keyword or string. e.g. “Sports Section”. Current process is to split the file first and store the single-page file into a directory. Then, use “Read PDF Text” for a single page, and use “Find Matching Pattern” to identify if the page contains the keyword. If condition is true, then move the single-page PDF to another folder.
While the above approach works, but very slow for a large file over 500 pages. After all, only about 10% of the pages contain the keyword. If I put the “Read PDF Text” and get a long string before splitting to single-page PDF, I lost which the page where the keyword is supposed to have come from.
Appreciate any higher power’s help with a process to identify the pages with keyword and store them in a string. Then, I can use “Extract PDF Page Range” to get the desired pages with PDF.
If the pdf contains any static header or footer…then read all the data at on e then split on the header or footer and then loop and chexk each string from array…by which we can get the page numbers where your string is present and then can use extract at once to extract all
There is a static footer. So, I tried to add some sort of marking after the static footer, but it’s still a long string of text with my “true” page marking. What activities should I use to split? Do I split the one long string text file into smaller strings with an appropriate page number at the end of the each smaller string? Could you please provide more details on the footer approach? Thanks.
Thank you very much for your xaml. It worked great splitting all the pages into separate *.txt file. Now I can try to further the rest of the steps. The videos on youtube seem to over complicate using Document Understanding. Yours is simple to understand and use. Thank you!