How to extract all pages of a PDF based on a specific Text?

I am a beginner to UiPath. I have gone through the videos and tutorials on the web as and when needed. However, there is a problem which I havent had success with.

Problem Statement -

There is a PDF file consisting of 500 pages or more.
Each page may or may not have a text like ‘Block 01’. The number for each block varies.

For example, the first pdf page might have ‘Block 01’.
Second page might not have anything.
The third page could contain ‘Block 02’.
The third page may have ‘Block 04’.
Each page can have a maximum of one Block (No).

How can I read all those pages containing a Block and then combine the Blocks(page) into a single pdf with each pdf associated with a particular Block (No).

Please suggest.
Thanks All.

Hello,

One approach sticking with UiPath activities would be to:

  • Read each page
  • Check if it contains the string
  • If a match, keep track of the page number
  • At the end, extract retained pages in one document

Below you’ll find a skeleton with emphases on some key points.

Assign (Int32)
Initialise a counter (the page number)
page = 0

Assign (List of String)
A list of string were we’ll stack the matching page numbers
pages = New List(Of String)

Get PDF Page Count
Get the number of page in the document
lastPage

While (page < lastPage)

  • Assign
    Increment the counter
    page = page + 1

  • Read PDF Text
    With range set to page, you get a string
    pageText

  • If System.Text.RegularExpressions.Regex.isMatch(pageText, “\bBlock \d+\b”)
    If the text contains “Bock (No)”, do the following:

    • Add To Collection (page.ToString to pages)
      We’re adding the page number to the list of retained pages

// We’ve reach the number of pages in the document, it’s time to generate the document

Extract PDF Page Range
Create a new document from the original passing the range argument the pages retained
range = String.Join(", ", pages)

Thanks. This way I am able to do it, though my algo differs a little. Is it not possible through Screen Scrapping? My apologies as I fail to mention this.

I don’t know, but you can adapt

Once I scrap screen, how can I go through all the pages? I am only able to take the first one. Any suggestions?

You’re adding the pdf reader problem: you have to interact with its interface, set it to allow reading the flow as you expect, etc.

Really, I’ll go directly to the file: it’s more straightforward and probably more reliable.

For fun, I want to try with this approach as well. I am not sure as to what do you mean by pdf reader problem as I am a beginner to all this. Can you guide me to some tutorials on this. Thanks.

Sorry, I have no experience in tha matter. You’ll have to find out how to navigate page by page and how to scrap the text.

I guess I need to maintain some sort of a Map to have (Block Id, list of Pages on which it appears this particular block appears).

I see that you have taken a pages = New List(Of String). A list wont be able to mantain such a relationship.

I thought as soon as a page contains the text Block … it should be retained. The list contains only the related page number.

The goal is to feed Extract PDF Page Range with the page number to output.

If Block 05 appears on page 1, 5, 9 then I need to insert the same in a map
Map<BlockNo, ListofPages>

If there any way to create a Map variable? Please suggest.

use a dictionary

Is that you mean?

image

And shall I maintain an array of pages corresponding to a block no.?

I am done with the dictionary. It works. The only issue I am facing is how to extract the string Block along with the No so that I can then maintain the key in the dictionary based on the Block No.
So, the keys will be like this in the dictionary.
Block 01
Block 02
Block 03
Block 04

Any suggestions, dear?

Hello,

key = System.Text.RegularExpressions.Regex.Match(pageText, "\bBlock \d+\b").Value

Amazing. Thanks dear.I am learning a lot from you. Thanks once again.