How to extract all pages of a PDF based on a specific Text?

Hello,

One approach sticking with UiPath activities would be to:

  • Read each page
  • Check if it contains the string
  • If a match, keep track of the page number
  • At the end, extract retained pages in one document

Below you’ll find a skeleton with emphases on some key points.

Assign (Int32)
Initialise a counter (the page number)
page = 0

Assign (List of String)
A list of string were we’ll stack the matching page numbers
pages = New List(Of String)

Get PDF Page Count
Get the number of page in the document
lastPage

While (page < lastPage)

  • Assign
    Increment the counter
    page = page + 1

  • Read PDF Text
    With range set to page, you get a string
    pageText

  • If System.Text.RegularExpressions.Regex.isMatch(pageText, “\bBlock \d+\b”)
    If the text contains “Bock (No)”, do the following:

    • Add To Collection (page.ToString to pages)
      We’re adding the page number to the list of retained pages

// We’ve reach the number of pages in the document, it’s time to generate the document

Extract PDF Page Range
Create a new document from the original passing the range argument the pages retained
range = String.Join(", ", pages)

2 Likes