How to extract all pages of a PDF based on a specific Text?

Farhan_Shirgill · May 11, 2020, 5:58am

I am a beginner to UiPath. I have gone through the videos and tutorials on the web as and when needed. However, there is a problem which I havent had success with.

Problem Statement -

There is a PDF file consisting of 500 pages or more.
Each page may or may not have a text like ‘Block 01’. The number for each block varies.

For example, the first pdf page might have ‘Block 01’.
Second page might not have anything.
The third page could contain ‘Block 02’.
The third page may have ‘Block 04’.
Each page can have a maximum of one Block (No).

How can I read all those pages containing a Block and then combine the Blocks(page) into a single pdf with each pdf associated with a particular Block (No).

Please suggest.
Thanks All.

msan · May 11, 2020, 6:33am

Hello,

One approach sticking with UiPath activities would be to:

Read each page
Check if it contains the string
If a match, keep track of the page number
At the end, extract retained pages in one document

Below you’ll find a skeleton with emphases on some key points.

Assign (Int32)
Initialise a counter (the page number)
page = 0

Assign (List of String)
A list of string were we’ll stack the matching page numbers
pages = New List(Of String)

Get PDF Page Count
Get the number of page in the document
lastPage

While (page < lastPage)

Assign
Increment the counter
page = page + 1
Read PDF Text
With range set to page, you get a string
pageText
If System.Text.RegularExpressions.Regex.isMatch(pageText, “\bBlock \d+\b”)
If the text contains “Bock (No)”, do the following:
- Add To Collection (page.ToString to pages)
  We’re adding the page number to the list of retained pages

// We’ve reach the number of pages in the document, it’s time to generate the document

Extract PDF Page Range
Create a new document from the original passing the range argument the pages retained
range = String.Join(", ", pages)

Farhan_Shirgill · May 11, 2020, 8:21am

Thanks. This way I am able to do it, though my algo differs a little. Is it not possible through Screen Scrapping? My apologies as I fail to mention this.

msan · May 11, 2020, 8:25am

I don’t know, but you can adapt

Farhan_Shirgill · May 11, 2020, 8:28am

Once I scrap screen, how can I go through all the pages? I am only able to take the first one. Any suggestions?

msan · May 11, 2020, 8:35am

You’re adding the pdf reader problem: you have to interact with its interface, set it to allow reading the flow as you expect, etc.

Really, I’ll go directly to the file: it’s more straightforward and probably more reliable.

Farhan_Shirgill · May 11, 2020, 9:12am

For fun, I want to try with this approach as well. I am not sure as to what do you mean by pdf reader problem as I am a beginner to all this. Can you guide me to some tutorials on this. Thanks.

msan · May 11, 2020, 9:14am

Sorry, I have no experience in tha matter. You’ll have to find out how to navigate page by page and how to scrap the text.

Farhan_Shirgill · May 11, 2020, 1:28pm

I guess I need to maintain some sort of a Map to have (Block Id, list of Pages on which it appears this particular block appears).

I see that you have taken a pages = New List(Of String). A list wont be able to mantain such a relationship.

msan · May 11, 2020, 1:31pm

I thought as soon as a page contains the text Block … it should be retained. The list contains only the related page number.

The goal is to feed Extract PDF Page Range with the page number to output.

Farhan_Shirgill · May 11, 2020, 1:36pm

If Block 05 appears on page 1, 5, 9 then I need to insert the same in a map
Map<BlockNo, ListofPages>

If there any way to create a Map variable? Please suggest.

msan · May 11, 2020, 1:45pm

use a dictionary

Farhan_Shirgill · May 11, 2020, 2:01pm

Is that you mean?

And shall I maintain an array of pages corresponding to a block no.?

Farhan_Shirgill · May 15, 2020, 12:22pm

I am done with the dictionary. It works. The only issue I am facing is how to extract the string Block along with the No so that I can then maintain the key in the dictionary based on the Block No.
So, the keys will be like this in the dictionary.
Block 01
Block 02
Block 03
Block 04

Any suggestions, dear?

msan · May 15, 2020, 12:32pm

Hello,

key = System.Text.RegularExpressions.Regex.Match(pageText, "\bBlock \d+\b").Value

Farhan_Shirgill · May 15, 2020, 12:50pm

Amazing. Thanks dear.I am learning a lot from you. Thanks once again.

Topic		Replies	Views
Extract PDF oages contain specific text Activities pdf , activities , question	5	1296	October 26, 2022
Merge pdf page after extraction of data from a large file in Uipath Studio studio , question , activities_panel	5	1212	October 27, 2021
How to split pdf pages and extract? Help pdf , activities , question	4	15094	September 25, 2020
UiPath only able to read blocks of text in PDF instead of specific values Help uiautomation , studio	7	2421	October 24, 2019
Extracting pdf page count without using regex Help pdf , studio	13	3023	October 24, 2019

Most Active Users - Yesterday
Anil_G
ashokkarale
jinal.shah
Gautham_Pattabiraman
postwick
chandreshsinh.jadeja
vrdabberu
Ajay_Mishra
sven.wullum1
Vyshnavi_Nalumachu
More details...

How to extract all pages of a PDF based on a specific Text?

I am done with the dictionary. It works. The only issue I am facing is how to extract the string Block along with the No so that I can then maintain the key in the dictionary based on the Block No. So, the keys will be like this in the dictionary. Block 01 Block 02 Block 03 Block 04

Related Topics

I am done with the dictionary. It works. The only issue I am facing is how to extract the string Block along with the No so that I can then maintain the key in the dictionary based on the Block No.
So, the keys will be like this in the dictionary.
Block 01
Block 02
Block 03
Block 04