Extracting pdf page count without using regex

Hawkins · October 21, 2019, 8:08am

I want to extract the page count of a pdf, there are some solutions which gives the total number of pages in a pdf using regular expressions like this -regex.Matches(StreamReader.ReadToEnd(),"/Type\s/Page[^s]")*, which is not returning accurate result always. Is there any activity or any other way in uipath using which we can predict the total number of pages in a pdf without using any external library.

strqsr · October 21, 2019, 9:02am

Maybe you can use read ocr image and cut out the page number at the total number on the right?

Hawkins · October 21, 2019, 3:27pm

Unfortunately, I cannot open the acrobat pdf reader. This should happen in the background.

bcorrea · October 21, 2019, 4:41pm

Do you only need to know how many pages a pdf file has? Or do you need to read the contents as well?

DanielMitchell · October 21, 2019, 4:56pm

@Hawkins
There’s a workaround using the Extract PDF Page Range activity. It’s not particularly fast, but it works behind the scenes using only the uipath pdf activities.

Start an integer variable at 1. Then, in a loop pass that variable as the range of the pdf extraction.
Increment it every time.

Inside the loop use a try/catch. An exception will be thrown when you try to extract a page that doesn’t exist. Once the exception is thrown then the value of the variable is your page count.

The main workflow:

Inside the catch:

Hawkins · October 21, 2019, 5:08pm

Thank you so much Daniel.

If there is no option I would go with this, since this seems to be a long workaround.

Hawkins · October 21, 2019, 5:10pm

I do need to read the content as well @bcorrea. I am already doing that using “read pdf text” and “read pdf ocr text” activities.

dmccammond · October 21, 2019, 5:11pm

This is a really hacky workaround.

For one, you know that the PDF is always going to have at least one page. You should start looking at page 2. You may also want to try using different numbers than simply incrementing by one. I would suggest even considering exponentially increasing the number and then halving once you get an out of bounds index error until you get to the correct number, like using a searching method on a sorted array.

The real answer is that there should be a new activity that has this functionality since it appears to be pretty fundamental for dealing with PDFs.

dmccammond · October 21, 2019, 5:15pm

I’ve created a request for the activity, please vote here: PDF Page Count Activity

bcorrea · October 21, 2019, 5:16pm

Ok, but dont mid me asking, does your pdf has the page number on it or not? Because that regex can be made to work on your conditions if they do appear in your text after you read them… if you are going for @DanielMitchell solution and you do need the pages content, then you can to his approach BUT use Read PDF Text activity passing the range as a page, then you hit two birds with one stone…

dmccammond · October 21, 2019, 5:19pm

It’s a really bad idea to hope that the page ranges are going to be present unless you’re sure that these are machine output files that includes them. PDFs could even be scanned images, which means that there would not be any text present.

Hawkins · October 21, 2019, 5:25pm

Just now I got an idea. My requirement is that, if a pdf exceeds the limit of 10 pages, then it should throw a business exception. So I will use Extract PDF Page Range with range as 11.

If it throws an exception, I will catch it and proceed normally. If it does not throw an exception, then I can come to a conclusion that, it is exceeding the page limit and a business exception will be thrown. yayyyyyyyyyy!!!

Thank you @DanielMitchell @dmccammond @bcorrea.

Hawkins · October 21, 2019, 5:27pm

Very much true. That’s why I am not going with Regex. @dmccammond @bcorrea

system · October 24, 2019, 5:27pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
to get the count of number of pages in a web based pdf file , without downloading or passing the path of the pdf Activities uiautomation , pdf , question	9	557	March 21, 2023
Extract key term and Identify the page number it is on Help pdf , activities , question	3	1487	November 26, 2019
How to extract all pages of a PDF based on a specific Text? Studio studio	15	3126	May 15, 2020
How to find the total number of pages in a pdf file in uipath Academy Feedback robot , activities	13	9144	February 23, 2021
How to split pdf pages and extract? Help pdf , activities , question	4	16917	September 25, 2020

Most Active Users - Yesterday
Anil_G
jast1631
yuichi
More details...

Extracting pdf page count without using regex

Related topics