Extracting pdf page count without using regex

I want to extract the page count of a pdf, there are some solutions which gives the total number of pages in a pdf using regular expressions like this -regex.Matches(StreamReader.ReadToEnd(),"/Type\s/Page[^s]")*, which is not returning accurate result always. Is there any activity or any other way in uipath using which we can predict the total number of pages in a pdf without using any external library.

Maybe you can use read ocr image and cut out the page number at the total number on the right? image

1 Like

Unfortunately, I cannot open the acrobat pdf reader. This should happen in the background.

Do you only need to know how many pages a pdf file has? Or do you need to read the contents as well?

@Hawkins
There’s a workaround using the Extract PDF Page Range activity. It’s not particularly fast, but it works behind the scenes using only the uipath pdf activities.

Start an integer variable at 1. Then, in a loop pass that variable as the range of the pdf extraction.
Increment it every time.

Inside the loop use a try/catch. An exception will be thrown when you try to extract a page that doesn’t exist. Once the exception is thrown then the value of the variable is your page count.

The main workflow:

Inside the catch:
image

1 Like

Thank you so much Daniel.

If there is no option I would go with this, since this seems to be a long workaround.

I do need to read the content as well @bcorrea. I am already doing that using “read pdf text” and “read pdf ocr text” activities.

This is a really hacky workaround.

For one, you know that the PDF is always going to have at least one page. You should start looking at page 2. You may also want to try using different numbers than simply incrementing by one. I would suggest even considering exponentially increasing the number and then halving once you get an out of bounds index error until you get to the correct number, like using a searching method on a sorted array.

The real answer is that there should be a new activity that has this functionality since it appears to be pretty fundamental for dealing with PDFs.

1 Like

I’ve created a request for the activity, please vote here: PDF Page Count Activity

Ok, but dont mid me asking, does your pdf has the page number on it or not? Because that regex can be made to work on your conditions if they do appear in your text after you read them… if you are going for @DanielMitchell solution and you do need the pages content, then you can to his approach BUT use Read PDF Text activity passing the range as a page, then you hit two birds with one stone… :wink:

1 Like

It’s a really bad idea to hope that the page ranges are going to be present unless you’re sure that these are machine output files that includes them. PDFs could even be scanned images, which means that there would not be any text present.

Just now I got an idea. My requirement is that, if a pdf exceeds the limit of 10 pages, then it should throw a business exception. So I will use Extract PDF Page Range with range as 11.

If it throws an exception, I will catch it and proceed normally. If it does not throw an exception, then I can come to a conclusion that, it is exceeding the page limit and a business exception will be thrown. yayyyyyyyyyy!!!

Thank you @DanielMitchell @dmccammond @bcorrea.

2 Likes

Very much true. That’s why I am not going with Regex. @dmccammond @bcorrea

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.