How to efficiently extract Page X of Y for all pages of scanned pfd?

My earlier post on related issues did not get a satisfactory response, so I am re-posting with just the essential issue.
I have multipage brokerage account statements which are scanned pdf’s. Each page is numbered Page X of Y (e.g., Page 1 of 4, Page 2 of 4, etc.) on top right. I want to extract the “Page X of Y” for each page so I can run logic to verify that all pages of the statement were captured in the scan, and there are no duplicated pages. So, for this 4 page document example, I expect to get extracted data as follows:
“Page 1 of 4”, “Page 2 of 4”, “Page 3 of 4”, “Page 4 of 4”.
Additionally, there is a ton of scanned text on each page, which I don’t care about. If possible, for better efficiency, I want the OCR engine to ignore the main body of text on each page, and capture only the occurrences of “Page X of Y”. A sample page is attached.
What is an efficient way to extract this data? Thanks for your help!
PageXofY.pdf (55.8 KB)

Manage Packages > Official > PDF Actvities. They are great for page / text extraction.

For OCR I suggest Google Tesseract with a languagemodel installed.(slow but free)
I bet some cloud based solution is way faster. Never have tried it that way though.

You can get tons of languages from here:

Language installation instruction:

Hi @TastyToast,
Thank you. I chewed on your response, LOL! But unfortunately, your answer is too generic - it’s a little like saying RTFM :grinning: It does not provide a solution to the specific problem I described in my post. Btw, I have read the documentation on pdf data extraction and on Document Understanding, as well as seen videos of the same. However, I have not yet seen a way to do what I want, in an efficient manner. Hopefully, someone else will post an example to show how to do this. Thanks again.

hey @amodsinghal can you upload pdf with multiple pages and this time do not highlight Page 2 of 4

PageXofY-d.pdf (1.2 MB)

Here it is. I have erased/blurred out sensitive information. Thank you.

1 Like

try with this xaml…
Test.xaml (45.6 KB)

I am using Adobe acrobat as PDF reader, if you are using any other app then you might have to change few steps
for some weird reason it is giving me false at the end even if all the pages exist.

Hi @AkshaySandhu,
First of all, thanks for taking time to work on this!
I got errors running this, so I’ll try to fix them.

However, you may be getting false at the end due to a typo in your default value of strPageValue: “,Page 1 ot 4,Page 2 of 4,Page 3 of 4,Page 4 of 4”
Not sure if I understand your code completely.
Looks like you are using Powershell to open up the document in Acrobat. Then using hotkeys to page to the last page?? Not sure I understand what you get from Get Text text /4" GetValue activity. Finally, it appears that in your While loop you are getting the OCRText in the clipping region.
If you don’t mind, can you briefly explain the steps in your workflow? Are you getting efficiency by avoiding OCR’ing the entire page by using Clipping Region?
Thanks again.

sure below is the step by step explaination:

  1. Open PDF with default PDF reader. In my case it is Adobe PDF Reader.
  2. After opening the file I am sending hotkey ctrl + Home to get to the first page. (adobe remembers at what page the PDF was closed that is why added this.)
  3. Sending hotkey ctrl + 3 to adjust the view of PDF to Screen width.
  4. Using Get Text activity to get the exact number of pages in PDF file.
  5. In first while loop a. Finding “Page” image [name of activity is “Find Image ‘AcroRd32.exe Doc1.pdf’”]. b. Increasing the region by using set clipping region activity. c. Performing OCR on that region.
    d. storing extracted text in variable strPageValue. e. sending hotkey to go to next page.
  6. In second while loop a. get number of pages that should be available in PDF (we are getting this with following Right(strPageValue.Split(","c)(1).Trim,1) ). b. check if all the pages are there in strPageValue or not. e.g if in PDF there are 3 pages but as per page there should be 4 (determining this from Page 1 of 4) then bAllPageExist will be false.

Note: I got to know the issue. Just set bAllPageExist = True in Variable pane. this variable was giving false because I never assigned it True

Thank you for the clear explanation! You have been very helpful. I’ll run some tests to verify if the added overhead of opening the pdf in Acrobat results in poorer performance than other methods which ocr the whole document. Thanks again.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.