Looped "Read PDF with OCR" -- process breaks when attempting to use try catch for exception handling

loop
pdf
ocr
scraping
trycatch

#1

Scenario: I have a simple loop in my newest UIPath Robot where I use Google OCR inside a “Read PDF with OCR” activity inside a For Each (row in a Data Table I’ve created) loop to cycle through a folder of PDFs. I scrape each PDF, determine if it contains certain text, and save pages of the PDF to different locations depending on what I find. If at any point my Robot runs into the “Scrape returned empty text” exception (and I attempt to handle this exception using a Try Catch), the “Read PDF with OCR” activity bugs out “eternally” on every loop going forward – it simply stops scraping the page on each pass.

Steps to reproduce:

(1) Create a simple loop similar to the one I will attach a screenshot of below. The loop should be designed to use “Google OCR” inside a “Read PDF with OCR” activity to read through each page in a PDF document one at a time. (Use a incremented counter.
(2) The “Read PDF with OCR” activity should be inside a Try Catch. The “ArgumentOutOfRangeException” catch is optional (if you want to duplicate what I am doing to determine when the end of the document is), but the “Exception” catch is necessary as this is the only catch that can contain the “Scrape returned empty text” error.
(3) Point the “Read PDF with OCR” activity to a PDF that contains at least one page with no text.
(4) Run the process in Slow Step debug mode, noticing the difference between how UIPath handles the “Read PDF with OCR”/“Google OCR” activities before and after the first time it runs into the “Scrape returned empty text” exception.

Current Behavior:

I have a simple loop in my newest UIPath Robot where I use Google OCR inside a “Read PDF with OCR” activity inside a For Each (row in a Data Table I’ve created) loop to cycle through a folder of PDFs. I scrape each PDF, determine if it contains certain text, and save pages of the PDF to different locations depending on what I find.

One aspect of my loop is that I use a Try Catch to determine if I have scraped to the last page in each file. (I scrape each page separately rather than the whole document at once because I may be saving specific pages in each document to different locations depending on what I find, and I found that this was the simplest way to build that out.) I use this Try Catch on the outside of the “Read PDF with OCR” activity, and the exception type it is catching is “ArgumentOutOfRangeException” (exception source: Read PDF with OCR). This part of the Try Catch works completely fine. I have now run the process for thousands of PDFs and my loop/last page catch work flawlessly.

My issue occurs when I run into a different exception – Namely, “Scrape returned empty text.” This occurs whenever the page I am trying to scrape has no readable text on it. The exception type for this exception is simply “Exception”, and the exception source is “Google OCR” (rather than “Read PDF with OCR”), which is I think where the problem comes in.

To attempt to handle this exception, I have added an additional catch to the Try Catch on the outside of my “Read PDF with OCR” activity that catches any additional generic “Exceptions” that may occur. (The “Scrape returned empty text.” exception is the only other exception I have run into during my extensive testing, so I am fine with this.) I believe the reason this may not work appropriately (as I will explain below) is that the error is really occurring within the “Read PDF with OCR” step (in the “Google OCR” activity), but there is no way for me to put a Try Catch within the “Read PDF with OCR” activity, as the only activities it will accept are OCR activities.

So what happens is this: The Try Catch “successfully” catches the error, in that the robot doesn’t error out and stop running. Instead it continues on with the loop, increments my page counter, and comes back around to the “Read PDF with OCR” step. After running the process in Slow Step Debug Mode several times, I finally figured out what is happening at this point.-- The robot completely skips the “Google OCR” step in each instance of the loop moving forward. The UIPath yellow debug highlighting stops at the “Read PDF with OCR” step and does not highlight the “Google OCR” step, nor does it take enough time on the “Read PDF with OCR” activity to have actually screen scraped anything. In addition, my Try Catch for finding the last page in the PDF never triggers – my robot goes into an eternal loop and I have to force quit it.

Expected Behavior:

I just need some way for UIPath to be able to catch these “Scrape returned empty text.” errors and continue on with my loop without completely bugging out my program.

Studio/Robot/Orchestrator Version:

Last stable behavior: NA
Last stable version: NA
OS Version: Windows 7 Enterprise
Others if Relevant: (workflow, logs, .net version, service pack, etc):
Loop:


#2

I have uploaded (see below) a sample project containing my “problem loop” as well as two PDFs (one with a blank page, one without) to replicate this error. Please – would someone test this out and (A) confirm if you are experiencing this same error, and/or (B) reply if you know what is causing this.

Just to summarize: The issue you should run into is that after the robot encounters the blank page in the attached “PDF with Blank Page” document, it will get stuck in an eternal loop in which the robot no longer even attempts to run the “Google OCR” activity. (Turn Debug mode on and run it in Slow Step to confirm this.)

Note: The OCR scrapes on these test PDFs is gibberish, but that doesn’t have anything to do with the error.

Thanks again!
Riley

Error Demonstration.xaml (16.7 KB)
PDF with Blank Page.pdf (231.0 KB)
Regular PDF.pdf (230.0 KB)


Exception Fixed by BOT
#3

For me it just zipped through up to page 132 at which point Studio crashed completely…

Something similar was happening at one time with MODI (Microsoft OCR). Solution back then was to run it in a separate process (via Invoke with Isolated flag).
Your code with small modification below - works on my end with 2017.1.6435.

PdfOcrErrorExample.xaml (16.2 KB)
ExtractOCR.xaml (6.6 KB)


#4

Hi Andrzej,

Thanks for the recommendation! I tried applying this to my process last night and it ALMOST worked as a workaround. (Although even if it had worked, I still feel that the fundamental problem is with the “Read PDF with OCR” step, and the ultimate solution needs to be figuring out why that activity isn’t working properly in a try catch/loop situation.)

As I said, when I first started running the modified process it looked like your idea of putting the “Read PDF with OCR” portion of the loop inside its own isolated Invoked Workflow was going to work. My process was able to navigate several PDFs it came to that I knew had “Scrape returned blank text” pages in them. If I was only going through 10 or 11 documents, this would have been great. Unfortunately, my process is looping through thousands of scanned PDFs, each with between 1 and 15 pages. I started the process at about 6:00 PM last night. By 7:00 PM, my computer was popping warnings about running slow, and the screen was flickering black (it looked like it may have been flickering every time it reached the OCR step in the loop). I left it running all night, and when I woke up in the morning the robot was still showing as running (in my Start Menu bar), but everything had frozen. In fact, when I went to open a folder on my desktop to determine where in the process the robot had got to, my Windows Explorer wouldn’t even load properly, and a blank (not fully loaded) “Workflow Designer” dialogue box popped up that prevented me from doing anything else in either Windows Explorer or UIPath until I had closed it:

I don’t know if this is directly related to my original error, or if this freezing up problem was brought on as a result of my looping through the same “Invoke Workflow” step so many times in a row. (I can only guess that Invoking a Workflow uses a fair amount of system resources, and perhaps Invoking the same Workflow in such a tight loop was what caused my process to freeze like this?)

So, all goes to say-- this is still an issue and I’m still looking for a workaround for my process. (Last night the robot only read through 146 out of 1,293 PDFs before freezing up. It appears to have fully “frozen” right around midnight. Just for the record, I do have a very good computer that has run much more complex robots (which contained OCR steps – just not in a try catch/loop situation like this) over night before with no problem.)

There were a couple other things from your post, Andrzej, that I wanted to mention/comment on. Regarding your following quote:

Yes, this is what it does to me as well. If you debug it in Slow Step mode you will notice what is actually happening is that once the process hits the “Scrape returned empty text” page and exception, on all future loops the robot just skips the “Google OCR” step entirely (i.e. this step stops getting highlighted in yellow). This is why it speeds through the following “pages” so quickly after reaching the blank page in the PDF. Another indication that it is skipping the OCR altogether is that it says it reads through “page 132”+ in the first place.-- The PDF is only a few pages long, and if the process were working as it should the “ArgumentOutOfRange” catch would kick it out of the loop whenever it reached the last page.

I also just wanted to bring up this part of your post as well:

You mention that the isolated flag has to be checked in order to make your recommendation work. You’re right in that if you don’t check this box the Robot will freak out on the very first PDF it encounters (whether or not that PDF has blank pages on it). – It doesn’t begin skipping the “Google OCR” step or anything “weird” like that, but rather crashes the Robot outright:

Definitely think there is something weird going on, either with the “Read PDF with OCR” step or with the way OCR interacts with UIPath in general.

Let me know if anyone has additional recommendations/ideas on how I can work around this. (My current solution is that I run the robot but have to manually reset it whenever I encounter a “Scrape returned empty text” error.

Cheers,
Riley


#5

Hi,
I am getting a strange error. UiPath stops after reading 15-20 random files while working with Read pdf with OCR activity. After I re-run from where it has stopped it again works for next 15-20. I have 1000 files to pass through. I am not able to move forward. Any help at earliest would be highly appreciated. Below is the error

Main has thrown an exception

Source: Read PDF with OCR

Message: Value cannot be null.
Parameter name: property

Exception Type: ArgumentNullException

System.ArgumentNullException: Value cannot be null.
Parameter name: property
at System.Activities.ExecutionProperties.Add(String name, Object property, Boolean skipValidations, Boolean onlyVisibleToPublicChildren)
at UiPath.PDF.Activities.ReadOCRFileActivity.SchenduleProcessImage(NativeActivityContext context)
at UiPath.PDF.Activities.ReadOCRFileActivity.EndExecute(NativeActivityContext context, IAsyncResult result)
at UiPath.PDF.Activities.AsyncNativeActivity.BookmarkResumptionCallback(NativeActivityContext context, Bookmark bookmark, Object value)
at System.Activities.Runtime.BookmarkCallbackWrapper.Invoke(NativeActivityContext context, Bookmark bookmark, Object value)
at System.Activities.Runtime.BookmarkWorkItem.Execute(ActivityExecutor executor, BookmarkManager bookmarkManager)

Thanks & Regards,
Aashish


#6

What engine are you using for ocr scanning?


#8

I am using Google OCR. I have used for each loop to pass list of pdf files in Read pdf from OCR.


#9

And what tesseract language are you using? Are you up to date with that?
We have a known issue with japanese language, when processing lots of files