Auto-Assign "Not classified" documents to MISC

Intelligent Keyword classifier completely ignores blank pages, is there a way to auto-assign these as MISC? Instead of not classified?

@David_Hernandez2

using intelligent keyword classifier you might not be able to do that…you can separately read the doc pages and assign them or ignore them…but as there are non classfied as of now this it what it shows

cheers

You can parse the classification results before presenting Classification Station, and loop through the pages, and if one has no classification result then update it to the MISC classification. I haven’t specifically done this, but I have parsed certain information out of the classification results. Hopefully it’s not a read-only object.

I just ran a test and found the same as you, the blank page isn’t classified. It isn’t even included in the classification results. There are only 2 results in the classification results for my 3 page test document where one page is blank.

I came up with a fairly simple way to use classification results to remove the blank pages from the PDF. The only catch is then you’d have to re-digitize and re-classify the document or the classification results will no longer match the document structure.

The trick is to use the classification results to build a page range of the pages you want to keep, to pass to the Extract PDF Pages activity.

Start by looping through the classification results:

Inside the loop, check if the current page count is 1:

If it is 1, we just add the page number to our list:

(We add 1 to the StartPage property because the classification results are 0 indexed, but Extract PDF Pages is not. So page 0 in classification results is page 1 to Extract PDF Pages)

Otherwise (ie the page count is greater than 1) then build a page range:

Now we Join the list into a string:

This gives us a result like:
image

Then we pass that to Extract PDF Pages:


As far as the re-classification, here’s how I would do that.

Put everything in a Repeat Number of Times activity, set to repeat 2 times. Use Get PDF Page Count into a variable. Digitize and Classify. Loop through classification results to get page count that was classified. If PDF page count equals classified page count, then we are fine and break out of the repeat. Otherwise, build the page range, extract the PDF pages to a new document, and let the repeat happen.

In the end, the whole thing looks like…

DU Remove Blank PDF Pages.xaml (30.0 KB)

I used…

UiPath.IntelligentOCR.Activities 6.22.1
UiPath.PDF.Activities 3.20.2
UiPath.Persistence.Activities 1.5.11
UiPath.System.Activities 24.10.6
UiPath.UIAutomation.Activities 24.10.10

1 Like

Hello,

I also initially removed blank pages from the PDF, the issue is about ~1-2% of the time the intelligent keyword classifier will classify a document with actual images/text on it as blank. But we still need those doc(s), which is why I want to set them as MISC.

I came up with a sort of complicated solution for this, but Ill attach the workflow below which turns all non classified documents into MISC. You just need to pass in the initial classification results and the total pages from the PDF!

CheckBlankPages.xaml (67.6 KB)

1 Like

What dependency versions are you using? I tried to open that but I’m getting…

What version of Studio?

From what I can see, you’re looping through and wherever there is a missing page you’re manually inserting a new page into the classification result?

I’m using Studio 2025.0.157, with all dependencies up to most recent version.

Also in C#.

Yes, I’m essentially looping through and manually inserting a classification result object of “MISC” where there is a missing page. Was a fun problem to solve.

1 Like

Oh that explains why I can’t see any of it, I opened it in a vb.net project on an earlier version of Studio. I can at least see your high level logic. Awesome solution, I had wondered if the classification results could be updated like that. I have a project where the PDFs are very long and the digitize step takes too long. We aren’t doing automatic classification, we are presenting the user with Classification Station. But the digitize and classify steps are still required so I spoof the document DOM in a similar fashion by building it manually to avoid the digitize step.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.