Splitting multi-page PDF document containing multiple document types

Hi all

Pretty big ramble here so please bear with me

We are working on a problem where we are using DU to ultimately ingest files of one type into a system, lets call these cover pages. We will be performing text extraction on these at a later stage (performer) but for now, in the dispatcher, the goal is to extract a number of documents from a large, single master pdf.

Now, I am using DU & the Intelligent Keyword Classifier to split these documents based on the cover pages.

I manually split one of these documents up and carefully extracted only the cover pages out to train this model which I have done.

So the plan is the document gets digitized and then classified and using these classification results, I use the Extract PDF Page Range activity feeding in the DocumentBounds.StartPage and a little logic using DocumentBounds.PageCount to generate a string of the format “2” or “2-4” etc for the range. This approach works nicely.

My first question.

The example I am using to dev with is a 96 page pdf and the digitization process takes a r-e-a-l-l-y long time and absolutely pins the cores on my cpu for the duration. That in itself is no big issue but I do have concerns when this scales up to dealing with bigger numbers of these. Does anyone know of any optimizations I can make with the digitization step?

Another question.

Some of these cover pages actually stretch to 2 pages depending on the volume of content in them, so I have trained with these as 2 pages and some are single page documents. Does that pose a problem? I only ask as I am getting mixed results in classification.

Ultimately, I think we are going to be using the Classification station within the Action Centre to provide human in the loop input to help us split these initially, feeding that back into the training using the Train Classifiers scope but I had hoped for better results without that out of the gate.

I did actually think that I should perhaps train the system using the worst case first page of a covering document. For example, work out a worst case document and see what content is on that and use that to train?

I guess this is difficult as the Intelligent Keyword Classifier takes granular control of the word list it has built away from me and would appreciate your thoughts

Andy

Hello @andrewh!

It seems that you have trouble getting an answer to your question in the first 24 hours.
Let us give you a few hints and helpful links.

First, make sure you browsed through our Forum FAQ Beginner’s Guide. It will teach you what should be included in your topic.

You can check out some of our resources directly, see below:

  1. Always search first. It is the best way to quickly find your answer. Check out the image icon for that.
    Clicking the options button will let you set more specific topic search filters, i.e. only the ones with a solution.

  2. Topic that contains most common solutions with example project files can be found here.

  3. Read our official documentation where you can find a lot of information and instructions about each of our products:

  4. Watch the videos on our official YouTube channel for more visual tutorials.

  5. Meet us and our users on our Community Slack and ask your question there.

Hopefully this will let you easily find the solution/information you need. Once you have it, we would be happy if you could share your findings here and mark it as a solution. This will help other users find it in the future.

Thank you for helping us build our UiPath Community!

Cheers from your friendly
Forum_Staff

The speed of digitization depends on what OCR engine you use. It’s pretty straightforward that you might see a decent time spent to digitize 90+ pages, but you have a small window to increase the speed of digitization, through the OCR engine used, if it’s in the server you can increase the memory allocated for it, GPU etc.

Another approach on speeding up the digitization process.

I dont see a problem here as long as you have trained them they should be fine.

You can improve the accuracy of the model over time by using an intelligent keyword classifier trainer, this will retrain the model with human in the loop validated data.

1 Like