How to classify and extract multiple invoices on same page of document?

Hello UiPath Community! I am working on a scenario where one pdf file will contain a list of multiple invoices. The invoices are variable in length. If they are short there can be multiple invoices on 1 page. Invoices can also extend over multiple pages if they are long or if an invoice starts near the bottom of a page it will extend to the next page. The invoices are fairly standardized but there can be some differences between them. There are some fields that will always appear on each invoice but there are also optional fields which may or may not appear on every invoice. I have included a simple drawing showing how the invoices could be laid out. Note how page 1 has multiple invoices on it and how invoice 3 begins on page 1 but extends onto page 2.

I am trying to figure out the best way to classify and extract my data but have not had success yet. I tried using the Intelligent Keyword Classifier but it did not give me the results I was expecting. I only trained the classifier on one document so perhaps it was not enough? Is there a different classifier that would work better for this use case? Or perhaps I do not need a classifier at all and can go straight to extraction? I will only be dealing with one document type (invoices) so maybe I do not need to classify them but I thought classification may be helpful for distinguishing where one invoice ends and the next begins?

Appreciate any help with my situation.

Based on your scenario, you are trying to extract data from a PDF file that contains a list of multiple invoices, which can vary in length and layout. The invoices are standardized but may have some variations, and some fields may be optional.

There are a few different approaches you can take to tackle this problem. One approach is to use the Intelligent OCR engine to extract text from the PDF, and then use a combination of UiPath’s built-in data extraction activities, such as “Form Extractor” and “Data Extraction Scope”, to extract the relevant data from the invoices.

Alternatively, you could use the “Intelligent Keyword Classifier” activity to classify the invoices, but as you mentioned, it may not be suitable for your use case. In this case, you could use other machine learning models such as RNN, LSTM which are good at handling sequential data and they could be fine-tuned to classify the invoices and extract the data.

Additionally, you could use a combination of the above approaches. For example, you could use the OCR engine to extract text from the PDF, and then use the “Intelligent Keyword Classifier” activity to classify the invoices, followed by data extraction activities to extract the relevant data.

In any case, I recommend that you start by training the classifier with a good amount of data, and you can fine-tune the classifier as you go.

It’s also worth noting that you can use the “Document Understanding” package on UiPath, which allows you to perform OCR, classification and data extraction in one step.

Overall, the best approach will depend on the specific characteristics of your invoices and the data you need to extract. I recommend experimenting with different approaches to find the one that works best for your use case.

Hello…

I have faced this scenario several times with some vendors. The tricky thing here is, we don’t know how many invoices will be there on a given page. In addition, if there is another document along with invoices, that can cause issues. So in such case, creating templates for extraction is not possible.

The classification can help up to some extent, but originally the classification works for each page. In other words, if you use Intelligent Keyword Classifier, it classifies each page. But when you have multiple invoices on the same page, or a invoice and a purchase order on the same page, that is a tricky thing to handle.

So, The best thing to do is try to standardize the way the data is presented in the file. If we can get the invoices to always start on a new page (one can span up to multiple pages), that would be the ideal and the best way.

Thanks Lahiru

Yes it seems the Keyword and Intelligent Keyword Classifiers can’t handle multiple documents on the same page. I tried manually training them through the Classification Station but could not find any way to assign the same page as part of two separate invoices. Each page can only be used once in the classification.

It seems your suggestion to separate the invoices so each invoice starts on a new page would fix this, but it would be very manual and time-consuming process in my case.

I have not had a chance to test the Machine Learning Classifier yet. Maybe it can identify 2 invoices on the same page? I will keep trying.

Even with Machine learning classifier it will probably be the same.

The challenge is, if you have two documents in one, we also need to figure out the range of the page that belong to each. This is very complicated. We could still try to get the help of regex and stuff to do it, but it will be complex.

How about setting some standards for the users who submit these documents to create one in each? I think we can also improve the process a bit if we can convince them.