AI Document Understanding for Date Processing

Hello community,

I have a use case that requires us to extract the dock date from a pdf. However, sometimes there are three different types of dates within the pdf: dock date, ship date, and a random date.
Is there any tool that can help me categorize a ship date, dock date, and a random date?
One idea is to scan nearby words to infer whether the date is one of the three categories (or maybe there is a better solution that I am unaware of).
Additionally, every pdf does not follow the same form factor.

Let me know your thoughts!

Thanks,
Grant Walker

What type of document do you have?
Is it a Purchase Order?

Hi @grant.walker

Since you have the files in .pdf format & the form structure might not be the same everytime, you can go ahead with Document Understanding where you will be able to extract data from semi/unstructured files.

We will make an assumption that the data you want to fetch is present in a key : value pair format, like ship date: aa/bb/cc or dock date: xx/yy/zz. If this is the case, then you can directly read the pdf file, which eventually gives you a text string output. From this, you can basically provide RegEx patterns to extract the data which is associated with the key works such as ship date, dock date, etc.

If the above assumption doesn’t hold true, then you might have to proceed with the deployment of Machine Learning Skill, where you basically train an ML package to acquire necessary skills to extract the data from the file grid.

UiPath provides a wide range of out-of-the-box ML models from which you can choose the closest model to your use case. (If you are dealing with invoices, UiPath provides the Invoices model, in some cases, for specific countries as well)

Once you select the base model, you will be label the fields that you want to extract from different files, per say 30-50 files & then train the model to get ready for the similar extractions in future. Once you train & deploy the skill, you can use the same in the Studio, which will be able to extract the data from the files you get subsequently.

Hope this helps, please let us know if you plan on proceeding with Document Understanding.
Best Regards.

Hi,
Yes, it is a purchase order document. We get confirmations from lots of different suppliers, so we cannot rely on the same format.

Hi,
Thank you for the ideas!
We are currently using regex, however it is not reliable enough, so we want to move towards and AI or ML approach.
Would the ML models be able to train and learn even though they have different formats? The demos I have seen with the models people use documents with the same format so it does not fit our use case.

@grant.walker

Yes. You can actually train a model to get started with the initial extraction. Even though the file structures are different, your ML skill be capable of extracting the data, given that you have labeled & trained the model on sufficient data.

You can basically create a data set in the AI Center & import 30-50 files in the data labeling session. You need to manually label the ship date, dock date & the random date for all these 30-50 documents manually first, in order to train the model. Once you label the data & export it to the data set, you can train an ML package with this data set & deploy the skill. This skill now has the capability to identify ship date, dock date & the random date from files of any given structure.

Even if the model doesn’t extract the data in the initial runs (generally happens if the model if not trained on sufficient data), you can validate such documents & sent it to the user review, which is technically human-in-the-loop concept. Then these human validated data can also be used to re train the model, which makes the model pretty reliable over time.

Hope this helps,
Best Regards.

Oh wonderful!
Would this work even if sometimes only one of the three dates were in the document?
For example, only the ship date is present and the dock/random date are not. Would the model get confused by that?

How would I try this out to see if this would work?

Thanks!
Grant Walker

@grant.walker

If ship date is available & dock/random dates are not available, BOT will simply incorporate the respective fields with null values as per the ML model. But you can explicitly change those values in the workflow by the usage of conditional controls such as If activity. That’s basically checking whether there is a value for that field or not & then taking necessary actions based on the outcome.

Best Regards.

Thank you @arjunshenoy.
Do you have a link to where I can read about this more and the different types of models?

Thanks,
Grant Walker

@grant.walker

You can take a look at different models by navigating through the following resource:

https://docs.uipath.com/document-understanding/automation-cloud/latest/user-guide/out-of-the-box-pre-trained-ml-packages

Additional information can be found here:

https://docs.uipath.com/document-understanding/automation-cloud/latest/user-guide/overview-ml-packages

Best Regards.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.