I am working on a project to extract multiple fields from a Form-type document. The format of the forms is similar so I attempt to use Forms AI under Document Understanding Service to handle this.
After I performed the data labeling under Forms AI and then Published it, I incorporated the Endpoint of the extractor into the ML Extractor in Studio. I also set up the ML Extractor Trainer and hope to improve the accuracy of the extraction after processing additional documents.
The problem I faced is where I could cycle back the training data for the ML model, which is set up under Forms AI.
I have set up a local storage folder to collect the training data. I did attempt to use the Extractor Endpoint provided by Forms AI as the dataset endpoint, but an error message was prompted.
My questions are:
Can the ML extractor model derived from Forms AI be trained and re-trained?
Are the 20 documents allowed by Forms AI for training the maximum number of documents that one can train the ML model under Forms AI?
If this is not the case, is there a way to import the training data collected locally and import it into the ML model to improve the model accuracy? [I tried to locate the dataset file under AI Center. However, the project that was developed under “Document Understanding > Forms AI” did not show up under AI Center.]
If the re-training capability is the limitation of Forms AI, could one easily convert an existing ML Extractor that was developed under Forms AI into one under the AI center without starting from scratch?
Hence, for your use case why are you using forms AI. Do you have structured or dynamic documents?
The idea of training an ML model is for continuous improvement if we see dynamic documents over time. If that’s the case, you can use the document understanding OOTB or any OOTB that suits your requirements to combine the train extractor scope.
There is only a single type of document in this use case. The document is a Form so it can be considered a structured document.
However, they are scan pdfs so the alignment of the scan could cause the field positions to shift or the fields are not in the same position for each PDF causing inaccurate data extraction.
This is the reason why I explore whether training the ML Extractor could improve the accuracy of data extraction.
I have explored the article mentioned in the above URL. However, it did not mention training the ML extractor in the article.
if you only have a single type of structured document, you are on the right path in choosing Forms AI. Forms AI is built to handle skewed, native/scanned scenarios, so when you’re training your Forms AI model, make sure to add a combination of native/scanned/skewed etc. documents to train your model.
Once you have trained the forms AI model do check the accuracy by testing some files. from my experience, Forms AI does a very good job in predicting values for documents, hence you should test it out to determine if forms AI is still the best solution.
OOTB means Out-of-the-box. UiPath has many OOTB models and for a general document you can use the document understanding ML model. If you want to continuously retrain your model over time, you can explore this option.
On a related topic…
Is there a way to export the ML model developed under Forms AI so it can be included in a DU process performed by an unattended robot on a separate machine rather than on the developer’s machine?
you can always export the ML package from one machine to another, and under different environments. You can also export the schema and labelled dataset in the export option & import back into the environment/machine you need and run a train/eval pipeline.