When training UiPath machine learning extraction models, I understand that by labeling data, we help train the model where on the document this value can be found.
But if during data labeling, we correct the actual value that is being extracted, does this help the model with better reading values that are poor in quality, or hand written? Or will documents with low resolution, or warped text always not be extracted well?
The OCR of a document is not affected by labelling.
If you have hand written documents there are perhaps other OCR methods that can be used to more effectively extract the text correctly. UiPath doesnt have the strongest offering on hand written documents and some other vendors have better extractions of that raw text, but document quality is a tricky one to fix.
The thing to remember though, its the OCR engine, not the Machine learning model that is the weak part, and you only train the Model.
It is the toughest part about utilizing the machine learning extractor. Are stakeholders always want us to make automations with variable documents that contain low-quality or handwritten text which is just not easy to work with in UiPath
Yeah, its important to explain the limitations there as shit text quality cannot be improved by new training.
You need to investigate the quality early on the set expectations of whats possible.
Its perhaps worth throwing a doc at a GPT and see how it goes, I am not sure how they work under the hood though and I expect they use the same OCR engines so you’d get the same result, but worth benchmarking.
Its what we do early on in projects, benchmark a few options.