Incorrect Data Extraction or Partial Data Extraction from Document using AI Center OOB Model

Dear Forum Members,

I am working on a solution where I need to extract details from different documents. For that, I am using Out of Box model named, “Passport” and “Document Understanding”. And the model is trained on enough documents. Now, for few cases, data is extracted completely and in few cases, data is partially extracted. And in other document, model is extracting incomplete number.

Below are the details for the cases of Partial Data Extraction and Incorrect Data Extraction:

  1. Checked the output of OCR under Digitize Document. For UiPath Document OCR, data is there but checked for other OCRs, i.e., Omnipage, Tesseract, Microsoft Computer Vision, OCR is not providing any data. And input of ML extractor has data but output doesn’t contains the data.
  2. For Incorrect Data Extraction, I am trying to extract data from Aadhar card, And from the back page, it is extracting data correctly but from front page, it is extracting incomplete aadhar number. For this, the output of UiPath Document OCR doesn’t contains the complete number and other OCRs apart from Microsoft, it contains the number completely but doesn’t contains other details. Microsoft OCR output is blank.

I hope, I could explain the problem statement. Can you please suggest some solution because I have retrained the model multiple times. Is there anything that I can do for full and correct Data Extraction.

Thanks,
Dimple

Hi @dimple.khurana

There is no way to enhance the OCR ML Package as it is an Non-Retrainable package, but as you mentioned that the UiPath Document OCR is digitizing the document without any issues, that should not be a problem.

And to enhance the ML Model you can try these things.

  • For Training, make sure the Dataset does not contain any noise, and take only good quality samples. if you find any lowquality file try to ignore them.
  • Try adding more files to your dataset. and make sure that you are labelling the without any redundancy.
  • Try creating an Evaluation dataset and train the model with an Train and Evaluation Pipeline (if you have not tried that yet)
  • For Adhaar try and verify by using the ID Card ML Package if it yields better results.

Hope this helps.

Regards.