Sorry for not being quick in responding. I just saw your post. Here are my answers…
→ How many layouts/batches can a single Document Understanding Model accommodate to reliably extract data?
As far as I know, there is no such limitation in the number of batches a single model can handle. We also had several use cases where we had more than 200+ vendors sending invoices. This means 200+ layouts. Our approach for this is:
- Identify the most frequent layouts
- Collect sample documents from each layout (making sure we have enough samples from each layout starting with 20)
- Doing the training
Since you mentioned the extraction results are not so good in your case, I highly recommend you do an Evaluation of the model to identify where it lacks in terms of training. This will require you to run an evaluation pipeline.
→ Are there any suitable pre-processing steps we can adopt to improve the OCR performance?
Yes, there are different methods we could do depending on the scenarios we encounter. One of the pre-processing steps we do is to convert the document into gray-scale and increase the contrast where required. This could help clear the black text for the OCR a bit. In addition, also try to apply some standards on how people send the documents to you. This goes as a part of process improvement and standardization for better accuracy.
→ Can Unstructured Documents or multiple documents on single page be handled with NER & Document Understanding?
Yes. We can connect Document Understanding with NER. I have done this in another project for some unstructured documents. The Digitize Document activity gives you the Document Text. This goes as the input for the NER model. Depending on the scenario, you may need to do a little bit of cleansing before submitting the text to NER. Example: removing extra line breaks, special characters etc.
It can be tricky when you have multiple documents on a single page. However, if these are added as Images, we can use PDF activities to get the images extracted into a list and process them separately using the new Document Understanding activity pack. If a single file has multiple documents spread across different pages, you can try applying an Intelligent Keyword Classifier to split those documents. The output of this classification can be used to process each identified document separately and extract the data.
I hope this helps. Im sure you have a lot of questions, and may be more based on my reply. Feel free to connect with me so I can help you.
Check out my Dev Dives Follow Up video as well: