Not able to Read the PDF data

Hi Guys,

I am trying to read a PDF where i need to extract data Like Voter id, Name, Age, Address, Guardian type , Guardian Name, Gender from multiple boxes

When i tried to read with OCR engine i am not get good results.

Any suggestion is highly appreciated

Thanks in AdvancePDF3.pdf (104.5 KB)

did you try document understanding?

Five it a try, considering the language I believe it’ll be good to have structured extraction using DOM and not regex.


This is a Telugu native language, you need to find a OCR which have the capability to extract the telugu language specific for name etc

For Voter ID specific you can search for Document understanding, Abby Flexicapture, Azure Form recognizer

Hope this may help you


I tried using documents understand but still not getting good results

Okay I will look for ocr which is compatible with Telugu language

can you share workflow, I have done a project for a less popular language using Document Understanding with form Extractor, outcome was quite decent.

Form Extractor will focus on extracting using torn layout and position specified in training. It will definitely work as the language is challenging to decide.

As per my experience, if you go around finding the OCR engines, Google OCR and Tesseract will give you some outcome but very poor accuracy.

The forum extractor will take the input for digitalize documents but the output of digitise document is not good so how is form extractor helpful…

If you can ellobroate a bit on it

This is related to the limitations of OCR extraction, how clear the input file is. You can try with Omni OCR that should capture this well. I tried for Arabic and Mandarin, it had worked well.

I mentioned, form Extractor is extracting data as per predefined elements. You can give a try to get those values in some container.

If you can confirm that the structure will remain exactly same for this document then you may use regex ans split using new line variable.

As ML Extractor will require some serious training and will definitely have low accuracy (as per my experience with foreign language documents) so go for it, if you don’t get something from above two Extractor.

You can share the xaml if possible.

I will try it out first .if still I am not getting …
I will share the workflow xaml

1 Like