Not able to Read the PDF data

Shanmukh_P · October 9, 2021, 8:49am

Hi Guys,

I am trying to read a PDF where i need to extract data Like Voter id, Name, Age, Address, Guardian type , Guardian Name, Gender from multiple boxes

When i tried to read with OCR engine i am not get good results.

Any suggestion is highly appreciated

Thanks in AdvancePDF3.pdf (104.5 KB)

rahulsharma · October 9, 2021, 8:55am

did you try document understanding?

Five it a try, considering the language I believe it’ll be good to have structured extraction using DOM and not regex.

Srini84 · October 9, 2021, 9:27am

@Shanmukh_P

This is a Telugu native language, you need to find a OCR which have the capability to extract the telugu language specific for name etc

For Voter ID specific you can search for Document understanding, Abby Flexicapture, Azure Form recognizer

Hope this may help you

Thanks

Shanmukh_P · October 9, 2021, 9:33am

I tried using documents understand but still not getting good results

Shanmukh_P · October 9, 2021, 9:34am

Okay I will look for ocr which is compatible with Telugu language

rahulsharma · October 9, 2021, 11:26am

can you share workflow, I have done a project for a less popular language using Document Understanding with form Extractor, outcome was quite decent.

Form Extractor will focus on extracting using torn layout and position specified in training. It will definitely work as the language is challenging to decide.

As per my experience, if you go around finding the OCR engines, Google OCR and Tesseract will give you some outcome but very poor accuracy.

Shanmukh_P · October 17, 2021, 6:32am

The forum extractor will take the input for digitalize documents but the output of digitise document is not good so how is form extractor helpful…

If you can ellobroate a bit on it

rahulsharma · October 17, 2021, 6:40am

This is related to the limitations of OCR extraction, how clear the input file is. You can try with Omni OCR that should capture this well. I tried for Arabic and Mandarin, it had worked well.

I mentioned, form Extractor is extracting data as per predefined elements. You can give a try to get those values in some container.

If you can confirm that the structure will remain exactly same for this document then you may use regex ans split using new line variable.

As ML Extractor will require some serious training and will definitely have low accuracy (as per my experience with foreign language documents) so go for it, if you don’t get something from above two Extractor.

You can share the xaml if possible.

Shanmukh_P · October 17, 2021, 6:43am

I will try it out first .if still I am not getting …
I will share the workflow xaml

Topic		Replies	Views
Getting hidden data from pdf using DU Studio studio , question , tools	8	1119	April 4, 2022
Unable to read few rows-document understanding Document Understanding studio , feedback	7	750	February 26, 2021
How to use the Intelligent OCR for any PDF(other than invoice ) ? Both by Regex and Machine Learning Extractor? Studio uiautomation , activities	7	2479	September 4, 2020
Seeking solutions for better information extraction Something Else activities , studio , feedback , document_understanding , data-extraction	5	41	March 18, 2025
How to extract telugu and english language data from the PDF file Community studio , question	0	894	January 27, 2021

Not able to Read the PDF data

Related topics