Data Extraction From Scanned PDF'S

Hi!
I’m here to ask for a opinion.
I’m automating a process that contains Scanned PDF’S of diferent types, always with the same structure each type, classification is working well. I’m not using Document Understand because as I saw in documentation, it’s only available for licenses with orchestrator/automation cloud, and in this case, is just the UiPath Studio Enterprise.

I’m extracting data with Omni Page OCR, because of all, it’s the one that it’s getting the best results…but…sometimes the values extracted are bad extracted, with strange chars, and I’m asking how I can have something like confidence of extraction like we have in Document Understanding… I’m thinking abou RegEx’s to, but I will like to know you’re opinion about this case and what will you do!

Thanks!!! :))
@Palaniyappan

1 Like

Hi

Glad talking to you

Yeah with on premise orchestrator and license it’s possible to go for Regex only because others need a api key
And omnipage ocr is fine and good to go but considering your case we need to go for multiple tries with different ocr engines and scale levels

Because I have faced one such situation where the pdf is a non digital one
So I were using ocr engines and tried to get optimal solution with different scale levels and I also included Regex expressions to fetch exact data.

So I would suggest to go for multiple tries with other engines like Tessaract, Google, Microsoft and at diff. Scale levels as well

And why not if we can procure some license, Abby is good

Cheers @mmcruzRPA

2 Likes

Use Google Tesseract + language pack would be my suggestion.

You can find the installation and language training data link here:

and here in more detail :

2 Likes

Right!
Thanks for your feedback, I will do in that way :))
Basically, try different scales/ engines until the regex match a valid value, makes sense :))

Cheers @Palaniyappan

1 Like

Already did that, and OMNI with language to Portuguese performs best compared to Tesseract with portuguese language installed, in this case!
Thanks for your help anyway :))

1 Like

Interesting. I always had 100% correct extraction from machine written documents with Tesseract and english eng.traineddata installed.

But my documents were pretty straight forward PC generated with no occlusion or anything.

MS OCR seems to be the best in the field for 5000pages/month free:

1 Like

Wow nice image, very helpfull with nice info! Thank you very much!!! :))
Yeah I like the MSFT OCR, works well!
I like the UiPath Document OCR also but in this case I can’t use because the customer has just the UiPath Studio enterprise :stuck_out_tongue:

2 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.