Data Extraction From Scanned PDF'S

mmcruzRPA · October 29, 2020, 12:43pm

Hi!
I’m here to ask for a opinion.
I’m automating a process that contains Scanned PDF’S of diferent types, always with the same structure each type, classification is working well. I’m not using Document Understand because as I saw in documentation, it’s only available for licenses with orchestrator/automation cloud, and in this case, is just the UiPath Studio Enterprise.

I’m extracting data with Omni Page OCR, because of all, it’s the one that it’s getting the best results…but…sometimes the values extracted are bad extracted, with strange chars, and I’m asking how I can have something like confidence of extraction like we have in Document Understanding… I’m thinking abou RegEx’s to, but I will like to know you’re opinion about this case and what will you do!

Thanks!!! :))
@Palaniyappan

Palaniyappan · October 29, 2020, 1:13pm

Hi

Glad talking to you

Yeah with on premise orchestrator and license it’s possible to go for Regex only because others need a api key
And omnipage ocr is fine and good to go but considering your case we need to go for multiple tries with different ocr engines and scale levels

Because I have faced one such situation where the pdf is a non digital one
So I were using ocr engines and tried to get optimal solution with different scale levels and I also included Regex expressions to fetch exact data.

So I would suggest to go for multiple tries with other engines like Tessaract, Google, Microsoft and at diff. Scale levels as well

And why not if we can procure some license, Abby is good

Cheers @mmcruzRPA

TastyToast · October 29, 2020, 1:15pm

Use Google Tesseract + language pack would be my suggestion.

You can find the installation and language training data link here:

and here in more detail :

mmcruzRPA · October 29, 2020, 2:05pm

Right!
Thanks for your feedback, I will do in that way :))
Basically, try different scales/ engines until the regex match a valid value, makes sense :))

Cheers @Palaniyappan

mmcruzRPA · October 29, 2020, 2:06pm

Already did that, and OMNI with language to Portuguese performs best compared to Tesseract with portuguese language installed, in this case!
Thanks for your help anyway :))

TastyToast · October 29, 2020, 2:38pm

Interesting. I always had 100% correct extraction from machine written documents with Tesseract and english eng.traineddata installed.

But my documents were pretty straight forward PC generated with no occlusion or anything.

MS OCR seems to be the best in the field for 5000pages/month free:

mmcruzRPA · October 30, 2020, 9:31am

Wow nice image, very helpfull with nice info! Thank you very much!!! :))
Yeah I like the MSFT OCR, works well!
I like the UiPath Document OCR also but in this case I can’t use because the customer has just the UiPath Studio enterprise

system · November 2, 2020, 9:31am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Different results reading a Native PDF File and Scanned PDF File with the same OCR Activities activities , question , document_understanding	2	1909	March 6, 2022
Document template automation StudioX	3	1375	May 3, 2021
How to extract data from digitize pdf Studio studio , question , activities_panel	4	35	March 28, 2025
Available Intelligent Automation APIs Studio uiautomation	4	723	April 6, 2021
How to extract data from pdf files on a dynamic way with OCR Activities pdf , ocr , activities , question , tesseract-ocr , ocr-engine	5	1761	October 15, 2022

Data Extraction From Scanned PDF'S

Related topics