Hi!
I’m here to ask for a opinion.
I’m automating a process that contains Scanned PDF’S of diferent types, always with the same structure each type, classification is working well. I’m not using Document Understand because as I saw in documentation, it’s only available for licenses with orchestrator/automation cloud, and in this case, is just the UiPath Studio Enterprise.
I’m extracting data with Omni Page OCR, because of all, it’s the one that it’s getting the best results…but…sometimes the values extracted are bad extracted, with strange chars, and I’m asking how I can have something like confidence of extraction like we have in Document Understanding… I’m thinking abou RegEx’s to, but I will like to know you’re opinion about this case and what will you do!
Yeah with on premise orchestrator and license it’s possible to go for Regex only because others need a api key
And omnipage ocr is fine and good to go but considering your case we need to go for multiple tries with different ocr engines and scale levels
Because I have faced one such situation where the pdf is a non digital one
So I were using ocr engines and tried to get optimal solution with different scale levels and I also included Regex expressions to fetch exact data.
So I would suggest to go for multiple tries with other engines like Tessaract, Google, Microsoft and at diff. Scale levels as well
And why not if we can procure some license, Abby is good
Right!
Thanks for your feedback, I will do in that way :))
Basically, try different scales/ engines until the regex match a valid value, makes sense :))
Already did that, and OMNI with language to Portuguese performs best compared to Tesseract with portuguese language installed, in this case!
Thanks for your help anyway :))
Wow nice image, very helpfull with nice info! Thank you very much!!! :))
Yeah I like the MSFT OCR, works well!
I like the UiPath Document OCR also but in this case I can’t use because the customer has just the UiPath Studio enterprise