How to extract data from pdf files on a dynamic way with OCR

Hi there,
I have scanned documents whose contents are more or less the same but because of scanning the positions are always a little bit different. (I was trying to improve the pdf files as much as I could…)
And I am trying to get the needed data from the pdfs with OCR (GetOCRText with Tesseract OCR or OmniPage OCR engines) but the results are different.
Could you recommend any more dynamic way of data extraction independently from the position’s changes?
Thank you very much!

HI @Adam_Biro

Have you tried with regex ?

Regards
Sudharsan

Hi Sudharsan,
yes I wanted to try it but the “Read PDF Text” activity had empty string result.
Regards
Adam

And with “Read PDF with OCR” there are a lot of extra characters and the result is not accurate…

Hi @Adam_Biro ,

When dealing with Scanned documents we might not always get the data in the right way or it might not be accurate 100%, So we will not be able to perform a Strict methods of finding / locating the field values necessary. Maybe because of the position change / because of the inaccuracy. However, if the scanned documents are of a better quality then it would be near to a 100% which should be good.

So, we would suggest you to check with Different OCR, specially with UiPath Document OCR and maybe also try with the Document Understanding approach by UiPath.

Let us know if you were able to get your desired outcome and also post the method that you have used.

Hi,
yes this is the case exactly what you were talking about…
I have tried all the OCR engine, UP Doc OCR’s result was very bad, for me the tesseract and omnipage are the best just the problem is the position change.
I wanted to try the document understanding also but the end point given on the uipath site gave me no useable method so it was not working unfortunately. I am using right now the trial version of uipath so far, and maybe it can be the reason?
And I tried the regex way as well but the pdf is too complex to get the needed output.
So still the OCR way seems to be the best just somehow should be more dynamic…
I don’t know what would be the best way to adjust the PDF…
Thanks,
Adam