How to extract data from pdf files on a dynamic way with OCR

Adam_Biro · October 14, 2022, 9:49am

Hi there,
I have scanned documents whose contents are more or less the same but because of scanning the positions are always a little bit different. (I was trying to improve the pdf files as much as I could…)
And I am trying to get the needed data from the pdfs with OCR (GetOCRText with Tesseract OCR or OmniPage OCR engines) but the results are different.
Could you recommend any more dynamic way of data extraction independently from the position’s changes?
Thank you very much!

Sudharsan_Ka · October 14, 2022, 10:28am

HI @Adam_Biro

Have you tried with regex ?

Regards
Sudharsan

Adam_Biro · October 14, 2022, 5:47pm

Hi Sudharsan,
yes I wanted to try it but the “Read PDF Text” activity had empty string result.
Regards
Adam

Adam_Biro · October 14, 2022, 6:12pm

And with “Read PDF with OCR” there are a lot of extra characters and the result is not accurate…

supermanPunch · October 15, 2022, 10:42am

Hi @Adam_Biro ,

When dealing with Scanned documents we might not always get the data in the right way or it might not be accurate 100%, So we will not be able to perform a Strict methods of finding / locating the field values necessary. Maybe because of the position change / because of the inaccuracy. However, if the scanned documents are of a better quality then it would be near to a 100% which should be good.

So, we would suggest you to check with Different OCR, specially with UiPath Document OCR and maybe also try with the Document Understanding approach by UiPath.

Let us know if you were able to get your desired outcome and also post the method that you have used.

Adam_Biro · October 15, 2022, 3:52pm

Hi,
yes this is the case exactly what you were talking about…
I have tried all the OCR engine, UP Doc OCR’s result was very bad, for me the tesseract and omnipage are the best just the problem is the position change.
I wanted to try the document understanding also but the end point given on the uipath site gave me no useable method so it was not working unfortunately. I am using right now the trial version of uipath so far, and maybe it can be the reason?
And I tried the regex way as well but the pdf is too complex to get the needed output.
So still the OCR way seems to be the best just somehow should be more dynamic…
I don’t know what would be the best way to adjust the PDF…
Thanks,
Adam

Topic		Replies	Views
About OCR Engines Activities ocr , activities , question	8	1398	July 4, 2023
Extract data from PDF using get OCR text Help	2	1137	April 14, 2020
PDF extraction from multiple pdf and how to check which pdf is scanned and which pdf is regular Activities pdf , activities	10	1720	March 10, 2022
Data Extraction From Scanned PDF'S Help activities , question	7	2721	November 2, 2020
How to extract OCR text form PDF and text position was slightly moving Help pdf	2	1865	September 8, 2020

How to extract data from pdf files on a dynamic way with OCR

Related topics