UiPath Document OCR engine recognize pdf text with wrong ordered words

Hi all,
I used UiPath Document Ocr engine in the Read PDF With Ocr activity since May 2021. The result text was very good. But suddenly from October 2021 up to now, the result text is in wrong order.
For example, if the pdf is: “That is a good idea” then the output result is “That good is a idea”. Everything are correct except the word order.
I attach the pdf file and some first lines of the result are here:
“Prasmatic Bookshelf of Many the designations used manufacturers by and sellers to distinguish their prod- ucts claimed are trademarks. as Where those designations in this appear book, and The Pragmatic Programmers, LLC was aware of trademark a claim, the designations have been printed in initial letters capital all in or capitals…”
You can see many positions the word are in wrong order.
I choose UiPath Document Ocr engine because it is correct nearly 100% with my documents. I tested other engines such as UiPath Screen OCR, Microsoft OCR, Google OCR, Tesseract, Abby, Omnipage many times with lots of my documents and the result is not good as UiPath Document OCR.
Does anyone face my error? Or can you recommend other engines for me? Thank you.
Test.pdf (80.8 KB)
Test.xaml (11.6 KB)

hi

i tried your secuence and it worked for me.
Test.xaml (11.8 KB)

dunno whats going on

Thank you for trying my case. I am using Community Edition, and I’m in Vietnam. Does it just happen with Community Edition?

I really need to fix this problem. I don’t think it is a bug because fernando_zuluaga does not face it. So what other things or configuration do I have to try? Thank you.

Sorry, can someone from UiPath provide me an official answer for this case?

I see this is not an scanned pdf. Have you tried Read PDF Text activity instead of Read PDF with OCR?

No, this is just a sample file. My documents have a lot of scanned documents that need OCR so I can not use Read PDF Text. The problem here is it used to be very good until recently (October). And I have tried on different machines to see what is going on but I really can’t understand this strange error. Do not have any glue.

hi, i used community edition, and i got no problems, have you tried in another machine your workflow?

Regards!

Try changing the “preserve format” parameters.

Hi @Gabriel_Wisniewski, Can you please explain more about “preserve format” in UiPath Document Ocr engine? I really don’t know about it. Thank you.

Yes, I’ve tried with the other engines (Google, Tesseract…). They are correct with the word order, but the accuracy is lower than UiPath Document OCR engine. So I can’t use them.

Have you received an answer about this topic since DEC-21? because I have the same problem.

Thanks in advance for your reply.

?? Somebody have a solution for this kind of issue?

Sorry, there’s no answer yet. I’m very disapointed about this error, it seems there’s no logic in it so I don’t know where to solve.

It works for me now. Try to use Digitize Document activity along with UiPath Document OCR and in the properties of the Uipath Document OCR engine set variable Localserver under “UseLocalServer” field.

Hope this will solve your issue.

Rgds,
Maic

1 Like

Thanks mce, I’ll try.

Hello @mce - I am also having this issue. Can you elaborate a little more on your solution please? I do not see “Digitize Document”, and if I set True for Localserver, I get an error saying I need the localserver package installed, which I did do. Any help would be appreciated, thanks!

Update: I found the “Digitize Document” activity. I had to get the IntelligentOCR official package. I thought it would have been in the DU one, but it was not. LocalServer threw an error when set to True, but it worked fine when I removed it altogether. The lines are out of order, which is fine. The words aren’t jumbled anymore, which is great! I appreciate the suggestion of using “Digitize Document”. That seems to make a huge difference, thanks!