UiPath Document OCR engine recognize pdf text with wrong ordered words

Nguyen_Ky · November 27, 2021, 3:55am

Hi all,
I used UiPath Document Ocr engine in the Read PDF With Ocr activity since May 2021. The result text was very good. But suddenly from October 2021 up to now, the result text is in wrong order.
For example, if the pdf is: “That is a good idea” then the output result is “That good is a idea”. Everything are correct except the word order.
I attach the pdf file and some first lines of the result are here:
“Prasmatic Bookshelf of Many the designations used manufacturers by and sellers to distinguish their prod- ucts claimed are trademarks. as Where those designations in this appear book, and The Pragmatic Programmers, LLC was aware of trademark a claim, the designations have been printed in initial letters capital all in or capitals…”
You can see many positions the word are in wrong order.
I choose UiPath Document Ocr engine because it is correct nearly 100% with my documents. I tested other engines such as UiPath Screen OCR, Microsoft OCR, Google OCR, Tesseract, Abby, Omnipage many times with lots of my documents and the result is not good as UiPath Document OCR.
Does anyone face my error? Or can you recommend other engines for me? Thank you.
Test.pdf (80.8 KB)
Test.xaml (11.6 KB)

fernando_zuluaga · November 27, 2021, 4:35am

hi

i tried your secuence and it worked for me.
Test.xaml (11.8 KB)

dunno whats going on

Nguyen_Ky · November 27, 2021, 10:02am

Thank you for trying my case. I am using Community Edition, and I’m in Vietnam. Does it just happen with Community Edition?

Nguyen_Ky · November 27, 2021, 10:06am

I really need to fix this problem. I don’t think it is a bug because fernando_zuluaga does not face it. So what other things or configuration do I have to try? Thank you.

Nguyen_Ky · November 28, 2021, 1:39pm

Sorry, can someone from UiPath provide me an official answer for this case?

dokumentor · November 28, 2021, 2:34pm

I see this is not an scanned pdf. Have you tried Read PDF Text activity instead of Read PDF with OCR?

Nguyen_Ky · November 28, 2021, 2:50pm

No, this is just a sample file. My documents have a lot of scanned documents that need OCR so I can not use Read PDF Text. The problem here is it used to be very good until recently (October). And I have tried on different machines to see what is going on but I really can’t understand this strange error. Do not have any glue.

fernando_zuluaga · December 1, 2021, 4:34pm

hi, i used community edition, and i got no problems, have you tried in another machine your workflow?

Regards!

Gabriel_Wisniewski · December 1, 2021, 11:16pm

Try changing the “preserve format” parameters.

Nguyen_Ky · December 2, 2021, 12:50am

Hi @Gabriel_Wisniewski, Can you please explain more about “preserve format” in UiPath Document Ocr engine? I really don’t know about it. Thank you.

Nguyen_Ky · December 2, 2021, 12:53am

Yes, I’ve tried with the other engines (Google, Tesseract…). They are correct with the word order, but the accuracy is lower than UiPath Document OCR engine. So I can’t use them.

mce · March 23, 2022, 1:56pm

Have you received an answer about this topic since DEC-21? because I have the same problem.

Thanks in advance for your reply.

mce · April 6, 2022, 1:48pm

?? Somebody have a solution for this kind of issue?

Nguyen_Ky · April 6, 2022, 2:27pm

Sorry, there’s no answer yet. I’m very disapointed about this error, it seems there’s no logic in it so I don’t know where to solve.

mce · April 13, 2022, 7:50am

It works for me now. Try to use Digitize Document activity along with UiPath Document OCR and in the properties of the Uipath Document OCR engine set variable Localserver under “UseLocalServer” field.

Hope this will solve your issue.

Rgds,
Maic

Nguyen_Ky · April 13, 2022, 9:10am

Thanks mce, I’ll try.

Josh_James · January 12, 2024, 2:30pm

Hello @mce - I am also having this issue. Can you elaborate a little more on your solution please? I do not see “Digitize Document”, and if I set True for Localserver, I get an error saying I need the localserver package installed, which I did do. Any help would be appreciated, thanks!

Update: I found the “Digitize Document” activity. I had to get the IntelligentOCR official package. I thought it would have been in the DU one, but it was not. LocalServer threw an error when set to True, but it worked fine when I removed it altogether. The lines are out of order, which is fine. The words aren’t jumbled anymore, which is great! I appreciate the suggestion of using “Digitize Document”. That seems to make a huge difference, thanks!

Topic		Replies	Views
Why does "UiPath Document OCR" reverse the words it captures from a payment voucher? Activities ocr , activities , question , ocr-engine , uipath-ocr , uipath-document-ocr	2	1009	May 31, 2022
OCR Citrix Help	0	1574	September 12, 2017
Pdf get ocr text format changing Help	8	1314	June 14, 2019
Written document OCR error Studio ocr , activities	1	717	January 10, 2021
Issue regarding pdf scrapping Help activities , studio	5	1805	July 13, 2018

UiPath Document OCR engine recognize pdf text with wrong ordered words

Related topics