Extract text from a digital signed PDF document

Hi,
I’m using ReadPDFText acivity to read PDf documents which contain non English text.
It works well, but when I try to read a digital signed PDf document (invoice) the output is mostly gibberish.
Attached is the main part of the output in a txt file (part of the output was OK but I removed it because it contains private information)
FileContent.txt (1.8 KB)

Is there a way to correctly extract the text from a digital signed pdf ? Please advise.

Thank you

1 Like

@Udiar

Try to use read pdf with ocr

https://docs.uipath.com/activities/other/latest/document-understanding/read-pdf-with-ocr

cheers

Try using with OCR Engine or use Document understanding here

For reference

Cheers @Udiar

Thank you, I tried read pdf with ocr, both with Google OCR and Microsoft OCR, but the result contains only a small part of the content.
Is there a chance that other OCR’s may bring better results? how can I know which OCR engine I need?

1 Like

Hi @Udiar

Try with Tesseract OCR or Omnipage OCR

For Omnipage OCR you have to download UiPath. OmniPage.Activities package

@Udiar

You have to try them…which ever works weel should be used based on the file

Generally tessaract suites for most cases

Cheers

Have a view on this to understand which suits your need

Hope this clarifies

Cheers @Udiar

I tried them both but no results at all.

Did u try with this one
https://docs.uipath.com/activities/other/latest/document-understanding/ui-path-document-ocr

@Udiar

Yes, I tried, it didn’t work, I think the issue is related to setting the correct format of reading, because the text in the file is in mostly Hebrew.