Extract data from PDF using OCR or Text read activity

Hi,

We are receiving emails with pdf attached and we’re trying to read the content of the pdf.
Is there any good way of figuring out if the pdf document contain images or readable text?
We see that if we use the OCR activity on documents that contain text, it generally do a bad job and confuses a lot of letters.

Would be beneficial to do some kind of check to see what activity to use. Anyone know how to do this?

Thanks

1 Like

Hi @RayJay

Use Read pdf text with ocr and give microsoft oct

Thanks
Ashwin S

Yes, we are already using both of them. But we would like a more precise check so that the pdfs that contain text will use the “Read pdf text” activity, and the ones with image should use the Microsoft OCR activity. Now we use a try catch to first use Microsoft OCR on the pdf, if that fails it should use the Read PDF text. This unfortunately leads to problems with some of the characters recognition. If we use it the other way around we receive just jibberish from the Read PDF activity on the “image pdf files”.

Anyone know of a way to check what the pdf contains?

1 Like

Hi
—once after getting the list of mail and it’s attachments been saved to a folder use a assign activity like this
arr_files = Directory.GetFiles(“yourfolderpath”,”*.pdf”)

Where arr_files is a variable of type array of string which will be having all the Pdf file path
—now use a FOR EACH loop and pass the above variable as input and change the type argument as string
—inside the loop use a READ PDF activity and mention the file path as item.ToString and get output with a variable of type string name str_output
—followed by this read pdf use a READ PDF OCR activity with same input and use GOOGLE Ocr and get the output from that ocr activity added named str_output1

—now use a IF activity like this
str_output.Length > str_output1.Length

If true it will go to THEN part where we can use the output from str_output variable and the rest of activities inside this
Or
If fails it will go to ELSE part where we can use str_output1 variable and rest of activities with that variable

Cheers @RayJay

2 Likes

Hi, that was actually not a bad idea. I will try this out and see how it works out.
Thanks!

1 Like

Just an update. This solution worked great! Thanks a lot!

2 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.