Extract an encoded PDF without using OCR

I have a PDF file which I need to read without using OCR. I have used the PDF to Text activity, it reads and write to text file without any errors. But the problem is, the content is just a bunch of square, as the text is in some other encoding.

While doing some research, found out that this could be due to font details not being embedded in the PDF file. When I check the font details of the PDF, it gives as encoding-H. The font I get when the text is copied and pasted is something called Arial,Unicode#20MS. Tried by adding the font Arial Unicode MS to machine (don’t know whether it works that way) but couldn’t resolve.

Some help would be great…:slight_smile:

Try this -

  1. Create a text file with the intended encoding.
  2. Append the text to this text file.

I had the same issue and got resolved with this approach. (with other programming language)

1 Like

In my case, you mean encoding-H as the intended encoding? Or is it UTF something?

I have created a text file with UTF encoding and appended to the text document with extracted content, but so far no luck…:frowning:

Is it ok to share the pdf file if not confidential. So will be easy to find out the issue.

Regards,
Karthik

Well, it is kind of confidential. If you need I can provide the property details of the file. But, it would be great if you can direct me on ways to find the missing components. I know that is not easy as it sounds…:confused: