Extract an encoded PDF without using OCR

urweeraratne · January 3, 2019, 8:56am

I have a PDF file which I need to read without using OCR. I have used the PDF to Text activity, it reads and write to text file without any errors. But the problem is, the content is just a bunch of square, as the text is in some other encoding.

While doing some research, found out that this could be due to font details not being embedded in the PDF file. When I check the font details of the PDF, it gives as encoding-H. The font I get when the text is copied and pasted is something called Arial,Unicode#20MS. Tried by adding the font Arial Unicode MS to machine (don’t know whether it works that way) but couldn’t resolve.

Some help would be great…

KarthikByggari · January 3, 2019, 10:48am

Try this -

Create a text file with the intended encoding.
Append the text to this text file.

I had the same issue and got resolved with this approach. (with other programming language)

urweeraratne · January 3, 2019, 11:05am

In my case, you mean encoding-H as the intended encoding? Or is it UTF something?

I have created a text file with UTF encoding and appended to the text document with extracted content, but so far no luck…

KarthikByggari · January 3, 2019, 12:41pm

Is it ok to share the pdf file if not confidential. So will be easy to find out the issue.

Regards,
Karthik

urweeraratne · January 4, 2019, 1:31am

Well, it is kind of confidential. If you need I can provide the property details of the file. But, it would be great if you can direct me on ways to find the missing components. I know that is not easy as it sounds…

Topic		Replies	Views
PDF OCR Problem in extracting a single numeric character Document Understanding ocr , feedback	1	1054	June 29, 2021
Turkish Language-PDF Data Extraction Problem Studio question , pdf-extraction , write-text , read-pdf	2	556	March 10, 2023
Extract data from PDF using OCR or Text read activity Help pdf , ocr , activities , question	6	8587	December 6, 2019
Pdf Extraction? Help pdf , studio , error	4	760	September 6, 2019
Extract Pdf using Read Pdf Text Studio uiautomation	4	564	November 14, 2022

Most Active Users - Yesterday
Anil_G
ashokkarale
jinal.shah
Gautham_Pattabiraman
postwick
chandreshsinh.jadeja
vrdabberu
Ajay_Mishra
sven.wullum1
Vyshnavi_Nalumachu
More details...

Extract an encoded PDF without using OCR

Related Topics