Converting Pdf to text File

Hi all,
I am new here. I am working on a process where I need to convert a pdf file with 5 to7 pages to text file.

I am using Tesseract OCR but there is stamp in one or two pages and the words in the stamp are also getting extracted which I don’t want.

Can anyone help me with this please.

Hi @0bb4628e217fd43ac86ac9294 ,

Can you please share the sample pdf file so will try to share the solution.

From Pdf we need to extract specific data or entire data from pdf?

Regards,
Pavan Kumar

Hi @pavan_kumar5
We need to extract entire data

Hi @0bb4628e217fd43ac86ac9294

After extracting the data from pdf file store it in a String Variable. After that use the replace function to replace the stamp words with the Empty. Then Write the String Variable data to the Text file.

Hope it helps!!

Hi @0bb4628e217fd43ac86ac9294

While using Read PDF with OCR after storing the output into text file if the stamp words are static you can use Replace function to replace the text and store it in the same variable. After that, you can write that output variable into text file using Write Text file activity.

Regards,

Hi @0bb4628e217fd43ac86ac9294

In the properties of Tesseract OCR there is something called Allowed Characters and Denied Characters

Hope this helps

Hi @0bb4628e217fd43ac86ac9294 ,
You can try


or

regards,