Convert PDF to Text File

Hi all,
I am stuck at one point please help me.

I have a pdf file with 4 to 5 pages. I need to read all the pages and convert it to .txt file. There is a stamp in the pdf file as well.

Please help me

Hi,

How about using ReadPdfText activity or ReadPDFWithOcr activity?

Regards,

HI @Yoichi Thank you for the quick reply.
I have used ReadPDF activity but the output is coming empty.
I have also tried with Read PDF with OCR activity and I have used Tesseract OCR engine, I am getting the output but it is not as expected.

As there is a stamp at the end of the page unwanted commas, fullstops, numbers and alphabets are coming

Hi @0bb4628e217fd43ac86ac9294

You can use the Read PDF File activity to read the Structured files.
You can use the Read PDF with OCR Activity to read the Unstructured files.

The Output of these activities is String. Then you can write the String data to notepad by using Write Text file activity.

Hope it helps!!

Hi @0bb4628e217fd43ac86ac9294 ,
You can try
image
or
image
Regards,

Each OCR will extract in different formats. If the format is the proper for you with the Tesserract OCR then change the scale value of OCR in the properties from 0 to 5.

Change the scale for every run until you get the proper output as expected.

Hope it helps!!

It seems the pdf contains not text but image.

It may be better to use OminPage OCR or Coud OCR such as UiPathDocumentOCR, Google Cloud Vision OCR, Azure Computer Vision OCR etc.

Regards,

@Yoichi How do I get the API keys for OminPage OCR or Coud OCR such as UiPathDocumentOCR, Google Cloud Vision OCR, Azure Computer Vision OCR

HI,

We can use OmniPageOCR without API key. Please check the following document.

Google Cloud Vision or Azure Computer Vision OCR are required API key to use. Please check web site of each service. (It can be used free of charge up to a certain amount.)
If you use Community Edition, UiPathDocumentOCR also can be used with free.

Regards,