Get entire text from scanned pdf file

Hi, all I would like to extract all text from scanned pdf and the extracted text need to store back as pdf below,
OCR – Extract all text from PDF file by page wise
PDF to JPG – single page,
JPG to Tiff – Multi-page Tiff
If anyone can give a solution would be a great helpful

This is my pdf file,

Sample PDF_compressed.zip (1.7 MB)

Thank You,

Hi @DivyaT

If we are trying to save the same pdf again
rather to extract them and write again to a pdf , we can use a copy file activity where we can mention the source file path as input and destination file path with the file name we want

or

if we need only the text and not the image from this pdf we can use Read PDF OCR activity that would read all the pages by mentioning as “ALL” in the range
that would read all the pages and we can get the output as a string named out_text
–then we use a word application scope and pass a word file path as input
–inside this use a append text activity and pass the input as out_text the variable that we obtained from the read pdf ocr activity
–then use a activity called EXPORT PDF activity from word activities that would convert this word with appended text as a pdf

Thats all you are done buddy
Hope this would help you
Cheers @DivyaT

yeah i already go through this procedure only , but some pdf in image format and some in normal pdf format , after reading the pdf I can fetch only the normal pdf data so please help me in this.

extractpdf.zip (14.2 KB)

This is what am getting in word.

For this you need to identify the pdf whthre in pdf for every page you have read individually so check whether page is image or normal if it is image read using some other methods like(using python code ) and if it is normal use activity finally combine all these(but reading the image pdf page will not accurate)

How should i invoke code into it ?, i can able to read each page and saved as different pdf file now the thing is some pdf pages is different dimension that time i should rotate the pdf and then read can you suggest me how to rotate page and read text in a loop?

Hi @DivyaT

Can you please tell me what is the process you are following so that I can explain clearly

– If PDF contains images and normal text then Use python code in that code you can able to check and rotate the page(pypdf2 lib) using python code you can code easily rotate the page and you can read.
–You want to use python code then let me know

yeah i want to use python code if its easy to understand because i dont know how it will work and this is my pdf you may get idea by seeing it

Sample PDF_compressed.zip (1.7 MB)

I am trying to extract handwritten text from scanned pdf .I tried with different OCRs but could not extract the text.any can help me?

Hi @rosemarykp997

Have you tried using intelligent ocr activity or Abby ocr or flexicapture

Thanks
Ashwin.S

Thanks for your reply .
yeah.I tried Abby cloud ocr , But could’t .

Hello, how would you do a searchable pdf, after passing the OCR? Could you help me, I’m attaching the project. Thank youRead PDF files - Example.zip (132.4 KB)

Hi @Enmanuel_D_Talla_Neg,

To make pdf searchable, please follow below steps, it should work.

  1. Read pdf with OCR
  2. Save extracted data from this activity.
  3. Use invoke code activity.
  4. Write below c# code to place extracted data from scanned pdf into pdf’s “Keywords” section. Once done, this will make the pdf searchable using the keywords present in pdf’s “keywords” section.

var doc = new Document();
string path = “”;
PdfReader reader = new PdfReader(path+“”);
PdfStamper stamper = new PdfStamper(reader, new FileStream(path+“”, FileMode.Create));
var info = reader.Info;
info[“Keywords”] =pdfText; where pdfText is the variable that holds the data extracted using step1
stamper.MoreInfo = info;
stamper.FormFlattening = true;
stamper.Close();
insertedWordCount = info[“Keywords”].Length;

hope this helps.

Regards
Sonali

Hi Sonali,

I just come across your post, thanks for that.

However, I am getting the following error message.

I have also imported the following namespace - iTextSharp.text.pdf and iTextSharp.text.xml.xmp

grafik

Do you happen to know what is causing this?

best regards