Get entire text from scanned pdf file

DivyaT · June 13, 2019, 10:40am

Hi, all I would like to extract all text from scanned pdf and the extracted text need to store back as pdf below,
OCR – Extract all text from PDF file by page wise
PDF to JPG – single page,
JPG to Tiff – Multi-page Tiff
If anyone can give a solution would be a great helpful

This is my pdf file,

Sample PDF_compressed.zip (1.7 MB)

Thank You,

Palaniyappan · June 13, 2019, 10:53am

Hi @DivyaT

If we are trying to save the same pdf again
rather to extract them and write again to a pdf , we can use a copy file activity where we can mention the source file path as input and destination file path with the file name we want

or

if we need only the text and not the image from this pdf we can use Read PDF OCR activity that would read all the pages by mentioning as “ALL” in the range
that would read all the pages and we can get the output as a string named out_text
–then we use a word application scope and pass a word file path as input
–inside this use a append text activity and pass the input as out_text the variable that we obtained from the read pdf ocr activity
–then use a activity called EXPORT PDF activity from word activities that would convert this word with appended text as a pdf

Thats all you are done buddy
Hope this would help you
Cheers @DivyaT

DivyaT · June 13, 2019, 11:17am

yeah i already go through this procedure only , but some pdf in image format and some in normal pdf format , after reading the pdf I can fetch only the normal pdf data so please help me in this.

DivyaT · June 13, 2019, 11:30am

extractpdf.zip (14.2 KB)

This is what am getting in word.

kalyanDev · June 13, 2019, 11:32am

For this you need to identify the pdf whthre in pdf for every page you have read individually so check whether page is image or normal if it is image read using some other methods like(using python code ) and if it is normal use activity finally combine all these(but reading the image pdf page will not accurate)

DivyaT · June 14, 2019, 4:30am

How should i invoke code into it ?, i can able to read each page and saved as different pdf file now the thing is some pdf pages is different dimension that time i should rotate the pdf and then read can you suggest me how to rotate page and read text in a loop?

kalyanDev · June 14, 2019, 5:15am

Hi @DivyaT

Can you please tell me what is the process you are following so that I can explain clearly

– If PDF contains images and normal text then Use python code in that code you can able to check and rotate the page(pypdf2 lib) using python code you can code easily rotate the page and you can read.
–You want to use python code then let me know

DivyaT · June 14, 2019, 5:20am

yeah i want to use python code if its easy to understand because i dont know how it will work and this is my pdf you may get idea by seeing it

Sample PDF_compressed.zip (1.7 MB)

rosemarykp997 · October 17, 2019, 7:06am

I am trying to extract handwritten text from scanned pdf .I tried with different OCRs but could not extract the text.any can help me?

AshwinS2 · October 17, 2019, 7:11am

Hi @rosemarykp997

Have you tried using intelligent ocr activity or Abby ocr or flexicapture

Thanks
Ashwin.S

rosemarykp997 · October 17, 2019, 7:24am

Thanks for your reply .
yeah.I tried Abby cloud ocr , But could’t .

Enmanuel_D_Talla_Neg · January 30, 2020, 11:18pm

Hello, how would you do a searchable pdf, after passing the OCR? Could you help me, I’m attaching the project. Thank youRead PDF files - Example.zip (132.4 KB)

sonaliaggarwal47 · April 19, 2021, 6:28pm

Hi @Enmanuel_D_Talla_Neg,

To make pdf searchable, please follow below steps, it should work.

Read pdf with OCR
Save extracted data from this activity.
Use invoke code activity.
Write below c# code to place extracted data from scanned pdf into pdf’s “Keywords” section. Once done, this will make the pdf searchable using the keywords present in pdf’s “keywords” section.

var doc = new Document();
string path = “”;
PdfReader reader = new PdfReader(path+“”);
PdfStamper stamper = new PdfStamper(reader, new FileStream(path+“”, FileMode.Create));
var info = reader.Info;
info[“Keywords”] =pdfText; where pdfText is the variable that holds the data extracted using step1
stamper.MoreInfo = info;
stamper.FormFlattening = true;
stamper.Close();
insertedWordCount = info[“Keywords”].Length;

hope this helps.

Regards
Sonali

Benjamin_Wurth · March 7, 2022, 9:25am

Hi Sonali,

I just come across your post, thanks for that.

However, I am getting the following error message.

I have also imported the following namespace - iTextSharp.text.pdf and iTextSharp.text.xml.xmp

grafik

Do you happen to know what is causing this?

best regards

Topic		Replies	Views
Pdf with image to text Help	4	1177	December 3, 2018
PDF - Merge and Extract Pages (Split) activities Activities pdf , activities , completed	20	12211	February 20, 2020
Extract a particular Page data from multi page PDF document Help studio	6	3259	April 11, 2019
Extracting text within an image (PDF) Help pdf , ocr , activities	16	12359	October 29, 2018
Extract data from scanned PDFs Help	7	890	August 31, 2020

Get entire text from scanned pdf file

Related topics