Hi, all I would like to extract all text from scanned pdf and the extracted text need to store back as pdf below,
OCR – Extract all text from PDF file by page wise
PDF to JPG – single page,
JPG to Tiff – Multi-page Tiff
If anyone can give a solution would be a great helpful
If we are trying to save the same pdf again
rather to extract them and write again to a pdf , we can use a copy file activity where we can mention the source file path as input and destination file path with the file name we want
or
if we need only the text and not the image from this pdf we can use Read PDF OCR activity that would read all the pages by mentioning as “ALL” in the range
that would read all the pages and we can get the output as a string named out_text
–then we use a word application scope and pass a word file path as input
–inside this use a append text activity and pass the input as out_text the variable that we obtained from the read pdf ocr activity
–then use a activity called EXPORT PDF activity from word activities that would convert this word with appended text as a pdf
Thats all you are done buddy
Hope this would help you
Cheers @DivyaT
yeah i already go through this procedure only , but some pdf in image format and some in normal pdf format , after reading the pdf I can fetch only the normal pdf data so please help me in this.
For this you need to identify the pdf whthre in pdf for every page you have read individually so check whether page is image or normal if it is image read using some other methods like(using python code ) and if it is normal use activity finally combine all these(but reading the image pdf page will not accurate)
How should i invoke code into it ?, i can able to read each page and saved as different pdf file now the thing is some pdf pages is different dimension that time i should rotate the pdf and then read can you suggest me how to rotate page and read text in a loop?
Can you please tell me what is the process you are following so that I can explain clearly
– If PDF contains images and normal text then Use python code in that code you can able to check and rotate the page(pypdf2 lib) using python code you can code easily rotate the page and you can read.
–You want to use python code then let me know
Hello, how would you do a searchable pdf, after passing the OCR? Could you help me, I’m attaching the project. Thank youRead PDF files - Example.zip (132.4 KB)
To make pdf searchable, please follow below steps, it should work.
Read pdf with OCR
Save extracted data from this activity.
Use invoke code activity.
Write below c# code to place extracted data from scanned pdf into pdf’s “Keywords” section. Once done, this will make the pdf searchable using the keywords present in pdf’s “keywords” section.
var doc = new Document();
string path = “”;
PdfReader reader = new PdfReader(path+“”);
PdfStamper stamper = new PdfStamper(reader, new FileStream(path+“”, FileMode.Create));
var info = reader.Info;
info[“Keywords”] =pdfText; where pdfText is the variable that holds the data extracted using step1
stamper.MoreInfo = info;
stamper.FormFlattening = true;
stamper.Close();
insertedWordCount = info[“Keywords”].Length;