Convert a pdf to a searchable pdf

sharon.palawandram · October 8, 2023, 8:09pm

Hi All,

Would anyone know how you can convert a pdf into searchable pdf?

I am converting an image > pdf > searchable pdf.

Is there a way we can convert a pdf into searchable pdf without using adobe activities?

I think there’s a way to invoke code using C# and and itextsharp but not sure how to implement it in UiPath.

jeevith · October 8, 2023, 10:39pm

Pyteseract (based on Teseract) in combination with OpenCV can achieve what you are looking for.
Some years ago one of my deliverables was a similar case. Scanned pdf (image and not the rich text)

Pyteseract mostly allows for OCR while OpenCV integration would provide the coordinates for each extracted text from the image. Later you can create an empty document and superimpose the extracted text using the same coordinates of texts from your original image (essentially removing background). The last step would be to convert it back to a PDF.

Handwritten text and sharpness of the text in the image will be the challenge to extract but you could use an ensemble approach ( combination of ML models) , although even with that you should expect some outlier or exception cases.

Whichever approach you use, preprocessing your image using OpenCV will be an essential part of your pipeline.

jeevith · October 8, 2023, 10:42pm

@sharon.palawandram

In addition to the above, you could also try opening your PDF in Microsoft word or Libre Office Draw, both support PDF files. Libre Office Draw also supports scanned pdfs (not 100% though)

Nguyen_Van_Luong1 · October 9, 2023, 1:02am

Hi @sharon.palawandram ,
Can you share your file?
I think we can use “read PDF with OCR”

or

regards,

sharon.palawandram · October 9, 2023, 2:45pm

Thank you, I can try this out. I’m trying to generally convert any text image of (.jpeg,jpg) to pdf.

After we read the pdf, what will be the step? I’ve digitized it and on the output you can either convert it to .txt or a key value pair

sharon.palawandram · October 9, 2023, 2:48pm

Thank you Jeevith for your feedback, interesting approach I will test this out.

Does the output pdf distort the original image?

sharon.palawandram · October 9, 2023, 2:49pm

Thank you for your suggestions, I will test this out too.

jeevith · October 9, 2023, 7:20pm

The resulting PDF worked well (we had around 20 cases in our testing), the OCR engine failed in some while extracting some of the texts, but we did get the coordinates of the text and there was little to no distortion.

----------------------------------------------------------------

Another approach you could also try is using GhostScript in some parts of your process. You can convert image files to pdf and test if the resulting PDF is rich text or not, I cannot remember how it embeds the image. (How to Use Ghostscript)

I also have a component in the Marketplace to convert image files to PDF using GhostScript, you could try it during your prototyping. https://cloud.uipath.com/jeeviszafpnq/marketplace_/listings/pdf-manipulation-using-ghostscript

Nguyen_Van_Luong1 · October 10, 2023, 12:44am

Hi @sharon.palawandram ,
Sorry reply late,
You can write text to word file
then convert word to PDF

hope it help
regards,

sharon.palawandram · October 10, 2023, 1:12am

Hi Jeevith, thanks for letting me know. I will try this too.

In the previous comment, if we get the output of an OCR to a key value pair, how can we resave it to a pdf?

sharon.palawandram · October 10, 2023, 1:13am

Hi Nguyen,

Thank you for your answer. But wouldn’t this distort the original image position?

Nguyen_Van_Luong1 · October 10, 2023, 1:16am

Hi Shron,
I think it will take it from top to bottom and return 1 string. When pasted into word, it will be in the same order. Converting it to pdf will also be in the same order.
regards,

dokumentor · March 21, 2025, 7:50pm

Here is a library to generate a searchable PDF from an image-based PDF.

PDF.Manipulation.zip (543.4 KB)

It is a simple approach using Tesseract OCR CLI (https://github.com/tesseract-ocr/tesseract/releases/download/5.5.0/tesseract-ocr-w64-setup-5.5.0.20241111.exe)

It requires you to install tesseract and set its location in PATH env variable

Hope it helps!

Topic		Replies	Views
Extracting data through pdf using ocr and store in pdf uipath Help pdf , ocr , activities	14	5420	November 16, 2022
Non-searchable PDF to searchable PDF without the use of 3rd party app Help	4	1662	April 19, 2021
OCR Without Extracting Data Help activities	2	902	March 7, 2019
OCR PDF Help pdf , ocr , activities , question	7	1208	December 4, 2019
Scanned pdf converter into text pdf Help uiautomation , pdf , studio	7	1052	October 31, 2019

Convert a pdf to a searchable pdf

Related topics