Convert a pdf to a searchable pdf

Hi All,

Would anyone know how you can convert a pdf into searchable pdf?

I am converting an image > pdf > searchable pdf.

Is there a way we can convert a pdf into searchable pdf without using adobe activities?

I think there’s a way to invoke code using C# and and itextsharp but not sure how to implement it in UiPath.

Hi @sharon.palawandram ,

Pyteseract (based on Teseract) in combination with OpenCV can achieve what you are looking for.
Some years ago one of my deliverables was a similar case. Scanned pdf (image and not the rich text)

Pyteseract mostly allows for OCR while OpenCV integration would provide the coordinates for each extracted text from the image. Later you can create an empty document and superimpose the extracted text using the same coordinates of texts from your original image (essentially removing background). The last step would be to convert it back to a PDF.

Handwritten text and sharpness of the text in the image will be the challenge to extract but you could use an ensemble approach ( combination of ML models) , although even with that you should expect some outlier or exception cases.

Whichever approach you use, preprocessing your image using OpenCV will be an essential part of your pipeline.

2 Likes

@sharon.palawandram

In addition to the above, you could also try opening your PDF in Microsoft word or Libre Office Draw, both support PDF files. Libre Office Draw also supports scanned pdfs (not 100% though)

2 Likes

Hi @sharon.palawandram ,
Can you share your file?
I think we can use “read PDF with OCR”
image
or
image
regards,

1 Like

Thank you, I can try this out. I’m trying to generally convert any text image of (.jpeg,jpg) to pdf.

After we read the pdf, what will be the step? I’ve digitized it and on the output you can either convert it to .txt or a key value pair

Thank you Jeevith for your feedback, interesting approach I will test this out.

Does the output pdf distort the original image?

Thank you for your suggestions, I will test this out too.

The resulting PDF worked well (we had around 20 cases in our testing), the OCR engine failed in some while extracting some of the texts, but we did get the coordinates of the text and there was little to no distortion.

----------------------------------------------------------------

Another approach you could also try is using GhostScript in some parts of your process. You can convert image files to pdf and test if the resulting PDF is rich text or not, I cannot remember how it embeds the image. (How to Use Ghostscript)

I also have a component in the Marketplace to convert image files to PDF using GhostScript, you could try it during your prototyping. https://cloud.uipath.com/jeeviszafpnq/marketplace_/listings/pdf-manipulation-using-ghostscript

Hi @sharon.palawandram ,
Sorry reply late,
You can write text to word file
then convert word to PDF


hope it help
regards,

1 Like

Hi Jeevith, thanks for letting me know. I will try this too.

In the previous comment, if we get the output of an OCR to a key value pair, how can we resave it to a pdf?

Hi Nguyen,

Thank you for your answer. But wouldn’t this distort the original image position?

Hi Shron,
I think it will take it from top to bottom and return 1 string. When pasted into word, it will be in the same order. Converting it to pdf will also be in the same order.
regards,