Pyteseract (based on Teseract) in combination with OpenCV can achieve what you are looking for.
Some years ago one of my deliverables was a similar case. Scanned pdf (image and not the rich text)
Pyteseract mostly allows for OCR while OpenCV integration would provide the coordinates for each extracted text from the image. Later you can create an empty document and superimpose the extracted text using the same coordinates of texts from your original image (essentially removing background). The last step would be to convert it back to a PDF.
Handwritten text and sharpness of the text in the image will be the challenge to extract but you could use an ensemble approach ( combination of ML models) , although even with that you should expect some outlier or exception cases.
Whichever approach you use, preprocessing your image using OpenCV will be an essential part of your pipeline.
In addition to the above, you could also try opening your PDF in Microsoft word or Libre Office Draw, both support PDF files. Libre Office Draw also supports scanned pdfs (not 100% though)
The resulting PDF worked well (we had around 20 cases in our testing), the OCR engine failed in some while extracting some of the texts, but we did get the coordinates of the text and there was little to no distortion.
Another approach you could also try is using GhostScript in some parts of your process. You can convert image files to pdf and test if the resulting PDF is rich text or not, I cannot remember how it embeds the image. (How to Use Ghostscript)
Hi Shron,
I think it will take it from top to bottom and return 1 string. When pasted into word, it will be in the same order. Converting it to pdf will also be in the same order.
regards,