Hi Guys, I need to extract checboxes from digital PDF. I have tried Microsoft OCR but it doesn’t gives me accurate results.
We are using on primise version.
Any suggestions how to extract check boxes would be helpful.
Hi @Rajat
When it comes to extracting checkboxes from digital PDFs, accuracy can be a challenge, especially when using OCR technology. However, there are alternative approaches you can consider to improve checkbox extraction:
- Utilize PDF Libraries: Instead of relying solely on OCR, you can use PDF processing libraries that offer specific functionality for checkbox extraction. Libraries like iText, PDFBox, or PyPDF2 provide APIs to parse PDF documents and extract checkbox states directly from the PDF’s underlying structure.
- Coordinate-Based Extraction: If the checkbox positions are consistent across the PDF documents you’re working with, you can programmatically extract checkboxes by specifying their coordinates. You can use PDF libraries or image processing libraries, such as OpenCV, to locate and extract checkboxes based on their position on the page.
Thanks!!
Hi @Rajat ,
Could you provide us more on what do you mean by you want to Extract Checkboxes ? Do you want to extract the check box value ?
Could you check what is the Output from a Read PDF Text
activity ? There sometimes would be indicated as x
the value that was checked.
Do let us know when checking the output text after using Read PDF Text
activity.
If the method mentioned does not work, we could go ahead understanding further requirements and structure/formats of the documents.
Thanks for your reply. For using API’s, i would need to pay for using API key.
However position of my checkboxes would always be fixed, so i will try with your secound solution.
Thanks, let me know if you got another ideas
It’s a simple digital PDF with text and checkboxes.
Yes, I need to extract checkbox values.
Read pdf text provides me nice results for the text but for checkboxes it doesn’t proivide me anything.
I have tried OCR engines but result is not accurate.
@Rajat ,
Is it possible for you to provide the PDF file ? So that we can check on the other options if not possible with direct conversion to text.
Sorry, it’s a bit difficult as it contains sensitive data.
Seems for CV we also need API key which I don’t have
Hi Guys, just to update you, I used both Microsoft and Tesseract OCR’s to extract my checkboxes.
Microsoft OCR did most of the Job only 1 was not able to extract 1 checkbox from there and on that place I used Tesseract OCR and it resolved my problem.
Thanks everyone for their Reponses
.
This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.