Looking for ideas on detecting checkbox information on PDF (scan)

postwick · March 16, 2025, 4:12pm

I have some ideas for how to do this, but before I get too deep into it I’d like to know how others would achieve this. Keep in mind we are on Studio 2023.4 and an on-prem installation so we don’t have cloud, AI, advanced document understanding etc. I need to do this with standard UI automation activities, OCR, etc

Take this example…

How would you determine which boxes are checked and whether or not they are initialed?

lrtetala · March 16, 2025, 5:09pm

Hi @postwick

My approach is
1.Read PDF with OCR activity
2.Write the output to Text file and observe the difference in checked and not checked fields
3.By using Regex we can extract which field is checked

Regards,

manasrlenka25 · March 16, 2025, 5:19pm

You can use the Microsoft Form Recognizer API to extract checkbox values from a scanned PDF, returning results in Boolean format.

Somanath1 · March 17, 2025, 5:29am

@postwick ,

Use “Read PDF Text” to extract content.
Check for checkboxes represented as [X] (checked) and [ ] (unchecked).
Use Regex to extract checked values:
checkedBoxes = System.Text.RegularExpressions.Regex.Matches(pdfText, “\s*(.+)”).Cast(Of Match).Select(Function(m) m.Groups(1).Value).ToList()

cheers!

postwick · March 17, 2025, 1:50pm

Unfortunately the form is filled out by hand so OCR doesn’t reliably detect checked boxes.

Topic		Replies	Views
How to check a checkbox in a scanded pdf file Help	5	2182	December 20, 2018
Read Checkbox from PDF using GetText Studio pdf , studio	6	2912	March 27, 2024
How to get check box values from PDF Studio studio , question , activities_panel	6	77	November 1, 2024
How to check/uncheck checkboxes in PDF Something Else feedback	14	682	September 29, 2023
Need help to extract Pdf data from tick box with ocr , tried screen scraping but no output Studio pdf-extraction	2	1736	November 29, 2021

Looking for ideas on detecting checkbox information on PDF (scan)

Related topics