Looking for ideas on detecting checkbox information on PDF (scan)

I have some ideas for how to do this, but before I get too deep into it I’d like to know how others would achieve this. Keep in mind we are on Studio 2023.4 and an on-prem installation so we don’t have cloud, AI, advanced document understanding etc. I need to do this with standard UI automation activities, OCR, etc

Take this example…

How would you determine which boxes are checked and whether or not they are initialed?

Hi @postwick

My approach is
1.Read PDF with OCR activity
2.Write the output to Text file and observe the difference in checked and not checked fields
3.By using Regex we can extract which field is checked

Regards,

You can use the Microsoft Form Recognizer API to extract checkbox values from a scanned PDF, returning results in Boolean format.

@postwick ,

  • Use “Read PDF Text” to extract content.
  • Check for checkboxes represented as [X] (checked) and [ ] (unchecked).
  • Use Regex to extract checked values:
  • checkedBoxes = System.Text.RegularExpressions.Regex.Matches(pdfText, “\s*(.+)”).Cast(Of Match).Select(Function(m) m.Groups(1).Value).ToList()

cheers!

Unfortunately the form is filled out by hand so OCR doesn’t reliably detect checked boxes.