Scanned PDF having data in square boxes

pdf
ocr

#1

Hi All,

I am using UiPath OCR to read data from scanned pdf, in few of the scanned pdf I have characters written in square boxes.

Due to square boxes UIPath OCR doesn’t give accurate results. First of all, there is one box per character as shown in image. If I am reading whole section having multiple boxes it gives me lot of junk details in between, I am not sure how can I read it character by character from the pdf.

Please suggest any alternative way of extracting data out pdf having format described.


#2

A bit tricky indeed.

I see one option but it’ll be a bit fragile because it depends on the exact scaling & rotation of the image.

In broad terms, it would work something like this:
• Make sure the scaling is 100% or other fixed amount.

  • Start with a Find Image for “Name of Bank”
  • Then get the coordinates of the found image.
  • Then do a While loop (11 times for the 11 boxes), and scrape the interior of each box and add each one to a single string. After the loop is done you should have the full contents of the boxes in the string.

To scrape the interior of each box you’ll need to generate the coordinates for each box inside the loop, something like this:
box_X = nameOfBank_X + offset + boxWidth*counter

A lot of trial and error might be involved. Also, not sure what OCR will return for an empty box, hope not too much junk.

I think this would be an interesting exercise for Automation Challenge.


#5

This solution should work - I’m using it for scraping tables row by row and once the numbers are right it’s pretty consistent for RDP. For scanned documents it will be harder, as there might be 1 or 2 pixel differences (f.e. the paper got a little bit bent) that can throw it off, especially for finding the label image - it can vary what is the exact coordinate of it.
Also make sure that the image is 100% horizontally aligned, or you will catch the boxes, especially in the right side/further part of it.

Quality will probably be awful though, handwritings should be done through ICR, most regular OCR engines give garbage output for it.

For empty text it will either return junk or throw an error (“OCR returned empty text” is an actual exception, at least with GoogleOCR).

Sidenote:
In printed forms what is usually done is that the boxes are in colour (either green or red) which is filtered out during scanning. That way OCR/ICR can just read an area instead of singular characters, and actually apply dictionary correcting (for names, addresses and other known inputs).
With single character readings it will be nigh impossible to catch an error in recognition and 99% of the time human validation will be required. No matter what OCR/ICR companies say, in my experience if you want reliably high accuracy for handwriting, you need a human validation.


#6

Cosin,
We have tried with what you have described but as Andrzej have mentioned it returns garbage for empty boxes and if text is ovelapped with boxes it says X for the Y.

Andrzej,
I am totally agree that with handwriting human validation will be required. For other inputs such as having boxes in different color and handwritten text in different color is one of the niche technique to separate out concerns and increase efficiency. I wanted to check whether UIPath internally provides any operations (pre processing of images to ignore boxes just throwing out a thought ) along with OCR which can make things easy than what is there right now.

Regarding ICR can you recommend any available solution which can be used to improve efficiency of handwritten texts ?

Thanks for putting your valuable thought on this.


#7

This approach was quite useful. Thanks!


#8

Nothing out-of-the-box comes to mind.
While it may not feel encouraging, in a completely honest opinion robots (at least not any I’ve tried) cannot rival specialized digitizing software when it comes to these tasks. They can come very close with printed texts, but for handwritten it’s not even a competition.

I don’t want to advertise for or against any particular solution, although it might be best to just ask your sourcing dept to look for a digitizing partner/software solution. It helps a lot if you can control how the forms are designed too, to match what your software can process succesfully.

Other than that, at least at this stage, I don’t see robots, being generalistic software, overthrone specialized OCR/ICR companies anytime soon. They could very well work for processing the data afterwards, but the readout itself will always be lagging behind I think, at least for handwriting.