Using OCR to extract Small text Data ( not working )

MrJoints · June 18, 2019, 2:50pm

Hey all,

Reaching out to see if anyone might have an answer for how I can extract or Scrape this data. See picture below

The areas with a Red box around them is what I need. They are very small and OCR seems to be having a lot of trouble. These are the things I have tried

Extract just Text ( does not work because the CAD file is flattened before it is sent out so this is considered an IMAGE
Extract text with OCR trying multiple OCR engines Google, Microsoft.
Changing Scale with OCR scan so it can get make the small text a little bigger
Changing my DEFAULT app for opening PDFs I typically do my measuring on BLUEBEAM but since this is a very advanced software I did not want to wait for the load time each time it opened a new PDF so I tried just opening PDFs with Google and then using a scrape function ( this did not work either )
I most recently tried to change my PDF viewer to Adobe thinking it might help but it seems to not of made much of a difference.

If I could pull the text only, I am not to bad with Regex now and I wouldn’t mind using a matching activity but I can’t even get there.

After everything I have tried, I am almost certain this will have to be where a human does the leg work and types in these fields so the bot can continue but that really loses touch with the fact this is automation.

I look forward to hearing from anyone who thinks they have a good answer for this. If you need me to send you a sample version of my PDF for testing on your side let me know.

Kind regards,
Mr. Joints

asesor-rpa · June 18, 2019, 3:05pm

I think you should use Google API or similar services

Use the Try API box to upload a image and see if that API can solve your problem

Sorry I took your screenshot to test

MrJoints · June 18, 2019, 3:14pm

Woah dont be sorry my friend that looks like a great scrape :0 your making me feel hope for this part of the project once again

I’m going to read this article now does it tell me how I get that API? These files are stores locally on my Dropbox sync folder so are in a path and i have to tell it what app to open with. Should I be using Google chrome I assume with using Google api? Gonna read this now may help solve this.

asesor-rpa · June 18, 2019, 3:17pm

Read Pricing Section and Get started section if you need more details about Vision API from Google

Example code looks like this:

MrJoints · June 18, 2019, 3:24pm

Thanks so much will 100% be looking into this it aswell looks super cost effective for my needs.
Can you let me know just curious before I go start playing in there.
Does it count a text unit as per character or per word? As well I cannot see what it is outputting for the text is it looking super accurate for the perimeter for example?

I’m not quite sure how to read that code in all honesty that does not look like anything within uipath I assume I will see more of it when getting started with vision AI thanks again!!

asesor-rpa · June 18, 2019, 3:30pm

Pricing Guide says that Bills are charged per image, and for PDF’s Files, A page is treated separately. A unit is a Featured applied to your image, in your case Text Detection is your unit and Google will charge you for every unit applied to your images.

You can create a program in C# and call from your robot, or create Python Script. No matter language, robot can deal with that (that’s the easiest part )

MrJoints · June 18, 2019, 3:37pm

In all honestly I do not know any coding language
regex atm is the best I got but I believe I have it down. Looking at your code i think i could change certain parts of it to pertain to my needs ( need the general field off that pdf and the decking info, and as well the main customer area I need there address info ) sounds to me like that would be considered as 3 seperate units which I would set up on Google’s side.

Is C# best used with Google vision AI as I will start learning how to read and write it today if so.

MrJoints · June 18, 2019, 3:50pm

Hey Asesor,

Do I have to turn my PDFs to images before I send them into Vision?

asesor-rpa · June 18, 2019, 3:51pm

AFAIK, You can process PDF’s, every page is counted like an image for billing purposes.

system · June 21, 2019, 3:51pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Get text from OCR scrape - string manipulation Help	4	1525	February 21, 2019
OCR and image automation Help uiautomation , ocr , activities	6	2189	June 9, 2020
Extract pdf with OCR activity Help activities	6	3076	January 26, 2020
Issue with PDF extraction by OCR Help activities , studio	2	908	May 8, 2019
Extract Data from PDF Help	1	1706	October 21, 2017

Most Active Users - Yesterday
Anil_G
mukesh.singh
ashokkarale
postwick
arivu96
dutta.marina
HaticeKubraYilmaz
Mark007
csajal
htanaka1
More details...

Using OCR to extract Small text Data ( not working )

Related Topics