How to extract text from image using OCR

Hi,

I am new to UI path and I would like to know How to use OCR with Image or PDF files. Could anyone send me a sample or steps to do it.

Thanks in advance.

1 Like

Hi
Go through this articles

Regards,
Ranjith

1 Like

Hi,
did you sign up for this.
https://academy.uipath.com/lms/index.php?r=site/index
world’s first open online training platform for RPA users.

Happy Automation.:slight_smile:

3 Likes

Hi,

I got an exception when i tried with Microsoft OCR Engine and Abbyy OCR engine.

It’s working fine with Google OCR but not able to read the entire text properly.

Same way how can read the text from an image?

@ranjith

Hi Ranjith,

How can I read particular portion of the PDF?

1 Like

You have to use string manipulation or Regex to read particular portion of the PDF.

1 Like

@Bharat_Kumar

If possible can you provide me some sample or please let me know the steps to do it.

1 Like

Please find attached xaml file where i have fetched Fax Number, File# and Date from the PDF.
I have used RegularExpressions to fetch the data. You can also achieve the same using string manipulation as well.
You can use regex101: build, test, and debug regex and try different regular expression
and for string manipulation https://www.dotnetperls.com/string-vbnet

ScanedPDF.zip (23.2 KB)

3 Likes

Thank you @Bharat_Kumar

@Bharat_Kumar
thanks bharat this information is very helpful for me … I have one doubt if any functions are available to check the given string is date or not except IsDate() function

try this Regex for checking the Invoice Date. it will work for dd/mm/yy[yy]or dd/MMM/yy[yy]. It will also support dd-mm-yy[yy]( dash as a separator). I dont remember if I tested it for space as a separator but hopefully it would work for dd mmm yyyy ( space as separator). And it will also take care if there are hidden spaces ,tabs , #, : and in most cases newline between the label and the actual date.

β€œ((INVOICE DATE)|(INVOICE DT)|(DATE))”+
β€œ(\t){0,3}”+
β€œ(\s){0,3}”+
β€œ(\t){0,3}”+
β€œ(#){0,2}”+
β€œ(:){0,1}”+
β€œ(#){0,2}”+
β€œ(\t){0,3}”+
β€œ(\s){0,3}”+
β€œ(\t){0,3}”+
β€œ(\n){0,1}”+
β€œ((\d){1,2}|JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[a-z]{0,9}”+β€œ[/-\s,]{1,4}”+
β€œ((\d){1,2}|JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[a-z]{0,9}”+β€œ[/-\s,]{1,4}”+
β€œ(\d){2,4}”+
β€œ(\s){0,3}”

This should work for most of the invoice templates if they are already OCRed.

Unfortunately 99.99% of the AHT is spent in entering Line items in the P2P systems. So you would want to use a tool like abbyy flexicapture for a long term enterprise solution

Thanks a lot Bharat, I was really stuck for a long time and your solution was absolute bang on.
Cheers mate.

2 Likes