Need Help with Data Extraction from OCR Processed Images in UiPath

Hello UiPath Community,

I’m currently working on a project where I need to extract specific data from OCR-processed images using UiPath. I’ve implemented an OCR process that successfully extracts text from the images, but I’m struggling with efficiently extracting the required data using regular expressions.

Here’s what I have so far:

  • I’ve used the UiPath Document OCR activity to extract text from images.
  • The extracted text contains multiple fields such as Centre, Firstname, Lastname, VFS-Ref-No, Visa Category, Sub Visa Category, and Sub Type Category.

I’ve attempted to use a combination of Invoke Code activity with regular expressions to extract the data, but I’m not getting any output.

My goal is to efficiently extract the required data from the extracted text and store it in variables for further processing.

If anyone has experience with data extraction from OCR-processed images or can guide me on using iTextSharp for this purpose, I would greatly appreciate your insights. Any examples or guidance on how to improve my workflow would be very helpful.

Input Text that OCR returns :


Centre, Facilitation Visa and Permit SA Town ABC LTD PTY SA Processing Visa VFS XYZ123456 receipt cum Invoice 14:35 Time : 31/7/2023 JOHN DOE Mr. ABC123456 Category: VISA Visa Residence Temporary Visa Renewal - TRV Category: VISA Sub S Visa Person Retired Category: Type Sub section 425.00 ZAR Fee: VISA 185.00 ZAR Fee: PCC 00 1550. ZAR fee: Service 20 ZAR Fee: SMS 15% VAT: 425. ZAR Vatable): Non Fees Visa ( Total 00 1550 ZAR VAT): (Including Fees VFS Total

Desired Extracted Output:


Centre: Facilitation Visa and Permit SA Town ABC LTD PTY SA
Firstname: JOHN
Lastname: DOE
VFS-Ref-No: XYZ123456
Visa Category: Visa Residence Temporary
Sub Visa Category: TRV Renewal - TRV
Sub Type Category: Person Retired

In this example, the input OCR text contains various fields, and the desired output consists of the extracted values for each field. The Centre , Firstname , Lastname , VFS-Ref-No , Visa Category , Sub Visa Category , and Sub Type Category have been extracted and formatted into the desired output format. This is the type of output you can expect after successfully extracting the relevant information from the OCR text using regular expressions or other methods.

After that I want to convert the text to json for use in a database.

Here’s a description of the sequence I’ve created so far:

  1. Load Images and Perform OCR:
  • Load images from a specified folder using a For Each loop.
  • Perform OCR on each image using the UiPath Document OCR activity.
  • Extract the OCR results (ExtractedText) from each image.
  1. Data Extraction using Regular Expressions:
  • Invoke Code activity is used to extract specific information from the OCR results.
  • Use regular expressions to match and extract the following fields:
    • Centre
    • Firstname
    • Lastname
    • VFS-Ref-No
    • Visa Category
    • Sub Visa Category
    • Sub Type Category
  • Store the extracted values in corresponding variables.
  1. Output Extracted Data:
  • Use Write Line activities to output the extracted data for each field in a formatted manner.
  • Output the extracted values for Centre, Firstname, Lastname, VFS-Ref-No, Visa Category, Sub Visa Category, and Sub Type Category.
  1. MessageBox Confirmation:
  • Display a MessageBox indicating that the image OCR job has been completed.

Overall, this sequence aims to automate the extraction of specific information from OCR results using regular expressions and then present the extracted data in a human-readable format using Write Line activities. The MessageBox at the end provides confirmation that the OCR job has finished processing the images.

Here is my regex in the invoke code :

Dim extractedText As String = extractedTextIn 

' Extract Centre
Dim centreMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "Centre, ([^\n]+)")
Dim centre As String = centreMatch.Groups(1).Value.Trim()

' Extract Firstname
Dim firstnameMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "(Mr\.|Ms\.|Mrs\.) ([A-Za-z]+)")
Dim firstname As String = firstnameMatch.Groups(2).Value.ToLowerInvariant()

' Extract Lastname
Dim lastnameMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "([A-Za-z]+) [A-Za-z]+ (Mr\.|Ms\.|Mrs\.)")
Dim lastname As String = lastnameMatch.Groups(1).Value.ToLowerInvariant()

' Extract VFS-Ref-No
Dim vfsRefNoMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "VAT\d+")
Dim vfsRefNo As String = vfsRefNoMatch.Value

' Extract Visa Category
Dim visaCategoryMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "Category: ([A-Za-z ]+)")
Dim visaCategory As String = visaCategoryMatch.Groups(1).Value

' Extract Sub Visa Category
Dim subVisaCategoryMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "Sub VISA ([A-Za-z \-]+)")
Dim subVisaCategory As String = subVisaCategoryMatch.Groups(1).Value

' Extract Sub Type Category
Dim subTypeCategoryMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "Sub Type Category: ([A-Za-z0-9 \(\)\-]+)")
Dim subTypeCategory As String = subTypeCategoryMatch.Groups(1).Value

This is the output I get(All Blank?):


Centre: 
Firstname: 
Lastname: 
VFS-Ref-No: 
Visa Category: 
Sub Visa Category: 
Sub Type Category:

OCR OUT : Centre, Facilitation Visa and Permit SA Town ABC LTD PTY SA Processing Visa VFS XYZ123456 receipt cum Invoice 14:35 Time : 31/7/2023 JOHN DOE Mr. ABC123456 Category: VISA Visa Residence Temporary Visa Renewal - TRV Category: VISA Sub S Visa Person Retired Category: Type Sub section 425.00 ZAR Fee: VISA 185.00 ZAR Fee: PCC 00 1550. ZAR fee: Service 20 ZAR Fee: SMS 15% VAT: 425. ZAR Vatable): Non Fees Visa ( Total 00 1550 ZAR VAT): (Including Fees VFS Total

Hope this is clear with enough details.

Thank you in advance for your assistance!

Best regards

@Ray_Shadow

one basic question…does your invoke code is having the arguments direction as out

also the data you provided if you can provide exactly how it is in a text file we can help better…but your regex looks proper assuming each data is coming in different line…

another possibility to get the blank value is for center you are leaving new line may be center and data is coming in two different lines

cheers

Yes, this the exact output , the OCR reads the receipt and puts it all on one line, then I pass that to the Regex as argument with direction in.

Hi @Ray_Shadow ,

Maybe take time to reflect on the question asked, We do not see an argument being passed in for all the field values declared. As mentioned, you would require to keep the direction of the arguments of the fields that you want extracted as Out (Currently we only see it as a variable within the Invoke Code).

Next, to check if the extraction is working fine, you could use the below Expression after each Extraction done :

Console.Writeline(firstname)                     //change the variable accordingly for each extraction

Hi, so I have tried to use your suggestion of using out variables it didnt work, the Console.Writeline(firstname) prints to the console this doesn’t help, because I need to use the value later in my work flow, but can get the code block to return the value back to the workflow , how to do that ?

@Ray_Shadow ,

Could you let us know what have you actually tried and what did not work ? Also, Screenshots of the Invoke Code activity arguments/parameters should help us to analyse better.

For this part, we mentioned this so to verify the proper extraction using the regex. I believe from the Console/Output you were able to confirm that.