Hello UiPath Community,
I’m currently working on a project where I need to extract specific data from OCR-processed images using UiPath. I’ve implemented an OCR process that successfully extracts text from the images, but I’m struggling with efficiently extracting the required data using regular expressions.
Here’s what I have so far:
- I’ve used the
UiPath Document OCR
activity to extract text from images. - The extracted text contains multiple fields such as Centre, Firstname, Lastname, VFS-Ref-No, Visa Category, Sub Visa Category, and Sub Type Category.
I’ve attempted to use a combination of Invoke Code
activity with regular expressions to extract the data, but I’m not getting any output.
My goal is to efficiently extract the required data from the extracted text and store it in variables for further processing.
If anyone has experience with data extraction from OCR-processed images or can guide me on using iTextSharp for this purpose, I would greatly appreciate your insights. Any examples or guidance on how to improve my workflow would be very helpful.
Input Text that OCR returns :
Centre, Facilitation Visa and Permit SA Town ABC LTD PTY SA Processing Visa VFS XYZ123456 receipt cum Invoice 14:35 Time : 31/7/2023 JOHN DOE Mr. ABC123456 Category: VISA Visa Residence Temporary Visa Renewal - TRV Category: VISA Sub S Visa Person Retired Category: Type Sub section 425.00 ZAR Fee: VISA 185.00 ZAR Fee: PCC 00 1550. ZAR fee: Service 20 ZAR Fee: SMS 15% VAT: 425. ZAR Vatable): Non Fees Visa ( Total 00 1550 ZAR VAT): (Including Fees VFS Total
Desired Extracted Output:
Centre: Facilitation Visa and Permit SA Town ABC LTD PTY SA
Firstname: JOHN
Lastname: DOE
VFS-Ref-No: XYZ123456
Visa Category: Visa Residence Temporary
Sub Visa Category: TRV Renewal - TRV
Sub Type Category: Person Retired
In this example, the input OCR text contains various fields, and the desired output consists of the extracted values for each field. The Centre
, Firstname
, Lastname
, VFS-Ref-No
, Visa Category
, Sub Visa Category
, and Sub Type Category
have been extracted and formatted into the desired output format. This is the type of output you can expect after successfully extracting the relevant information from the OCR text using regular expressions or other methods.
After that I want to convert the text to json for use in a database.
Here’s a description of the sequence I’ve created so far:
- Load Images and Perform OCR:
- Load images from a specified folder using a For Each loop.
- Perform OCR on each image using the UiPath Document OCR activity.
- Extract the OCR results (ExtractedText) from each image.
- Data Extraction using Regular Expressions:
- Invoke Code activity is used to extract specific information from the OCR results.
- Use regular expressions to match and extract the following fields:
- Centre
- Firstname
- Lastname
- VFS-Ref-No
- Visa Category
- Sub Visa Category
- Sub Type Category
- Store the extracted values in corresponding variables.
- Output Extracted Data:
- Use Write Line activities to output the extracted data for each field in a formatted manner.
- Output the extracted values for Centre, Firstname, Lastname, VFS-Ref-No, Visa Category, Sub Visa Category, and Sub Type Category.
- MessageBox Confirmation:
- Display a MessageBox indicating that the image OCR job has been completed.
Overall, this sequence aims to automate the extraction of specific information from OCR results using regular expressions and then present the extracted data in a human-readable format using Write Line activities. The MessageBox at the end provides confirmation that the OCR job has finished processing the images.
Here is my regex in the invoke code :
Dim extractedText As String = extractedTextIn
' Extract Centre
Dim centreMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "Centre, ([^\n]+)")
Dim centre As String = centreMatch.Groups(1).Value.Trim()
' Extract Firstname
Dim firstnameMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "(Mr\.|Ms\.|Mrs\.) ([A-Za-z]+)")
Dim firstname As String = firstnameMatch.Groups(2).Value.ToLowerInvariant()
' Extract Lastname
Dim lastnameMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "([A-Za-z]+) [A-Za-z]+ (Mr\.|Ms\.|Mrs\.)")
Dim lastname As String = lastnameMatch.Groups(1).Value.ToLowerInvariant()
' Extract VFS-Ref-No
Dim vfsRefNoMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "VAT\d+")
Dim vfsRefNo As String = vfsRefNoMatch.Value
' Extract Visa Category
Dim visaCategoryMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "Category: ([A-Za-z ]+)")
Dim visaCategory As String = visaCategoryMatch.Groups(1).Value
' Extract Sub Visa Category
Dim subVisaCategoryMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "Sub VISA ([A-Za-z \-]+)")
Dim subVisaCategory As String = subVisaCategoryMatch.Groups(1).Value
' Extract Sub Type Category
Dim subTypeCategoryMatch As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(extractedText, "Sub Type Category: ([A-Za-z0-9 \(\)\-]+)")
Dim subTypeCategory As String = subTypeCategoryMatch.Groups(1).Value
This is the output I get(All Blank?):
Centre:
Firstname:
Lastname:
VFS-Ref-No:
Visa Category:
Sub Visa Category:
Sub Type Category:
OCR OUT : Centre, Facilitation Visa and Permit SA Town ABC LTD PTY SA Processing Visa VFS XYZ123456 receipt cum Invoice 14:35 Time : 31/7/2023 JOHN DOE Mr. ABC123456 Category: VISA Visa Residence Temporary Visa Renewal - TRV Category: VISA Sub S Visa Person Retired Category: Type Sub section 425.00 ZAR Fee: VISA 185.00 ZAR Fee: PCC 00 1550. ZAR fee: Service 20 ZAR Fee: SMS 15% VAT: 425. ZAR Vatable): Non Fees Visa ( Total 00 1550 ZAR VAT): (Including Fees VFS Total
Hope this is clear with enough details.
Thank you in advance for your assistance!
Best regards