Not able to extract specific data from PDF

How do i amend the regex expression below so as to extract both Invoice number and Account number from a PDF (not scanned)?

system.text.RegularExpressions.Regex.Match(strInvoiceText,“(?<=Tax Invoice No : )(\d+)”).value

system.text.RegularExpressions.Regex.Match(strInvoiceText,“(?<=Customer ID : )(\d+)”).Value

Used .NET Regex Tester - Regex Storm for reference, it was able to pick up the specific data i wanted from a PDF.

Screenshot as follows:

Extract data from invoice pdf for SAP.xaml (23.1 KB)

Text as follows:

Singapore 408942 Registration No : 199802208C GST Registration No : M9-0005650-C Tax Invoice No : 8000010691 Invoice Date : 25.08.2020 Sales Order No : 3600011123 Customer ID : 1000000464 Payment Terms : 30 days Due Date : 24.09.2020 Page No : Page 1 of 1
Qty Unit Price Amount SGD

Hi @Justine ,

Do you want to have a Single Regex Expression that can extract both values ?

If so, is there a Specific reason you want to do it in this manner ?

Hi @supermanPunch , i am still new to Uipath, wasn’t aware that there is a single regex expression though. I am fine with either way. I tried using Matches activity to extract the 2 values, it did populate the values but my text diff from invoice to invoice.

Therefore, am not sure why this time round, the regex expression is not able to pick up the values…

@Justine Your Regex Expressions are good. Are you facing any errors?
Kindly share more information about your requirement.

@Justine ,

We would also need to check on different samples and create/prepare the regex so that it can accept all the formats available.

Are the number of formats in which the text appears not known, can it any format it any way ? or is it finite ? Do we also have always a Key Value pair that we would need to look at always ?

Above points would need to be considered when formats of text vary.

Let us know if you could do the above on your own and provide us with the necesary unique formats that you may receive so that we can work on creating the appropriate regex pattern.

Hi @supermanPunch

Really appreciate the help :slight_smile:

There are no other formats, only 1 format, thankfully. As for now, both the Account No and the Invoice No are the key value pairs for extraction.

Hi @Gokul_Jayakumar , thank you for reverting. No errors being flagged out when executed.

I tried using the Matches activity; the regex expression works thus have no idea why it is not extracting when using the multiple assign activity.

I had a message box to print out the whole invoice in text format and it did print out the whole invoice

@Justine ,

Based on your Statement provided we provided a Suggestion. Let us know if the current regex is failing and for which text data.

In your regex you have a specific name used as a anchor, for example “invoice number:” If the specific is available in the document, it will extract the values, otherwise it won’t.
In such case, we have provided all possible issues with “|” this separation, which means “OR” in regex.

(?<=Tax Invoice No : )(\d+)|(?<=Tax Invoice no : )(\d+)|(?<=Tax Invoice Number : )(\d+)

@Gokul_Jayakumar , i tried the regex expression you have provided and still unable to extract both the invoice and account no.
I tried using Matches, it didn’t populate any values either…

@Justine try this

system.text.RegularExpressions.Regex.Matches(strInvoiceText,“(?<=Tax Invoice No : )(\d+)|(?<=Customer ID : )(\d+)”)(0) for 1st match
system.text.RegularExpressions.Regex.Matches(strInvoiceText,“(?<=Tax Invoice No : )(\d+)|(?<=Customer ID : )(\d+)”)(1) for 2nd match

Hi @Gokul_Jayakumar , do i use multiple assign activity or matches activity for the regex expression?

Because when i use the regex expression provided in a multiple assign activity, it throws out an error stating:

RemoteException wrapping System.InvalidOperationException: Can not assign 'system.text.RegularExpressions.Regex.Matches(strInvoiceText,“(?<=Tax Invoice No : )(\d+)|(?<=Customer ID : )(\d+)”)(1).value to ‘strInvoiceNumber’. —> RemoteException wrapping System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values.Parameter name: i

It’s better to use separate assig activities for each data.
If the data is available in the text, it returns the value, otherwise it results in an empty value. No error will occur.

1 Like

i try using the single assign activity and paste the formula

system.text.RegularExpressions.Regex.Matches(strInvoiceText,“(?<=Tax Invoice No : )(\d+)|(?<=Customer ID : )(\d+)”)(0).tostring

it throws out an error stating:
Assign: Specified argument was out of the range of valid values.
Parameter name: i

Tried excluding .totstring and an error populated stating:

Assign:Object reference not set to an instance of an object

tried searching the forum for the other alternatives by adding a Matches and assign activity, same error still populated; Assign:Object reference not set to an instance of an object.

Would you be able to share the workflow with me?

It means the value is not in text file.

this is weird… but anyways, appreciate the assistance provided!

If you need to find the total number of matches available in txt document try this
system.text.RegularExpressions.Regex.Matches(strInvoiceText,“(?<=Tax Invoice No : )(\d+)|(?<=Customer ID : )(\d+)”).count.tostring

thanks @Gokul_Jayakumar . I managed to extract the specific data by using indexof instead of regex expression.

Hi @Justine ,

If you were able to solve your issue, please do post the solution (a bit more in detail - an extended expression) or mark the appropriate suggestion post as solution, so that we could close the topic.

Hi @supermanPunch , alright noted on this.

Below are the formula used to extract both Invoice no and Account no using the Indexof in a Multiple Assign Activity.

strInvoiceText.Substring(strInvoiceText.IndexOf(“Invoice No “)+“Invoice No “.Length,1302).Split(Environment.NewLine.ToCharArray)(0).Trim.Replace(”:”,””)

strInvoiceText.Substring(strInvoiceText.IndexOf(“Customer ID”)+“Customer ID”.Length,1866).Split(Environment.NewLine.ToCharArray)(0).Trim.Replace(“:”,“”)

Watched this informative video to understand how to use Indexof:
Extract Specific text from string

1 Like