How to get value by Regex match

Hi everybody,
I’m scraping information from a sheet.
I have managed to write RegEx expressions for getting most of the information I need.
However, I am facing trouble getting values from the scraped text.

The RegEx Pattern doesn’t work well that is below:

(?<=[0-9A-Za-z]{4}.[0-9A-Za-z]{8}.\d.[0-9]{10}\s\S{1,}\s[^0-9]{0,})([0-9]{1,})

The sample text is below (Bold letters are the values which I want to get):

ATW222 PAGE 25
INVOICE N0 : BZ9999TW99 COUNTRY
9999-99999999-9 9968009999 géégouTYPE 20 CODE SCANNER 99 PCS. TWD 9,997.91 TWD 999,999.30 FRANCE
7131-99999999-2 9999999999 SASS SUB—ASSY 10 PCS. 123.33 1,223.30 FRANCE
7030-99999999-2 9999999999 SEES; 10 SET. 16.13 162.10 FRANCE
7030-99999999-2 9999999999 PANE: ASSY 3 PCS. 1,523.22 7,777.77 FRANCE
7030—99999999-2 1234567891 EERNINAL SUB-ASSY 10 PCS. 300.33 3,033.33 FRANCE
7030-99999999—2 1234567891 gliggER 60 SET. 16.25 915.10 CHINA
7030-99999999-2 1234567891 gASE SUB-ASSY 60 PCS. 223.35 12,808.00 KOREA
7030-99999999—2 1234567891 glngER 40 PCS. 19.25 510.00 FRANCE
7030-99999999-2 1234567891 €28: SUB-ASSY 40 PGS. 233.45 8,848.00 ITALY
7030-99999999-2 1234567891 SIEEUIT ASSY 100 PCS. 915.79 99,479.00 JAPAN
7030-99999999-2 1234567891 PIECE(TAPE) 100 PCS. 9.19 919.00 SPAIN
Parts
TOTAL 999 PCS. TWD 999,999.99

Have you an idea to get the values ?

@TakeshiC,

Check with this Regex, it is matching most of the expected values.

(?i)(\d+)\s(?=PCS|PGS|SET)*

The (?i) would ignore case sensitive

Check this once

System.Text.RegularExpressions.Regex.Match(txt, "\d{2,3}\s(?=PCS|SET)").ToString

image

Cheers

HI @TakeshiC

Checkout this Expression

System.Text.RegularExpressions.Regex.Match(System.Text.RegularExpressions.Regex.Split(InputString, "Parts")(0).ToString.,"\d+(?=\sPCS|\sPGS|\sSET)").ToString

Regards
Sudharsan

@sarathi125 , @Manju_Reddy_Kanughula, @Sudharsan_Ka
Thank you for all your replys,
But I think these Regex patterns that you said cannot work well in my case.
Because PCS, PGS or SET is a changeable word according to condition, I mean sometimes it will be change to GAL, DOZ, PKG or something else.

Do you know the list ? If you know means you can add them in the pattern like

\d+(?=\sPCS|\sPGS|\sSET|\sGAL|\sDOZ|\sPKG)

use Pipeline " | " & add the words what ever it is.

@Sudharsan_Ka @Manju_Reddy_Kanughula

I know this list, but actually I don’t know what the word will appear in next time.
I only know the word is a measure word in english in upper case.

You know the list means you just need to add the pattern with those list like this

You need to seperate each of them with the “|\s”

System.Text.RegularExpressions.Regex.Match(System.Text.RegularExpressions.Regex.Split(InputString, "Parts")(0).ToString.,"\d+(?=\sPCS|\sPGS|\sSET|\sGAL|\sDOZ|\sPKG)").ToString

Check this also @TakeshiC

System.Text.RegularExpressions.Regex.Match(System.Text.RegularExpressions.Regex.Split(InputString, "Parts")(0).ToString,"\d+(?=\s[A-Z]{3}\.)").ToString

This pattern will get whatever the measure word is most probably that word will Three Caps Letter with the dot i think so you can try this

\d+(?=\s[A-Z]{3}\.)

Regards
Sudharsan

1 Like

Hello @TakeshiC
Try this Regex Expression.
(?<=[A-Za-z\s])\d{1,4}(?=[A-Za-z\s]+[.])

System.Text.RegularExpressions.Regex.Matches(YourInput,"(?<=[A-Za-z\s])\d{1,4}(?=[A-Za-z\s]+[.])").ToString.trim

Use above code in For each and loop it ,you can get the each data.

@Sudharsan_Ka @Gokul_Jayakumar
Thank you all.

Your Regex patterns nearly works well.
Now just only one problem,
Your Regex patterns will also get the word(Bold letter) in last line

TOTAL 999 PCS. TWD 999,999.99

If I don’t want to get this word , have you an idea to correct the Regex pattern?

@TakeshiC
Initially use this

YourInput=System.Text.RegularExpressions.Regex.Replace(YourInput,"(?>TOTAL)[\D\d\s\n].*","").ToString

Later use this

System.Text.RegularExpressions.Regex.Matches(YourInput,"(?<=[A-Za-z\s])\d{1,4}(?=[A-Za-z\s]+[.])").ToString.trim

@TakeshiC

Have you used teh expression which i sent you because in that expression i will remove the last line by spliting

System.Text.RegularExpressions.Regex.Match(System.Text.RegularExpressions.Regex.Split(InputString, "Parts")(0).ToString,"\d+(?=\s[A-Z]{3}\.)").ToString

Your Regex pattern seems to work well,
But how to get each value by using this pattern?

Checkout this @TakeshiC

System.Text.RegularExpressions.Regex.Matches(System.Text.RegularExpressions.Regex.Split(InputString, "Parts")(0).ToString,"\d+(?=\s[A-Z]{3}\.)")

Use in for each this pattern and then inside you can get all the matches while looping

Updated the expression check with this @TakeshiC

ohhhh
It works well!
Then, I’m sorry for another question.
If “Parts” and “TOTAL” are all changeable words and only know “Parts” is a 4 letters word(One upper case and four cases), “TOTAL” is a 5 letters word in upper case, have you any idea to correct the pattern?

Quite not understandable can you elaborate please?

If you know the words and the cases are different means what you can do is

System.Text.RegularExpressions.Regex.Matches(System.Text.RegularExpressions.Regex.Split(InputString, "Parts",System.Text.RegularExpressions.RegexOptions.IgnoreCase)(0).ToString,"\d+(?=\s[A-Z]{3}\.)")

I aks this question beacuse I guess you use “Parts” to remove the last line.
And I just want to know if “Parts” and “TOTAL” like PCS,PGS or SET that I previously said.
I don’t know what the word will appear in next time.

So in this case, is there any idea to do ?