TakeshiC
(Hsiuchu Chen A3n)
February 16, 2023, 4:01am
1
Hi everybody,
I’m scraping information from a sheet.
I have managed to write RegEx expressions for getting most of the information I need.
However, I am facing trouble getting values from the scraped text.
The RegEx Pattern doesn’t work well that is below:
(?<=[0-9A-Za-z]{4}.[0-9A-Za-z]{8}.\d.[0-9]{10}\s\S{1,}\s[^0-9]{0,})([0-9]{1,})
The sample text is below (Bold letters are the values which I want to get):
ATW222 PAGE 25
INVOICE N0 : BZ9999TW99 COUNTRY
9999-99999999-9 9968009999 géégouTYPE 20 CODE SCANNER 99 PCS. TWD 9,997.91 TWD 999,999.30 FRANCE
7131-99999999-2 9999999999 SASS SUB—ASSY 10 PCS. 123.33 1,223.30 FRANCE
7030-99999999-2 9999999999 SEES; 10 SET. 16.13 162.10 FRANCE
7030-99999999-2 9999999999 PANE: ASSY 3 PCS. 1,523.22 7,777.77 FRANCE
7030—99999999-2 1234567891 EERNINAL SUB-ASSY 10 PCS. 300.33 3,033.33 FRANCE
7030-99999999—2 1234567891 gliggER 60 SET. 16.25 915.10 CHINA
7030-99999999-2 1234567891 gASE SUB-ASSY 60 PCS. 223.35 12,808.00 KOREA
7030-99999999—2 1234567891 glngER 40 PCS. 19.25 510.00 FRANCE
7030-99999999-2 1234567891 €28: SUB-ASSY 40 PGS. 233.45 8,848.00 ITALY
7030-99999999-2 1234567891 SIEEUIT ASSY 100 PCS. 915.79 99,479.00 JAPAN
7030-99999999-2 1234567891 PIECE(TAPE) 100 PCS. 9.19 919.00 SPAIN
Parts
TOTAL 999 PCS. TWD 999,999.99
Have you an idea to get the values ?
sarathi125
(Parthasarathi)
February 16, 2023, 4:27am
2
@TakeshiC ,
Check with this Regex, it is matching most of the expected values.
(?i)(\d+)\s (?=PCS|PGS|SET) *
The (?i) would ignore case sensitive
Check this once
System.Text.RegularExpressions.Regex.Match(txt, "\d{2,3}\s(?=PCS|SET)").ToString
Cheers
HI @TakeshiC
Checkout this Expression
System.Text.RegularExpressions.Regex.Match(System.Text.RegularExpressions.Regex.Split(InputString, "Parts")(0).ToString.,"\d+(?=\sPCS|\sPGS|\sSET)").ToString
Regards
Sudharsan
TakeshiC
(Hsiuchu Chen A3n)
February 16, 2023, 5:28am
5
@sarathi125 , @Manju_Reddy_Kanughula , @Sudharsan_Ka
Thank you for all your replys,
But I think these Regex patterns that you said cannot work well in my case.
Because PCS, PGS or SET is a changeable word according to condition, I mean sometimes it will be change to GAL, DOZ, PKG or something else.
Do you know the list ? If you know means you can add them in the pattern like
\d+(?=\sPCS|\sPGS|\sSET|\sGAL|\sDOZ|\sPKG)
use Pipeline " | " & add the words what ever it is.
TakeshiC
(Hsiuchu Chen A3n)
February 16, 2023, 5:38am
8
@Sudharsan_Ka @Manju_Reddy_Kanughula
I know this list, but actually I don’t know what the word will appear in next time.
I only know the word is a measure word in english in upper case.
You know the list means you just need to add the pattern with those list like this
You need to seperate each of them with the “|\s”
System.Text.RegularExpressions.Regex.Match(System.Text.RegularExpressions.Regex.Split(InputString, "Parts")(0).ToString.,"\d+(?=\sPCS|\sPGS|\sSET|\sGAL|\sDOZ|\sPKG)").ToString
Check this also @TakeshiC
System.Text.RegularExpressions.Regex.Match(System.Text.RegularExpressions.Regex.Split(InputString, "Parts")(0).ToString,"\d+(?=\s[A-Z]{3}\.)").ToString
This pattern will get whatever the measure word is most probably that word will Three Caps Letter with the dot i think so you can try this
\d+(?=\s[A-Z]{3}\.)
Regards
Sudharsan
1 Like
Hello @TakeshiC
Try this Regex Expression.
(?<=[A-Za-z\s])\d{1,4}(?=[A-Za-z\s]+[.])
System.Text.RegularExpressions.Regex.Matches(YourInput,"(?<=[A-Za-z\s])\d{1,4}(?=[A-Za-z\s]+[.])").ToString.trim
Use above code in For each and loop it ,you can get the each data.
TakeshiC
(Hsiuchu Chen A3n)
February 16, 2023, 6:02am
12
@Sudharsan_Ka @Gokul_Jayakumar
Thank you all.
Your Regex patterns nearly works well.
Now just only one problem,
Your Regex patterns will also get the word(Bold letter) in last line
TOTAL 999 PCS. TWD 999,999.99
If I don’t want to get this word , have you an idea to correct the Regex pattern?
@TakeshiC
Initially use this
YourInput=System.Text.RegularExpressions.Regex.Replace(YourInput,"(?>TOTAL)[\D\d\s\n].*","").ToString
Later use this
System.Text.RegularExpressions.Regex.Matches(YourInput,"(?<=[A-Za-z\s])\d{1,4}(?=[A-Za-z\s]+[.])").ToString.trim
@TakeshiC
Have you used teh expression which i sent you because in that expression i will remove the last line by spliting
System.Text.RegularExpressions.Regex.Match(System.Text.RegularExpressions.Regex.Split(InputString, "Parts")(0).ToString,"\d+(?=\s[A-Z]{3}\.)").ToString
TakeshiC
(Hsiuchu Chen A3n)
February 16, 2023, 6:31am
15
Your Regex pattern seems to work well,
But how to get each value by using this pattern?
Checkout this @TakeshiC
System.Text.RegularExpressions.Regex.Matches(System.Text.RegularExpressions.Regex.Split(InputString, "Parts")(0).ToString,"\d+(?=\s[A-Z]{3}\.)")
Use in for each this pattern and then inside you can get all the matches while looping
Updated the expression check with this @TakeshiC
TakeshiC
(Hsiuchu Chen A3n)
February 16, 2023, 6:56am
18
ohhhh
It works well!
Then, I’m sorry for another question.
If “Parts” and “TOTAL” are all changeable words and only know “Parts” is a 4 letters word(One upper case and four cases), “TOTAL” is a 5 letters word in upper case, have you any idea to correct the pattern?
TakeshiC:
If “Parts” and “TOTAL” are all changeable words and only know “Parts” is a 4 letters word(One upper case and four cases), “TOTAL” is a 5 letters word in upper case, have you any idea to correct the pattern?
Quite not understandable can you elaborate please?
If you know the words and the cases are different means what you can do is
System.Text.RegularExpressions.Regex.Matches(System.Text.RegularExpressions.Regex.Split(InputString, "Parts",System.Text.RegularExpressions.RegexOptions.IgnoreCase)(0).ToString,"\d+(?=\s[A-Z]{3}\.)")
TakeshiC
(Hsiuchu Chen A3n)
February 16, 2023, 7:20am
20
I aks this question beacuse I guess you use “Parts” to remove the last line.
And I just want to know if “Parts” and “TOTAL” like PCS,PGS or SET that I previously said.
I don’t know what the word will appear in next time.
So in this case, is there any idea to do ?