Help with RegExp. (read from PDF)

Hi all,

I need some help with a regexp and hope you can help me.
Example rows look like this (from a PDF) that have 4 columns
In the following the brackets are there just to show the columns:
[2019-09-20] [2019-10-26] [940 C] [65565656]
Each have always ONE space in between.
I need to get the value “940 C”
The values can also look like this

  • 2019-09-20 2019-10-26 940 A65565656
    ** Should give the value “940”

  • 2019-09-20 2019-10-26 930 C 65565656
    ** Should give the value “930 C”

  • 2019-09-20 2019-10-26 940   B B655656CC
    ** Should give the value “940   B”

  • 2019-09-20 2019-10-26 B655656CC
    ** Here we have 3 columns. Bonus if this could give the value “” or empty, otherwise this can be handled afterwards.

What is a good RegExp I can use for this?

you can use substring funciton .

Are you set on regex? I think Split might be better here. First split by newline, then for each line you can split on spaces, ignore empty values, then grab index 2, check if index 3 contains exactly 1 character and append that to index 2 value if so.

If you’re absolutely set on regex then I can help with an expression on that as well, but it will rely on you always getting 2 dates, and always having the dates in the yyyy-MM-dd format & it is a bit tougher to get the optional single character after the number.

Thanks for the quick reply, will that work for all the examples that I have?
This as I don’t know the number of characters in this "third"column.

I’d still recommend using split, but below is a regex solution that would work. Get the value and use .Trim() to get rid of the excess spaces that could be present. Note that it is not possible to get an empty value using this regex method for option 4. However, you could use regex on each line individually and if no match is found, then give a string.empty value instead.

(?<=\d{4}-\d{2}-\d{2}\s+\d{4}-\d{2}-\d{2}\s*)(?(\d+\s+.(?=\s))\d+\s+.(?=\s)|\d+)

Assumptions:
You are looking for digits only in column 3 (switch the \d+ with .+ or the specific characters within [square brackets] if that isn’t true)
The 2 dates always come in the format of 4 digits - 2 digits - 2 digits

This has been tested at .NET Regex Tester - Regex Storm and works without issue based on your sample

1 Like

Thanks @Dave

I’ll try with split as well and see how that works.
Your RegEx seems to work fine but not for an example like this (where it will find “655656” which I don’t want):
2019-09-20 2019-10-26 655656

How can we add so that the value we are looking for is not at the end (not ends with $).
Then I think we can have a solution with Regexp for this.

Good point, I’ll add that in and edit this post (give me a couple min)

(?<=\d{4}-\d{2}-\d{2}\s*\d{4}-\d{2}-\d{2}\s*)(?(\d+\s+.(?=\s))\d+\s+.(?=\s)|\d+)(?!$)

You must have the multiline option checked (or use RegexOptions.Multiline if doing it in an assign) in order for this to work

1 Like

Sorry, one more example is the following:
2019-09-20 2019-10-26 904C 655656

Where I would need “904C”

@P_S Haha alright i’ve edited again for the new requirements :slight_smile:

(?<=\d{4}-\d{2}-\d{2}\s*\d{4}-\d{2}-\d{2}\s*)(?([0-9A-Za-z]+\s+.(?=\s))[0-9A-Za-z]+\s+.(?=\s)|[0-9A-Za-z]+\b)(?!$)

Multiline option must be used. This assumes it could be any character 0-9, A-Z, or a-z that you want to capture. It will grab column 3, column 4 (if it exists) and will not grab the final column

I tested the below Input and received the following matches: 940, 930 C, 940 B, 904C

2019-09-20 2019-10-26 940 A65565656

2019-09-20 2019-10-26 930 C 65565656

2019-09-20 2019-10-26 940   B B655656CC

2019-09-20 2019-10-26 655656

2019-09-20 2019-10-26 904C 655656

2019-09-20 2019-10-26 B655656CC
1 Like

convert to list the words
return list(2)

Thanks again @Dave! This seems to work fine! :slightly_smiling_face:

Seems it could be easier tho:
“2019-09-20 2019-10-26 940 A65565656”.Remove(0, 21).TrimStart.Split(" "c)(0)

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.