I need to be able to extract some text within a PDF extract…
Below is the sample text:
The quick brown fox jumped over the lazy dog
ABC 02 1234 1234567 00 Y N N N 38 2468 1234567 00
ABC 02 1234 1234567 01 Y Y N N 38 2468 1234567 02
ABC 02 1234 1234567 03 N N Y N 38 2468 1234567 04
Additional Comments (if any)
In the above sample text, I want to be able to get the 3 lines that have the characters ‘ABC’ on it.
From those lines I want to be able to further extract text so that the line will be further subdivided to the following:
Text ABC will be saved as 1 group
Text where it begins with 02 will be saved as 1 group
Y - 1 group
N - 1 group
N - 1 group
Y - 1 group
Text where it begins with 38 will be saved as 1 group
How do I bundle up the numbers after ABC, e.g. ‘02 1234 1234567 00’ together? Same with the last set of numbers, e.g. ‘38 2468 1234567 00’?
Also - in my extract I noticed that the second number is not in the correct number. So instead of ABC 02 1234 1234567 01 it appears as ABC 021234 1234567 01.
@redanime94 , Do we have pattern for each of the group that you want to extract ?
For Example, After ABC, there will be only a 2 Digit number, after the 2 digit number there will be a 4 Digit Number, and so on.
So if we do know the exact characteristics/definite pattern for the groups to be extracted, then we may be able to separate them from a mixed group else it wouldn’t be possible.
The pattern can be random as the input will be coming from a scanned document converted to a PDF. I was actually able to get it but it’s a not a straightforward solution. I needed to use regex to extract those lines that I need then used Substring to get those details within those line of text.
So it’s all sorted for me. Thanks for the initial suggestion, I was able to achieve my goal.