Getting text from a PDF extract

Hi,

I need to be able to extract some text within a PDF extract…

Below is the sample text:

The quick brown fox jumped over the lazy dog
ABC 02 1234 1234567 00 Y N N N 38 2468 1234567 00
ABC 02 1234 1234567 01 Y Y N N 38 2468 1234567 02
ABC 02 1234 1234567 03 N N Y N 38 2468 1234567 04
Additional Comments (if any)

In the above sample text, I want to be able to get the 3 lines that have the characters ‘ABC’ on it.

From those lines I want to be able to further extract text so that the line will be further subdivided to the following:

  1. Text ABC will be saved as 1 group
  2. Text where it begins with 02 will be saved as 1 group
  3. Y - 1 group
  4. N - 1 group
  5. N - 1 group
  6. Y - 1 group
  7. Text where it begins with 38 will be saved as 1 group

Thanks.

How will I be able to accomplish these please?

Hi @redanime94 ,

Could you Check with the Below Regex Expression :

(.*?)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\w)\s+(\w)\s+(\w)\s+(\w)\s+(\d+)

Hi,

Thanks for that. Partly it works…

How do I bundle up the numbers after ABC, e.g. ‘02 1234 1234567 00’ together? Same with the last set of numbers, e.g. ‘38 2468 1234567 00’?

Also - in my extract I noticed that the second number is not in the correct number. So instead of ABC 02 1234 1234567 01 it appears as ABC 021234 1234567 01.

Thanks.

@redanime94 , Do we have pattern for each of the group that you want to extract ?

For Example, After ABC, there will be only a 2 Digit number, after the 2 digit number there will be a 4 Digit Number, and so on.

So if we do know the exact characteristics/definite pattern for the groups to be extracted, then we may be able to separate them from a mixed group else it wouldn’t be possible.

Hi,

The pattern can be random as the input will be coming from a scanned document converted to a PDF. I was actually able to get it but it’s a not a straightforward solution. I needed to use regex to extract those lines that I need then used Substring to get those details within those line of text.

So it’s all sorted for me. Thanks for the initial suggestion, I was able to achieve my goal.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.