Extraction using the regex

I need to extract the total but sometime subtotal too appears like this

Total: 567
Sub Total: 3481
subtotal 38813
sub Total 38813

I need to get only the total
not subtotal

Please help with regex

If your text appears exactly as you have provided it, this regex will work:

(?<=^Total: )\d+.

2 Likes

But the total can come in the middle of the sentence

Please provide an example for me to test.

1 Like

Thus the Total: 567
Sub Total: 3481
today subtotal $38813
sub Total 38813

One additional thing added to this
If Total tax is provided can we omit it

Thus the Total: 567
Sub Total: 3481
today subtotal $38813
sub Total 38813
the Total Tax: 6672

Need only total, omit the sub total or subtotal and Total tax

1 Like

Does the total always come before the subtotals after it?

1 Like

you mean the lines

This may vary pdf to pdf

It may be simpler in this case to iterate over the matches for regex Total:\s+, and use a For Each loop to find the first instance not containing sub or Sub. The regex above would capture the first and second lines, and the For Each loop would omit the second line, leading you to your line with the total. You can extract the value using my original regex, (?<=Total: )\d+. This also works for your preceding examples.

3 Likes

@Sweety_Girl Can you Check this regex and Check it for all the types of Input that you have and verify if it satisfies :

Every thing is good except 1

That is,
If total is found in between the sentence like this,

Ram increases the Total 456

It must take the total unless the ‘sub’ comes before it as subtotal

It sounds like it will be easier to use a regex or different logic for each different document type. It will either be very difficult or impossible to handle all cases for all document types.

Yup… But we have more than 30+ formats pdfs

(?<![sS]ub)\s?[Tt]otal: (?<total>\d+)

image

or (?<![sS]ub)\s?[Tt]otal\s?:\s?\D?\s?(?<total>\d+) if you expect money symbol

image

2 Likes

Try