Extraction using the regex

Sweety_Girl · April 13, 2020, 12:09pm

I need to extract the total but sometime subtotal too appears like this

Total: 567
Sub Total: 3481
subtotal 38813
sub Total 38813

I need to get only the total
not subtotal

Please help with regex

Anthony_Humphries · April 13, 2020, 12:10pm

If your text appears exactly as you have provided it, this regex will work:

(?<=^Total: )\d+.

Sweety_Girl · April 13, 2020, 12:16pm

But the total can come in the middle of the sentence

Anthony_Humphries · April 13, 2020, 12:16pm

Please provide an example for me to test.

Sweety_Girl · April 13, 2020, 12:24pm

Thus the Total: 567
Sub Total: 3481
today subtotal $38813
sub Total 38813

Sweety_Girl · April 13, 2020, 12:25pm

One additional thing added to this
If Total tax is provided can we omit it

Thus the Total: 567
Sub Total: 3481
today subtotal $38813
sub Total 38813
the Total Tax: 6672

Need only total, omit the sub total or subtotal and Total tax

Anthony_Humphries · April 13, 2020, 12:38pm

Does the total always come before the subtotals after it?

Sweety_Girl · April 13, 2020, 12:43pm

you mean the lines

This may vary pdf to pdf

Anthony_Humphries · April 13, 2020, 12:47pm

It may be simpler in this case to iterate over the matches for regex Total:\s+, and use a For Each loop to find the first instance not containing sub or Sub. The regex above would capture the first and second lines, and the For Each loop would omit the second line, leading you to your line with the total. You can extract the value using my original regex, (?<=Total: )\d+. This also works for your preceding examples.

supermanPunch · April 13, 2020, 1:52pm

@Sweety_Girl Can you Check this regex and Check it for all the types of Input that you have and verify if it satisfies :

Sweety_Girl · April 13, 2020, 2:09pm

Every thing is good except 1

That is,
If total is found in between the sentence like this,

Ram increases the Total 456

It must take the total unless the ‘sub’ comes before it as subtotal

Anthony_Humphries · April 13, 2020, 2:15pm

It sounds like it will be easier to use a regex or different logic for each different document type. It will either be very difficult or impossible to handle all cases for all document types.

Sweety_Girl · April 13, 2020, 2:16pm

Yup… But we have more than 30+ formats pdfs

msan · April 13, 2020, 3:58pm

(?<![sS]ub)\s?[Tt]otal: (?<total>\d+)

or (?<![sS]ub)\s?[Tt]otal\s?:\s?\D?\s?(?<total>\d+) if you expect money symbol

ToddPull · April 13, 2020, 9:10pm

Try

Topic		Replies	Views
Extraction using regex Help	13	1547	April 9, 2020
Regexp pdf extraction to excel Automation Hub question , automation_hub	2	634	January 23, 2023
Extract data from pdf using Regex Help activities	6	1384	December 22, 2020
Help with Regex to get a value Studio studio , question , activities_panel	6	949	November 3, 2021
Regex Expression to Extract Total From PDF Robot robot , question	10	674	August 17, 2023

Extraction using the regex

Related topics