How do I isolate these values as within a PDF?

Hey there,

I am scraping information from a PDF and I’m a little confused as to how I can isolate specific pieces of information within it; the variable is called pNewQuotePDF.

The PDF to text output (that I want) looks like this - but the values will change each time:

image

Is there a way in which I can take the highlighted values and have them as an array? I don’t know what syntax I would need to isolate them. For the ‘Dealer Fitted Options’, I would also need the price immediately after.

Would this be possible to do or would I need to find another method?

Thanks for your help.

Hi @dr1992

Use regular expression & write pattern to extract the data you will get it

Thanks
Varun

How would I do that?

@dr1992

Let you share the input & what is the output you expected, I will help you on this

Thanks
Varun

Ah, thank you! So the variable of pNewQuotePdf contains:

Factory fitted options

Metallic - Record grey 458.33 0.00 0.00 12.90 59.12 399.21 79.84 479.05

Style pack - 595/595C 750.00 0.00 0.00 12.90 96.75 653.25 130.65 783.90

Turismo leather pack - 595/595C - Black 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Dealer fitted options

matts 30.00 0.00 0.00 0.00 0.00 30.00 6.00 36.00

Other costs / profits

I have highlighted the fields required in bold. I need these as their own array of strings: factory fitted options then another for dealer fitted options and then another for dealer fitted option price.

These values will be different each time though.

Thank you for your help, regex still alludes me…

@dr1992

Let me know if my understanding is correct with the below data;

Input:
Metallic - Record grey 458.33 0.00 0.00 12.90 59.12 399.21 79.84 479.05

Output:
458.33 0.00 0.00 12.90 59.12 399.21 79.84 479.05

If not explain me what you are expecting

Thanks
Varun

So the whole text is the input, the output is that in bold.

@dr1992

Input:
Metallic - Record grey 458.33 0.00 0.00 12.90 59.12 399.21 79.84 479.05

Output:
Metallic - Record grey

This is correct, is it constant wording?

Thanks
Varun

This is what is required, but the entire text is one output variable which needs splitting up. The items themselves will differ each time.

@dr1992

Share the input file if possible

Thanks
Varun

I’m not sure what you mean, the text I posted is what comes from the PDF, but the values can change within it.

The output I require is what I highlighted in bold.

@dr1992

Before the required data which you want before tat wording is any constant words

Thanks
Varun

Consistent words are:

Factory fitted options

Dealer fitted options

Other costs / profits

@dr1992

Please find the below pattern to use similarly use your required expected output

System.Text.RegularExpressions.Regex.Match(YourString,“(?<=Factory fitted options)(?sim).*(?=Dealer fitted options)”).Tostring

This will match your criteria

Thanks
Varun

1 Like

Hi @dr1992

Please try this regex

^.*?(?=\s\d+\.\d+)

Output:

for the last part after dealer fitted. Use split string first and then use the below expression
^.*?(?=(\s\d+\.\d+))

output:

Do let me know if you need any help

I hope this is what you are looking for

cheers

1 Like

Hey, thanks for this, I think I am struggling to get this bus I’m unsure why. I output Varun’s solution to getting that section as pGetCriteria (String), but when I try using that as an input for the Matches activity with your regex, it doesn’t seem to give me the right thing. Any ideas?

Hi @dr1992

can you show what you are trying to do? or where you are facing issue

System.Text.Regularexpressions.Regex.Matches(“your string”,“Your regex”)

this is the format to get all the matches

cheers

Like this in an assign?

System.Text.Regularexpressions.Regex.Matches(pGetCriteria,“^.*?(?=\s\d+.\d+))”)

Seem to be getting this:

image

edit: no idea why, but this defaulted to an IEnumberable rather than a string lol ignore me. Trying now!

Hi @dr1992

The variable that you use in left side should be on type Matchcollection.Can you change that please

cheers

Hmm, this tells me MatchCollection is not a member of regex. With matches, it tells me too many ) :frowning:

Error: Assign: parsing “^.*?(?=\s\d+.\d+))” - Too many )'s.

1 Like