dr1992
December 6, 2022, 10:31am
1
Hey there,
I am scraping information from a PDF and I’m a little confused as to how I can isolate specific pieces of information within it; the variable is called pNewQuotePDF.
The PDF to text output (that I want) looks like this - but the values will change each time:
Is there a way in which I can take the highlighted values and have them as an array? I don’t know what syntax I would need to isolate them. For the ‘Dealer Fitted Options’, I would also need the price immediately after.
Would this be possible to do or would I need to find another method?
Thanks for your help.
varunk
(Varun Kumar)
December 6, 2022, 10:34am
2
Hi @dr1992
Use regular expression & write pattern to extract the data you will get it
Thanks
Varun
varunk
(Varun Kumar)
December 6, 2022, 10:36am
4
@dr1992
Let you share the input & what is the output you expected, I will help you on this
Thanks
Varun
dr1992
December 6, 2022, 10:40am
5
Ah, thank you! So the variable of pNewQuotePdf contains:
Factory fitted options
Metallic - Record grey 458.33 0.00 0.00 12.90 59.12 399.21 79.84 479.05
Style pack - 595/595C 750.00 0.00 0.00 12.90 96.75 653.25 130.65 783.90
Turismo leather pack - 595/595C - Black 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Dealer fitted options
matts 30.00 0.00 0.00 0.00 0.00 30.00 6.00 36.00
Other costs / profits
I have highlighted the fields required in bold. I need these as their own array of strings: factory fitted options then another for dealer fitted options and then another for dealer fitted option price.
These values will be different each time though.
Thank you for your help, regex still alludes me…
varunk
(Varun Kumar)
December 6, 2022, 10:45am
6
@dr1992
Let me know if my understanding is correct with the below data;
Input:
Metallic - Record grey 458.33 0.00 0.00 12.90 59.12 399.21 79.84 479.05
Output:
458.33 0.00 0.00 12.90 59.12 399.21 79.84 479.05
If not explain me what you are expecting
Thanks
Varun
dr1992
December 6, 2022, 10:47am
7
So the whole text is the input, the output is that in bold.
varunk
(Varun Kumar)
December 6, 2022, 11:11am
8
@dr1992
Input:
Metallic - Record grey 458.33 0.00 0.00 12.90 59.12 399.21 79.84 479.05
Output:
Metallic - Record grey
This is correct, is it constant wording?
Thanks
Varun
dr1992
December 6, 2022, 11:13am
9
This is what is required, but the entire text is one output variable which needs splitting up. The items themselves will differ each time.
varunk
(Varun Kumar)
December 6, 2022, 11:17am
10
@dr1992
Share the input file if possible
Thanks
Varun
dr1992
December 6, 2022, 11:18am
11
I’m not sure what you mean, the text I posted is what comes from the PDF, but the values can change within it.
The output I require is what I highlighted in bold.
varunk
(Varun Kumar)
December 6, 2022, 11:25am
12
@dr1992
Before the required data which you want before tat wording is any constant words
Thanks
Varun
varunk
(Varun Kumar)
December 6, 2022, 12:09pm
14
@dr1992
Please find the below pattern to use similarly use your required expected output
System.Text.RegularExpressions.Regex.Match(YourString,“(?<=Factory fitted options)(?sim).*(?=Dealer fitted options)”).Tostring
This will match your criteria
Thanks
Varun
1 Like
Anil_G
(Anil Gorthi)
December 6, 2022, 12:24pm
15
varunk:
required
Hi @dr1992
Please try this regex
^.*?(?=\s\d+\.\d+)
Output:
for the last part after dealer fitted. Use split string first and then use the below expression
^.*?(?=(\s\d+\.\d+))
output:
Do let me know if you need any help
I hope this is what you are looking for
cheers
1 Like
dr1992
December 6, 2022, 3:18pm
16
Anil_G:
^.*?(?=\s\d+.\d+)
Hey, thanks for this, I think I am struggling to get this bus I’m unsure why. I output Varun’s solution to getting that section as pGetCriteria (String), but when I try using that as an input for the Matches activity with your regex, it doesn’t seem to give me the right thing. Any ideas?
Anil_G
(Anil Gorthi)
December 6, 2022, 3:20pm
17
Hi @dr1992
can you show what you are trying to do? or where you are facing issue
System.Text.Regularexpressions.Regex.Matches(“your string”,“Your regex”)
this is the format to get all the matches
cheers
dr1992
December 6, 2022, 3:30pm
19
Like this in an assign?
System.Text.Regularexpressions.Regex.Matches(pGetCriteria,“^.*?(?=\s\d+.\d+))”)
Seem to be getting this:
edit: no idea why, but this defaulted to an IEnumberable rather than a string lol ignore me. Trying now!
Anil_G
(Anil Gorthi)
December 6, 2022, 3:33pm
20
Hi @dr1992
The variable that you use in left side should be on type Matchcollection.Can you change that please
cheers
dr1992
December 6, 2022, 3:45pm
21
Hmm, this tells me MatchCollection is not a member of regex. With matches, it tells me too many )
Error: Assign: parsing “^.*?(?=\s\d+.\d+))” - Too many )'s.
1 Like