Coverting inconsistent string data into the same structure

Hi everyone. I have a part of my automation which loops though different PDF’s and returns the total amount of each. Then saves the amount with the corresponding PDF File paths into a dictionary. The amount values are string manipulated to output the same structure for further processing. A built a template using demo PDF’s and it works however I’m having a bit of trouble with Real data used. The PDF’s are all differnent structure, I’m able to pull the amount from each but some additional data is extracted from the the PDF’s such as blank spaces, some additional numbers.

This is how the amounts appear once exctracted from PDF’s:

  • R 1,432,47
  • R2257.76
  • ZAR 403,90

I want to be able to be left with just the number without: any letters before, commas, any fullstop and white space. How can I tweak the parametre below to achieve this and log results?
InvokeMethod

Have a check if regex can help you for extracting the amount
grafik

later you can cleanse up (prefered: Value parsing or String method: replace)

grafik

Be aware on ZAR as it is other locals - comma maybe is the decimal seperator

How would the architecture look, would I be using the the regex in the Parameters? to which as its a dictionary, would I be saving the Amounts in a variable below the Invoke Method activity?

In the second image: number - 2257.76,was there an issue with fullstop? as it didn’t strip

Yes, thank you for that pointer, once the value is left with as just a number, i’ll be matching it to a specific number in an excel sheet which will be the same number sequence

Questions not fully clear understood, but let me try to answer

what the input? String Variable with pdf text? then lets feed this to Regex
if different then we find another option as well
we can use assign activity in the most cases

a string in the format X,XXXX.XX (X =digit) will be regonized as a double. default: Comma=Group Seperator, dot=Decimal seperator

If value is different due different locals / Country specifics (Spain: dot=GroupSeperator, Comma=decimal seperator), then Default behaviour is to get controlled by Globalization settings)

On last part we also can support if help is needed


Once the PDF’s values are pulled, I add the to a dictionary. In the image the first In( item.Value.Split(".“c).First.ToString.Replace(”,", “”) ) row is where I originally processed the PDF data. Not sure how to approach editing that. As once i get the value which has gone through regex, I would need the value for further processing.

Thank you, just attempting getting the first bit of the sequence working

Hi hope you’re well, could i reopen this query

yes, topic is still open. Just go ahead with your questions


Once the PDF’s values are pulled, I add the to a dictionary. In the image the first In( item.Value.Split(".“c).First.ToString.Replace(”,", “”) ) row is where I originally processed the PDF data. Not sure how to approach editing that. As once i get the value which has gone through regex, I would need the value for further processing.

we would suggest to work it out with us stepe by step. So we can align the suggestion on a clean base.

Also lets sharpen the requirements

define input: String text from pdf containing:
R 1,432,47
R2257.76
ZAR 403,90
, right?

output: Dictionary was requested

Here we would ask for what should be taken for the key (is it guranteed that it will be unique as maybe the values in PDF do differ)

Maybe an intermediate datatable structure with 2 cols (key, Value) will be prefered, as it can be better processed into dictionary while handling duplicated keys.

Maybe you can share some sample pdfs with us. Thanks

Have provided the necessary information

Have you received them by any chance