Read PDF Uipath Activities 2

Hello
I have a problem when reading a pdf file since I need to extract a specific data but I have not been able to, I use the pdf activities but literally they extract all the data from the pdf but it does not work so what I want to do is extract a specific part to apply an if

Example
BALANCE AT DATE 0

I need to extract the value of that balance to date

Could you help me
Thank you

How large are the PDFs? I have usually found that if they are of a reasonable size (less than ~50 pages or so) that the best method is to read the entire pdf using the PDF activities, then to use string manipulation to pull out the specific data as needed.

One possible way would be with regex: Assign BalanceAtDate = System.Text.RegularExpressions.Regex.Match(FullPDFAsString,"(?<=BALANCE AT DATE\s*)\d+").Value

This would return a string that finds the words “BALANCE AT DATE”, then returns whatever numbers it finds after those words. Note that it is case sensitive (turn this off using RegexOptions.IgnoreCase) and it doesn’t find decimals (alter regex pattern to include this in the search)

2 Likes

Hello
try to use this declaration System.Text.RegularExpressions.Regex.Match (Output, “(? <= BALANCE TO DATE \ s) \ d + (. \ d +) + (. \ d +)”). Value since it has decimal values
It worked but when in the pdf the balance is 0 it does not show me on the screen I do not know if it is misused or I am missing something, could you help me

If you want to include decimals, change it to this: System.Text.RegularExpressions.Regex.Match (Output,"(?<=BALANCE AT DATE\s*)\d+(\.\d{1,2})?").Value

If you want to include more than 2 decimal places, then change the 2 in the {1,2} portion to however many is the maximum digits you want to include.

2 Likes

Hello a query if I would like to capture in this case a formatted date
2020-09-19 how should the regular expression be?

You could use: \d{4}-\d{2}-\d{2} This grabs 4 digits then a dash, then 2 digits, then a dash, then 2 digits again.

You can make it more complex or specific as needed depending on your input text. For example, if there is a possibility your input text would contain something like 9827-28-27-38859-183-12 that isn’t a valid date. However, the regex would still find 9827-28-27. You could prevent that by puttting \b (word boundary) on one or both sides of the regex: \b\d{4}-\d{2}-\d{2}\b

You also know specific digits are invalid, so you could further complicate it by validating specific digits:

\b[1-2][0,9]\d{2}- You know the year is either 1000 or 2000 millenia, the second year is either a 0 (2000) or a 9 (1900), and the last 2 of the year could be any digit.
[0-1]\d- The first digit of the month has to be a 0 or a 1 since there is no month 20+
[0-3]\d\b The first digit of the day needs to be a 0, 1, 2, or 3 since there is no day 40+

Putting that all together, the more complicated regex (possibly very unneccessarily complicated) would be: \b[1-2][0,9]\d{2}-[0-1]\d-[0-3]\d\b

A better way than making it more complex is sometimes to pull the value, then make sure it is a valid datetime after the fact IsDate(Datetime.ParseExact(RegexMatch,"yyyy-MM-dd",CultureInfo.InvariantCulture)) might be a bit easier to tell what you’re doing :slight_smile:

1 Like