Read PDF Uipath Activities 2

Mike99 · September 2, 2020, 6:42pm

Hello
I have a problem when reading a pdf file since I need to extract a specific data but I have not been able to, I use the pdf activities but literally they extract all the data from the pdf but it does not work so what I want to do is extract a specific part to apply an if

Example
BALANCE AT DATE 0

I need to extract the value of that balance to date

Could you help me
Thank you

Dave · September 2, 2020, 6:55pm

How large are the PDFs? I have usually found that if they are of a reasonable size (less than ~50 pages or so) that the best method is to read the entire pdf using the PDF activities, then to use string manipulation to pull out the specific data as needed.

One possible way would be with regex: Assign BalanceAtDate = System.Text.RegularExpressions.Regex.Match(FullPDFAsString,"(?<=BALANCE AT DATE\s*)\d+").Value

This would return a string that finds the words “BALANCE AT DATE”, then returns whatever numbers it finds after those words. Note that it is case sensitive (turn this off using RegexOptions.IgnoreCase) and it doesn’t find decimals (alter regex pattern to include this in the search)

Mike99 · September 4, 2020, 9:42pm

Hello
try to use this declaration System.Text.RegularExpressions.Regex.Match (Output, “(? <= BALANCE TO DATE \ s) \ d + (. \ d +) + (. \ d +)”). Value since it has decimal values
It worked but when in the pdf the balance is 0 it does not show me on the screen I do not know if it is misused or I am missing something, could you help me

Dave · September 4, 2020, 10:52pm

If you want to include decimals, change it to this: System.Text.RegularExpressions.Regex.Match (Output,"(?<=BALANCE AT DATE\s*)\d+(\.\d{1,2})?").Value

If you want to include more than 2 decimal places, then change the 2 in the {1,2} portion to however many is the maximum digits you want to include.

Mike99 · September 23, 2020, 8:14pm

Hello a query if I would like to capture in this case a formatted date
2020-09-19 how should the regular expression be?

Dave · September 23, 2020, 8:44pm

You could use: \d{4}-\d{2}-\d{2} This grabs 4 digits then a dash, then 2 digits, then a dash, then 2 digits again.

You can make it more complex or specific as needed depending on your input text. For example, if there is a possibility your input text would contain something like 9827-28-27-38859-183-12 that isn’t a valid date. However, the regex would still find 9827-28-27. You could prevent that by puttting \b (word boundary) on one or both sides of the regex: \b\d{4}-\d{2}-\d{2}\b

You also know specific digits are invalid, so you could further complicate it by validating specific digits:

\b[1-2][0,9]\d{2}- You know the year is either 1000 or 2000 millenia, the second year is either a 0 (2000) or a 9 (1900), and the last 2 of the year could be any digit.
[0-1]\d- The first digit of the month has to be a 0 or a 1 since there is no month 20+
[0-3]\d\b The first digit of the day needs to be a 0, 1, 2, or 3 since there is no day 40+

Putting that all together, the more complicated regex (possibly very unneccessarily complicated) would be: \b[1-2][0,9]\d{2}-[0-1]\d-[0-3]\d\b

A better way than making it more complex is sometimes to pull the value, then make sure it is a valid datetime after the fact IsDate(Datetime.ParseExact(RegexMatch,"yyyy-MM-dd",CultureInfo.InvariantCulture)) might be a bit easier to tell what you’re doing

Topic		Replies	Views
Extrat selected data from PDF Activities uiautomation , activities , question	4	630	November 11, 2022
Need regex to extract the data Activities activities , question , document_processing	13	1308	February 22, 2021
Extract data fromPDF Help	13	1223	October 2, 2019
Extracting specific values from changing image PDF Help	8	1023	February 12, 2019
Enable to get specific text from pdf file Help pdf , activities , regex , string , question	6	998	December 6, 2019

Read PDF Uipath Activities 2

Related topics