Reading PDF Text


#1

Hey Everyone,

Im having some issues with one of my pdf files. I did a read text so I know how uipath reads the file. One pdf works fine because it has all of the information I need from the invoice example:“SUM TOTAL 0.00 41,691.75” the 0.00 is the deduction. The other pdf invoice did not come with a values so Im not sure the best way to read it
Example: “TOTALS: 575866.76 575866.76” the header has deduction written there but does not show value of 0.00 how would still be able to grab it?


#2

@kishanpatel728 ,
Don’t mind that I’m not clear with your requirement… though, Whatever the PDF file is , once you read it output will be string. So, understanding some string manipulation methods makes the solution more easier…

In first example , 0.00 is present and in second it is not there… so what value you want to get now ?
If possible can you please a screenshot of that PDF file.

Regards,
SP


#4

Hello,

First off, those are 2 different PDF files, so it may need a different configuration based on which pdf file it is.

Is there an example or possibility for the second one to have numbers under the Discount or WHT amount columns? If not, then you might be able to just ignore those numbers.

In any case, you essentially need to create a keyword argument or variable to store either “SUM TOTAL” or “TOTALS:”, then you can split it to find the Net Amount which is the last value.

fulltext.Split({totalKeyword},System.StringSplitOptions.None)(1).Split(System.Environment.Newline(0))(0).Split({" "},System.StringSplitOptions.RemoveEmptyEntries).Last

This splits by the keyword and takes (1) item, then splits by Newline and takes (0) item, then finally splits by the space and takes the Last item.

You can also do this with Regex which may or may not be simpler.

Extracting the other amounts when they are sometimes not there will be more tricky, but if all you need is the Net Amount, then I wouldn’t worry about it since the Net Amount is always the last value.

I hope this helps some.

Regards.


#5

I know they are different. The first pdf is is the right format. As for the 2nd one im not sure yet if there needs to be values. Issues is when I do a write text to see how it looks like there is only one space between the totals. I used regex to extract most of the info I needed. This is how it is viewed in text
“TOTALS: 575866.76 575866.76”


#6

What if you create a condition where if the .First value = the .Last value, then assume the discount is 0?

To do that, you would first extract the first value and the last value, using similar code as I presented in my previous post.

Assign amount = fulltext.Split({totalKeyword},System.StringSplitOptions.None)(1).Split(System.Environment.Newline(0))(0).Split({" "},System.StringSplitOptions.RemoveEmptyEntries).First
Assign netAmount = fulltext.Split({totalKeyword},System.StringSplitOptions.None)(1).Split(System.Environment.Newline(0))(0).Split({" "},System.StringSplitOptions.RemoveEmptyEntries).Last

if amount = netAmount
    Assign discount = 0
Else
    Assign discount = fulltext.Split({totalKeyword},System.StringSplitOptions.None)(1).Split(System.Environment.Newline(0))(0).Split({" "},System.StringSplitOptions.RemoveEmptyEntries)(1)

This would basically store the discount to the second value if the invoice amount = the net amount. However, there is no invoice amount in the first pdf you have there, which will need additional logic. Maybe you can determine if there is an Invoice/Document Amount column prior and where I am showing Assign discount = … change the last (1) to (0) based on if it doesn’t exist, et cetera.

Regards.