Reading text from PDF left to right

pdf

#1

Hi!

I’ve been trying to extract text from an invoice with certain values like totals and taxes. Things is, every activity I tried (Read PDF Text, Get Text and Screen Scraping) always reads from top to bottom and then from left to right, making the output unusable.

PDF

Is there a way to make the program read from left to right? In essence, what I’d like is for the output to be:

“Total $ 31,160.40
Total gravado al 21.00% 4,427.40
Total gravado al 10,50 % 26,733.00”
[etc…]

Thanks!


#2

Hi Smassau!

I’m not entirely sure I understand your question. Looking at the screen shots you provided of the PDF, the screen scrape results, and your desired data, it looks like the only difference is that the screen scrape is not capturing paragraph breaks. Is that the essence of your problem?

Let me know!
Riley


#3

Hi Riley!

It’s definitely capturing the data. My issue is in how it is presented. Right now, instead of writing the Total and its value next, it writes all the total “types” and then all the values.

Here is the current output in a .TXT format:

TXT

What I’m looking for is for each number to be next to its referenced item like it does in the PDF (e.g IVA 21.00% 929.75, then IVA 10.50% 2806.97) instead of all the numbers at the bottom which makes it difficult to understand.

Sorry if I’m not being clear enough.

Thanks!


#4

Hi Smassau,

That makes sense! Sorry, I was confused by the original post because the amount displayed in the Screen Scraper Wizard is correct for the "TOTAL " row. It’s only after that where it starts messing up the order of things.

Unfortunately, I’m guessing that the reason UIPath is reading the PDF like this is that the background file that was originally used to create the PDF you’re looking at was formatted like that. If you are able to scrape the PDF using Native or Full Text scraping this means that UIPath is able to access the text of the file the way it was originally saved to PDF. This is great in that it should always get you 100% accuracy as far as words and text go, but unfortunately it has the downside of potentially throwing the visual formatting off if the original saved-to-PDF file was formatted strangely.

There are two potential workarounds for this:

  1. You could try using an OCR scrape rather than a Native or Full Text scrape. Since OCR looks at the PDF as a “picture” rather than looking at the file the way it was originally formatted, I expect it would put the titles/amounts in the appropriate rows the way you would like them to be formatted. I would hesitate to use this method, however, as OCR is not 100% reliable when it comes to reading text and numbers. (Although looking at the text/numbers in your screenshot, they appear to be very readable. – I expect OCR would be able to do a relatively good job with these.)
  2. You can try programming in your own logic to split and then concatenate the rows back together the way you want them to be represented visually. Your logic could go something like this: Scrape the data --> Create a Data Table variable with two columns --> Create a loop that loops through the scraped string and splits it out into different segments, each segment stored to a different row in column 1 of the Data Table --> Have this loop continue until it reaches the final “TOTAL” line (there are several different ways you could have it differentiate this line logically) --> At this point start a new loop that stores the next several string segments (which would be the $ amounts) into column 2 of the Data Table --> Finally, use a “For Each Row” loop to loop through the Data Table, concatenating columns 1 and 2 together to form the string the way you want it (Title Amount Paragraph, Title Amount Paragraph, Title Amount Paragraph, etc.)

I know this is probably quite a bit more complicated than you are hoping for, but due to the limitations of screen scraping this will probably be your best option.

Let me know if you need any additional help!

Cheers,
Riley


#5

Thanks for the detailed response Riley!

I liked your second option, but unfortunately the order in which the numbers are returned is completely jumbled as well - it’s like it copies the descriptions from top to bottom and then copies the numbers from bottom to top.

I think I’ll just end up using relative scraping each value manually and hoping that future invoices don’t change too much from my template.

Again, thanks for your help!
Santiago