PDF that has multiple pages but same structure to Excel

Hi all,

Can any of you help me by giving me a solution? I have been trying for 3 days right now, but i can’t seem to find a solution. I tried almost all method from Text to documents learning, but i am still failing.

So the real question here is: How can i extract the tables from the pdf files? From page 1 - page 10 (same format), and then how can i transfer them over to excel?

Few things i would like to point out, that each page representing a place, so i was thinking to create a different tab for each.

Solution i have tried:

Using taxonomy and build the template along with OCR reading tools, but i am still failing.

This is only Page 1, but page 2-10 is the same. Having the same format.

Field to extract:

  1. Address
  2. PO number
  3. The tables (all of them)

Main.xaml.json (141 Bytes)

1 Like

Hi @bidfood_IT ,

Could you let us know whether the PDF document would be always Digital/Native or if it could contain scanned documents as well ?


Hi, This is my “home” account, sorry for the confusion because i didnt bring my work laptop home. Anyway, to answer your question, the PDF will always be digital copy without containing scanned docs.

In fact, this is a computer generated PDF through one of our customer’s portal, so the format will always be the same. I have hide some of the details, but the format will be the same as shown on the pic.

Appreciate if you could assist to help me!

@vincenthoh.capital ,

In that case, we could perform a first check by using Regular Expressions to get the details of each rows in the table.

Check if it is feasible and if the regular expression derived is applicable to all the different examples of files that you would have.

This should be our first analysis and then we could move on to DU if this does not work.

For checking the feasibility with RegEx, would you be able to provide a sample document ? We would also need to understand the constraints of the table columns (If All values will be present or if there are any optional values)

If you would want to use DU without the check of Regex, you could use the Invoice Model that UiPath provides out of the Box and check if it works.

1 Like

Well, is a little sensitive because this is the costing price where we are giving out to other company, and this forum is open right? Any other way we could go around on this? Because like i mentioned on my first post, the template is exactly the same except for the figures. In terms of the headings for the tables are all the same.

Few things are different:

PO number
Invoice to:
Deliver to:

The rest all are the same except for tables figures and description.

Just for your additional information, the pdf itself technically consist of 10 outlets. Meaning:

Page 1: Outlet A
Page 2: Outlet b

Which is why the “Deliver to” and PO number is different. I tried to turn it into .txt format, and the table only has 1 space, which i cant use another method to turn it into tables.

Anyway, i am very serious about learning this, so i would like to apologize in advance for not being able to show you the full .pdf, but please let me know what else i could do in this case. Like i mentioned earlier, all things are the same nothing change except for the information.