Invoice data extraction using document undertading

Greetings community, i would like your input in a project i am trying to do.

So, i have some invoices(pdf) that are multiple pages.
All of them have a summary in the first page (Some information and a small table 2,3 rows with the categories of the expenses and the amounts as well as a total amount after the table).
The rest of the pages are those categories in details (2,3 lines of text at the top and then a table with expenses 1 by 1).

The problem is with the detailed pages because the tables are not clearly written, there are no headers, no lines to separate the rows just spacing and the ML extractor seems not be able to identify them correctly.
In the template setup the ids that come up to match with the taxonomy fields are very wrong.

I cannot use a form extractor because the placement of the table depends on its size…

The pdf is separated into its pages and i made 2 taxonomy types for the 1st page and the rest.
Do i just need more invoices to train the ML or have i approached this wrong?

Any input is welcome , thanks!

Hi @Christodoulos

  1. Training the ML model with more invoices can definitely help improve the extraction accuracy. The model needs to learn from a diverse range of invoices to better understand the patterns and structures of the tables on the detailed pages.
  2. Consider alternative extraction approaches: Since the tables on the detailed pages are not clearly defined, you may need to explore alternative extraction methods.


Thanks for the input!

With other methods you mean without document understanding extraction? Because i tried all the extractors in that scope.
I will try to get my hands into more training samples and if that doesn’t work am thinking trying to manipulate the string returned from the digitilization.

Thanks again!

Do you know how i can train further the existing public endpoints provided from UiPath?

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.