Multiple Invoice in single pdf

Hi all, I wanted to extract invoice from a pdf. but there are multiple invoices in a single pdf and also each invoice may have 2 or 3 pages sometimes. How can I extract the data. Should I split the pdf or is there any other way.

Hi @john_smith2 ,
Can you share your PDF
I think we can get all in that file PDF to get string, then split it by start new invoice

every page has invoice string. the only thing I found is page 1 of 1, page 1 of 2 etc… these. but still in some pages this wont get extracted due to scanned pdf. and in some invoices its 1 of 2 without page string.

Hi @john_smith2 ,
I think there is a sign when starting a new bill, that’s right?
We can use it to separate invoices, because the length of the invoices is different

You an intelligent keyword classifier to handle this case. It can be configure either via training where you can show the classifier how the individual invoice formats might look like. Also if page numbers are present, that can also be utilised to do the splitting.
Refer this page for more info:

ok, will try this as I have never used it. Let me try.

Hi John,

You have one pdf that has multiple invoices, invoices come as 1,2,3 page invoices.

Here’s how you can build a DU process to handle your problem.

  1. First the model needs to learn how to split a pdf according to separate invoices. You can achieve this through a classification model such as intelligent keyword classification.
  2. One you train an intelligent keyword classifier you can split your pdf into invoices.
  3. Now you can a set of invoices that the model split from your pdf, we can pass each invoice into an ML invoice model and extract data from them.

Hope this helps, you can find many tutorials on how you can split and train a classifier model.

Will the ML classifier ever be able to do this?

The intelligent keyword classifier is not as good, to the point we had to abandon it for our project and manually split pages for the ML classifier to handle

Yes I m trying that now. TO train using th intelligent keyword classification should I pass individual invoices or the pdf with multiple invoices. and also in present validation station should i provide reference to the second page also?