I would like to know how to extract multiple subtotals per page from an PDF invoice. There is only one machine learning extractor for ‘total’ and the other extractors that are for numeric amounts are not picking the other totals. I’ve tried using the ultility bills endpoint but the other numeric options for utility don’t pick up the additional totals either.
Not that I’ll be able to solve the problem, but what kind of data are we working with here?
- scanned image of a print/fax
- fully digital file, such as “print to pdf” (text can be natively selected with a cursor)
- fully standard form (has structured/tagged input fields)
Additionally do all the invoices look the same, or can they vary by page or by vendor?
The answer to your question will vary significantly - ideally you have at least a digital file, and have a standardized format. If so, you may be able to read the .pdf and parse via regex or other means. If not, you will need more complex solutions.
@j_run - You can try the Regex based extractor…
I did try using Regex at first but was informed that the focus has to be on machine learning extractors to account for the invoices coming from different vendors in the future
The PDFs are scanned images of a printed utility bill. My understanding is that the format is not standardized as the assumption is there will be invoices coming from other vendors.