Extracting tables with varying number of items from pdf using Document Understanding

I am using Document Understanding (Form Extractor) to extract tables from pdf files. The items in the table are varying for example some pdf’s have tables which contain 6 items:

whereas some pdf’s contain tables that have only 1 item.

So if I create a template based on the 6 item pdf file then for the file which has 1 item in the table it is not extracting properly:

4 items.xlsx (9.3 KB)

1 item.xlsx (9.0 KB)

In the above excel files, the “4 file” is the data extracted from the pdf file on which the template was created and hence it is extracting properly. But for the other pdf file which contains only 1 item, the extraction is not proper. Some of the headers are not extracted.

Any solution to this? Can I use Anchor in this or is it not possible?

thank you for your time and help!

Hi @shrey.shah

Can you give a try using “Intelligent Form Extractor” and train the document having multiple line items (6 items in your case).

And check the output for both the files.

Thanks.

@suraj.setty I tried with intelligent form extractor but it still extracts incorrectly!

Hi,

Did you try with ML Extractor by providing the Api key and End Point.

Please find the link for endpoints

Thanks.

@suraj.setty Hi as I mentioned in the question, all the pdf have multiple pages and the ML extractor has a limit of 2 pages and 4mb. So it wont accept pdf files with more than 2 pages

Hi @shrey.shah

Yes its limited to Community Plan.

If possible you can Request for an “Enterprise Trial” and try to extract using ML extractor.

I tried using the ML Extractor on an invoice which had only 1 page but there also it is giving the same problem @suraj.setty

If you have an enterprise plan you can go with Combination of AI Center and Document Understanding for accurate results.

@suraj.setty AI Center is used for training custom ML models right? Do I really need the Enterprise version for that?

Hi @shrey.shah

Yes Enterprise license is required to Train an ML model.

Thanks.