I am not able to Extract a table using Form Extractor

Hi all,
I am trying to use Document understanding activities to read invoices, It’s working perfectly with “Form Extractor” but my issue is extracting the table with more than one row. I have tried the ML model but it’s not working. Is it not possible to extract a table with “Form Extractor”, Please find the attached workflow it has everything, I will be appreciated if anyone helps me to figure it out… Thanks DocumentUnderstanding_Invoices.zip (2.3 MB)

Hello @menna_almahdy,

You can use form extractor to extract tables as well - if they have a fixed height and fixed columns and fixed numbers of rows.

All you need to do is click on the table field, mark the table, as you would for regular processing (header row right side three bullets to access Extract New Table).

Then click save, and your template should be saved.

Ioana

2 Likes

I cannot get the table extractor in the Form Extractor to work effectively. It cannot pick up all the values and for some reason seems to reposition the capture area for the table incorrectly missing out the bottom third. Anyone an expert in this function?

Hi,
Does that mean that you can’t use form extractor for invoices? Since the number of items differ per invoice.

Yes, unless you want to configure very specific invoices that have a fixed table area… in this case, either form extractor (if lines have the same height) or regex extractor could be used for the line items.

Do remember form extractor is designed to process fixed form documents at this stage, imagine a W-4 IRS form for example. To process invoices or other document types of variable format, we strongly recommend using the machine learning extractor.

1 Like

could you share a sample file and template? (project with .local folder included in archive as well to see the templates)

Hi, I didn’t see the reply as it was several days after my post. Are you able to help with the form extractor? It is a standard, unchanging structure, but I cannot get it read by the Form extractor. At the moment I solve the problem by ingesting it into Adobe or Abbyy and exporting it as HTML.

I have been working with a Form Extractor and I have run multiple testsand I seem to have the same problem. When the template is created, it recognizes all columns of a fixed table. But when the DU Flow actually runs, only the first few columns for each of the tables are extracted.

Here is screen shot from the Validation Station and as seen, the last 5 columns of each of the two tables are not extracted.

And this is the Template Mapping that shows a more reassuring scenario that actually does not work when the flow runs.

This PDF is has many pages and therefore is too large for the free limits of the Form Extractor. Therefore I took a screen shot of one page and saved it as a JPEG image. The template has been created based on this image.

Ok, assuming that this forum thread is still live and responsive, I will post some of my findings over the past couple of days, and I hope this is helpful to others.

The Challenge:

The table structures in this PDF are not strictly conventional and this may be a contributing factor. Here is what I mean:

Originally I used the Tesseract OCR Engine to digitize the document and also to create a template for the Form Extractor. I had no success. :frowning:

One Possible Workaround:

  1. First, I changed all my columns in the Taxonomy from Number to Text datatype
  2. Second, I used the Microsoft OCR Engine in Scan Mode with Scale Factor 2 for creating the template
  3. Third, I modified my template to scan the table along its original lines and that means each of my columns will contain two data points - that is why I had to change my datatypes in step 1 to Text

Cons of the Solution
With these changes in place, I was able to extract all the data points except that I have to make a compromise to split the data in each column downstream.
This is how the extracted data looks in my Excel - Each table resembles their source from the PDF above :neutral_face:

Today, I followed the same process and edited the template for the Form Extractor. But this time I tried to extract each of the 9 data points individually from a different part of the PDF.

:grey_exclamation: Important I used Microsoft OCR Engine, Scan Mode, Scale 2! for the template as before!

This time I was able to extract each of the individual data points from another table on the same page . The validation station shows all data points extracted individually :slight_smile:

To Summarize:

  • Two different OCR Engines were used - OmniPage to digitize the documents, Microsoft to create the templates
  • Microsoft OCR in Scan Mode with a Scale set to 2 seems to work with consistent results

Hope this helps.

Cheers :+1:

2 Likes

Hi ,
I am also trying to extract table from image but when clicking on save new table nothing is happening.Please help

Hello Menna,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel:

2:00 GitHub free code for all the files
2:20 Logic of general workflow
4:40 File 1 simple PDF
9:50 File 2 PDF with a column with multiple lines
20:10 File 3 PDF with a column with multiple words ON the LAST column
27:00 File 5 PDF with a column with multiple words ON inside column (2 columns)
31:40 File 6 PDF with a column with multiple lines
39:10 File 8 simple PDF
42:15 File 9 PDF with multiple spaces on that need to be correct
45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
55:50 File 11 simple PDF with protection empty Cells
58:35 File 12 Big PDF with an empty line and Empty columns and partial total
1:02:25 File 13 PDF with multiple columns that have multiple words and hard to define a rule
1:10:15 File 15 PDF with multiple columns that have multiple lines
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data
1:16:05 File 18 simple PDF
1:17:10 File 19 PDF with multiple pages and columns with multiple lines
1:22:10 File 20 PDF with multiple columns that have multiple lines
1:25:00 File 21 PDF with empty columns and subtotal

Code:

Thanks,
Cristian Negulescu