I am not able to Extract a table using Form Extractor

menna_almahdy · June 25, 2020, 8:38am

Hi all,
I am trying to use Document understanding activities to read invoices, It’s working perfectly with “Form Extractor” but my issue is extracting the table with more than one row. I have tried the ML model but it’s not working. Is it not possible to extract a table with “Form Extractor”, Please find the attached workflow it has everything, I will be appreciated if anyone helps me to figure it out… Thanks DocumentUnderstanding_Invoices.zip (2.3 MB)

Ioana_Gligan · July 10, 2020, 4:21pm

Hello @menna_almahdy,

You can use form extractor to extract tables as well - if they have a fixed height and fixed columns and fixed numbers of rows.

All you need to do is click on the table field, mark the table, as you would for regular processing (header row right side three bullets to access Extract New Table).

Then click save, and your template should be saved.

Ioana

ChrisC · September 3, 2020, 7:34pm

I cannot get the table extractor in the Form Extractor to work effectively. It cannot pick up all the values and for some reason seems to reposition the capture area for the table incorrectly missing out the bottom third. Anyone an expert in this function?

JC_Lim · September 11, 2020, 11:48pm

Hi,
Does that mean that you can’t use form extractor for invoices? Since the number of items differ per invoice.

Ioana_Gligan · September 15, 2020, 8:12am

Yes, unless you want to configure very specific invoices that have a fixed table area… in this case, either form extractor (if lines have the same height) or regex extractor could be used for the line items.

Do remember form extractor is designed to process fixed form documents at this stage, imagine a W-4 IRS form for example. To process invoices or other document types of variable format, we strongly recommend using the machine learning extractor.

Ioana_Gligan · September 15, 2020, 1:00pm

could you share a sample file and template? (project with .local folder included in archive as well to see the templates)

ChrisC · September 19, 2020, 7:40am

Hi, I didn’t see the reply as it was several days after my post. Are you able to help with the form extractor? It is a standard, unchanging structure, but I cannot get it read by the Form extractor. At the moment I solve the problem by ingesting it into Adobe or Abbyy and exporting it as HTML.

AndyMenon · January 14, 2021, 5:20am

I have been working with a Form Extractor and I have run multiple testsand I seem to have the same problem. When the template is created, it recognizes all columns of a fixed table. But when the DU Flow actually runs, only the first few columns for each of the tables are extracted.

Here is screen shot from the Validation Station and as seen, the last 5 columns of each of the two tables are not extracted.

And this is the Template Mapping that shows a more reassuring scenario that actually does not work when the flow runs.

This PDF is has many pages and therefore is too large for the free limits of the Form Extractor. Therefore I took a screen shot of one page and saved it as a JPEG image. The template has been created based on this image.

AndyMenon · January 16, 2021, 12:15am

Ok, assuming that this forum thread is still live and responsive, I will post some of my findings over the past couple of days, and I hope this is helpful to others.

The Challenge:

The table structures in this PDF are not strictly conventional and this may be a contributing factor. Here is what I mean:

Originally I used the Tesseract OCR Engine to digitize the document and also to create a template for the Form Extractor. I had no success.

One Possible Workaround:

First, I changed all my columns in the Taxonomy from Number to Text datatype
Second, I used the Microsoft OCR Engine in Scan Mode with Scale Factor 2 for creating the template
Third, I modified my template to scan the table along its original lines and that means each of my columns will contain two data points - that is why I had to change my datatypes in step 1 to Text

Cons of the Solution
With these changes in place, I was able to extract all the data points except that I have to make a compromise to split the data in each column downstream.
This is how the extracted data looks in my Excel - Each table resembles their source from the PDF above

Today, I followed the same process and edited the template for the Form Extractor. But this time I tried to extract each of the 9 data points individually from a different part of the PDF.

Important I used Microsoft OCR Engine, Scan Mode, Scale 2! for the template as before!

This time I was able to extract each of the individual data points from another table on the same page . The validation station shows all data points extracted individually

To Summarize:

Two different OCR Engines were used - OmniPage to digitize the documents, Microsoft to create the templates
Microsoft OCR in Scan Mode with a Scale set to 2 seems to work with consistent results

Hope this helps.

Cheers

sailesh.tiwari · February 24, 2021, 10:10am

Hi ,
I am also trying to extract table from image but when clicking on save new table nothing is happening.Please help

Cristian_Negulescu · March 1, 2021, 7:21am

Hello Menna,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel:

2:00 GitHub free code for all the files
2:20 Logic of general workflow
4:40 File 1 simple PDF
9:50 File 2 PDF with a column with multiple lines
20:10 File 3 PDF with a column with multiple words ON the LAST column
27:00 File 5 PDF with a column with multiple words ON inside column (2 columns)
31:40 File 6 PDF with a column with multiple lines
39:10 File 8 simple PDF
42:15 File 9 PDF with multiple spaces on that need to be correct
45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
55:50 File 11 simple PDF with protection empty Cells
58:35 File 12 Big PDF with an empty line and Empty columns and partial total
1:02:25 File 13 PDF with multiple columns that have multiple words and hard to define a rule
1:10:15 File 15 PDF with multiple columns that have multiple lines
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data
1:16:05 File 18 simple PDF
1:17:10 File 19 PDF with multiple pages and columns with multiple lines
1:22:10 File 20 PDF with multiple columns that have multiple lines
1:25:00 File 21 PDF with empty columns and subtotal

Code:

github.com

cristinegulescu/startUiPathFromSalesforce/blob/master/PDFdecode.txt

        'FILE1
        Dim strtmp As String
        strtmp = strin.Substring(strin.IndexOf("Number"), strin.IndexOf("Subtotal") - strin.IndexOf("Number")).Trim
        strout = strtmp.Replace(" ", "|")

        strtmp = strin.Substring(strin.IndexOf("Subtotal") + 8)
        strpar = strtmp.Substring(0, strtmp.IndexOf(Environment.NewLine)).Trim


        'FILE2
        Dim strtmp As String
        Dim strout As String
        strout = "Col1|Col2|Col3|Col4"
        strtmp = strin.Substring(strin.IndexOf("Vacancies") + 11).Trim
        For Each line As String In strtmp.Split(New String() {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
            If (line.Length > 3) Then
                If (IsNumeric(line(0))) And (line(1) = " ") And (line(2) = " ") Then
                    strout = strout + Environment.NewLine + line.Replace("  ", "").Replace("  ", "|").Trim
                ElseIf (line(0) = "") And (line(1) = " ") And (line(2) = " ") Then
                    strout = strout + line.Replace("  ", "$").Trim()

This file has been truncated. show original

Thanks,
Cristian Negulescu

Topic		Replies	Views
Table Extraction image format from pdf Studio studio , question , workflow_analyzer	16	1138	March 27, 2023
Extract table from PDF - Document Understanding Studio studio , question , activities_panel	5	117	October 19, 2024
How to extract table same pdf more different format using Document understanding Studio studio , question , document_understanding , activities_panel , pdf-extraction , pdf-tag	1	149	May 26, 2024
UiPath Invoice Table Extraction with UiPath ML Extractor \| Extract table from an Invoice Video Tutorials	1	772	September 8, 2021
How to extract table using form extractor Studio	1	768	July 29, 2020

I am not able to Extract a table using Form Extractor

Related topics