Item Number not recognized - Invoice Document Understanding

I’m trying to extract data from an invoice PDF using Document Understanding.Invoice.pdf (58.6 KB)

It is not recognizing the “Item Number” Column as a separate field.

Thank you.

@Charbel1 Can you let us know the Details of the Extractor that you are using and the Taxonomy details that was configured for the Item Number Column ?

If you are using an invoice template and drawn the line after the Item_Number column and specified the column name it should work.

Another option- try using an alternate OCR engine with a scale factor 2 in the Digitization step. Example Microsoft OCR engine sometimes works better with scale set to 2 in the OCR Engine properties.

Yes sure,

I tried using part-no and item-po-no, they didn’t work either.

We’re working with multiple styles of invoices, and got the item number issues in multiple ones.

And unfortunately, the scale didn’t solve the problem…

Thanks for your suggestions anyway!

I just noticed in your screenshot that “123” is appearing under the description column.

There is a text view in the present validation station that displays the raw text of the OCR scan. Maybe that could reveal why your first item # column is out of step.

Here it is.

I have gotten this far. And this is what I see on my end in the Text mode

Okay, I was able to get the data extracted. But the Present Validation station is not working on my end. :open_mouth: . I don’t know if this is the same case on your end as well.
I see that the information is being extracted successfully. But when I bring up the Present Validation station, it shows no data has been extracted.

My previous post kind of led me down this path because in the Text-View I am able to see the OCR data extracted. So I output the extractedResults (which is the output of the Extractor) instead of the (validatedExtractedResults which is the output of the PV station) and found that the form extractor is indeed pulling the values from your invoice image correctly.

I have included the json of the ExtractedResults object and you can search for the string “Value=” and you will see all of your values extracted as they are supposed to be.

C1234_invoice_form_extractor_results.txt (9.8 KB)

I skipped the PV station and extracted data directly from the ExtractedResults and output them to the Excel file. I have included that as well in this post.

Invoice_Number_Image_png_results.xlsx (8.9 KB)

Some of the things I did to prepare the template are:

I downloaded the image file of your invoice and resized it to 300% .

For digitization I used the Omnipage OCR with Profile set to “Screen” and Scale = 2 . If I set it to anything else, the two highlighted values below aren’t extracted correctly. You can also see that in the image from my previous post

For the Form Extractor, I enabled the “Force OCR” option because I’m working with an Image scan. Here is how the Form Extractor template looks like:

Finally, I have attached the Exported Form Extractor Template that includes the invoice image file . I don’t know if it will work on your end. Given that my Document Id in the taxonomy is different (Finance.Accounts.Invoices) importing it into your environment may not be as straight forward. But if you open each of the files in the attached zip file and replace the Document Id with yours before you pull import the template, that might work. I hope there is an easier way. :thinking:

T_C1234_Invoice_PNG_WORKS.zip (1.9 KB)

I hope this helps. Now I got to see why the PV Station is breaking!

Thank you very much Andy! The Form Extractor works perfectly.

We still have the issue with the Machine Learning Extractor. Do you have any idea why Machine Learning is giving wrong results?

Got it. But could it the same situation case of the ML Extractor as well? Is the ExtractionResult object populated with the information, but the PV station is not able to render it visually?

I tried to do it on my end and for the love of life, your invoice is not rendering on the PV Station! I thought something is broken on my end, but when I ran a much more complex PDF it worked as it should.

This might sound silly, but is it common for Invoices to have two columns named “Total”? I see two in there - one at the line item level and another at the invoice level. Could this be a factor? Sometimes the most unexpected things could be the cause of the larger pains.

Thank you for your insights. We tried changing the total column but didn’t work either.

Can you please send us the complex PDF that you’ve used, maybe it could help us find the issue?

Hello @Charbel1 ,

The PDF isn’t mine to share openly. I will see if I can get permission from the source before I send it to you.

1 Like