Unable to extract data from Invoices

Aishwarya_Bhargava · October 12, 2020, 2:07pm

I am trying to extract the following data:

InvoiceNo
InvoiceDate
Order Information
- ItemNo
- Description
- Quantity
- Price
SubTotal
GST
Total

i tried using Document understanding but its not working, its not able to recognize any attribute, if i use get text, how will i get the table data?

Can anyone please help me out, I am attaching sample invoices.JaneDoe_01092020_130792.pdf (17.6 KB) JoeyTribbiani_01102020_281092.pdf (25.7 KB) MonicaGeller_09052020_87654.pdf (15.9 KB) RachelGreen_04042020_40874.pdf (20.6 KB)

ghazanfar · October 12, 2020, 2:34pm

Hi Aishwarya,

First use CV Screen Scope then use CV get text. It is more reliable if the position of element on your PDF is not changing then you can use CV activities.

Hope it will work

Aishwarya_Bhargava · October 12, 2020, 2:35pm

how will i get the data from the table, and also i need to add all this data in one excel

RPAForEveryone · October 12, 2020, 2:35pm

Hi @Aishwarya_Bhargava
Can you please share your current approach?
Have you used IntelligentOCR activities or the Document Understanding module available through ML models on Cloud?

Aishwarya_Bhargava · October 12, 2020, 2:41pm

so till now i tried 2 approaches

document understanding
data scrapping

(both are fail)

Project2.zip (89.9 KB)

ghazanfar · October 12, 2020, 2:45pm

Aishwarya,

Instead of going into more complexity simple use Computer vision activities install CV activities package into your studio and use first CV Screen scope then use CV get text. I have seen your pdf files and they are stable so no need for using complex method like ML models on Cloud etc…

Aishwarya_Bhargava · October 12, 2020, 2:46pm

will the get text activity work for the table from which i need to get data, one of the table is in 2 pages

RPAForEveryone · October 12, 2020, 2:51pm

The reason for me to ask about ML cloud was the same as @ghazanfar stated. These documents do not warrant a complicated approach.

Although, CV activities are designed for a remote environment like Citrix.

IntelligentOCR, on the other hand, can produce stellar results without much upfront effort or time investment.
You can also easily map tables for extraction with relative ease. See this section for a demo of table selection during IntelligentOCR training.

RPAForEveryone · October 12, 2020, 2:57pm

Also, you have all the resources you need to get started (including the project that you can readily use!)

Aishwarya_Bhargava · October 12, 2020, 2:59pm

the problem i am facing with document understanding is that, that the none of the data is getting recognized and captured to be verified in human validation center

ghazanfar · October 12, 2020, 3:02pm

Just try it.

Aishwarya_Bhargava · October 12, 2020, 3:06pm

okay i will try and share the output

RPAForEveryone · October 12, 2020, 3:13pm

Hi @Aishwarya_Bhargava

There is nothing wrong with your code/approach.

You just need to make one change.

The output of Extraction Scope - extractionResults is not being passed to the Validation Station.
I made that one change and could see results as shown above.

The ML extractor (or any other extractor which you use to train) should then pass the extraction details to validation. Otherwise, the extraction step is basically not doing anything productive as its results are never used.

I hope that clears the issue you have.

Aishwarya_Bhargava · October 13, 2020, 5:34am

yes that clears a lot of things thankyou
but i have a question, how can i get the table extracted, like i want all the rows information, how can i get it, and in some of the cases the table is spread in 2 pages

RPAForEveryone · October 13, 2020, 11:24am

In my reply above, I have quoted the link to Intelligent Form Extractor.
Please scroll down to the section Configuring a template with table selection.

The gifs provided in this section explain how to easily extract a table.
For multi-page documents, I haven’t worked much on those. Although going by info in the same article, you should be able to use ‘Page 1 Matching Info’ and ‘Page 2 Matching Info’ to your advantage.

In my experience, when the documents get complicated, such as tables spanning across multiple pages, an advanced data extraction platform such as ABBYY or UiPath’s Document Understanding module is ideal to achieve maximum accuracy.

IntelligentOCR is still developing, but some of its limitations will probably stay the same, given that Document Understanding piece is quite capable of achieving these results accurately.

@Ioana_Gligan May we have an expert weigh in here?
Cheers!

Aishwarya_Bhargava · October 15, 2020, 10:49am

I got the expected result using document understanding.

Thankyou everyone for the help

system · October 18, 2020, 10:49am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to get table from invoice Help activities	10	2097	February 24, 2021
Only tables extraction from scanned pdf Activities ocr , table	3	637	March 22, 2023
Need help data extraction in PDF Invoice Help activities	0	877	August 28, 2019
Issue in Table data extraction using Document understanding Activities orchestrator , activities , document_understanding	8	1683	May 20, 2022
Invoice Data Extraction .PDF Activities uiautomation , activities , studio , question	6	1639	December 2, 2022

Unable to extract data from Invoices

Related topics