PDF - Invoice Data extraction only of product name and Quantity

Rahulsinha · February 19, 2020, 4:00pm

Hi Team,

How are you?
I came across a query as I am trying to do some automation on CSR.

I have to extract the details like Item description, Quantity and Price from the PDF file, can you please advise how IntelligentOCR can help me on this as I am trying to do with this only but not able to create the workflow.

If you have any other suggestion for me on how to extract the details, please let me know.

Below is an example how the data is in PDF and it may vary

@lakshman
@Palaniyappan
@RishiVC1

Lahiru.Fernando · February 19, 2020, 4:26pm

Hi @Rahulsinha

I think you can make use of the Intelligent OCR for this one. They have Position based extractors which you can easily configure to capture the data in the grids if the format of the document is the same all times.

First you need to do is to create a template using the taxonomy manager and classify the document. Next use the position based extractors if the structure is the same. You can also try to use the machine learning extractor to see how it can extract your fields…

How you build the workflow, you can refer to this sample workflows and create your own…

Rahulsinha · February 19, 2020, 4:40pm

@Lahiru.Fernando
Thanks, Here I have two queries
1- How to use Taxonomy Manager, I have read about it in the portal but I am not clear on it as after it creates the JSON file what next needs to be done.—> Can you please explain in deatails

2- If it will be position based then it might be a problem as different PDF may have different format in that case how it will pick.

Thanks!

Lahiru.Fernando · February 19, 2020, 4:50pm

@Rahulsinha

You can get a much better understanding on how to use the Intelligent OCR part and building workflows, if you try to follow the 2019.04 updates course in the UiPath academy. It explains well on how to use the taxonomy manager and how to build the intelligent OCR workflows…

For your second question, if the position is changing time to time along with the formats, I would suggest to give a try with Machine Learning extractor or the regex based extractor or the combination of these…

Go through the course first to get a better understanding…

Rahulsinha · February 20, 2020, 3:47am

@Lahiru.Fernando
This course is really helpful but I am not able to analyse how may time we need to define to the extractor. I mean how it can be resolved and also as the output it is not giving the data which I have selected in the PDF.
Kindly let me know if my query is not clear

Ioana_Gligan · February 20, 2020, 8:44am

Hello @Rahulsinha,

The machine learning extractor is using a pre-trained model that does not adjust with the manual corrections submitted by the users. So it does not learn at the moment.

Ioana

Rahulsinha · February 20, 2020, 11:19am

Thanks for the response.
Could you please advise in my query How should I process on this.
I need to extract the PO number, Date and the data which is available in the table, how can I extract all these data in an excel sheet.

Ioana_Gligan · February 20, 2020, 12:29pm

Could you send me a real document to have a look at?

Rahulsinha · February 20, 2020, 5:17pm

Sorry, not be able to share the real document but here is the copy(kind of) of it.

Ioana_Gligan · February 21, 2020, 7:43am

Hello @Rahulsinha,

Without a few real life documents, I cannot propose a solution. Try looking into the Regex Based Extractor and the Position Based Extractor if you have just a few variations of where / now the data appears.

Ioana

Rahulsinha · February 24, 2020, 5:45am

its not working
How to extract all the data available in the Table from PDF

MinalGupta · February 24, 2020, 9:52am

@Rahulsinha ,

Please share the data in the text format or txt format.

Just read pdf text activity and put the data in a variable and then share the data here.

Based on that , a solution can be purposed.

Thanks
Minal

abhishek.tyagi · April 29, 2020, 10:53am

Hi Ioana,
I am facing a issue with the ML extractor, its able to fetch all details of a invoice but somehow not able to fetch invoice qty details which is in a table format.
Initially I thought it could be a pdf issue but its not working for any pdf.
Is this is a known issue?

Thanks
Abhishek

Ioana_Gligan · April 30, 2020, 10:56am

Hello @abhishek.tyagi, and Welcome to our forum!

Any chance you could share a sample file with me, in private, for analysis? The workflow failing to extract the quantity would also be of great help.

meanwhile, do check out our examples here: How to use the IntelligentOCR Package - do these samples, once you upgrade the packages, work for you?

Ioana

abhishek.tyagi · April 30, 2020, 11:33am

Hi Ioana,

Thanks for the reply, but I think I resolved the issue by myself.

Abhishek

Topic		Replies	Views
How to extract invoice data from PDF's? Help pdf , ocr	10	7369	February 24, 2020
What pdf extraction approach would be for getting different product details from a PDF having multivalued fields? Studio uiautomation	8	313	May 30, 2023
NEED HELP Invoice's Extraction Studio studio , question , extension	12	926	January 10, 2023
Extract data from PDFs with varying structures Studio pdf , studio , data_scraping , question	4	526	October 18, 2023
Hello Everyone, I have an use case of reading a table from multiple pages an invoice pdf. Is there any sample workflows that i can look into using Intelligent OCR ? Thanks in advance Help robot	1	1647	September 21, 2019

PDF - Invoice Data extraction only of product name and Quantity

Related topics