Unable to capture PDF Invoice information using OCR

Tom1989 · November 12, 2018, 7:19am

Hello Everyone,

I am trying to capture information from an invoice in PDF format (basically a JPF document converted to PDF).

I tried using both Google and Microsoft OCR activities to extract the data but it didn’t work.

Attached is the file from which I am trying to pull out key information such as order number, Invoice Number, Total Due, Invoice Date, Order Date.

Zalora Sample Invoice_Goodman Project.pdf (50.4 KB)

Attached below is the workflow design for this project along with the ‘blank’ output:

Please, can anybody advice me on it. I have also tried changing the accessibility parameters on the PDF.

indra · November 12, 2018, 7:21am

@Tom1989 can you attach your xaml file

Tom1989 · November 12, 2018, 7:34am

new.xaml (9.5 KB)

There you go @indra

Tom1989 · November 12, 2018, 8:12am

Hi @indra, can you suggest anything? How to resolve this problem?

indra · November 12, 2018, 8:39am

@Tom1989 From Read Pdf with OCR remove pdfData variable and pass in Google Ocr Text Properties

Tom1989 · November 12, 2018, 8:56am

hi @indra, thank you for your reply.

I made changes to the ocr activity suggested by you still the outcome isn’t desirable.

I am receiving following outcome upon executing the task:

Infact, I should be receiving Invoice Date, Order Date and Invoice Number without the string ‘Number’ and vertical lines ’ | ’ in the output.

Please, can you help me resolve it.

Rishi1 · November 12, 2018, 8:56am

@Tom1989
I will give you the another solution , this read pdf with ocr won’t work on your pdf, but before that I have few questions like
how many pdf you want to process per day?
will this be the standard format of your pdf or will it change ?
do you want all the information from the pdf?
is it possible to make few changes in this pdf if it is the standard pdf?

Tom1989 · November 12, 2018, 9:07am

Hi @Rishi1,

That would of great help.

Our client is a logistic company and I believe they receive thousands of such invoices everyday.
I am not sure whether or not it will be a standard format. I would request you to share your insights on approaching the problem from both perspectives.
Yes, I reckon. Following are the requirements of my client:

I don’t think so.

Rishi1 · November 12, 2018, 11:00am

Hi @Tom1989
Please find the attached text file , this is the data i fetched to my best from your pdf using python . if you are familiar with coding part then i can explain more.Basically i write python scripts that will run and fetch the data from pdf and give it in text format and excel format . the only thing i am unable to fetch is the total due and the table header because there background color is gray . any how i need some more time i will fetch those data also . Kindly have a look of txt file and let me know is it okay with you or not.

indra · November 12, 2018, 12:00pm

@Tom Use Regex

Tom1989 · November 13, 2018, 3:51am

Hi @Rishi1 @indra @Manjuts90,

If you look at the output of Microsoft OCR for the entire file , you will notice that the robot is able to capture all the details and present it in a different format to that of the pdf.

How can I modify my assign activity to capture specific details from the file, what changes do I need to make to the expression in dot net.

For example, what change to assign value HKD 93.50 to TotalDue variable?

My current argument doesn’t work:

System.Text.RegularExpressions.Regex.Match(pdfData,“(?<=Total Due).+”).ToString.Trim

And I receive following output when I try to extract this specific parameter using the above expression:

From the first image it is noticeable that the OCR engine is working fine. I just need to make changes to my expression to capture the correct parameter.

Please, can anybody advise?

Manjuts90 · November 13, 2018, 4:21am

@Tom1989 In this particular case u try like below.

DueDate = "HKD "+System.Text.RegularExpressions.Regex.Match(pdfData,"(?<=HKD).+").ToString.Trim

Tom1989 · November 13, 2018, 4:31am

Hi @Manjuts90, it doesn’t work. This is the output I received:

Manjuts90 · November 13, 2018, 4:48am

@Tom1989 I getting correct output check the below screenshot. Check the variable in Microsoft OCR and regex statement once, both are same or not.

Rishi1 · November 13, 2018, 5:25am

Hi @Tom1989
One thing i want to add here your output is not coming as required from microsoft ocr like last rate value is coming like EKS 488.90 , due date ,invoice date and to address all are incorrect and what you will do if total due and rate both are equal suppose HKD 93.50

Tom1989 · November 13, 2018, 6:26am

@Rishi1, in that case, I will pass the value of any one variable or use an if condition to verify it and then assign a value.

Tom1989 · November 13, 2018, 6:36am

No change in the output.

Here is the xaml file for your reference:

new.xaml (15.3 KB)

Tom1989 · November 13, 2018, 6:47am

@UiRobot

Hi, Can you help me address this case?

Manjuts90 · November 13, 2018, 7:01am

@Tom1989 above mentioned regex is not possible to get required output, u have to do string split operations to get ur required output

Tom1989 · November 13, 2018, 7:06am

@Manjuts90 Please, can you help me with the command

Topic		Replies	Views
Unable to capture specific details on the PDF Help pdf , activities , studio	5	981	November 9, 2018
PDF READ ERROR Studio pdf , studio , question , activities_panel , pdf-extraction	7	898	June 25, 2022
Read pdf ocr Help	5	1159	January 23, 2019
Convert invoice PDF to excel sheet Studio	19	3000	July 29, 2020
Read pdf with different formats Help	8	1846	February 6, 2020

Most Active Users - Yesterday
Anil_G
mkankatala
Ayatulla_Middya
Debashrit_Mishra
mkt.scott4
supermanPunch
Raj_esh
More details...

Unable to capture PDF Invoice information using OCR

Related Topics