Unable to capture PDF Invoice information using OCR

Hello Everyone,

I am trying to capture information from an invoice in PDF format (basically a JPF document converted to PDF).

I tried using both Google and Microsoft OCR activities to extract the data but it didn’t work.

Attached is the file from which I am trying to pull out key information such as order number, Invoice Number, Total Due, Invoice Date, Order Date.

Zalora Sample Invoice_Goodman Project.pdf (50.4 KB)

Attached below is the workflow design for this project along with the ‘blank’ output:

Please, can anybody advice me on it. I have also tried changing the accessibility parameters on the PDF.

1 Like

@Tom1989 can you attach your xaml file

new.xaml (9.5 KB)

There you go @indra

Hi @indra, can you suggest anything? How to resolve this problem?

@Tom1989 From Read Pdf with OCR remove pdfData variable and pass in Google Ocr Text Properties

hi @indra, thank you for your reply.

I made changes to the ocr activity suggested by you still the outcome isn’t desirable.

I am receiving following outcome upon executing the task:

image

Infact, I should be receiving Invoice Date, Order Date and Invoice Number without the string ‘Number’ and vertical lines ’ | ’ in the output.

Please, can you help me resolve it.

@Tom1989
I will give you the another solution , this read pdf with ocr won’t work on your pdf, but before that I have few questions like
how many pdf you want to process per day?
will this be the standard format of your pdf or will it change ?
do you want all the information from the pdf?
is it possible to make few changes in this pdf if it is the standard pdf?

Hi @Rishi1,

That would of great help.

  1. Our client is a logistic company and I believe they receive thousands of such invoices everyday.

  2. I am not sure whether or not it will be a standard format. I would request you to share your insights on approaching the problem from both perspectives.

  3. Yes, I reckon. Following are the requirements of my client:

image

  1. I don’t think so.

Hi @Tom1989
Please find the attached text file , this is the data i fetched to my best from your pdf using python . if you are familiar with coding part then i can explain more.Basically i write python scripts that will run and fetch the data from pdf and give it in text format and excel format . the only thing i am unable to fetch is the total due and the table header because there background color is gray . any how i need some more time i will fetch those data also . Kindly have a look of txt file and let me know is it okay with you or not.

@Tom Use Regex

Hi @Rishi1 @indra @Manjuts90,

If you look at the output of Microsoft OCR for the entire file , you will notice that the robot is able to capture all the details and present it in a different format to that of the pdf.

How can I modify my assign activity to capture specific details from the file, what changes do I need to make to the expression in dot net.

For example, what change to assign value HKD 93.50 to TotalDue variable?

My current argument doesn’t work:

System.Text.RegularExpressions.Regex.Match(pdfData,"(?<=Total Due).+").ToString.Trim

And I receive following output when I try to extract this specific parameter using the above expression:

image

From the first image it is noticeable that the OCR engine is working fine. I just need to make changes to my expression to capture the correct parameter.

Please, can anybody advise?

@Tom1989 In this particular case u try like below.

DueDate = "HKD "+System.Text.RegularExpressions.Regex.Match(pdfData,"(?<=HKD).+").ToString.Trim

Hi @Manjuts90, it doesn’t work. This is the output I received:

image

@Tom1989 I getting correct output check the below screenshot. Check the variable in Microsoft OCR and regex statement once, both are same or not.

image

Hi @Tom1989
One thing i want to add here your output is not coming as required from microsoft ocr like last rate value is coming like EKS 488.90 , due date ,invoice date and to address all are incorrect and what you will do if total due and rate both are equal suppose HKD 93.50

@Rishi1, in that case, I will pass the value of any one variable or use an if condition to verify it and then assign a value.

No change in the output.

Here is the xaml file for your reference:

new.xaml (15.3 KB)

@UiRobot

Hi, Can you help me address this case?

@Tom1989 above mentioned regex is not possible to get required output, u have to do string split operations to get ur required output

@Manjuts90 Please, can you help me with the command