I am trying to capture information from an invoice in PDF format (basically a JPF document converted to PDF).
I tried using both Google and Microsoft OCR activities to extract the data but it didn’t work.
Attached is the file from which I am trying to pull out key information such as order number, Invoice Number, Total Due, Invoice Date, Order Date.
Zalora Sample Invoice_Goodman Project.pdf (50.4 KB)
Attached below is the workflow design for this project along with the ‘blank’ output:
Please, can anybody advice me on it. I have also tried changing the accessibility parameters on the PDF.
@Tom1989 can you attach your xaml file
Hi @indra, can you suggest anything? How to resolve this problem?
@Tom1989 From Read Pdf with OCR remove pdfData variable and pass in Google Ocr Text Properties
hi @indra, thank you for your reply.
I made changes to the ocr activity suggested by you still the outcome isn’t desirable.
I am receiving following outcome upon executing the task:
Infact, I should be receiving Invoice Date, Order Date and Invoice Number without the string ‘Number’ and vertical lines ’ | ’ in the output.
Please, can you help me resolve it.
I will give you the another solution , this read pdf with ocr won’t work on your pdf, but before that I have few questions like
how many pdf you want to process per day?
will this be the standard format of your pdf or will it change ?
do you want all the information from the pdf?
is it possible to make few changes in this pdf if it is the standard pdf?
That would of great help.
Our client is a logistic company and I believe they receive thousands of such invoices everyday.
I am not sure whether or not it will be a standard format. I would request you to share your insights on approaching the problem from both perspectives.
Yes, I reckon. Following are the requirements of my client:
- I don’t think so.
Please find the attached text file , this is the data i fetched to my best from your pdf using python . if you are familiar with coding part then i can explain more.Basically i write python scripts that will run and fetch the data from pdf and give it in text format and excel format . the only thing i am unable to fetch is the total due and the table header because there background color is gray . any how i need some more time i will fetch those data also . Kindly have a look of txt file and let me know is it okay with you or not.
Hi @Rishi1 @indra @Manjuts90,
If you look at the output of Microsoft OCR for the entire file , you will notice that the robot is able to capture all the details and present it in a different format to that of the pdf.
How can I modify my assign activity to capture specific details from the file, what changes do I need to make to the expression in dot net.
For example, what change to assign value HKD 93.50 to TotalDue variable?
My current argument doesn’t work:
And I receive following output when I try to extract this specific parameter using the above expression:
From the first image it is noticeable that the OCR engine is working fine. I just need to make changes to my expression to capture the correct parameter.
Please, can anybody advise?
@Tom1989 In this particular case u try like below.
DueDate = "HKD "+System.Text.RegularExpressions.Regex.Match(pdfData,"(?<=HKD).+").ToString.Trim
Hi @Manjuts90, it doesn’t work. This is the output I received:
@Tom1989 I getting correct output check the below screenshot. Check the variable in Microsoft OCR and regex statement once, both are same or not.
One thing i want to add here your output is not coming as required from microsoft ocr like last rate value is coming like EKS 488.90 , due date ,invoice date and to address all are incorrect and what you will do if total due and rate both are equal suppose HKD 93.50
@Rishi1, in that case, I will pass the value of any one variable or use an if condition to verify it and then assign a value.
No change in the output.
Here is the xaml file for your reference:
new.xaml (15.3 KB)
Hi, Can you help me address this case?
@Tom1989 above mentioned regex is not possible to get required output, u have to do string split operations to get ur required output
@Manjuts90 Please, can you help me with the command