How to capture the data from the scanned PDF with different nature

vaibhav2.chavan · March 11, 2020, 4:12pm

I want to capture the Invoice ID, Date and customer id etc from the PDF’s with different nature.
Anyone have any idea on this?

Please see the attached file as reference.
All files will be PDF files.

Invoice 1.pdf (28.8 KB)

MikeBlades · March 11, 2020, 4:38pm

Hello @vaibhav2.chavan,

I work for a company where we have many different supplier invoices so i’m in similar boat to you. The way i got round it was as follows:

Get all PDF’s into one folder. then do a For Each loop on each file in directory.
Use a Read PDF activity (might have to DL it from the manage activities area) and set the output to String type variable (Lets call this variable OUTPUT).
Then use a string split method to isolate the desired string…Think of it as finding an anchor word (a word that maintains the same position relative to the desired word for each instance of same supplier invoice) Then splitting the rest of the String variable to narrow down / isolate the word you’re after.

Here is what the Read pdf OUTPUT variable looks like for your invoice:

INVOICE
1 Main Road
Johannesburg
South Africa DATE
leon@robopro.co 2017/09/29
TERMS
Net 30 Days
Mellicent Ivoshin
Dynazzy
37 Carpenter Court
Sinilian First
560-390-2703
mivoshincp@gravatar.com
DESCRIPTION QTY (hours) UNIT PRICE ($) AMOUNT ($)
Service Fee 6 200,00 1 200,00
Additional Services 7 75,00 525,00
1 725,00
If you have any questions about this invoice, please contact
[Leon, leon@robopro.co]
2170
CUSTOMER ID
INVOICE #
Thank you for your business! TOTAL
279
BILL TO

For your attached invoices i found the following splits for you:

INVOICE = Output.ToString.Split({“co.]”+vbCrLf},2,StringSplitOptions.RemoveEmptyEntries)(1).Split({vbCrLf},2,StringSplitOptions.RemoveEmptyEntries)(0).Trim

DATE = Output.ToString.Split({“DATE”+vbCrLf},2,StringSplitOptions.RemoveEmptyEntries)(1).Split({" "},2,StringSplitOptions.RemoveEmptyEntries)(1).Split({vbCrLf},2,StringSplitOptions.RemoveEmptyEntries)(0).Trim

I hope you can use this to find the rest of the variables you’re after.

Cheers
MikeB

vaibhav2.chavan · March 11, 2020, 5:17pm

@MikeBlades - Thanks for your response. I will try and let you know in case of any issues.

Palaniyappan · March 11, 2020, 7:13pm

HI
though we use regex it can handle only different templates of pdf, but if the text format or if its position changes then regex wont be able to handle it buddy
did we try with ABBY FLEXI capture on this
hope this video could give you some insights

Cheers @vaibhav2.chavan

Topic		Replies	Views
Different Vendor Invoices Studio datatable , excel , pdf , activities , studio , question , word , pdf-extraction , emailtopdf	5	720	March 24, 2023
Extract data generically from each invoice Document Understanding pdf , studio	19	2732	October 8, 2020
Invoice pdf extraction Help	3	1150	May 24, 2022
Extract data with different Names Studio studio	6	1213	August 18, 2020
How to scrape same type of data values from 5 different invoice pdf Help studio	6	2192	March 4, 2019

How to capture the data from the scanned PDF with different nature

Related topics