Extract invoice number from pdf

Hi Everyone, I have a set of invoices combined in 1 PDF file. I want to extract invoice number from all the Invoices. The issue i’m facing here is Invoice formats are not constant. Can anyone guide me how to solve this.
I’m attaching sample PDF below.

All invoices.pdf (1.2 MB)

Hello @Learner007 ,

Perhaps you should define a selector for each type of invoice.

Hence, you could:

  1. Open the invoice .pdf file
  2. Use the Check App State activity to identify your type of invoice https://docs.uipath.com/activities/docs/n-check-state . Note that each type of invoice should have its own check app state acivity, and its own activity for reading/ extracting the invoice number.
  3. Finally, extract your invoice number from the .pdf.

Hope it helps!
Best regards,
Marus

@Marius_Puscasu But here the problem is i will be having 40+ formats in each pdf, how can i write a code(To get Invoice Number) for all combined because i don’t know which format i get. So how can i achieve this.

Hi @Learner007 ,

40 possible formats looks a bit complex to automate in the traditional way :slight_smile:

Do you have by any chance access to UiPath’s Document Understanding?

2nd Option here I’d use is regex extraction. Use Read PDF activities from native PDF(generated digitally) or Read PDF with OCR Activity (scanned documents) and then try to identify the keywords to build a regex in order to extract invoice numbers.

  • this 2nd option is a bit longer to extract and can be a bit more complex due to the amount of variations in the Invoices.

Hope this helps!

Best Regards,
Ignasi Peiris

Hi @ignasi.peiris , I will try with Document Understanding Once

Hi, Can anyone guide me on this problem. it will be a great help for me.

@ppr , @Gokul001 @Sudharsan_Ka , @Yoichi , @supermanPunch , @omer.ozturk , @Rahul_Unnikrishnan

HI,

FYI, we can extract invoice number from page 1 to page 6 using the following regex.

System.Text.RegularExpressions.Regex.Match(strPdf,"(?<=Invoice\s*(NUMBER|NO.)?\s*\W*\s*)\d\S*",System.Text.RegularExpressions.RegexOptions.IgnoreCase).Value

Sample20221122-8.zip (1.1 MB)

However, for page 7 or higher, we need another approach. It might be necessary to try to fill actual data.

Regards,

@Yoichi is there any possibility to make single code to extract invoice number from all pages by using Regex or Document Understanding.

Hi,

As a personal opinion, it’s difficult to extract invoice number with single regex or any other rule base approach, I think.
However, it might be possible to use Machine Learning base extractor. (Sorry but I’m not very familiar with it)

Regards,

1 Like

Thanks @Yoichi for the useful information, I will give a try from my end once using Document Understanding.