How to extract invoice data from PDF's?

Is there a way to extract invoice information like Invoice No, PO No, Address, Amount etc… from multiple vendors with multiple invoices patterns. here data is not standard & data positions also vary from one invoice to another.

Which is the best way to achieve this problem?

  1. Regx
  2. AI & Machine Learning

If anyone goes through this kind of real-world scenarios, please advice.

Hello @harsha_vardhan,

The best approach is to use IntelligentOCR stable package and the MachineLearningExtractor beta package.

Please see this example here - it already has data extraction from invoices configured for some fields.

@harsha_vardhan

I have done this task using regex… and string manipulation.

1 Like

Hello @monika.c,

Good to hear this! In order to help out in such cases, we built a Regex Based Extractor, available in the IntelligentOCR 3.0.0 package. You might want to try it out, it will make it a lot easier to handle data from within files!

Also please check out the above example for how to use :slight_smile:

Ioana

1 Like

@monika.c - Are you able to handle multiple patterns Ex : invoice no, account no , PO , address, vendor name etc… ? I mean each PDF in loop have different layouts and data (Multiple markets like US, Canada, India).

If multiple pdf have different layouts but this keywords are fix in all pdf formats …we can extract data using regex pattern.

@monika.c - Ok, let’s take a case if the labels are not standard and positions are changing. after converting PDF to OCR text can we apply Regx and extract information.

Example : Invoice number is standard label but the value doesn’t have any length it might be combination of characters, digits & special symbols then how to extract that invoice number from those PDF ? like Total, Netamount, Tax % etc…

ok…If invoice Number is standard label…then we can use regex pattern like that…
(?<=(Invoice No.)|(INVOICE )).*
image

Use this pattern in assign activity…
Invoice Number=System.Text.RegularExpressions.Regex.Match(PdtText,"(?<=(Invoice No.)|(INVOICE No)|(Invoice #:)|(Invoice#:)).*").Value

@monika.c - Thanks for the update, i can understand if there is an standard pattern then we can use Regx but in my case there is no standard pattern. example i need to retrieve bill to address from invoice or account number or vendor name then how ?

When i am trying extract data from pdf to excel extracted some data and missing some data by using intelligent ocr, regex based ,and all activity 's in the document but data was missing like invoice number, invoice date .could u please solve my issue

My team had worked on a similar business requirement of invoice data extraction a few months ago. After tackling these issues with Regex and string manipulation, we also tried multiple OCRs in the market like Flexicapture, Kofax’s Omnipage and a few others. The major issue we faced was that we had daily approx 5K-6K invoices and no two had the same layout. Even Automation Anywhere’s IQ Bot could not prove to be perfect since the user needs to draw the templates around significant fields. After a lot of market research, now we are testing two AI-based tools, UiPath and KlearStack. These two have been excellent in catering to our business use case since they need no templates and are both Cloud-based.
UiPath has been versatile in extracting common fields like Invoice number, PO number but could not offer Address, Amount, Table data accurately. On the other hand, KlearStack is accurate at extracting detailed information like Table Data i.e line-items, tax details, tax amount, Supplier and customer address even if the invoice or PO formats are not standard and not seen before. UiPath is suitable for cases that need basic (3-4 fields) at High Speed, while KlearStack is more suitable for cases that need all the invoice/receipts/purchase order details and taxes, Amount via bulk processing ‘High Accuracy