PDF Data Extraction (Invoice)

How to extract specific data from Pdf documents(Ex: Invoice Extracting ) other than using Anchor base Activity?
While using Anchor base activity to extract data its taking more time to process.

Hi @Aditya_Srinivas

U can read the PDF using read PDF activitiy and then can use regex for extraction of data

It would be easier approach

Hi @Aditya_Srinivas ,

If any images data in the pdf file you can use read pdf with OCR activity and use microsoft OCR engine and use matches activity (regular expressions) to get the data.


Thank you very much for solution.

For example, I have 1000 Pdf Invoices and I want to extract specific data from that but using anchor base activity its taking a lot of time to extract. As you said by using matches activity How can we extract the data ? Would please provide any example?

@Aditya_Srinivas - If all your 1000 Pdf invoices follows the same pattern, then

  1. Read PDF activity , using Preserve format true and output to Stringvariable.
  2. Input the stringvariable to Matches activity and build Regex pattern
  3. Write the output

It will be different for every cases. if you could you share some sample text from the read pdf activity , we can help with Regex.

You can refer this post.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.