Read PDF data


#1

Hey everyone, So Im stuck on a issue and Im not sure how to go about it. I have multiple PDF files I’m trying to read and pull certain data. Such as date, account number, and invoice number. The issue I’m having is that the data is place differently in every pdf. Example: PDF_1 has date then account number and PDF_2 ha account number then date. What would you recommend?


#2

Hey there,

I would suggest you use regular expression to extract the data with all possibilities. Assuming that account and invoice have different format (number of chars) then you can have three regular expressions and one would work

  1. dd/dd/dd s* s*
  2. s* dd/dd/dd s*
  3. s* s* dd/dd/dd

PS. use Matches activity.


#3

What if they are on different different lines?


#4

Depends on PDF

If your PDF is flat image, then use anchor with find image activity to start screen scrapping, or attempt to read complete PDF with OCR - then use regular expression where it finds certain keyword before the data.

If your PDF has metadata and you can use “Find Text Position” activity and locate where the start of data is.

If you do not have place holder anywhere, then best bet is to get all the data on PDF and search for pattern I outlined below.

Let me know if this makes sense.