Read PDF data

kishanpatel728 · August 1, 2018, 4:45pm

Hey everyone, So Im stuck on a issue and Im not sure how to go about it. I have multiple PDF files I’m trying to read and pull certain data. Such as date, account number, and invoice number. The issue I’m having is that the data is place differently in every pdf. Example: PDF_1 has date then account number and PDF_2 ha account number then date. What would you recommend?

Kemal · August 1, 2018, 4:52pm

Hey there,

I would suggest you use regular expression to extract the data with all possibilities. Assuming that account and invoice have different format (number of chars) then you can have three regular expressions and one would work

dd/dd/dd s* s*
s* dd/dd/dd s*
s* s* dd/dd/dd

PS. use Matches activity.

kishanpatel728 · August 1, 2018, 6:44pm

What if they are on different different lines?

Kemal · August 1, 2018, 7:19pm

Depends on PDF

If your PDF is flat image, then use anchor with find image activity to start screen scrapping, or attempt to read complete PDF with OCR - then use regular expression where it finds certain keyword before the data.

If your PDF has metadata and you can use “Find Text Position” activity and locate where the start of data is.

If you do not have place holder anywhere, then best bet is to get all the data on PDF and search for pattern I outlined below.

Let me know if this makes sense.

Topic		Replies	Views
Read pdf with different formats Help	8	2039	February 6, 2020
PDF Data Extraction (Invoice) Activities activities , question , document_processing	6	1944	February 19, 2021
Extract data fromPDF Help	13	1235	October 2, 2019
How to Extract a particular Data from a pdf file? Help	11	9527	August 8, 2019
Looping through PDF files to extract specific selected data Academy Feedback	4	1838	June 28, 2019

Read PDF data

Related topics