How to get specific data from PDF to input in a Excel file without OCR Engineer

Dears,

How can I get specific data from a PDF (different templates) and input them in a just one excel file, without using OCR engineer?

The scenario is, I have different templates of PDFs (similar to an invoice), and I have to get some specific information such as date, invoice number, item number (each item has one), client, and so on… and input all this information in just one excel file.

Problem:
The items don’t have title, in any template, and can be different quantity of lines between each PDF.
The excel file needs to be created with all the items of the invoice, one line for each item, and merged with all the data of different templates (different clients).

Example, of the Excel:

Column1: Item number
Column2: Quantity of the item
Column3: Client name (each client has a different template of the PDF)
Column4: Version of the Item
Column5: Invoice Number
Column6: Invoice Date
Column7: Address

Data:

Item1 10 CLIENT1 1.0 14901 05/03/2018 Address1
Item2 3 CLIENT1 1.1 14901 05/03/2018 Address1
Item1 5 CLIENT2 1.1 489760 07/03/2018 Address2
Item1 1 CLIENT3 1.0 11133 08/03/2018 Address3

Kind regards,
Rafael

Hello,

two important things to be able to do this without OCR - the PDFs must be computer generated (not scanned files), and the column order should always stay the same.

  1. Use “Read PDF Text” to read all the data into a string variable.
  2. Use “Generate Datatable” to turn that string variable into a datatable. There you can select different column seperators, and parse the string into a table. Looks like in your example the separator is just a space.
  3. Do this in a loop, merging each generated table to a master one.
  4. Then you can either rename the columns with an assign activity and output a CSV, or write the range to a prepared Excel file.