I am having trouble when extracting a data table from a PDF file to a DataTable. The first entry in the table below on the PDF is blank and there is data in the 2nd and 3rd and 4th columns. When the data is extracted into a datatable and then sent to excel, it is ignoring the blank in the first row/column and moving everything to the left. It only happens when the blank is in the first column. There are blanks in other columns and they remain.
Here is the PDF table
Here is the excel table. The data table that I extract into also reflects this.
have you unticked ‘ignore first column’?
I am not sure where that is. There are some cases where the first 2 columns are blank and it ignores both columns. The other strange behavior is that in the rows that have a blank in column one, all blanks are ignored. There may be data in columns 2,3 and 8 but the data will appear in column 1,2,3. For the normal rows, it works fine. In the example screen print, the second record has a blank in column 5 and it processes correctly.
Did you find a solution? I am running into the same issue. I can get the data, but empty columns are collapsed (skipped) and the columns have significance in my source. I am working with a Native PDF (text, not image). I tried ExtractData (all 3 options) as well as GetFullText and GetVisibleText. Nothing preserves the column/data relationship. I realize that PDF is not intended to provide structure, but the data isn’t much good without it.
Looks like I found a solution – and it is the easiest possible solution. You don’t even need to extract the table. You can use the (very fast and simple) PDF.Activities.ReadPDFText with the PreserveFormatting property set to True. Split the result on NewLine for an array of strings that are column-delimited data (space padded) that can be readily converted using substring.