Info: How Excel PDF Exported Files Stores Table Data

In many discussions and posts it is very often assumed that PDF files, exported from Excel, store the table data they contain as structured data. Just as often, posts describes solutions for turning tables, stored in PDF, back into structured data. You can find excellent examples here

In all these cases the text of the PDF is restructured into a table.

This shows us a key insight: PDF files stores unstructured data.

Let’s take a closer look at that. At first a tiny Excel table …

image

… and here its representation in PDF document.

image

Even though they look identical, but they are not.

According to ISO 32000 is “The goal of PDF is to enable users to exchange and view electronic documents easily and reliably, independent of the environment in which they were created or the environment in which they are viewed or printed. At the core of PDF is an advanced imaging model derived from the PostScript® page description language. The PDF Imaging Model enables the description of text and graphics in a device-independent and resolution-independent manner.”
In my words: PDF reproduces the layout of a document and is not a container for structured content.

The PDF data is stored hierarchically in the document.

In this example is the table a stream.

Here you see the representation of the first line in the table. The number 1 and the text Text 1. The tags BT and ET marks the beginning of a text object and the end.
Hint: You can find the operators in table 51 of the ISO 32000.

The text is saved with its position, the structuring in the table is no longer present. There is no structuring as we know it e.g. from HTML.

<table>
  <thead>
    <tr>
      <th>Number</th>
      <th>Text</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>Text 1</td>
    </tr>
  </tbody>
</table>

Therefore, the existing data must be restructured.

To make this restructuring process as simple as possible and, above all, to ensure a high degree of reusability, Document Understanding is available. You can find a great introduction into unstructured data analysis with AI, OCR and RPA here, from @Tony_Tzeng.

I hope I have explained the reasons why it is necessary to restructuring data from PDF.
As we can see, in this case, nothing is as it seems.

5 Likes

Hello Stefan,
Very nice article and thank you that you mention me inside. I also want to add to your article this movie where I have 17 use-cases for extracting tables from PDF and write data in Excel:

2:00 GitHub free code for all the files
2:20 Logic of general workflow
4:40 File 1 simple PDF
9:50 File 2 PDF with a column with multiple lines
20:10 File 3 PDF with a column with multiple words ON the LAST column
27:00 File 5 PDF with a column with multiple words ON inside column (2 columns)
31:40 File 6 PDF with a column with multiple lines
39:10 File 8 simple PDF
42:15 File 9 PDF with multiple spaces on that need to be correct
45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
55:50 File 11 simple PDF with protection empty Cells
58:35 File 12 Big PDF with an empty line and Empty columns and partial total
1:02:25 File 13 PDF with multiple columns that have multiple words and hard to define a rule
1:10:15 File 15 PDF with multiple columns that have multiple lines
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data
1:16:05 File 18 simple PDF
1:17:10 File 19 PDF with multiple pages and columns with multiple lines
1:22:10 File 20 PDF with multiple columns that have multiple lines
1:25:00 File 21 PDF with empty columns and subtotal

Code:

Thanks,
Cristian Negulescu

1 Like