In many discussions and posts it is very often assumed that PDF files, exported from Excel, store the table data they contain as structured data. Just as often, posts describes solutions for turning tables, stored in PDF, back into structured data. You can find excellent examples here
In all these cases the text of the PDF is restructured into a table.
This shows us a key insight: PDF files stores unstructured data.
Let’s take a closer look at that. At first a tiny Excel table …
… and here its representation in PDF document.
Even though they look identical, but they are not.
According to ISO 32000 is “The goal of PDF is to enable users to exchange and view electronic documents easily and reliably, independent of the environment in which they were created or the environment in which they are viewed or printed. At the core of PDF is an advanced imaging model derived from the PostScript(R) page description language. The PDF Imaging Model enables the description of text and graphics in a device-independent and resolution-independent manner.”
In my words: PDF reproduces the layout of a document and is not a container for structured content.
The PDF data is stored hierarchically in the document.
In this example is the table a stream.
Here you see the representation of the first line in the table. The number 1 and the text Text 1. The tags BT and ET marks the beginning of a text object and the end.
Hint: You can find the operators in table 51 of the ISO 32000.
The text is saved with its position, the structuring in the table is no longer present. There is no structuring as we know it e.g. from HTML.
<table> <thead> <tr> <th>Number</th> <th>Text</th> </tr> </thead> <tbody> <tr> <td>1</td> <td>Text 1</td> </tr> </tbody> </table>
Therefore, the existing data must be restructured.
To make this restructuring process as simple as possible and, above all, to ensure a high degree of reusability, Document Understanding is available. You can find a great introduction into unstructured data analysis with AI, OCR and RPA here, from @Tony_Tzeng.
I hope I have explained the reasons why it is necessary to restructuring data from PDF.
As we can see, in this case, nothing is as it seems.