Info: How Excel PDF Exported Files Stores Table Data

StefanSchnell · February 27, 2021, 9:31am

In many discussions and posts it is very often assumed that PDF files, exported from Excel, store the table data they contain as structured data. Just as often, posts describes solutions for turning tables, stored in PDF, back into structured data. You can find excellent examples here

Regex to extract data from @prasath17
How to Extract Table from PDF from @Cristian_Negulescu

In all these cases the text of the PDF is restructured into a table.

This shows us a key insight: PDF files stores unstructured data.

Let’s take a closer look at that. At first a tiny Excel table …

… and here its representation in PDF document.

Even though they look identical, but they are not.

According to ISO 32000 is “The goal of PDF is to enable users to exchange and view electronic documents easily and reliably, independent of the environment in which they were created or the environment in which they are viewed or printed. At the core of PDF is an advanced imaging model derived from the PostScript(R) page description language. The PDF Imaging Model enables the description of text and graphics in a device-independent and resolution-independent manner.”
In my words: PDF reproduces the layout of a document and is not a container for structured content.

The PDF data is stored hierarchically in the document.

In this example is the table a stream.

Here you see the representation of the first line in the table. The number 1 and the text Text 1. The tags BT and ET marks the beginning of a text object and the end.
Hint: You can find the operators in table 51 of the ISO 32000.

The text is saved with its position, the structuring in the table is no longer present. There is no structuring as we know it e.g. from HTML.

<table>
  <thead>
    <tr>
      <th>Number</th>
      <th>Text</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>Text 1</td>
    </tr>
  </tbody>
</table>

Therefore, the existing data must be restructured.

To make this restructuring process as simple as possible and, above all, to ensure a high degree of reusability, Document Understanding is available. You can find a great introduction into unstructured data analysis with AI, OCR and RPA here, from @Tony_Tzeng.

I hope I have explained the reasons why it is necessary to restructuring data from PDF.
As we can see, in this case, nothing is as it seems.

Cristian_Negulescu · February 28, 2021, 7:39pm

Hello Stefan,
Very nice article and thank you that you mention me inside. I also want to add to your article this movie where I have 17 use-cases for extracting tables from PDF and write data in Excel:

2:00 GitHub free code for all the files
2:20 Logic of general workflow
4:40 File 1 simple PDF
9:50 File 2 PDF with a column with multiple lines
20:10 File 3 PDF with a column with multiple words ON the LAST column
27:00 File 5 PDF with a column with multiple words ON inside column (2 columns)
31:40 File 6 PDF with a column with multiple lines
39:10 File 8 simple PDF
42:15 File 9 PDF with multiple spaces on that need to be correct
45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
55:50 File 11 simple PDF with protection empty Cells
58:35 File 12 Big PDF with an empty line and Empty columns and partial total
1:02:25 File 13 PDF with multiple columns that have multiple words and hard to define a rule
1:10:15 File 15 PDF with multiple columns that have multiple lines
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data
1:16:05 File 18 simple PDF
1:17:10 File 19 PDF with multiple pages and columns with multiple lines
1:22:10 File 20 PDF with multiple columns that have multiple lines
1:25:00 File 21 PDF with empty columns and subtotal

Code:

github.com

cristinegulescu/startUiPathFromSalesforce/blob/master/PDFdecode.txt

        'FILE1
        Dim strtmp As String
        strtmp = strin.Substring(strin.IndexOf("Number"), strin.IndexOf("Subtotal") - strin.IndexOf("Number")).Trim
        strout = strtmp.Replace(" ", "|")

        strtmp = strin.Substring(strin.IndexOf("Subtotal") + 8)
        strpar = strtmp.Substring(0, strtmp.IndexOf(Environment.NewLine)).Trim


        'FILE2
        Dim strtmp As String
        Dim strout As String
        strout = "Col1|Col2|Col3|Col4"
        strtmp = strin.Substring(strin.IndexOf("Vacancies") + 11).Trim
        For Each line As String In strtmp.Split(New String() {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
            If (line.Length > 3) Then
                If (IsNumeric(line(0))) And (line(1) = " ") And (line(2) = " ") Then
                    strout = strout + Environment.NewLine + line.Replace("  ", "").Replace("  ", "|").Trim
                ElseIf (line(0) = "") And (line(1) = " ") And (line(2) = " ") Then
                    strout = strout + line.Replace("  ", "$").Trim()

This file has been truncated. show original

Thanks,
Cristian Negulescu

katara · October 24, 2021, 5:52pm

thanks

document.pdf (109.8 KB)

phpmyadmin

Topic		Replies	Views
How to extract data from unstructured pdf table Help pdf , activities , data_scraping , question	2	3051	February 24, 2021
PDf Table To Excle Help pdf , ocr , activities	2	1287	February 24, 2021
Brainstorming Solutions for Editing Data in PDFs Activities pdf	4	4004	February 28, 2021
How to extract tabular data from pdf? Help	0	692	August 27, 2020
PDF to Excel - Extract structured data Help excel , pdf , activities , studio	14	8663	November 28, 2018

Info: How Excel PDF Exported Files Stores Table Data

Related topics