PDF Regex problem

Construction_Directive.pdf (1.3 MB)
I need a regex that will read a single line with columns, for example:

Column name: ESO
Content: CEN
-----------------
Column name: Reference and title of the standard (and reference document)
Content:
EN 1:1998
Flued oil stoves with vaporizing burners
EN 1:1998/A1:2007
-----------------
Column name: Reference and title of the standard
Content:
-----------------
Column name: Beginning of the coexistence period
01/01/2008
01/01/2008
-------------------
Column name: End of the coexistence period
01/01/2009
01/01/2009"

This regex just gets Flued oil stoves with vaporizing burners
EN 1:1998/A1:2007, but it doesnt get the first one EN 1:1998: "(?:(ESO)\s+)?(?<Reference>[A-Z0-9\/:]+(?: [\w\-]+)+(? :\s+[A-Z0-9\/:]+(?:/[A-Z0-9]+)?)*)\s+(?<Title>[\w\s\-\:]+(?:[\w\s\-\:]+)*)\s*(?<Superseded>[A-Z0-9\/:]+(?:\/[A-Z0-9]+)?)?\s+(?<Beginning>\d{2}/\d{2}/\d{4})\s+(?<End>\d{2}/\d{2}/\d{4})"

@vytmon , please show the required output or is that the example you showed above?

Are you looking to flatten the tabular data in the attached doc? How are you reading the pdf? Based on your query, I’m assuming you want to use Regex rather than DU?

Hello

Please provide examples of the required text you need form your sample.

Cheers

1 Like

Example, it needs to take just:

EN 1:1998 01/01/2009 and EN 1:1998/A1:2007 01/01/2009

Output:

[      EN 1:1998       01/01/2008 01/01/2009
 Flued oil stoves with vaporizing burners

 EN 1:1998/A1:2007       01/01/2008 01/01/2009]