Need to extract information from PDF using Regular expression

Hi,
I need to extract information for two tags from a list of pdfs using regular expression.
The structure of every pdf would be same.

The first word tag is Wertmindernde Faktoren (in bold) and it will always have information like below (in table manner)

As you can see the information is in table format and I need to use regex to extract in table format only for this particular tag.

The second tag is Gebrauchsspuren (in bold)
This will have information sometimes in table format or sometimes in a sentence.
Mentioned below


Table format


Information in sentence format (when no data is available)

So any help or suggestion on what will be the regex for these two tags ?

Thanks in advance.

Two strategies could be found at least (we assume text can be split into the lines

Filtering

  • start line grabbing when the keyword is detected
  • stop filtering the the typically tabular format pattern has been added
  • feed it to a generate table activity

*Slicing and index

  • match all lines on the tabular format pattern
  • identify the table starters while checking the match position index
  • check for keyword before first table data line
  • feed it to a generate table activity
1 Like

Thanks for the reply, can you please elaborate more or provide me a small poc on these ?

please provide some sample text

Hi Please find the sample below

"hinten zu 65% abgedunkelt
Verkehrszeichenerkennung
Vordersitze beheizbar
Vordersitze elektrisch einstellbar, Fahrersitz mit Memory, Komforteinstieg, längs verschiebbarer
Oberschenkelauflage
Wegfahrsperre elektronisch

20.07.2021 Gutachtennummer: XXXXXX Seite 6 / 17TĂśV SĂśD Auto Plus GmbH Fahrzeugbewertung
Wiesenring 2
04159 Leipzig
XXXXXXXXX
GUTACHTENNUMMER: XXXXXX
Bei RĂĽckfragen bitte Gutachtennummer und Datum angeben Datum: 20.07.2021

Ausstattung
Zentralverriegelung ohne Safe-Sicherung,mit Funkfernbedienung, 2 FunkschlĂĽssel, Komfortstartfunktion
“Press & Drive”

Wertmindernde Faktoren
Nr. Bauteilgruppe Beschreibung
1 Heckklappe/-tĂĽr Heckklappe - Dellen - sanft instandsetzen

Gebrauchsspuren
Nr. Bauteilgruppe Beschreibung
1 Stossfänger vorn Spoiler (Unterhalb) - Kratzer - kein Abzug
2 Stossfänger hinten Stossfänger hinten - Kratzer - kein Abzug
3 TĂĽr hinten rechts TĂĽr - Dellen - kein Abzug
4 TĂĽr vorn rechts TĂĽr - Dellen - kein Abzug

Vorschaden
Nr. Vorschaden Schadenshöhe
1 fachgerecht repariert , Reparaturrechnung nicht vorhanden 311,10 €
2 fachgerecht repariert , Reparaturrechnung nicht vorhanden 493,31 €
3 Seite links, fachgerecht repariert , Reparaturrechnung nicht vorhanden

Summe (netto): 804,41 €"

I didnt put whole data as it has some sensitive information as well.

So as you can see I need to extract table for “Wertmindernde Faktoren” and table for “Gebrauchsspuren”. These two words usually come in bold letter. You can refer my post to see the structure of these two tables.

FYI, as I mentioned I have a list pdfs from which I need to extract these tables, the structure of these table will be same in all of the pdfs.

hi there,
for good regex extraction of a table, it can be helpfull to keep formatting:

do you see a difference?

yes, actually I already did it and wrote the extracted information from pdf to a txt file.
Please find the text in the attachment. You can find the correct format there.

TUV_Data - Copy.txt (2.3 KB)

I have to omit some information as it has some sensitive data.

Thanks in advance.

@ppr please find the text file which has the information (it has preserved format)

TUV_Data - Copy.txt (2.3 KB)

@ppr @yrobert
I created this regular expression which is working on regex101.com but not working in UiPath

(?:Wertmindernde Faktoren|Gebrauchsspuren.)\n(.(?:\n.+)*)

any suggestion on how should I modify this ?

(?:Wertmindernde Faktoren|Gebrauchsspuren. )\r?\n(. (?:\n.+)*)

include optional \r? and check again - handling windoes linebreak specifics

thanks for your response, I gave the wrong regex previously by mistake, it is actually

(?:Wertmindernde Faktoren|Gebrauchsspuren.)\n(.(?:\n.+)*)

I put the optional windows linebreaker as well but didnt work,

so my code is currently like this

in the Assign activity I put the below code

matchString=Regex.Match(extractedText, “(?:Wertmindernde Faktoren|Gebrauchsspuren.)\r?\n(.(?:\n.+)*)”).Value

and then I printed it, but didnt work in uipath.

Let me know if I missed anything.

Thanks

@Debartha_Mitra_DE
can you check?


(?:Wertmindernde Faktoren|Gebrauchsspuren)(?:.?\r?\n)(.|\n)?(?=\r?\n\r?\n)

1 Like

thanks for your modification, it worked :slight_smile:

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.