Robott
(Sajid Younas)
April 5, 2023, 4:18pm
1
I want to capture everything whatever comes between each city pair including empty lines till next city pair.
Next city pair is also the beginning of next citypair with its text.
any idea what am i doing wrong? is there any other better approach?
(?<city_pair1>KOELN EIFELTOR UBF - BUSTO ARSIZIO)(?:\r?\n(?:\d.*\n)*\r?\n(?:(?!KOELN EIFELTOR UBF - BUSTO ARSIZIO|DUISBURG RHEINHAUSEN DKT - NOVARA CIM|BUSTO ARSIZIO - DUISBURG RHEINHAUSEN DKT).)*\r?\n)*
(?<city_pair2>BUSTO ARSIZIO - DUISBURG RHEINHAUSEN DKT)(?:\r?\n(?:\d.*\n)*\r?\n(?:(?!KOELN EIFELTOR UBF - BUSTO ARSIZIO|DUISBURG RHEINHAUSEN DKT - NOVARA CIM|BUSTO ARSIZIO - DUISBURG RHEINHAUSEN DKT).)*\r?\n)*
(?<city_pair3>DUISBURG RHEINHAUSEN DKT - NOVARA CIM)(?:\r?\n(?:\d.*\n)*\r?\n(?:(?!KOELN EIFELTOR UBF - BUSTO ARSIZIO|DUISBURG RHEINHAUSEN DKT - NOVARA CIM|BUSTO ARSIZIO - DUISBURG RHEINHAUSEN DKT).)*\r?\n)*
(?<city_pair4>BUSTO ARSIZIO - KOELN EIFELTOR UBF)(?:\r?\n(?:\d.*\n)*\r?\n(?:(?!KOELN EIFELTOR UBF - BUSTO ARSIZIO|DUISBURG RHEINHAUSEN DKT - NOVARA CIM|BUSTO ARSIZIO - DUISBURG RHEINHAUSEN DKT).)*\r?\n)*
(?<city_pair5>KOELN EIFELTOR UBF - DUISBURG RHEINHAUSEN DKT)(?:\r?\n(?:\d.*\n)*\r?\n(?:(?!KOELN EIFELTOR UBF - BUSTO ARSIZIO|DUISBURG RHEINHAUSEN DKT - NOVARA CIM|BUSTO ARSIZIO - DUISBURG RHEINHAUSEN DKT).)*\r?\n)*
(?<city_pair6>BUSTO ARSIZIO - NOVARA CIM)(?:\r?\n(?:\d.*\n)*\r?\n(?:(?!KOELN EIFELTOR UBF - BUSTO ARSIZIO|DUISBURG RHEINHAUSEN DKT - NOVARA CIM|BUSTO ARSIZIO - DUISBURG RHEINHAUSEN DKT).)*\r?\n)*
(?<city_pair7>DUISBURG RHEINHAUSEN DKT - KOELN EIFELTOR UBF)(?:\r?\n(?:\d.*\n)*\r?\n(?:(?!KOELN EIFELTOR UBF - BUSTO ARSIZIO|DUISBURG RHEINHAUSEN DKT - NOVARA CIM|BUSTO ARSIZIO - DUISBURG RHEINHAUSEN DKT).)*\r?\n)*
Hi @Robott ,
Is it also possible for you to provide the Sample Data as well ?
We could check if it matches the data from our end.
Robott
(Sajid Younas)
April 5, 2023, 4:38pm
3
Text between city pai can look like this. insead of # it can be any other character of course
##/##/## BUS#/###### AGCU######-# ###/# #### #### ###.## 06 AGCU####### Additional performances:
* SUPPL.REDUCTION CH GOVERNMENT SUBSIDIES 30.00
##/##/## BUS#/###### AGCU######-# ###/# ### #### ###.## 06 AGCU####### Additional performances:
* SUPPL.REDUCTION CH GOVERNMENT SUBSIDIES 30.00
##/##/## BUS#/###### AGCU######-# ###/# ### #### ###.## 06 AGCU######-# Additional performances:
* SUPPL.REDUCTION CH GOVERNMENT SUBSIDIES 30.00
##/##/## BUS#/###### AGCU######-# ###/# ### #### ###.## 06 AGCU####### Additional performances:
* SUPPL.REDUCTION CH GOVERNMENT SUBSIDIES 30.00
##/##/## BUS#/###### AGCU######-# ###/# ### #### ###.## 06 AGCU####### Additional performances:
* SUPPL.REDUCTION CH GOVERNMENT SUBSIDIES 30.00
##/##/## BUS#/###### AGCU######-# ###/# ### #### ###.## 06 AGCU####### Additional performances:
* SUPPL.REDUCTION CH GOVERNMENT SUBSIDIES 30.00
##/##/## BUS#/###### AGCU######-# ###/# ### #### ###.## 06 AGCU######-# Additional performances:
Sono valide le Condizioni Generali delle Società . Es gelten die Allgemeinen Bedingungen.### ########## ###### ##################### #################### ###################### #################### [www.###########](http://www./###########) D-###### Berlin
############IVA INVOICE ####### / 2
January
OUR REF ### ###### Period #.##.## ##.##.##
@Robott ,
In the regex provided, you are using the keyword - KOELN EIFELTOR UBF - BUSTO ARSIZIO
and many others, but we are not able to find it in the data sample provided.
Make sure the keywords/anchors are not masked when sending the data.
Robott
(Sajid Younas)
April 5, 2023, 4:53pm
5
Please have a look here
Regular expression tester with syntax highlighting, explanation, cheat sheet for PHP/PCRE, Python, GO, JavaScript, Java, C#/.NET, Rust.
it should capture all text following the pair UNTILL new city pair.
also city pair can appear in any order next time
@Robott ,
Could you Check if the below regex is what you require :
(?<=KOELN EIFELTOR UBF - BUSTO ARSIZIO|BUSTO ARSIZIO - KOELN EIFELTOR UBF|DUISBURG RHEINHAUSEN DKT - NOVARA CIM|NOVARA CIM - DUISBURG RHEINHAUSEN DKT|BUSTO ARSIZIO - DUISBURG RHEINHAUSEN DKT|DUISBURG RHEINHAUSEN DKT - BUSTO ARSIZIO|KOELN EIFELTOR UBF - DUISBURG RHEINHAUSEN DKT)[\S\s]+?(?=KOELN EIFELTOR UBF - BUSTO ARSIZIO|BUSTO ARSIZIO - KOELN EIFELTOR UBF|DUISBURG RHEINHAUSEN DKT - NOVARA CIM|NOVARA CIM - DUISBURG RHEINHAUSEN DKT|BUSTO ARSIZIO - DUISBURG RHEINHAUSEN DKT|DUISBURG RHEINHAUSEN DKT - BUSTO ARSIZIO|KOELN EIFELTOR UBF - DUISBURG RHEINHAUSEN DKT)
Robott
(Sajid Younas)
April 6, 2023, 9:15am
7
not exactly…
city pair should also be part of the result or at least possible to name them somehow.
As you can see in regex check… with that it stops capturing after 4 matches
@Robott ,
Could you check with this updated regex :
(?=KOELN EIFELTOR UBF - BUSTO ARSIZIO|BUSTO ARSIZIO - KOELN EIFELTOR UBF|DUISBURG RHEINHAUSEN DKT - NOVARA CIM|NOVARA CIM - DUISBURG RHEINHAUSEN DKT|BUSTO ARSIZIO - DUISBURG RHEINHAUSEN DKT|DUISBURG RHEINHAUSEN DKT - BUSTO ARSIZIO|KOELN EIFELTOR UBF - DUISBURG RHEINHAUSEN DKT|KOELN EIFELTOR UBF - NOVARA CIM|NOVARA CIM - KOELN EIFELTOR UBF)[\S\s]+?(?=KOELN EIFELTOR UBF - BUSTO ARSIZIO|BUSTO ARSIZIO - KOELN EIFELTOR UBF|DUISBURG RHEINHAUSEN DKT - NOVARA CIM|NOVARA CIM - DUISBURG RHEINHAUSEN DKT|BUSTO ARSIZIO - DUISBURG RHEINHAUSEN DKT|DUISBURG RHEINHAUSEN DKT - BUSTO ARSIZIO|KOELN EIFELTOR UBF - DUISBURG RHEINHAUSEN DKT|KOELN EIFELTOR UBF - NOVARA CIM|NOVARA CIM - KOELN EIFELTOR UBF)
Also, Are you sure the keyword values are constant values that we use as anchors.
Initially you had only used few keywords or city_pair identifiers so it only had 4 matches, but now after adding the other values it was able to identify 7 matches.
Is the 7 Matches what you required ?
The below are the keywords use to identify the different sections, but if there are going to be other words, then it might be that it is also dynamic ? Or do you have a list of these keywords that we can use ?
Robott
(Sajid Younas)
April 6, 2023, 11:15am
9
Hi,
well problem is city pair repeats itself on each page. So requirement is to get all text for city pair UNTILL new city pair appears in the text. So in this case it should get 4 matches in total, applied for whole text.
@Robott , Then was the first regex not the required one ?
Just change the initial part (?<=
to (?=
and check as well.