I want to split my input string based on capital letter but with 2 exceptions. Exception1: Dont split the caps letter if letter is preceded by space or " ’ "(Euris/d’Investissements). Exception 2.:Dont split if letter is immediately succeeded by another Caps or “.” (S.A.S / SARIS)
This is my sample input text :
“Fonciere EurisCarpinienne De ParticipationEuris Cie Europeenne d’InvestissementsMiramont Finance et Distribution SASociété EurismaSociété SARIS S.A.S.”
Judging by the rules you presented, I would say that you need to do a split every time you have a lowercase letter followed by an uppercase one.
Fonciere EurisCarpinienne De ParticipationEuris Cie Europeenne d’InvestissementsMiramont Finance et DistributionSASociété EurismaSociété SARIS S.A.S.
I used RegEx ([a-z][A-Z] for English, [a-z,àâäèéêëîïôœùûüÿç][A-Z,ÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ] for French) to find these occurrences. Next, I am adding a separator character between these two letters (e.g. pipe character). Last step is to split the string by the separator, after which I am left with the individual tokens in an array of strings.