Regex - clean string from unwanted characters

Yameso · March 22, 2021, 1:36pm

What would be the best paterrn for regex to leave only digits, letters, spaces and new lines?

I’m guessing it should be:

System.Text.RegularExpressions.Regex.Replace(myStr, “pattern”,“”) but I don’t know the right “pattern”.

This myStr come from OCR (scan) and there could be a lot of characters like . , & ^ % and many others.

prasath17 · March 22, 2021, 1:40pm

Hi @Yameso …You can try the below pattern.

Regex.Replace(Input,“[\~#%&*{}/:<>?|"-]”,“”)

Please add all the character with in square brackets…

songoel · March 22, 2021, 1:42pm

@Yameso,

Use the below Regex to only retain Digits, Letters, Spaces and NewLines:
[^A-Za-z0-9\s\r?\n|\r]

This identifies any other character than ones mentioned above. You can replace it using null string.

Srini84 · March 22, 2021, 1:44pm

@Yameso

Check as below

I add the : , remaining it will identify

Use Regex.Replace(Your String, @“[^0-9a-zA-Z:,]+”, “”)

Hope this may help you

Thanks

songoel · March 22, 2021, 1:45pm

There’s a double back slash before s. It should be two \s before the letter s. It is being automatically adjusted to one while typing the expression here.

Yameso · March 22, 2021, 2:20pm

I think that would work [^A-Za-z0-9\s]+

But I require to leave also letters with accents like “Ę” “ę” and other polish characters. is there other way to leave them too except this:

[^A-Za-z0-9\sĘęĄąŻżŹźŁłŃńÓóŚśĆć]+

Adrian_Star · March 22, 2021, 2:29pm

Check this:
Replace:

System.Text.RegularExpressions.Regex.Replace(your_String,"[^[\p{L}|\p{N}|\s]","")

songoel · March 22, 2021, 2:29pm

@Yameso,

Here you can use ASCII encoding instead of Regex.
This will simply replace any Es with normal E, and same goes for other characters too.

songoel · March 22, 2021, 2:37pm

Use in a Assign:
System.Text.Encoding.UTF8.GetString(System.Text.Encoding.GetEncoding(“ISO-8859-8”).GetBytes(string.ToString))

Yameso · March 22, 2021, 3:03pm

I used:

System.Text.Encoding.UTF8.GetString(System.Text.Encoding.GetEncoding(“ISO-8859-8”).GetBytes(string.ToString))

and this one leaves marks such as " ? * , . ( ) and few others, and change letters from “Ę” to “E”.
I don’t want to replace them for characters without accent. I’d like to ignore them and clean others.

Pattern [~#%&*{}/:<>?|"-] will not get sth like this:

With this one [^[1] those dots are also considered as thing to replace and letters with accents stays :

Thanks for help. @Adrian_Star nailed it

\p{L}|\p{N}|\s ↩︎

system · March 25, 2021, 3:03pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How do i remove special characters from a string? Help studio	12	42087	January 30, 2020
How to make a replace in this string Studio regex , string	6	952	July 26, 2020
Remove characters Studio studio , question , workflow_diff	3	726	December 9, 2022
How to remove a special characters in a given string by regex Studio regex , text	9	3082	September 27, 2023
I want to remove the special characters as well as extra spaces Help	15	12369	March 30, 2019

Regex - clean string from unwanted characters

Related topics