Regex - clean string from unwanted characters

What would be the best paterrn for regex to leave only digits, letters, spaces and new lines?

I’m guessing it should be:

System.Text.RegularExpressions.Regex.Replace(myStr, “pattern”,“”) but I don’t know the right “pattern”.

This myStr come from OCR (scan) and there could be a lot of characters like . , & ^ % and many others.

Hi @Yameso …You can try the below pattern.

Regex.Replace(Input,“[\~#%&*{}/:<>?|"-]”,“”)

Please add all the character with in square brackets…

@Yameso,

Use the below Regex to only retain Digits, Letters, Spaces and NewLines:
[^A-Za-z0-9\s\r?\n|\r]

This identifies any other character than ones mentioned above. You can replace it using null string.

@Yameso

Check as below

image

I add the : , remaining it will identify

Use Regex.Replace(Your String, @“[^0-9a-zA-Z:,]+”, “”)

Hope this may help you

Thanks

There’s a double back slash before s. It should be two \s before the letter s. It is being automatically adjusted to one while typing the expression here.

I think that would work [^A-Za-z0-9\s]+

But I require to leave also letters with accents like “Ę” “ę” and other polish characters. is there other way to leave them too except this:

[^A-Za-z0-9\sĘęĄąŻżŹźŁłŃńÓóŚśĆć]+

1 Like

Check this:
Replace:

System.Text.RegularExpressions.Regex.Replace(your_String,"[^[\p{L}|\p{N}|\s]","")

image

1 Like

@Yameso,

Here you can use ASCII encoding instead of Regex.
This will simply replace any Es with normal E, and same goes for other characters too.

Use in a Assign:
System.Text.Encoding.UTF8.GetString(System.Text.Encoding.GetEncoding(“ISO-8859-8”).GetBytes(string.ToString))

I used:

System.Text.Encoding.UTF8.GetString(System.Text.Encoding.GetEncoding(“ISO-8859-8”).GetBytes(string.ToString))

and this one leaves marks such as " ? * , . ( ) and few others, and change letters from “Ę” to “E”.
I don’t want to replace them for characters without accent. I’d like to ignore them and clean others.

Pattern [~#%&*{}/:<>?|"-] will not get sth like this:

image

With this one [[1] those dots are also considered as thing to replace and letters with accents stays :

image

Thanks for help. @Adrian_Star nailed it :wink:


  1. \p{L}|\p{N}|\s ↩︎

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.