Yameso
(Jakub Swiderski)
March 22, 2021, 1:36pm
1
What would be the best paterrn for regex to leave only digits, letters, spaces and new lines?
I’m guessing it should be:
System.Text.RegularExpressions.Regex.Replace(myStr, “pattern”,“”) but I don’t know the right “pattern”.
This myStr come from OCR (scan) and there could be a lot of characters like . , & ^ % and many others.
Hi @Yameso …You can try the below pattern.
Regex.Replace(Input,“[\~#%&*{}/:<>?|"-]”,“”)
Please add all the character with in square brackets…
songoel
(Sonal)
March 22, 2021, 1:42pm
3
@Yameso ,
Use the below Regex to only retain Digits, Letters, Spaces and NewLines:
[^A-Za-z0-9\s\r?\n|\r]
This identifies any other character than ones mentioned above. You can replace it using null string.
Srini84
(Srinivas Kadamati)
March 22, 2021, 1:44pm
4
@Yameso
Check as below
I add the : , remaining it will identify
Use Regex.Replace(Your String, @“[^0-9a-zA-Z:,]+”, “”)
Hope this may help you
Thanks
songoel
(Sonal)
March 22, 2021, 1:45pm
5
There’s a double back slash before s. It should be two \s before the letter s. It is being automatically adjusted to one while typing the expression here.
Yameso
(Jakub Swiderski)
March 22, 2021, 2:20pm
6
songoel:
[^A-Za-z0-9\s\r?\n|\r]
I think that would work [^A-Za-z0-9\s]+
But I require to leave also letters with accents like “Ę” “ę” and other polish characters. is there other way to leave them too except this:
[^A-Za-z0-9\sĘęĄąŻżŹźŁłŃńÓóŚśĆć]+
1 Like
Adrian_Star
(Adrian Starukiewicz)
March 22, 2021, 2:29pm
7
Check this:
Replace:
System.Text.RegularExpressions.Regex.Replace(your_String,"[^[\p{L}|\p{N}|\s]","")
1 Like
songoel
(Sonal)
March 22, 2021, 2:29pm
8
@Yameso ,
Here you can use ASCII encoding instead of Regex.
This will simply replace any Es with normal E, and same goes for other characters too.
songoel
(Sonal)
March 22, 2021, 2:37pm
9
Use in a Assign:
System.Text.Encoding.UTF8.GetString(System.Text.Encoding.GetEncoding(“ISO-8859-8”).GetBytes(string.ToString))
Yameso
(Jakub Swiderski)
March 22, 2021, 3:03pm
11
I used:
System.Text.Encoding.UTF8.GetString(System.Text.Encoding.GetEncoding(“ISO-8859-8”).GetBytes(string.ToString))
and this one leaves marks such as " ? * , . ( ) and few others, and change letters from “Ę” to “E”.
I don’t want to replace them for characters without accent. I’d like to ignore them and clean others.
Pattern [~#%&*{}/:<>?|"-] will not get sth like this:
With this one [ those dots are also considered as thing to replace and letters with accents stays :
Thanks for help. @Adrian_Star nailed it
1 Like
system
(system)
Closed
March 25, 2021, 3:03pm
12
This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.