Weird symbol regex can't identify it

Hi,

Can anybody help to identify below symbol, it is not regular hyphen symbol i believe
.
The input is from pdf extraction.

Input Data

Block 15 PLOT 166- PLOT 177 (LOT 72322 — LOT 72333) = 12 units

My issue is symbol between (LOT 72322 — LOT 72333)

If the input is having normal hyphen, regex can get the the output by regex command below.

(Block [\d]+) PLOT ([\d]+)- PLOT ([\d]+)\s(LOT ([\d]+)\s- LOT ([\d]+))\s?=.+?(\d+)

any idea how can i process the above input.

Hi @Abang_Jamuri_Abang_Shoker

Before using Regex, use Replace

input = input.Replace("–", "-")

Regards,

1 Like

@Abang_Jamuri_Abang_Shoker

In the given context, it appears that the symbol between “LOT 72322 — LOT 72333” is not a standard hyphen but rather an en dash. An en dash (–) is slightly longer than a hyphen (-) and is typically used to indicate a range of values. The en dash can often be mistaken for a hyphen, especially in text extracted from PDFs where OCR (Optical Character Recognition) might misinterpret characters. To process or recognize this symbol in your regex, you should look for the en dash character (–). In Unicode, the en dash is represented as U+2013. So, your regex pattern should account for this en dash symbol instead of the regular hyphen.

You can either replace the character with other character before applying regex or remove it.

1 Like

Hi Irtetala

Try to replace as per your suggestion but it did not work.

Hi Ashokkarale,

how can i convert it by using command replace if i want to replace en dash in intxt.replace command.

Thank you

Hi @Abang_Jamuri_Abang_Shoker ,

Use \u2014 in your regex pattern. Please note that you need to escape the “(” and “)” in the string.

For example:

Block\s[\d]+\sPLOT\s[\d]+\-\sPLOT\s[\d]+\s\(LOT\s[\d]+\s\u2014\sLOT [\d]+\)\s=\s\d+\sunits

Hi sudster,

thanks for the advise and i change my existing coomand by followed our advised and manage to get the output by given command

(Block [\d]+) PLOT ([\d]+)- PLOT ([\d]+)\s(LOT ([\d]+)\s[(-||\u2014)] LOT ([\d]+))\s?=.+?(\d+)

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.