Problem with Regex

I have a problem with a regex pattern that works fine when i test it on diffrent website.

“(?<=CAKE:)(.|\n)*?(?=\n[A-Z]{4,})”

Input
CAKE:
ASdlamdsa alökjsdpk epwm wöemr
öl asdpo pöqweo pqwe qpwme öl
ölqew påö ölkqwe ölwkmqeöl.
JNCR

output
ASdlamdsa alökjsdpk epwm wöemr
öl asdpo pöqweo pqwe qpwme öl
ölqew påö ölkqwe ölwkmqeöl.

But when i run this in Studio, and using Matches.

I always get this message Object reference not set to an instance of an object.

This is the code i am using to get the Value from the Match

ienOutTest(0).Value.Tostring

What am i missing?

Hello

How are you getting your text? From OCR?

1 Like

I use Word application scope and then Read Text

Hi

Triple check your input/output on the Matches activity.

Make sure the sample above is what is being read from the read text activity.

Sometimes a \n should be swapped with (\r|\n). Maybe try that.

1 Like

I have triple checked

When i trie the thing it works.
But i dont get any value from it.

@Anders_Dahl1
when reading word text it can occur that the bell character is doing confusion.

Just check in debugger if \a chars occur and mabye cleanse the text by e.g. following approach:

1 Like

Your pattern works fine in a file.txt but not i a word document.

Hmm :slight_smile:

as mentioned the bell char occurs often when word text is readin e.g. the read text activity form the word package.

However it is recommended everytime in case of issues to debug and inspect the variables with its content.

1 Like

But i have done by debug of the text.

And the last time when i had a other pattern that worked fine, but had some limits.
Then you gave me a pattern that works much better so i want to use it.
The debug told me last time that there was a bad symbol.
But this time it only refer to Instance of object.

You should know that i appreciate your help @ppr

we are sure that you have done the debug. But when the issue is not solveable at your end we do need details (e.g. screenshots from the debug). Currently the question is:

  • the outcome from the match activity is null?
  • then reduce the pattern e.g. to (?<=CAKE:)(.|\n)*? and explore iterative where it is failing
  • for this the immediate panel while debugging is helpfully
1 Like

This is the first lines Before Some text

image

Could this be some problem?

I tried this (?<=CAKE:)(.|\n)*? and then it works .
So there is something with this part (?=\n[A-Z]{4,})

I have tried to remove \n from this part (?=\n[A-Z]{4,}) and the it works, but dont STOP at the next capital letters.

What do think it could be?

@Anders_Dahl1 – It would be great …if could you share couple lines before Cake…

1 Like

Hi,

In my environment, it contains \u000B (VT).
So can you try the following pattern?

"(?<=CAKE:)(.|\n)*(?=[\u000A-\u000D][A-Z]{4,})"

Regards,

1 Like

@Anders_Dahl1 - When I first tried your pattern, output returned as empty even though .NET Regex Tester - Regex Storm was showing a match. But the below pattern yielded the output. So please give it a try.

(?<=CAKE:)[\s\S]+(?=[A-Z]{4,})

Output

2 Likes

Hi,

Have you tried with open a word document and then try your pattern?

I use Word application scope
And there are something strange when you use the wordfile and the " \n " .

I tried your pattern i word, i dont get any wrong message. But it does not stop at the CAPITAL LETTERS.

Hi,

Have you tried with open a word document and then try your pattern?

Yes, as the following.

I tried your pattern i word, i dont get any wrong message. But it does not stop at the CAPITAL LETTERS.

Doesn’t your message end with “JNCR”? If it continues some words, the following pattern might be better.

"(?<=CAKE:)(.|\n)*?(?=[\u000A-\u000D][A-Z]{4,})"

If this doesn’t work, can you share your word file? It’s no problem if dummy data.

Regards,

1 Like

Hi again,

The code that you have is working fine when i move the text to another word document.
So there is something wrong with the original data.

And i have problems to share data, because of privacy. otherwise I would have done it.

Do you think there can be a work around?

Hi,

perhaps we should check original word file in detail.

I’ll share petite tool for showing each character code in string as the following.

Sample20210202-1.zip (12.5 KB)

This outputs like the following text file. So can you check which code number is there before JNCR?

Regards,

1 Like