So the scenario is the following one:
I have an Excel file with several keywords.
I have to find all of these keywords within a lot of CV’s. Each one of the CV’s is completely different.
I simply have to detect that the keyword is within the CV.
What I did:
I used the PDF Read activity and the whole text is put into the string “PDFText”.
Then I use a for each activity, that will read each one of the keywords of the Excel file, and within the for each activity I use PDFText.Contains(keyword), to see if the keyword is there.
This seems fairly simple. But imagine that the keyword is “ERP”. And imagine that the “PDFText” contains the word “PowerPoint”. This would mean that PDFText.Contains(keyword) would give TRUE. And this is not what I want. I want to detect the word “ERP” as a separate word, thus not being a substring of a given word.
So I think that the best solution would be do a match of the string “PDFText” with a Regex expression, with a given pattern.
The pattern would consist of the “keyword”, and before and after the “keyword” there must be either a space, a coma or a semicolon.
Furthermore, if the keyword is in the beginning of the string (or in the beginning of the line?), there will be no space, coma or semicolon before the “keyword”, so this should also be reflected on the Regex pattern. The same if the keyword is in the end of the string (in the end of each line of the string?)
Do you think this is the best way to detect a given keyword as being an independent word and not being a substring of a given string?
Could someone please indicate what the exact pattern expression would be?
Thanks in advance!