Regex to fetch Data from Word Document

Hi Team,

I have a word document which contains dynamic data comprises of Question Number, Question Type, Question and Options available for the same.

Input String -

Q013 - PastCheck: Past Participation Check Single coded

Not back

Have you taken part in any market research related to dairy products in the past 1 month?

Normal

1 Yes  GO TO SCREEN OUT
2 No

Q014 - GENDER: Gender Single coded
Input -

Not back

Please select your gender.

Normal

1 Male  GO TO SCREEN OUT
2 Female

Q015 - AGE: Age Numeric

Not back | Min = 1 | Max = 99

Please input your age.

Q016 - AgeQuota_TH: Age Quota (Thailand) Single coded

Not back | Dummy

HIDDEN QUESTION: AUTOCODE FROM < Age >

Normal

1 Below 25 years old  GO TO SCREEN OUT
2 25 - 29
3 30 - 40
4 41 years old and above  GO TO SCREEN OUT

… Need to have some Regex pattern that will help to fetch below details from the above input -

  1. Question Number (starts with ‘Q0’).
  2. Question type (Will be on the same line as Question, like - Single coded, Numeric etc).
  3. Question (Always be the next line of the keyword ‘Not back’).
  4. Options available for that Question (Starts in the next line of keyword ‘Normal’).

I am attaching snap of the word document here for a better understanding of the Input Data -

Hi @DewanjeeS,

The first two are easy and can be extracted without issues.

  1. Question Number [A-Z]{1}\d(?<QuestionNumber>\d{2})

Start cursor at place where letter can be found for 1 character and a digit for 1 character finally get the next two digit characters, which in this case are the question numbers.

image

  1. QuestionType [A-Z]{1}\d{1}\d{2}\s-\s.*\s(?<Numeric>Numeric)$|([A-Z]{1}\d{1}\d{2}\s-\s.*\s(?<SingleCoded>\w+\s\w+)$)


    Since the input string has two different type of questions we have to take this one at a time. Here we first put the cursor to the start of the Question. Question line always contains the question type in the end of the sentence. It is important that you know all the question types before hand and each question type will have to be seperately extracted. Here I show only the numeric and SingleCoded types. It is impossible to use Regex if the text has an unknown pattern.

  2. Not Back will not be immidiate word as you mentioned so this one is tricky. See question 14 string it has an extra string Input - and new line. I am not sure how to get this. May be someone who is really good at regex can help you here.

  3. Options (Grouped syntax) `` 1\s(?<Option1>\w.*(\n|(?!\?)))2\s(?<Option2>\w.*)\n|3\s(?<Option3>\w.*)\n|4\s(?<Option4>\w.*)
    We place the cursor to start at each options bullet. In this case it is numbered and then check if the number has a space. Option 1 is the tricky one because in your input string you have a question which ends with a similar pattern “1 month?” so we have to use a negation (\n|(?!?)) rest of the options is quite straight forward: check for word and continue till the end of the line.

I will try and update the explanation of how things work soon in this answer.

Hope this helps!

1 Like

@jeevith …Thanks for your suggestions.
One thing here, for the 2nd result, Question Type can be of any type and formatting of that line can vary as well (But the line will always start with ‘Q0’) . Here I have given reference of 2 types only (Single coded, Numeric). Can we have more dynamic pattern for this one as well?

@DewanjeeS - Since @jeevith already provided a very good regex…I just sharing my views here…

Since Single Coded(2 words), Numeric is (1 word) I think capturing is little difficult(I may be wrong here) , so I have added or condition… feel free to add any options which comes …

Group1= Question#
Group2= Single or Numeric

I guess, in the previous post I have already provided how to capture between Not Back and Normal right?

1 Like

@prasath17 … Thanks for your suggestions… Let me try with different combinations for the 2nd one and I will let you know if it’s working or not.
For the capturing ‘Question’ part, initially I posted a separate thread which you mentioned here.
But later it has been observed that even though all the time Question starts ‘Not back’, end identifier keyword ‘Normal’/‘End’ may not be there always (You can see Q015 for reference, even though it started with ‘Not back’ there is no such end keyword for this one).
Hence, posted it separately here along with other questions as well…

@DewanjeeS …Got it…Regex would be impossible if its text/string does not follows some pattern(just from my experience here).

Please try and let us know, what else pending out of 4 points you have mentioned above…

1 Like

Hi @DewanjeeS …Please check this below regex…It is capturing the values for Q015 too…Here i have used an idea that, Any text between Not Back and (Normal or Q0)…

1 Like

Hi @DewanjeeS,

I agree with @prasath17, (Question 2) It is impossible to looking for characters/words without knowing what characters are possible in your input string using Regex.

Why is it impossible?
Although the question type is usually at same line at Q0 the word itself (type) can be a single word, two words or multiple words. The question type can be of any number of words. Without knowing this regex can never know what to look for. In addition, there are no unique separator characters which could help anchor the search either.

QuestionType possible text: 
Numeric 
Single Code
Foo Bar Bar 
Foo Foo Bar Bar  

So you will either need to have a list of possible question types to search or manually make a list of possible question types. Then tweak the regex expressions in this thread to match those terms.

2 Likes

@prasath17 and @jeevith … Thanks a ton to both of you for looking into this one and providing your valuable inputs.

I am able to fetch below as of now -

  1. Question Number
  2. Question Type

For 3) Question - I will check with concerning Team if we can get a fix end key identifier, for which I can apply the fix @prasath17 has provided in one of the earlier posts.

Only thing remaining is - 4) Options (Always starts from the next line of keyword ‘Normal’) - Is there a way we can fetch that as well?

This should get you all the options. If there are only two options then this will return only two matches if there are 4 options then it gets matches for all of them.

You could also extend this to more options if some questions have more than 4 answers.

.....\n|5\s(?<Option5>\w.*) etc. Here \n says look to nextline and search for number 5 and after a space character capture all words until the end of the line and name this group Option5.

@jeevith @prasath17 … One more query from the above Input pattern -

Is there a way we can get the Variable type of each Questions?
Pattern for the same will be as below -

Question Number - Variable ID: *******Question Type

Question Number and Question Type, I am able to fetch with the pattern @prasath17 shared above i.e. ([qQ]\d{1,}).*(Single coded|Numeric|Multi coded|Matrix|Shared list|Text)).

Variables Id’s will be below from above Input Text -

Q013 - PastCheck
Q014 - GENDER
Q015 - AGE
Q016 - AgeQuota_TH

@DewanjeeS - Here you go…

1 Like

@prasath17 … Thanks a ton Man!!

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.