Extract specific data from word

Hi all,

I need to extract some bullet points from word where there is text called [KEY]

I should also extract header for that byllet point. The documents are usually 5 to 10 pages and there could be 3 to 4 bullet points that I need to extract along with header for each of those documents. Attached two files. 1st one is the actual data and 2nd one is the output i am looking for


Hi @Krithi1 ,

  • Word Application Scope (specify file path)
    └ Assign activity: documentText = Extracted Text
  • Assign activity: lines = documentText.Split(Environment.NewLine.ToCharArray)
  • For Each activity (iterate through lines)
    └ If condition: Regex.IsMatch(line, “Heading Pattern”)
    └ Assign activity: currentHeader = line
    └ If condition: line.Contains(“[KEY]”)
    └ Add to Dictionary (header: currentHeader, keyPoint: line)
  • Output: Write dictionary or data table to Excel/CSV

Could you please try this and let me know, if this solution works?

Hey @Krithi1
You can achieve this by reading the Word content with Read Text activity and applying a regular expression to extract only the bullet points containing [KEY], along with their corresponding headers.

I’ve attached the project and a screenshots with a sample output. It works even if multiple [KEY] points appear under the same header.
Regex expression:
System.Text.RegularExpressions.Regex.Matches(myTxt, "(?<Header>^[^\r\n]+)\s*(?:\r?\n)+(?<Block>(\s*•\s*\[KEY\].*(?:\r?\n\s*•\s+.*)*))", System.Text.RegularExpressions.RegexOptions.Multiline)


Result:

Project to download:
BlankProcess15.zip (32.2 KB)

@pikorpa

Can you please let me know what are these highlighted “Dots”? I have highlited them in yellow. Are they regular “.” or something else?

Hey @Krithi1
Those yellow dots are actually bullet points () in the original Word document. In the regex pattern, is matched literally to identify bullet points that start with [KEY].
The expression looks for the section header, and then captures one or more bullet points that contain [KEY], along with any subsequent indented bullets as part of the same block.
So, yes - it’s a specific bullet character, and not interchangeable with a dot.

Hey @Krithi1
Just checking in to see if the solution worked for your case - were you able to extract the bullet points along with the headers?
If everything looks good, feel free to mark the post as a solution so others can benefit too. And if anything’s still unclear, please ask :slight_smile:

@pikorpa

I am still not able to solve this and this i didnt mark it solution yet.

Attached are the screenhsot of my output and the regex expression. Can you please check what i could be doing wrong?

Its not going to forloop after the regex expression.

Output is not completely captured since output window size has some limit. But its eaxctly as i mentioned earlier


@Krithi1
If my project not worked, it’s possible that something in the structure of your Word document is affecting the match.

If possible, could you please share the actual Word file you’re working with? That would help. I’ll check it directly and adjust the pattern to make it work exactly for your case.

I cant share the actual documnet duty security reasons. Attached is the screenshot of pdf that i am trying to use.

Because i am not able to preserve format from word, i am firat saving ut to pdf and then reading text from pdf. Thus I am attaching the pdf screenshot

@pikorpa

@Krithi1
Thanks for the screenshot - but it’s really tricky to prepare the correct regex just based on the image. The formatting or invisible characters might cause the issue.
Would you be able to create a sample test file, for example a Word or PDF file with dummy data, but keeping the same structure, bullets, and indentation?
That way I can analyze the real extracted content and help build a working regex or extraction logic.

@pikorpa

Attached is the word document. But as I stated in my earlier post, I am saving it as PDF to preserve format before reading the text.

Please let me know if you need anything else. Thank you for looking into this.

Project Updates.docx (15.6 KB)

@pikorpa

Wanted to check if you have got a chance to look into the file I sent. I appreciate all your help.

Hey @Krithi1
Sure. Sorry I didn’t have time to look at it today. I’ll try later or tomorrow. I’ll let you know.

Hey @Krithi1
This turned out to be a slightly more complex task than it seemed at first look :slight_smile:

The main issue was that when reading the Word document directly, all the formatting and bullet structure were lost. So, I saved the Word file as a PDF, and then read the content as plain text - that way, all the bullet symbols like , , and o were preserved.

From there, I built a logic that:

  • Detects main headers based on the symbol,
  • Finds sub-headers marked with [KEY],
  • Collects bullet points that belong to those [KEY] headers,
  • Organizes everything into a nested dictionary structure,
  • And finally, exports it all into a JSON file for easier consumption.

The solution works correctly for the example you provided, and will work for documents that follow the same structure. If your future documents change layout or structure significantly, you might need to adjust the workflow a bit.

Hope it works well for you – I spent a fair amount of time getting everything just right :slight_smile:

Project:
BlankProcess15 (2).zip (221.6 KB)

@pikorpa

Thank you so much for working on this. I will try it out and let you know.

Hey @Krithi1
did you find a solution to your issue in the project I posted above?