I need to extract some bullet points from word where there is text called [KEY]
I should also extract header for that byllet point. The documents are usually 5 to 10 pages and there could be 3 to 4 bullet points that I need to extract along with header for each of those documents. Attached two files. 1st one is the actual data and 2nd one is the output i am looking for
For Each activity (iterate through lines)
└ If condition: Regex.IsMatch(line, “Heading Pattern”)
└ Assign activity: currentHeader = line
└ If condition: line.Contains(“[KEY]”)
└ Add to Dictionary (header: currentHeader, keyPoint: line)
Output: Write dictionary or data table to Excel/CSV
Could you please try this and let me know, if this solution works?
Hey @Krithi1
You can achieve this by reading the Word content with Read Text activity and applying a regular expression to extract only the bullet points containing [KEY], along with their corresponding headers.
I’ve attached the project and a screenshots with a sample output. It works even if multiple [KEY] points appear under the same header.
Regex expression: System.Text.RegularExpressions.Regex.Matches(myTxt, "(?<Header>^[^\r\n]+)\s*(?:\r?\n)+(?<Block>(\s*•\s*\[KEY\].*(?:\r?\n\s*•\s+.*)*))", System.Text.RegularExpressions.RegexOptions.Multiline)
Hey @Krithi1
Those yellow dots are actually bullet points (•) in the original Word document. In the regex pattern, • is matched literally to identify bullet points that start with [KEY].
The expression looks for the section header, and then captures one or more bullet points that contain [KEY], along with any subsequent indented bullets as part of the same block.
So, yes - it’s a specific bullet character, and not interchangeable with a dot.
Hey @Krithi1
Just checking in to see if the solution worked for your case - were you able to extract the bullet points along with the headers?
If everything looks good, feel free to mark the post as a solution so others can benefit too. And if anything’s still unclear, please ask
@Krithi1
If my project not worked, it’s possible that something in the structure of your Word document is affecting the match.
If possible, could you please share the actual Word file you’re working with? That would help. I’ll check it directly and adjust the pattern to make it work exactly for your case.
@Krithi1
Thanks for the screenshot - but it’s really tricky to prepare the correct regex just based on the image. The formatting or invisible characters might cause the issue.
Would you be able to create a sample test file, for example a Word or PDF file with dummy data, but keeping the same structure, bullets, and indentation?
That way I can analyze the real extracted content and help build a working regex or extraction logic.
Hey @Krithi1
This turned out to be a slightly more complex task than it seemed at first look
The main issue was that when reading the Word document directly, all the formatting and bullet structure were lost. So, I saved the Word file as a PDF, and then read the content as plain text - that way, all the bullet symbols like ➢, •, and o were preserved.
From there, I built a logic that:
Detects main headers based on the ➢ symbol,
Finds sub-headers marked with [KEY],
Collects bullet points that belong to those [KEY] headers,
Organizes everything into a nested dictionary structure,
And finally, exports it all into a JSON file for easier consumption.
The solution works correctly for the example you provided, and will work for documents that follow the same structure. If your future documents change layout or structure significantly, you might need to adjust the workflow a bit.
Hope it works well for you – I spent a fair amount of time getting everything just right