Regex to extract Table of content items using pattern

Sairam_RPA · June 22, 2023, 7:41pm

Hi have a word document with about 100 pages.

In one of the pages I have the table of contents like below.

Table of contents

Cover Page_______________________________ 1

Contents___________________________________ 3

Sites____________________________________ 4

Information______________ 5

(Description)__________________ 6

Narrative____________________________________ 7

Cited_______________________ 8

Resources_________________________ 13

Equipment________________________________________ 17

Attachments_________________________________ 20

Animals ____________________ 20

I need to get all the values and the associated page numbers. Can we use regex to pull these values and page numbers.

These values change and are not the same.

postwick · June 22, 2023, 7:47pm

Don’t really need to use RegEx, although you could.

You could just split on VbCrLf to get it all into an array. Then For Each on the array and split on _ while taking the first element and last element as the title and page.

Sairam_RPA · June 22, 2023, 7:52pm

The issue is it is a 100 page word document. I do not know where this text exists in that document. Hence need regex

postwick · June 22, 2023, 7:53pm

Isn’t there a way to determine where it starts and ends using other text like headers (ie a Contents header before it, some other header on the next section after it)?

Sairam_RPA · June 22, 2023, 7:58pm

It starts with Table of contents as the heading but it does not have a defined footer or ending text.

postwick · June 22, 2023, 7:59pm

What about the next section, does it have a consistent header?

postwick · June 22, 2023, 8:01pm

It’s a little bit of an ugly way to do it but you could read the entire text of the document, split on VbCrLf, and then loop through it. If the current array value contains “___” then keep it (put it into another array). Then on the resulting array you go through it and split each value up by _ to get the first item (title) and last item (page).

Sairam_RPA · June 22, 2023, 8:06pm

Ya. But that will take a long time to loop through the whole text in the 100 page document.

I felt that regex should do it faster

postwick · June 22, 2023, 8:10pm

You don’t have to loop through the whole thing. As soon as you get to an array item that does not contain “___” you know you’ve gotten to the end of the table of contents and can break out of the loop.

Sairam_RPA · June 22, 2023, 8:11pm

Ok I will try that

Sairam_RPA · June 22, 2023, 8:39pm

I tried this but did not work as there are lot of other places in the document which contain this “___”

raja.arslankhan · June 22, 2023, 9:04pm

@Sairam_RPA Hi
Please Try this Regex. This will return match array.

After this apply one loop over match collection and run two more regext to get
name and page number
you will get name and page number from each index and you can maintain dictionary for this:

Regex for name:

Regex for Number

system · June 29, 2023, 12:41pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Extract table of contents + Word document Activities word	6	922	September 13, 2023
How to extract content from word document Studio studio , question , word	19	1792	May 11, 2022
How to use Regex based extractor activity Activities uiautomation , activities , question	4	1250	October 16, 2020
Need REGEX code for extracted PDF info Studio studio , question , find_references	4	847	September 20, 2021
Regex Pattern to extract screen scrape results Studio studio , question	3	793	November 28, 2021

Regex to extract Table of content items using pattern

Related topics