Regex to extract Table of content items using pattern

Hi have a word document with about 100 pages.

In one of the pages I have the table of contents like below.

Table of contents

Cover Page_______________________________ 1

Contents___________________________________ 3

Sites____________________________________ 4

Information______________ 5

(Description)__________________ 6

Narrative____________________________________ 7

Cited_______________________ 8

Resources_________________________ 13

Equipment________________________________________ 17

Attachments_________________________________ 20

Animals ____________________ 20

I need to get all the values and the associated page numbers. Can we use regex to pull these values and page numbers.

These values change and are not the same.

Don’t really need to use RegEx, although you could.

You could just split on VbCrLf to get it all into an array. Then For Each on the array and split on _ while taking the first element and last element as the title and page.

The issue is it is a 100 page word document. I do not know where this text exists in that document. Hence need regex

Isn’t there a way to determine where it starts and ends using other text like headers (ie a Contents header before it, some other header on the next section after it)?

It starts with Table of contents as the heading but it does not have a defined footer or ending text.

What about the next section, does it have a consistent header?

It’s a little bit of an ugly way to do it but you could read the entire text of the document, split on VbCrLf, and then loop through it. If the current array value contains “___” then keep it (put it into another array). Then on the resulting array you go through it and split each value up by _ to get the first item (title) and last item (page).

Ya. But that will take a long time to loop through the whole text in the 100 page document.

I felt that regex should do it faster

You don’t have to loop through the whole thing. As soon as you get to an array item that does not contain “___” you know you’ve gotten to the end of the table of contents and can break out of the loop.

Ok I will try that

I tried this but did not work as there are lot of other places in the document which contain this “___”

@Sairam_RPA Hi
Please Try this Regex. This will return match array.

After this apply one loop over match collection and run two more regext to get
name and page number
you will get name and page number from each index and you can maintain dictionary for this:

Regex for name:
image
Regex for Number
image

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.