Don’t really need to use RegEx, although you could.
You could just split on VbCrLf to get it all into an array. Then For Each on the array and split on _ while taking the first element and last element as the title and page.
Isn’t there a way to determine where it starts and ends using other text like headers (ie a Contents header before it, some other header on the next section after it)?
It’s a little bit of an ugly way to do it but you could read the entire text of the document, split on VbCrLf, and then loop through it. If the current array value contains “___” then keep it (put it into another array). Then on the resulting array you go through it and split each value up by _ to get the first item (title) and last item (page).
You don’t have to loop through the whole thing. As soon as you get to an array item that does not contain “___” you know you’ve gotten to the end of the table of contents and can break out of the loop.
After this apply one loop over match collection and run two more regext to get
name and page number
you will get name and page number from each index and you can maintain dictionary for this: