Need to extract specific content from word

Hi,

I have a Word document,Where i have Pages around 75,In this need to extract specific data .For an example: document which has an index of topic like “Mangement discussion” which starts in 31st page and it ends in 44th page,by 45th pages it has another topic as Quality and materials. pages may vary from one documnet to another documnet.but heading will be same.can you please guide me how to scrape the data from 31-44.

@vaish
If the index of the topic is stable then you can use sub string method based on two specified indices

Thanks for your reply.
can you please explain me more.how to do that?

  1. Use read text activity inside word application scope activity. The output will be a string variable, strWordContent.
  2. Find the index of the topic “Management discussion”,

intStartIndex = strWordContent.IndexOf(“Management discussion”)

  1. Find the index of the topic “Quality and Materials”,

intEndIndex = strWordContent.IndexOf(“Quality and Materials”)

  1. Use substring function to get the required data.

strManagementContent = strWordContent.Substring(intStartIndex, intEndIndex - intStartIndex)

You can also refer

3 Likes

@vaish
Exactly the same format posted by @Madhavi

Hi,
Thanks for your help.

But i got output as M

Sorry. My mistake. I have updated the above solution. Please check step 4

strManagementContent = strWordContent.Substring(intStartIndex, intEndIndex - intStartIndex)

Hi,

But i got topics and the page number,but i didnt get content under the topic.
i have attached my flowchart .plz check .
Thanks
Flowchart.xaml (16.9 KB)

Is it possible to send input file as well?

For both start index and end index, you are using the same topic,
intStartIndex = StrWord.IndexOf(“Management’s Discussion and Analysis of Financial Condition and Results of Operations”)
intEndIndex -= StrWord.IndexOf(“Management’s Discussion and Analysis of Financial Condition and Results of Operations”) => This should be “Quality and materials”

and step 4:
strManagementContent = strWordContent.Substring(intStartIndex, intEndIndex - intStartIndex)

and intStartIndex and intEndIndex variables should be of type integer.

Made a small change in step 4. Please check. This will give you result

Hi,
Thanks for your help.
Sorry for the late reply.yeah i did the changes as you said…
Please find the attachment both xaml file and docx.
Flowchart.xaml (12.0 KB) MSFT_FY20Q2_10Q.docx (893.3 KB)

I see the issue is that word scope text is not read completely. May be because of the size of file.

Hi,
Thank you so much for your help, your code was really helpful.
Am able to scrap the page number for the start index.with reference of your code i scraped the end index.
with the start index and end index i have copied the content of the document .

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.