Need to extract specific content from word

vaish · March 13, 2020, 10:00am

Hi,

I have a Word document,Where i have Pages around 75,In this need to extract specific data .For an example: document which has an index of topic like “Mangement discussion” which starts in 31st page and it ends in 44th page,by 45th pages it has another topic as Quality and materials. pages may vary from one documnet to another documnet.but heading will be same.can you please guide me how to scrape the data from 31-44.

KGPK · March 13, 2020, 10:41am

@vaish
If the index of the topic is stable then you can use sub string method based on two specified indices

vaish · March 13, 2020, 10:53am

Thanks for your reply.
can you please explain me more.how to do that?

Madhavi · March 13, 2020, 10:57am

Use read text activity inside word application scope activity. The output will be a string variable, strWordContent.
Find the index of the topic “Management discussion”,

intStartIndex = strWordContent.IndexOf(“Management discussion”)

Find the index of the topic “Quality and Materials”,

intEndIndex = strWordContent.IndexOf(“Quality and Materials”)

Use substring function to get the required data.

strManagementContent = strWordContent.Substring(intStartIndex, intEndIndex - intStartIndex)

You can also refer

KGPK · March 13, 2020, 11:02am

@vaish
Exactly the same format posted by @Madhavi

vaish · March 13, 2020, 12:04pm

Hi,
Thanks for your help.

But i got output as M

Madhavi · March 13, 2020, 12:07pm

Sorry. My mistake. I have updated the above solution. Please check step 4

strManagementContent = strWordContent.Substring(intStartIndex, intEndIndex - intStartIndex)

vaish · March 13, 2020, 12:28pm

Hi,

But i got topics and the page number,but i didnt get content under the topic.
i have attached my flowchart .plz check .
Thanks
Flowchart.xaml (16.9 KB)

Madhavi · March 13, 2020, 12:30pm

Is it possible to send input file as well?

Madhavi · March 13, 2020, 12:34pm

For both start index and end index, you are using the same topic,
intStartIndex = StrWord.IndexOf(“Management’s Discussion and Analysis of Financial Condition and Results of Operations”)
intEndIndex -= StrWord.IndexOf(“Management’s Discussion and Analysis of Financial Condition and Results of Operations”) => This should be “Quality and materials”

and step 4:
strManagementContent = strWordContent.Substring(intStartIndex, intEndIndex - intStartIndex)

and intStartIndex and intEndIndex variables should be of type integer.

Madhavi · March 13, 2020, 1:02pm

Made a small change in step 4. Please check. This will give you result

vaish · March 13, 2020, 3:37pm

Hi,
Thanks for your help.
Sorry for the late reply.yeah i did the changes as you said…
Please find the attachment both xaml file and docx.
Flowchart.xaml (12.0 KB) MSFT_FY20Q2_10Q.docx (893.3 KB)

Madhavi · March 16, 2020, 11:32am

I see the issue is that word scope text is not read completely. May be because of the size of file.

vaish · March 17, 2020, 12:11pm

Hi,
Thank you so much for your help, your code was really helpful.
Am able to scrap the page number for the start index.with reference of your code i scraped the end index.
with the start index and end index i have copied the content of the document .

system · March 20, 2020, 12:11pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Get data based on index Help pdf , activities , question	4	824	January 21, 2021
How to extract data from MS word and get Specific value Studio studio , question , activities_panel	7	958	August 30, 2023
Extract String from Text file Help	24	13308	January 20, 2020
Can not extracted desired Text from website Studio studio , question , activities_panel	17	671	July 2, 2023
I want to read a specific data from Word Document Help studio	10	7478	June 14, 2019

Need to extract specific content from word

Related topics