How to extract content from word document

Nathan_Betters1 · May 6, 2022, 2:41am

Hello!

I am struggeling with splitting content, I have a couple hundred word files with email campaign content for our company, the problem is I need to get it into our marketing platform, as opposed to copying and pasting I was hoping to extract the valuable content and then use type into activities to place into the CMS system (this part I think I have down ok).

Each document has this starting string after the title of the document:

After the content of the word document there is an end of document indicator:

<!– END CONTENT

Disclaimer (AGENCY, PLEASE READ. DO NOT PRINT IN NEWSLETTER)

This content is provided for your convenience only. You should review this content carefully, because you are ultimately responsible for the accuracy of information that you provide through your website. Every agent and broker is responsible for knowing the guidelines and laws that govern rating, underwriting and claims handling in their states. Safeco provides you with a limited, non-exclusive, non-transferable license to display and use this content, and to make and display derivative works from it, so long as you are appointed with Safeco. By using this content, you agree to defend, indemnify, and hold Liberty Mutual Group, Inc., and its affiliates, including Safeco Insurance, harmless from any loss, claim, or damages arising from your display, or use, of the content or of any derivative works.

→

** One thing that would also be helpful is extracting the file name from the document footer

Here is an image of what the document looks like, the red is ALWAYS in the document in this exact format

Nathan_Betters1 · May 6, 2022, 2:42am

PS. I should note, the content is used with permission from Safeco Insurance so I am not breaking any laws by extracting it.

Rahul_Unnikrishnan · May 6, 2022, 2:46am

Hello @Nathan_Betters1 ,

IS this format the static one? and only the highlighted values will get change?

If yes, i would suggest to use Regex functions to extract the values. For each pattern you can create a regex and extract those values.

You can use text activities (or even activities from the uipath.word.activities) to read your Word files and use REGEX expression to extract the values you want (Is Match activity)

Use below for regex creation:

Nathan_Betters1 · May 6, 2022, 2:47am

everything in between the Begin content–> and <!–End Content is what I need to extract. I will take a look at the regexr you posted, I have never done anything in this.

Nathan_Betters1 · May 6, 2022, 2:49am

one look at the RegExr tool and I am confused…

Rahul_Unnikrishnan · May 6, 2022, 2:54am

Is this END CONTENT a word available in the word document? If yes you can read the Word document the use as below: InputString is the variable after reading the word document

String val1=Split(InputString,"END CONTENT)[0]
String Val2=Split(Val1,“BEGIN CONTENT”)[1]

Nathan_Betters1 · May 6, 2022, 3:03am

I get a compiler error, I tried adding a comma between the expression and a period and neither worked

Rahul_Unnikrishnan · May 6, 2022, 3:09am

No…Here in the assign activity you made a mistake.

Create two variable var1, var2. Then i hope Content is the variable which holds the data from Word after reading. Then in the assign activtiy, add 2 assign activities and do as below.

in the left side use val1 and right side Split(Content,“END CONTENT”)[0]
in the left side use val2 and right side Split(val1,“BEGIN CONTENT”)[0]

Nathan_Betters1 · May 6, 2022, 3:15am

humm… I am still getting a compiler error that end of expression expected

PS… thank you for such a quick reply!

Rahul_Unnikrishnan · May 6, 2022, 3:20am

Instead of [0] can you try (0)… Bracket change

Nathan_Betters1 · May 6, 2022, 11:37am

The syntax worked but the results were not as expected, when writing the assigned values to a new word document

Val1 writes: everything up to the last <!–
Val2 writes the last paragraph

I don’t understand how to only get the words in yellow/black
I don’t need the highlighted yellow with red text, and I don’t need the highlighted blue.

Rahul_Unnikrishnan · May 6, 2022, 11:49am

can you confirm whether “<!–” this will be there in the real document?

If yes ,

val1= Split(Content,“<!–”)(1)

val2= Split(val1,“–>”)(1)

Here in val1 it will remove only the blue highlighted and in the Val2 it will again remove the Red texted content.

Then you will get only yellow higlighted with black letters. Better ypu can print the output of val1 and val2 in a message box and cross check how it is splitting and you can modifiy the split to get the required output.

Nathan_Betters1 · May 6, 2022, 12:18pm

THANK YOU SOOO MUCH!!!

Nathan_Betters1 · May 8, 2022, 4:00pm

Follow up question, how do I get this to keep the original format of the “read” document? it seems when writing the new document its removing all format and paragraphs and just placing all the text into one long string of words

Rahul_Unnikrishnan · May 8, 2022, 4:08pm

Hello @Nathan_Betters1 ,

What type of document are you trying to read? Is it a PDF or word?

Nathan_Betters1 · May 8, 2022, 5:24pm

Word documents

Rahul_Unnikrishnan · May 8, 2022, 5:32pm

Hello @Nathan_Betters1 ,

If you are trying to write to word document, you can refer the below video. You can use the bookmark feature to write the content. I think based in the text field which you are creating for the bookmark, it will auto align.

Rahul_Unnikrishnan · May 8, 2022, 5:32pm

Try this. Using the bookmark

Nathan_Betters1 · May 8, 2022, 5:51pm

Unfortunately this will not work because each word file has different content, it can be any variation of paragraphs and images etc. I would have no way to know what to build for bookmarks

system · May 11, 2022, 5:52pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Regex to extract Table of content items using pattern Marketplace marketplace , question	12	988	June 22, 2023
Extract a word from a paragraph and insert it in the same WORD document Help activities , studio	5	3000	October 21, 2019
Read A Word Doc with Lines Studio studio	13	1380	May 21, 2020
To extract data from word Studio studio , question , tools	2	283	September 18, 2023
Extract table of contents + Word document Activities word	6	926	September 13, 2023

How to extract content from word document

Related topics