How to extract content from word document

Hello!

I am struggeling with splitting content, I have a couple hundred word files with email campaign content for our company, the problem is I need to get it into our marketing platform, as opposed to copying and pasting I was hoping to extract the valuable content and then use type into activities to place into the CMS system (this part I think I have down ok).

Each document has this starting string after the title of the document:

After the content of the word document there is an end of document indicator:

<!– END CONTENT

Disclaimer (AGENCY, PLEASE READ. DO NOT PRINT IN NEWSLETTER)

This content is provided for your convenience only. You should review this content carefully, because you are ultimately responsible for the accuracy of information that you provide through your website. Every agent and broker is responsible for knowing the guidelines and laws that govern rating, underwriting and claims handling in their states. Safeco provides you with a limited, non-exclusive, non-transferable license to display and use this content, and to make and display derivative works from it, so long as you are appointed with Safeco. By using this content, you agree to defend, indemnify, and hold Liberty Mutual Group, Inc., and its affiliates, including Safeco Insurance, harmless from any loss, claim, or damages arising from your display, or use, of the content or of any derivative works.

** One thing that would also be helpful is extracting the file name from the document footer

Here is an image of what the document looks like, the red is ALWAYS in the document in this exact format

PS. I should note, the content is used with permission from Safeco Insurance so I am not breaking any laws by extracting it.

Hello @Nathan_Betters1 ,

IS this format the static one? and only the highlighted values will get change?

If yes, i would suggest to use Regex functions to extract the values. For each pattern you can create a regex and extract those values.

You can use text activities (or even activities from the uipath.word.activities) to read your Word files and use REGEX expression to extract the values you want (Is Match activity)

Use below for regex creation:

everything in between the Begin content–> and <!–End Content is what I need to extract. I will take a look at the regexr you posted, I have never done anything in this.

one look at the RegExr tool and I am confused…

Is this END CONTENT a word available in the word document? If yes you can read the Word document the use as below: InputString is the variable after reading the word document

String val1=Split(InputString,"END CONTENT)[0]
String Val2=Split(Val1,“BEGIN CONTENT”)[1]

I get a compiler error, I tried adding a comma between the expression and a period and neither worked

No…Here in the assign activity you made a mistake.

Create two variable var1, var2. Then i hope Content is the variable which holds the data from Word after reading. Then in the assign activtiy, add 2 assign activities and do as below.

in the left side use val1 and right side Split(Content,“END CONTENT”)[0]
in the left side use val2 and right side Split(val1,“BEGIN CONTENT”)[0]

humm… I am still getting a compiler error that end of expression expected

PS… thank you for such a quick reply!

Instead of [0] can you try (0)… Bracket change

The syntax worked but the results were not as expected, when writing the assigned values to a new word document

Val1 writes: everything up to the last <!–
Val2 writes the last paragraph

I don’t understand how to only get the words in yellow/black
I don’t need the highlighted yellow with red text, and I don’t need the highlighted blue.

can you confirm whether “<!–” this will be there in the real document?

If yes ,

val1= Split(Content,“<!–”)(1)

val2= Split(val1,“–>”)(1)

Here in val1 it will remove only the blue highlighted and in the Val2 it will again remove the Red texted content.

Then you will get only yellow higlighted with black letters. Better ypu can print the output of val1 and val2 in a message box and cross check how it is splitting and you can modifiy the split to get the required output.

THANK YOU SOOO MUCH!!!

1 Like

Follow up question, how do I get this to keep the original format of the “read” document? it seems when writing the new document its removing all format and paragraphs and just placing all the text into one long string of words

Hello @Nathan_Betters1 ,

What type of document are you trying to read? Is it a PDF or word?

Word documents

Hello @Nathan_Betters1 ,

If you are trying to write to word document, you can refer the below video. You can use the bookmark feature to write the content. I think based in the text field which you are creating for the bookmark, it will auto align.

Try this. Using the bookmark

Unfortunately this will not work because each word file has different content, it can be any variation of paragraphs and images etc. I would have no way to know what to build for bookmarks

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.