Finding a word in a document, extracting the sentence it's in

Hi all

This may be a bit of a long shot but I wondered if anyone had any ideas on how to do the process someone has asked me to look into…

Step 1 - see if words from a list of words appears in a PDF document
Step 2 - if it does, either

  • A) grabs the page number of the page it appears on then carries on searching through the document
    or
  • B) grabs the whole sentence it appears in (this may not be the best of the two options in case something important is missed in the next sentence)

Step 3 - enter the information obtained (page number or sentence) in an Excel table

e.g. - if the word was “Contractor”, it would find it and write back to Excel that the word was found on page number 10:

The document will be anywhere from 50 to 300 pages, they’ll be sent from different companies so there’s no set template, the word may or may not appear in a table (as shown above).

I can manage step 1 and 3 but I have no idea with step 2. Any ideas please? :slight_smile:

Thanks

Read the text into a variable and then use RegEx to find the word and extract everything between the previous and next periods.

Hello,

I would solve this problem with a couple of lines in Python but I think I have found a way in UiPath also…

Basically you get the number of pages of the pdf doc, and loop through them, reading each page one by one ( by using the Range property of the ReadPDF Text activity and checking if the word is in that page. From there you can add the page number to a list, print it to console, etc.

I tested it on a 422 page document.

image

2 Likes

Amazing, thank you so much!

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.