Extract Text after underlined first word

I want to create a process as shown below. currently not able to separate text elements, as shown below. There for all the text gets extracted instead of just one line.

  1. open PDF, sample above

  2. create excel file with same file name as in bottom of Page.

3.Extract specific information (sample data extraction file shown below)

  1. extract bold text and plug it in a column in excel

  2. Extra all text after underlined text and plung it in different rows under the bold text.

  3. Perform same task for all the files in the same folder.

1 Like

Is the pdf in image format?
Did you use Get PDF Text activity or Get Text activity?

@packiaa The file is not in image formate. Yes, I have used Get PDF activity but it gets all the word. What I am trying to do is to get a specific line. For ex, if you look at the image attached in my question. I am trying to extract all the text written after MATERIAL AND SELECTION words. I want to know a solution which enables me to get text specific to what I mentioned.

@Rupendankhara, can you provide a copy of the pdf file? From my experience with pdf files and Uipath, using Acrobat Reader 2018 is better than using Acrobat Reader 2019.
Related to this article:

Please see the image on first post on this page., Its attached!!

Using a pdf file from you as a test, I can try to use different settings (as the one in the link that I shared in my first post) to see which is the best method to read the file.
Even if you can’t to take line by line, you can take all the text and apply some ‘substring’ operations to get what text you want and what line ou want.

Shortly, as per my understanding of your request, I believe the workflow should be like:

  1. Read PDF file and save the output in a variable
  2. Using assign activity to assign the text you want to be extracted in a variable. (Using substring or RegEx)
  3. Use Excel Scope Application with Write Range to put the data in the excel file you want.
1 Like

Hey wasea,

Will ypu be able to do from the pdf (attached). It similar for all others. I want to extract bold text, then want to extract text after undeline.

I am new to Uipath and coding, would be able to help me explain how Regex will help?

Thank you,

Well, attached in the first post is a .jpg file, not a .pdf file.

Let me give you then.

Please use this for reference.

RTA - 001 AUTO GALLERY for blog.pdf (334.6 KB)

@wasea any luck, with solution?

Hi @Rupendankhara,

Please check this workflow: PDF.zip (341.4 KB)

It can be for sure optimized, but is just an idea how to deal with that pdf.

  1. I used “For each” to get all the files from a specific folder.
  2. I open each file one by one (Start Process)
  3. I read the first and the last row (Last row is the ImageName or so)
  4. Get Text activity to get Flooring data
  5. Get Text to get Treads Data
  6. Get Text to get Walls Data
  7. Get Text to get Ceiling Data
  8. Excel application scope to add data to Excel row by row (it can be optimized with DataTable for sure).
  9. Kill Process to close the PDF.
  10. There are 2 folders: 1 for Invoices/ 1 for Excel file.

It might work also with RegEx (https://regex101.com/) to extract specific words from the pdf, or using substring function over the entire file.

I hope it helps to give you some ideas.

Vasile.

2 Likes

@wasea I loved it. It extracts all the required fields, as I specified. Thank you so much for your help. I can not appreciate much.

What do you use for extracting specific text? As TREAD , WALLS, CEILING Can all be different words in the next file. for example it would be COUNTERTOP, FLOOR, ROOF. The code you have given is very specific to this file.

Would be possible then to extract data in the same way. Data extracting criteria would " : " Make word in front of " :" as column and after it “Text under it” .

The PDF are ever changing. See example below. It does’nt have same headings (text).
pdf7-3.pdf (868 Bytes)

The only criteria here would be BOLD, CAPS, UNDERLINE and : now i would not know how do we go about it. Let me know if you don’t understand any part of it.

Hi @Rupendankhara,

As you can see in the solution that I’ve sent, I’ve created a lot of variables in order to get the required text. You can change the variable names, as you want.

Unfortunately, at this moment, I’m not aware of how to extract only the BOLD or Underline words. For CAPS words, REGEX can be used to extract the data.
What I’ve did are only some examples how to extract data, you just need to enhanced it to get the required data.

By the way, your pdf file “pdf7-3.pdf” appears to be empty.

Vasile.

Hey, thank you for all you have done. Let me send you that file again. See below. The file will have different text and titiles, only thing uniform will be caps and text after " : ". see if you can find a way to extract using that patterns. Thanks again.
RTA - 007 WINE CELLAR 008 RED WINE STORAGE 009 WHITE WINE STORAGE 010 WINE BUTLER-2.pdf (3.1 MB)

Hi @wasea,

Can you generate a search for this example using regex. using expression as shown below.REgex%20expression%20for%20Bold