PDF file color text convert as a hyperlink

I have PDF file in that color text convert as a hyperlink.
Scenario:
A. Must identify the blue color texts which will be having section(Dynamic) as keyword and ignore the section keyword without blue color. The format of the section keyword is not consistent. Sometimes it would be inside brackets.
B. In some cases, Infront of section keyword, module keyword will be mentioned, and we have to take which module the section belongs to and link only the text in blue color.
C. In some cases, the section keyword itself will be not there.

  1. How to achieve this through UiPath.
  2. How to do this
  3. What are all the activity can be used.

Expecting early response

@loginerror @Vibhor.Shrivastava @Palaniyappan @Lahiru.Fernando @mukeshkala @RAKESH_KUMAR_BEHERA

Read the PDF and get the Text
You can use regular expressions (regex) to identify URLs in text.

Here’s a simple regex pattern that can help you match URLs:

text = “Here is a sample text with a URL: https://www.example.com and another one http://google.com

url_pattern = r’https?://\S+|www.\S+’
urls = re.findall(url_pattern, text)

This regex pattern will match URLs that start with “http://” or “https://” and those that start with “www.”

You can modify the pattern to suit your specific requirements, but this should work for most common cases.

Solution of the Day:

import fitz # PyMuPDF

def extract_blue_text(pdf_path):
doc = fitz.open(pdf_path)
blue_text =

for page_num in range(doc.page_count):
    page = doc.load_page(page_num)
    blocks = page.get_text("dict")["blocks"]
    
    for b in blocks:
        for l in b["lines"]:
            for s in l["spans"]:
                color = s["color"]
                # Assuming color code 255(blue) is used
                if color == 255:
                    blue_text.append(s["text"])

doc.close()
return blue_text

pdf_path = “color.pdf”
blue_text = extract_blue_text(pdf_path)

for text in blue_text:
print(text)

def main():
pdf_path = “PDF input file path”
blue_text = extract_blue_text(pdf_path)

pdf_output_path = "Output/bluetext.xlsx"
wb = Workbook()
ws = wb.active
ws.append(["Extracted Blue Text"])  # Header
for text in blue_text:
    ws.append([text])
wb.save(pdf_output_path)