I have PDF file in that color text convert as a hyperlink.
Scenario:
A. Must identify the blue color texts which will be having section(Dynamic) as keyword and ignore the section keyword without blue color. The format of the section keyword is not consistent. Sometimes it would be inside brackets.
B. In some cases, Infront of section keyword, module keyword will be mentioned, and we have to take which module the section belongs to and link only the text in blue color.
C. In some cases, the section keyword itself will be not there.
for page_num in range(doc.page_count):
page = doc.load_page(page_num)
blocks = page.get_text("dict")["blocks"]
for b in blocks:
for l in b["lines"]:
for s in l["spans"]:
color = s["color"]
# Assuming color code 255(blue) is used
if color == 255:
blue_text.append(s["text"])
doc.close()
return blue_text
pdf_output_path = "Output/bluetext.xlsx"
wb = Workbook()
ws = wb.active
ws.append(["Extracted Blue Text"]) # Header
for text in blue_text:
ws.append([text])
wb.save(pdf_output_path)