Extract Hyperlink from PDF

Hi,
I need your help regarding using PDF activity. I want to extract hyperlinks in some text and images. I have tried multiple time but PDF activities only extract visible text. Please help if someone faces the same.

I have seen someone in forum uses copy file (pdf to text) and then find hyperlinks. But it didnt work for me.

Thanks

  1. First, use UiPath’s OCR capabilities (e.g., the “Read PDF with OCR” activity) to extract text from the PDF. Make sure to select an OCR engine that provides the best results for your specific PDF, as OCR accuracy can vary.
  2. Find Hyperlinks in Extracted Text: After extracting the text, you can use regular expressions or string manipulation to search for patterns that represent hyperlinks.

Hi @naveed.zafariqbal,

I hope you need to extract the Only Hyperlinks from PDF. If so, you can convert the PDF data into Json and transversal each everynode which need to be Extracted/captured from PDF.

Happy Learning !!

Try using READ PDF WITH OCR Engines like UiPath document ocr or omnipage ocr and get the hidden URLs as well

Once u get that you can use Regex to get the html link part alone
Refer this to extract the link with Regex

Hope this helps

Cheers @naveed.zafariqbal

Thankyou everyone for prompt replies. Actually, I can’t use any website to convert to any other format like JSON , as proposed by one of the member due to client limitations. Rest, I will try to use OCR Engines. Hopefully this might helps.

Hi @naveed.zafariqbal ,

Is it possible for you to provide us with a Sample PDF file so that we can analyse and try to extract the necessary from our end ?

Hello @naveed.zafariqbal

Read Text File
Output: plainText

Matches
Input: plainText
Pattern: http[s]?://\S+
Result: hyperlinkMatches

For Each
Type Argument: System.Text.RegularExpressions.Match
Values: hyperlinkMatches

Assign
To: hyperlink
Value: item.Value

Log Message
Message: hyperlink

Thanks & Cheers!!!

Sorry mate, due to security concerns cant share any pdf. Actually pdf is some invoice.

Hey, Kartheek. Thankyou for the help.

Actually, this is the second part of the process in which we use regex to extract hyperlinks from text. During reading PDF file, it only extracts text which is visible not the hyperlinks on the images in the pdf.