Extract Hyperlink from PDF

naveed.zafariqbal · October 23, 2023, 12:30pm

Hi,
I need your help regarding using PDF activity. I want to extract hyperlinks in some text and images. I have tried multiple time but PDF activities only extract visible text. Please help if someone faces the same.

I have seen someone in forum uses copy file (pdf to text) and then find hyperlinks. But it didnt work for me.

Thanks

Gayathri_Mk · October 23, 2023, 12:40pm

First, use UiPath’s OCR capabilities (e.g., the “Read PDF with OCR” activity) to extract text from the PDF. Make sure to select an OCR engine that provides the best results for your specific PDF, as OCR accuracy can vary.
Find Hyperlinks in Extracted Text: After extracting the text, you can use regular expressions or string manipulation to search for patterns that represent hyperlinks.

Jayavignesh_G · October 23, 2023, 12:43pm

Hi @naveed.zafariqbal,

I hope you need to extract the Only Hyperlinks from PDF. If so, you can convert the PDF data into Json and transversal each everynode which need to be Extracted/captured from PDF.

Happy Learning !!

Palaniyappan · October 23, 2023, 1:14pm

Try using READ PDF WITH OCR Engines like UiPath document ocr or omnipage ocr and get the hidden URLs as well

Once u get that you can use Regex to get the html link part alone
Refer this to extract the link with Regex

Hope this helps

Cheers @naveed.zafariqbal

naveed.zafariqbal · October 24, 2023, 6:30am

Thankyou everyone for prompt replies. Actually, I can’t use any website to convert to any other format like JSON , as proposed by one of the member due to client limitations. Rest, I will try to use OCR Engines. Hopefully this might helps.

supermanPunch · October 24, 2023, 6:40am

Hi @naveed.zafariqbal ,

Is it possible for you to provide us with a Sample PDF file so that we can analyse and try to extract the necessary from our end ?

Kartheek_Battu · October 24, 2023, 6:42am

Hello @naveed.zafariqbal

Read Text File
Output: plainText

Matches
Input: plainText
Pattern: http[s]?://\S+
Result: hyperlinkMatches

For Each
Type Argument: System.Text.RegularExpressions.Match
Values: hyperlinkMatches

Assign
To: hyperlink
Value: item.Value

Log Message
Message: hyperlink

Thanks & Cheers!!!

naveed.zafariqbal · October 24, 2023, 11:02am

Sorry mate, due to security concerns cant share any pdf. Actually pdf is some invoice.

naveed.zafariqbal · October 24, 2023, 11:04am

Hey, Kartheek. Thankyou for the help.

Actually, this is the second part of the process in which we use regex to extract hyperlinks from text. During reading PDF file, it only extracts text which is visible not the hyperlinks on the images in the pdf.

Topic		Replies	Views
Read Hyperlinks from PDF File Help pdf	3	3526	September 22, 2022
How to extract data from pdf with singature Studio studio , question , activities_panel	3	201	November 23, 2023
How to read all hyperlinks and book marks in pdf using uipath Help pdf , activities	2	2829	January 25, 2018
How do you read images from PDF? Activities pdf , activities , question	1	206	December 21, 2023
Text Extraction for PDF File Studio	4	1524	July 16, 2020

Most Active Users - Yesterday
Anil_G
mukesh.singh
postwick
anjani_priya
Anelisa_Bolosha1
More details...

Extract Hyperlink from PDF

Related Topics