Extrating URL from a PDF

Im going to receive a lot of emails, all the same virtually, I have them saved as .eml and converted to pdf and I need to get the url from them, regex I would imagine but thats a week one for me

https://www.renew-patent-trademark-registered-design.service.gov.uk/renewal-documents/start?orderId=c7f0fa54f0f54a14ba80bde6a13f87ca

Is all I need

Dear Robin Morgan
Your reference: CPAG6.
This is confirmation that GB2560592 was renewed on 15 June 2021.
You can download your payment receipt and renewal certificate via the following link:
https://www.renew-patent-trademark-registered-design.service.gov.uk/renewal-documents/start?orderId=c7f0fa54f0f54a14ba80bde6a13f87ca https://urldefense.proofpoint.com/v2/url?u=https-3A__www.renew-2Dpatent-2Dtrademark-2Dregistered-2Ddesign.service.gov.uk_renewal-2Ddocuments_start-3ForderId-3Dc7f0fa54f0f54a14ba80bde6a13f87ca&d=DwMFaQ&c=OGmtg_3SI10Cogwk-ShFiw&r=Y_mXAI-6lrvoJdaAmCy6bxSGGurXxk8-lDZkYv8lp7I&m=WL6o45xCmgNfx8QGhs5gq2m-tN4xSDRbeDkjgIYyuHQ&s=6ul_XtzPew4Q1D9esYy3RbgOzYbhbSVM6ZbK4i0l5Ag&e=
It can take up to two working days for confirmation of this renewal to appear on the official register.

This is what the actual PDF looks like

@Jersey_Practical_Sho

There is no need to save the emails in pdf format. You can read the body of the emails in a string and then use regex to extract the URL.

Is there any pattern in the URL?

Like, it will always begin with

https://www.renew-patent-trademark-registered-design.service.gov.uk/

To begin with you can use this pattern.

image

1 Like

Most Basic Regex to match URLs - https://.*\s

the issue is that what I showed is not everythign that appears in the text of the urls appear

you can try to click on every line of your email text. if it is hyperlink than it will open in a browser and you can capture it easily.

Hi @Jersey_Practical_Sho
It seems like email body should be in html format right ?

unfortunatly theres going to be around 5000 of these, and then im going to have to go to the url and download the docs

With some extra regex I have now achieved it :slight_smile: thankyou

2 Likes

Share your Regex

Were saving them as PDF for other reasons as well :wink:

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.