Pick Data From PDF

Hi all,

How to extract the line that has URL: from pdf

“An updated List is accessible on the Committee’s website at the following
URL: www.un.org/securitycouncil/sanctions/materials.

Thanks and Regrads,
Supriya Galentic

1 Like

Hi @supu123
Try this regex pattern


[\w\n\S\s]+URL:[\s\w\d\S]+

Regards,
Nived N
Happy Automation

1 Like

Hi @NIVED_NAMBIAR

It is fetching entire pdf

Need to extract the only string that has URL and not the above and below part

eg- entire 2. should be fetched
-entire 3. should be fetched

Hope u understand my query

Thanks and Regards,
Supriya Yenaganti

Hi @supu123

so u need to extract the url part ?
is it right ?

Hi @supu123

Have a look that this regex pattern

(?<=URL:).*

Hi @NIVED_NAMBIAR

Need to extract the url which contains updated or latest keyword in the string

Thanks and Regards,
Supriya Galentic

Hi @supu123
did the above regex pattern works for you ?

Fetch url which has updated

Hi @NIVED_NAMBIAR

expression extracts entire pdf

Thanks and Regrads,
Supriya Galentic

Hi @supu123
did u need to extract those url which had word updated near to it ?

Hi @NIVED_NAMBIAR
How to extract those url which had word updated near it

If you are dealing with documents (scanned or native PDFs), the best appraoch would be to look into Document Understanding. This way, you can:

  • digitize the documents using whatever OCR engine is available to you
  • extract the data using a Regex Based Extractor , and a pattern that fits your needs
  • be able to review the data using validation station, in case review / potential corrections are needed
  • use the data in a simple way afterwards.
1 Like

Hi @NIVED_NAMBIAR
(?<=URL :).*
cannot fetch URL in the image

Thanks and Regards,
Supriya

@supu123 - because these 2 URLs are not in the same line as the word URL thats why…

Could you please give the starting few letters of thr URLs…I saw your masked it. But for your information URLs on the forms are not PII so I don’t think you have to mask it…

If its always starts with www. Then please try www.\S+

This will fetch all the URLs starting with www

Hi @prasath17

Want to extract the entire statement with url in it

number 3. has latest keyword so want to extract only that 2 urls

Thanks and Regards,
Supriya

@supu123 - I am not clear on your requirement. If you are looking to extract the URLs below pattern would work…

Hi @prasath17

I also want the part above the url starting from 3. till end of 3.

The latest versions of the Sanctions lists are accessible on the UN Security
Council’s website at the following URL:a) List of individuals and entities issued by the UNSC ISIL (Da’esh) and Al-Qaida
Sanctions Committee:

b) List issued by the UNSC Committee established pursuant to
resolution 1988 (2011) of individuals and entities linked to Taliban

Thanks and Regards,
Supriya

@supu123 - I am not positive that this can be achieved using Regex…

You may have to try different approach.

Hi @prasath17

For @supu123 work, we had to implement two regex patterns , one for to search data (lines ) having updated word then after that use the another regex to extract URL

We had to try that way to achieve this

Sorry for late response @supu123

Regards

Nived N :robot:

Happy Automation :relaxed::relaxed:

Hi @NIVED_NAMBIAR

Can I get an example

Thanks and Regards,
Supriya