Take a url from a string

I know nothing about regex – how would I remove all of the text in this string and be left with just the url in Python?

“Go to this link: https://www.google.com

Sorry for the stupid question I’ve looked around and have not been able to find anything.

I went here: https://regexr.com/ and searched and found the following
/[-a-zA-Z0-9@:%+.~#?&//=]{2,256}.[a-z]{2,4}\b(/[-a-zA-Z0-9@:%+.~#?&//=]*)?/gi

if you test your string you will see it work. However trusting nobody very much I went to verify it here: https://regex101.com/ and it told me that the backslashes needed escaping so I got the following

/[-a-zA-Z0-9@:%+.~#?&//=]{2,256}.[a-z]{2,4}\b(/[-a-zA-Z0-9@:%+.~#?&//=]*)?/gi

I’d go with the 2nd one if it was me

URLs can be quite complicated to write a good regex for, so I normally use this one as a base (it is not mine, but it is released under the MIT license):

Adapted for your purpose, it would look something like this:

import re

regexURL = re.compile(
  '(?:^|\\s)(' +
    # protocol identifier (optional)
    # short syntax // still required
    '(?:(?:(?:https?|ftp):)?//)' +
    # user:pass BasicAuth (optional)
    '(?:\\S+(?::\\S*)?@)?' +
    '(?:' +
      # IP address exclusion
      # private & local networks
      '(?!(?:10|127)(?:\\.\\d{1,3}){3})' +
      '(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})' +
      '(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})' +
      # IP address dotted notation octets
      # excludes loopback network 0.0.0.0
      # excludes reserved space >= 224.0.0.0
      # excludes network & broadcast addresses
      # (first & last IP address of each class)
      '(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])' +
      '(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}' +
      '(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))' +
    '|' +
      # host & domain names, may end with dot
      # can be replaced by a shortest alternative
      # (?![-_])(?:[-\\w\\u00a1-\\uffff]{0,63}[^-_]\\.)+
      '(?:' +
        '(?:' +
          '[a-z0-9\\u00a1-\\uffff]' +
          '[a-z0-9\\u00a1-\\uffff_-]{0,62}' +
        ')?' +
        '[a-z0-9\\u00a1-\\uffff]\\.' +
      ')+' +
      # TLD identifier name, may end with dot
      '(?:[a-z\\u00a1-\\uffff]{2,}\\.?)' +
    ')' +
    # port number (optional)
    '(?::\\d{2,5})?' +
    # resource path (optional)
    '(?:[/?#]\\S*)?'
  ')(?:\\s|$)',
  re.IGNORECASE
)

text = 'Go to this link: https://www.google.com'

# gets the first URL only
urlMatch = regexURL.search(text)
if urlMatch:
  url = urlMatch.group()
  print(url)

# gets all URLs
urlList = regexURL.findall(text)
for url in urlList:
  print(url)

Credit for regex:

Regular Expression for URL validation

Author: Diego Perini
Created: 2010/12/05
Updated: 2018/09/12
License: MIT

Copyright (c) 2010-2018 Diego Perini (http://www.iport.it)
1 Like

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.