Get Subdomain From Url Using Python
Solution 1:
Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:
>> import tldextract
>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')
Note that tldextract properly handles sub-domains.
Solution 2:
urlparse.urlparse
will split the URL into protocol, location, port, etc. You can then split the location by .
to get the subdomain.
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]
Solution 3:
Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL
You will need the list of effective tlds from here
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:withopen("effective_tld_names.dat.txt") as tldFile:
tlds = [line.strip() for line in tldFile if line[0] notin"/\n"]
classDomainParts(object):
def__init__(self, domain_parts, tld):
self.domain = None
self.subdomains = None
self.tld = tld
if domain_parts:
self.domain = domain_parts[-1]
iflen(domain_parts) > 1:
self.subdomains = domain_parts[:-1]
defget_domain_parts(url, tlds):
urlElements = urlparse(url).hostname.split('.')
# urlElements = ["abcde","co","uk"]for i inrange(-len(urlElements),0):
lastIElements = urlElements[i:]
# i=-3: ["abcde","co","uk"]# i=-2: ["co","uk"]# i=-1: ["uk"] etc
candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
exceptionCandidate = "!"+candidate
# match tlds: if (exceptionCandidate in tlds):
return".".join(urlElements[i:])
if (candidate in tlds or wildcardCandidate in tlds):
return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
# returns ["abcde"]raise ValueError("Domain not in global list of TLDs")
domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print"Domain:", domain_parts.domain
print"Subdomains:", domain_parts.subdomains or"None"print"TLD:", domain_parts.tld
Gives you:
Domain: example Subdomains: ['sub2', 'sub1'] TLD: co.uk
Solution 4:
A very basic approach, without any sanity checking could look like:
address = 'http://lol1.domain.com:8888/some/page'
host = address.partition('://')[2]
sub_addr = host.partition('.')[0]
print sub_addr
This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:
Is that what you mean?
Solution 5:
What you are looking for is in: http://docs.python.org/library/urlparse.html
for example:
".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])
Will do the job for you (will return "www.my")
Post a Comment for "Get Subdomain From Url Using Python"