Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I'm looping over a series of URLs and want to clean them up. I have the following code:

# Parse url to remove http, path and check format
o_url = URI.parse(node.attributes['href'])

# Remove www
new_url = o_url.host.gsub('www.', '').strip

How can I extend this to remove the subdomains that exist in some URLs?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.2k views
Welcome To Ask or Share your Answers For Others

1 Answer

I just wrote a library to do this called Domainatrix. You can find it here: http://github.com/pauldix/domainatrix

require 'rubygems'
require 'domainatrix'

url = Domainatrix.parse("http://www.pauldix.net")
url.public_suffix       # => "net"
url.domain    # => "pauldix"
url.canonical # => "net.pauldix"

url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix       # => "co.uk"
url.domain    # => "pauldix"
url.subdomain # => "foo.bar"
url.path      # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...