Shortening and Expanding URLs with Python


Opinions on shortened URLs are a dime a dozen these days, but the basic facts are:

  1. They’re awfully convenient for passing around (and this was true even before Twitter came about)
  2. They are, by nature, short-lived (either the services or the URLs)
  3. You should never rely on their being around later on

So basically you have absolutely no excuse to not be able to handle them. I decided to mess around with the concept a few weeks back to see how simple I could make it all work, and came up with a couple of useful Python classes that I can share with the world:

Creating short URLs

The trouble with creating short URLs is that there are entirely too many shortening services, and far too many variations on APIs – in fact, nearly all of them suffer from “not invented here” syndrome and try to “enhance” their APIs to give you a lot of stuff that you basically don’t (ever) need, and wrap their results in JSON or XML

Me, I refuse to put up with that kind of crap.

So I poked around a bit, found the simplest services to work against and created the following class, which will try all its known services in turn until it gives you a working URL:

import urllib, urllib2, urlparse, httplib

BITLY_AUTH = 'login=foo&apiKey=bar'

class URLShortener:
  services = {
    'api.bit.ly':
    "http://api.bit.ly/shorten?version=2.0.1&%s&format=text&longUrl=" % BITLY_AUTH,
    'api.tr.im':   '/api/trim_simple?url=',
    'tinyurl.com': '/api-create.php?url=',
    'is.gd':       '/api.php?longurl='
  }
  def query(self, url):
    for shortener in self.services.keys():
      c = httplib.HTTPConnection(shortener)
      c.request("GET", self.services[shortener] + urllib.quote(url))
      r = c.getresponse()
      shorturl = r.read().strip()
      if ("Error" not in shorturl) and ("http://" + urlparse.urlparse(shortener)[1] in shorturl):
        return shorturl
      else:
        continue
    raise IOError

Yes, the error handling is naïve – any network exceptions and stuff ought to be caught upstream from this – but it works fine so far.

Expanding short URLs

This is the really fun bit, because it is not immediately obvious whether or not a short URL will actually be immediately useful – there are plenty of times when you’ll actually be redirected to something else, and while fooling around with the Google Reader API (something I’ll eventually write about alter), I found that also applied (in spades) to Feedburner links and whatnot.

So I decided to build some smarts into the process and have it not only ping some known hosts twice, but also turn it into a link checker of sorts, and learning which hosts were actually redirecting to other places:

import urllib, urllib2, urlparse, httplib

class URLExpander:
  # known shortening services
  shorteners = ['tr.im','is.gd','tinyurl.com','bit.ly','snipurl.com','cli.gs',
                'feedproxy.google.com','feeds.arstechnica.com']
  twofers = [u'\u272Adf.ws']
  # learned hosts
  learned = []
    
  def resolve(self, url, components):
    """ Try to resolve a single URL """
    c = httplib.HTTPConnection(components.netloc)
    c.request("GET", components.path)
    r = c.getresponse()
    l = r.getheader('Location')
    if l == None:
      return url # it might be impossible to resolve, so best leave it as is
    else:
      return l
  
  def query(self, url, recurse = True):
    """ Resolve a URL """
    components = urlparse.urlparse(url)
    # Check weird shortening services first
    if (components.netloc in self.twofers) and recurse:
      return self.query(self.resolve(url, components), False)
    # Check known shortening services first
    if components.netloc in self.shorteners:
      return self.resolve(url, components)
    # If we haven't seen this host before, ping it, just in case
    if components.netloc not in self.learned:
      ping = self.resolve(url, components)
      if ping != url:
        self.shorteners.append(components.netloc)
        self.learned.append(components.netloc)
        return ping
    # The original URL was OK
    return url

This one’s a bit more convoluted but has turned out to be very useful indeed, and you can simply pickle the whole object to preserve its learned hosts.