The FeedBurner De-Moronizer


You know those graphical buttons that FeedBurner adds to RSS/Atom feeds? You know, the ones that say "Digg this", "Take our survey", "Add to del.icio.us" and similar stuff, in an ugly non-antialiased font?

Well, I'm starting to consider them as obnoxious as RSS advertising, especially when I want to read news on my mobile phone.

The worst thing about them is that the images are downloaded from the FeedBurner servers when reading every item, and no matter how much people care to know about their readership, it is extremely annoying to wait until my phone retrieves every single one.

Add to that some people's propensity for adding up to six of these annoyances to every feed item, and you'll wonder why it took me so long to start a pogrom against the blasted things.

So here's a first stab at stripping all of that junk from a FeedBurner-mangled feed using Beautiful Soup. It works for a couple of the ones I'm subscribed to, and I'll see if there are any significant variations later on:

import sys, getopt, urllib2, re, cgi, urlparse
from BeautifulSoup import BeautifulSoup

def demoronizer(url):
  stream = urllib2.urlopen(url)
  buffer = stream.read()
  soup = BeautifulSoup(buffer)
  # break down real URL after redirect
  sections = urlparse.urlparse(stream.geturl())
  host = re.compile(sections[0] + '://' + sections[1] + '/*')
  for item in soup('item'):
    for description in item('description'):
      # poor man's entity mapping
      buffer = description.string
      buffer = buffer.replace('&lt;','<')
      buffer = buffer.replace('&gt;','>')
      buffer = buffer.replace('&amp;','&')
      nuts = BeautifulSoup(buffer)
      # remove the holder div contents
      for target in nuts('div', {'class': 'feedflare'}):
        del target['class']
        target.contents=''
      # remove the linked buttons
      for link in nuts('a', {'href': host}):
        del link['href']
        link.contents=''
      # remove any stragglers
      for img in nuts('img', {'src': host}):
        del img['src']
        img.contents=''
      buffer = str(nuts)
      # BS does not let me remove the tags themselves,
      # so this cleanup is required
      for block in ['<a></a>','<div></div>','<p></p>','<img />']:
        buffer = buffer.replace(block,'')
      description.contents=cgi.escape(buffer)
  print soup

This is a quick hack, but the basic technique should be useful to lots of people.

Right now I'm using it with newspipe's built-in external feed generator mechanism, but I'm seriously thinking of folding this and other Ad Blocking niceties into newspipe itself (provided I have the time, of course).

The supremely paranoid might also want to regenerate the link value from feedburner:origlink and avoid a round-trip to FeedBurner's servers just for the sake of a redirect to the original post, but I think it is a reasonable compromise to let the original publishers keep track of click-through stats.