Scrape, And Ye Shall Be Informed

Don't you just hate those stingy sites that don't provide full-text feeds?

It's right up there in my pet annoyances list, just after the wanton stupidity of inserting banner ads in RSS, so for the past year or so I've been honing the art of screen scraping using Beautiful Soup (which, in the author's own words, " usually good enough to get the data you need and then run away").

Since newspipe allows for running external Python scripts to create "virtual" feeds using the pipe: schema in its OPML file, I build small Python scripts that essentially do the following:

  • Parse the original, stingy feed
  • Go through the items on it and fetch each referenced page individually
  • Strip out everything but the main post contents
  • Stick the whole thing into a simple RSS feed that gets printed to standard output (and hence processed by newspipe when it invokes the scripts)

The first phase occasionally involves some fancy trickery (I have a couple of industry news sites that don't even publish a decent feed, so I usually navigate their front page headings), and I have some HTTP logic squirreled away in a module called scraper that essentially uses Mark Pilgrim's HTTP compression support examples and throws in some cookie handling.

Since you can do HTTP in plenty of other ways, I'll skip those bits.

But it is amazingly trivial to do. Let's say you have a site that publishes a titles-only RSS feed and that has articles inside a consistently-named div tag like so:

   (annoying banner ads, titles, advertising, etc.)
   <div id="copy" class="bodytext">
      <p>In the beginning, the dinosaurs ruled the Earth

This is a typical setup in most sites, and gives us two ways to reference the body text: by the div or its class. In fact, sometimes you have to string together all divs of a given class to gather the full article, or even parse "next page" links and grab more HTML, but those things would make for a rather involved example, so I'll stick to the basics and do the whole thing in (almost) one shot:

#!/usr/bin/env python

__version__ = "0.1"

from scraper import *
from BeautifulSoup import BeautifulSoup

baseurl = ''

def fetchindex():
  print """
<rss version="2.0">
    <title>Stingy Site</title>
""" % (baseurl)
  # grab the RSS feed (this is defined in, and does all the HTTP stuff)
  index = fetchURL(baseurl)
  # build a parse tree to navigate the feed
  soup = BeautifulSoup(index['data'], selfClosingTags=['link'])
  # for each feed item
  for entry in soup('item'):
    # grab the item title (this is a Soupism, and rather crude, but works)
    title = entry('title')[0].renderContents()
    # for each link, we...
    # find the guid (we can also use the <item>'s <link>, depending on the feed)
    for link in entry('guid'):
      # get the page URL
      url = link.renderContents()
      # get the page itself
      page = fetchURL(url)
      # navigate it
      dessert = BeautifulSoup(page['data'])
      # remove JavaScript
      for script in dessert('script'):
      # get rid of stupid inline styles
      for span in dessert('span'):
          del span['style']
      # this site only uses one div for the post content, but we grab all of them here...
      post = dessert('div', {'class':'bodytext'})
      print """
""" % (title,post[0],url,url) # and insert the first element here
  print """
if __name__ == '__main__':

...and that's about it. More sophisticated stuff (which I also do, but removed for clarity) includes parsing and passing on item post dates, removing banners in mostly the same way as inline styles, buffering the output instead of doing the whole thing at once, etc., etc.

But it should give you an idea of the power of Beautiful Soup, and why I love the thing.

See Also: