Don't you just hate those stingy sites that don't provide full-text feeds?
It's right up there in my pet annoyances list, just after the wanton stupidity of inserting banner ads in RSS, so for the past year or so I've been honing the art of screen scraping using Beautiful Soup (which, in the author's own words, "...is usually good enough to get the data you need and then run away").
- Parse the original, stingy feed
- Go through the items on it and fetch each referenced page individually
- Strip out everything but the main post contents
- Stick the whole thing into a simple RSS feed that gets printed to standard output (and hence processed by newspipe when it invokes the scripts)
The first phase occasionally involves some fancy trickery (I have a couple of industry news sites that don't even publish a decent feed, so I usually navigate their front page headings), and I have some HTTP logic squirreled away in a module called scraper that essentially uses Mark Pilgrim's HTTP compression support examples and throws in some cookie handling.
Since you can do HTTP in plenty of other ways, I'll skip those bits.
But it is amazingly trivial to do. Let's say you have a site that publishes a titles-only RSS feed and that has articles inside a consistently-named div tag like so:
(annoying banner ads, titles, advertising, etc.) <div id="copy" class="bodytext"> <p>In the beginning, the dinosaurs ruled the Earth
This is a typical setup in most sites, and gives us two ways to reference the body text: by the div or its class. In fact, sometimes you have to string together all divs of a given class to gather the full article, or even parse "next page" links and grab more HTML, but those things would make for a rather involved example, so I'll stick to the basics and do the whole thing in (almost) one shot:
...and that's about it. More sophisticated stuff (which I also do, but removed for clarity) includes parsing and passing on item post dates, removing banners in mostly the same way as inline styles, buffering the output instead of doing the whole thing at once, etc., etc.
But it should give you an idea of the power of Beautiful Soup, and why I love the thing.