Random Remainders

I'm taking things a bit slow these days, mostly because I find myself wanting to get home, put on some comfy slippers and read my brains out. Somehow, it seems like the only time my brain gets properly exercised.

Which reminds me: Today, during a 15-minute break to kick-start my neurons again after wading through a bunch of standards documents, I implemented a quick-and-dirty Growl-to-Gnome-notifications (or, rather, a Growl/D-Bus) bridge. I expect to be able to clean it up a bit and post it during the weekend, so that Linux folk will be able to exchange network-based notifications with Macs. The main thing to do is packet decoding, which I never got to implement in ReGrowl.

I still think it's a bit lame that Growl doesn't support HTML notifications off the bat (or at least a way to send a clickable URL over the network), but at least my Ubuntu laptop will be able to receive my servers' notifications...

Anyway, on to the news. Or, rather, to my new-found ability to float atop the "river of pointless news" that RSS feeds have become.

People curious about my Bayesian RSS filtering mechanism will like to know that it is turning out great: It is filtering out roughly 60% of my news items, and, more to the point, it is doing a decent job.

I changed my approach a bit, and besides having added author and feed names to the raw data fed to the classifier, I now have the following folder layout on my IMAP news account:

  - Archive          +
  - Interesting      + Positive Training
  - Not Interesting  - Negative Training
  - Unsorted

newspipe sends me new RSS items by SMTP, and my script then goes through the INBOX and moves what it believes to be "junk" into Unsorted (I could have used my Junk folder, but I wanted to keep things apart for the moment).

I then go into the INBOX and (thanks to Mail.app and Mail Act-On) breeze through my (vastly reduced flow of) messages and take one of four actions:

  • If they're interesting enough, they go into my permanent Archive (which is also where I archive web pages).
  • If they're about something I want to keep track of, I file them into Interesting.
  • If I really don't want to know more, I file them into Not Interesting.
  • Otherwise, I just flag them for follow-up or delete them straight away.

Upon the next iteration, the script will find new items on the training folders and learn from then. After a while, I clear out everything but the Archive folder (the Bayesian classifier keeps its own database).

So far, this has ensured that I remained blissful ignorant of (and wasted hardly any time putting up with) posts regarding:

  • The US elections
  • The Novell-Microsoft deal (don't really care, sorry)
  • All the Zune hype (I care even less)
  • Most of the utterly dumb, tongue-in-cheek posts from a bunch of "trendy" page-view-oriented tech blogs that think all phones are cool
  • Miscellaneous pieces from specific authors (it was pretty quick to cotton on to folk blogging about personal stuff or who keep harping on about the same things)

To give you an idea of what the training curve has been like, let's just say that after two days I've only had to go into Unsorted and flag as Interesting a handful of messages.

Here's the current message count for the relevant folders:

  - INBOX            30 messages
  - Archive          247 messages
  - Interesting      166 messages
  - Not Interesting  389 messages
  - Unsorted         274 messages
  - Trash            808 messages (700 from cleaning Unsorted)

New message counts in INBOX and Unsorted tend to have a 1-to-3 ratio, which is pretty good.

Still, it is very early to claim it's completely successful. I've since started reading up on different techniques, and will be tweaking this over the next few months - I have half a notion to start graphing some kind of statistics related to this, but really can't be bothered just yet.

See Also: