Geeking Out

After nearly a week typing up work e-mail and analysis docs in my Windows laptop well into the night, it's amazingly relaxing to walk in, throw my keys onto the desk, ditch the laptop onto the couch, and chill out on my . One of my best friends is getting married tomorrow, and I want to show up relaxed and at ease, not in some wierd ecliptic orbit around corporate stress.

(To give you an idea of just how bad it is these days, the absolutely perfect stress relief would most likely be walking into some of my latest meetings with a paintball Gatling gun.)

My current pursuits are far more realistic and much less belicose, of course: I'm currently grafting Bayesian classification onto . I've picked up Reverend and grafted a BaseHTTPServer onto the main thread. I did this because the classifier has to be trained in some way, and I assume that the entire point of using HTML mail is to click on it occasionally, so feels like the logical way to proceed.

As soon as I have something that minimally works, I'll hand it over to Ricardo for inclusion in the main package, but so far the concept I've outlined is this:

  • runs as a daemon and grabs the feed contents, tokenizes and stems the text-only contents and feeds the results to the Bayesian classifier:
    • Message subjects and some URLs are to be part of the classification as well (URLs impart some important context, although I don't know how much of them to use for classification).
    • Until I can figure out a better scheme (categories? knowledge domains? good/bad?), posts are classified in an -like scale (1-5 stars). I want to have more than two categories right from the start, and this is as good an approach as any for testing.
    • If there is no database, initial Bayesian seed data is read from a section of the OPML file.
  • Four new mail headers are added to each message: Message-ID (I've been meaning to do this based on the RSS GUID, as an easy way to remove duplicate messages generated from tests), X-Spam-Level and X-Spam-Status (to test interaction with ) and X-Interest-Level
  • Mail messages will include links back to the daemon's server: Something like http://localhost:port/feed_id/post_id/classification. Clicking on those will make re-read the cached data (if any) and train the Bayesian classifier with it. This will be a major UI problem, since HTML support in MUAs is completely broken in interaction terms and I have to figure out some way to prevent umpteen browser windows from piling up. I already have a "mini-webmail" that renders a WAP view of my IMAP inbox (and generates an RSS "metafeed" of all the items), but don't want to rely on that for training.

The rest of the problems are the usual ones: some feeds have very little context (barely more than headlines), there are digest messages to be reckoned with (not sure how to deal with those yet), removing bugs, making sure I can classify posts in any language (UTF-8 should make this trivial), etc.

Like most of the things I have been planning to do for months now, it probably won't happen anytime soon.

But it's a start.

This page is referenced in: