Taming my RSS feeds, the Bayesian way

I've been meaning to graft Bayesian classification on to newspipe for months now (probably even years), but never found a usable solution for doing it.

The main reason is that it is very tricky to set up a decent training mechanism that newspipe can take advantage of and that doesn't get in the way. From a coding perspective, things are pretty straightforward - it knows that it has incoming feeds, an outbound e-mail address, and a bunch of history files that tell it which items it has seen (and that, conceivably, could be extended to add Bayesian classification).

So I tried creating a little web UI to deal with those, but manipulating the files directly was tricky, and setting up an HTTP thread within newspipe itself was altogether too fiddly - not because it was tough to code, but because it was a mess in terms of UI - you viewed a feed item in Mail.app (which is where I want to read and manipulate news), clicked on a training link, and up came Safari, blocking your view of Mail.app.

This week, I decided I had had enough with reading RSS feeds - even if altogether I only spent about an hour a day doing so, it was one hour too much, most of which seemed to consist of wading through crap.

Since I've been tinkering with IMAP for a while now (for a number of different reasons, one of which consists of exploring ways of using it for a Wiki back-end), I decided to approach the issue from a different angle:

On my "news" IMAP account, I set up three folders:

Archive, which has always been there, and where I store everything I want to keep for an extended period of time (including web page archives created with MailArchive.py)
Interesting, for stuff I find interesting, but don't really want to keep.
Junk, which has always been there as well, but which is now used for a different purpose.

Then I added corresponding Mail Act-On rules to move messages into each (Archive and Junk had always been there, too, but Iinteresting wasn't), and defined a Flagged folder that listed only flagged items from that INBOX.

You may have guessed where this is going - yes, that's the one I read. I have also been thinking about using MailTags, but I wanted a simple, surefire approach.

I then grabbed thomas.py from divmod's ancient Python toolkit (the latest is here, I've had a copy on my hard disk for years now), and set to coding something like this:

Check Archive and Interesting for new messages
Train the Bayesian classifier with those (into the flag bucket)
Check Junk for new messages
Train the Bayesian classifier with those (into the junk bucket)
Check INBOX for new messages
Classify each one and set the IMAP \Flagged flag as appropriate.

The resulting script runs every 30 minutes and assumes the messages are either HTML mail as generated by newspipe and the rest of my stuff, or plaintext.

And since it bases its classification on what I toss into particular IMAP folders, it allows me to train it in a painless, simple fashion, without ever leaving Mail.app. So far, it seems to be working (it seems to be catching on to what I like and don't like), and with luck, I'll be letting it delete RSS items fairly soon.

It isn't particularly complex, so I'm posting it inline - apologies in advance to those of you who hate code listings.

Oh, and those of you using Plagger might be able to use this as well. Have the appropriate amount of fun.

#!/usr/bin/env python

"""IMAP Bayes Classifier"""
__version__ = "0.1"
__author__ = "Rui Carmo (http://the.taoofmac.com)"
__copyright__ = "(C) 2006 Rui Carmo. Code under BSD License."

import getpass, os, gc, sys, time, platform, getopt
import mailbox, rfc822, imaplib, socket
import StringIO, re, csv, sha, gzip, bz2
import cPickle, BeautifulSoup
from email.Parser import Parser
from email.Utils import decode_rfc2231
from thomas import Bayes

uidpattern = re.compile("\d+ \(UID (\d+)\)")
whitespace = re.compile("\s+", re.MULTILINE)

def remap(a):
  return (a[0], a[1])

class Classifier:
  def __init__(self, imap, folders = {'archive':'Archive', 'interesting':'Interesting', 'junk':'Junk','inbox':'INBOX'}, path = '.'):
    self.imap = imap
    self.bayesState = path + "/bayes.dat"
    self.imapState = path + "/imap.dat"
    self.folders = folders
    self.reverend = Bayes()

  def run(self):
    if os.path.exists(self.bayesState):
      self.reverend.load(self.bayesState)
    if os.path.exists(self.imapState):
      self.known = cPickle.load(open(self.imapState,'rb'))
    else:
      self.known = {}
      for key in self.folders.keys():
        self.known[self.folders[key]] = []
    self.train(self.folders['archive'],'flag')
    self.train(self.folders['interesting'],'flag')
    self.train(self.folders['junk'],'junk')
    self.classify(self.folders['inbox'])

  def classify(self, folder):
    self.imap.select(folder)
    typ, data = self.imap.search(None, 'ALL')
    newmessages = []
    i = 0
    for num in data[0].split():
      typ, data = self.imap.fetch(num, '(UID)')
      try:
        uid = uidpattern.match(data[0]).group(1)
        if uid not in self.known[folder]:
          newmessages.append(uid)
          self.known[folder].append(uid)
      except:
        pass
    if len(newmessages) > 0:
      for uid in newmessages:
        guess = dict(map(remap,self.reverend.guess(self.distillMessage(uid))))
        if guess['flag'] > 0.90:
          typ, data = self.imap.uid("STORE", uid, "FLAGS" ,"(\Flagged)")
      cPickle.dump(self.known, open(self.imapState,'wb'))

  def train(self, folder, bucket):
    self.imap.select(folder)
    typ, data = self.imap.search(None, 'ALL')
    newmessages = []
    i = 0
    for num in data[0].split():
      typ, data = self.imap.fetch(num, '(UID)')
      try:
        uid = uidpattern.match(data[0]).group(1)
        if uid not in self.known[folder]:
          newmessages.append(uid)
          self.known[folder].append(uid)
      except:
        pass
    if len(newmessages) > 0:
      for uid in newmessages:
        self.reverend.train(bucket,self.distillMessage(uid))
      self.reverend.save(self.bayesState)
      cPickle.dump(self.known, open(self.imapState,'wb'))

  def distillMessage(self, uid):
    typ, data = self.imap.uid("FETCH", uid, "RFC822.PEEK")
    p = Parser()
    msg = p.parsestr(data[0][1])
    for part in msg.walk():
      content = part.get_content_type()
      if content == 'text/html':
        soup = BeautifulSoup.BeautifulSoup(part.get_payload(decode=True),convertEntities=BeautifulSoup.BeautifulStoneSoup.ALL_ENTITIES,smartQuotesTo=None)
        plaintext = u' '.join(soup.findAll(text=re.compile('.+')))
        break # we assume there is a single html part worth decoding
      elif content == 'text/plain':
        plaintext = part.get_payload(decode=True)
    plaintext = whitespace.sub(' ', plaintext)
    return plaintext.encode('utf-8')

def main():
  try:
    opts, args = getopt.getopt(sys.argv[1:], "c:s:u:p", ["corpus=","server=", "username=","password="])
  except getopt.GetoptError:
    print "Usage: bayesimap.py [OPTIONS]"
    print "-s HOSTNAME --server=HOSTNAME   connect to HOSTNAME"
    print "-u USERNAME --username=USERNAME with USERNAME"
    print "-p PASSWORD --password=PASSWORD with PASSWORD (you will be prompted for one if missing)"
    sys.exit(2)
  corpus = username = password = server = None
  clobber = False
  for option, value in opts:
    if option in ("-s", "--server"):
      server = value
    if option in ("-u", "--username"):
      username = value
    if option in ("-p", "--password"):
      password = value
  if(server is None):
    print "ERROR: No server specified."
    sys.exit(2)
  if(username is None):
    print "ERROR: No username specified."
    sys.exit(2)
  if(password is None):
    password = getpass.getpass()
  server = imaplib.IMAP4(server)
  server.login(username, password)
  c = Classifier(server)
  c.run()
  server.logout()

if __name__ == '__main__':
  main()

Update: Those of you who don't think reading RSS feeds by e-mail is the right way to go will like to know that I also have a simple web interface that accesses my news IMAP account (via browser or mobile phone) and lets me classify items:

So I have all the advantages of web-based feed readers, plus being able to archive everything for years (without, say, images going stale), performing full-text search (via Spotlight or anything else that talks to IMAP) and easily forward news items to my friends and colleagues.

Tao of Mac

Taming my RSS feeds, the Bayesian way

This page is referenced in: