Making your mail sit up and beg

I’ve got a specific itch that needs scratching regarding e-mail management (i.e., I get too much of it, which is normal, but I want it properly contextualized and filed away, which is nearly impossible to achieve by conventional means), so I’ve been investigating both Apple’s Latent Semantic Mapping framework and building mail plugins with PyObjC in order to try to build an even better “related messages” feature than that which ships with Lion.

Which is a pretty tall order, but one that I suppose will have fun trying to achieve with the minimal amount of code required.

And, in the great tradition of my iSync hacking (back when it was actually a part of the OS, oh, up until a few weeks ago), I’m delving into undocumented stuff.

LSM in a nutshell

No, not the theory. There’s a massive corpus of machine learning devoted to it, and Apple’s framework doesn’t (apparently) implement the most sophisticated techniques, so I’m just going to explain how to use it with this rather thin wrapper1:

  • LSMMapCreate to get a handle for a map
  • LSMMapStartTraining to prime it for loading data
  • LSMMapAddCategory to add categories and LSMMapAddText to add representative text for those categories
  • LSMMapCompile to compile the map and prime it for queries
  • LSMResultCreate to query it and iterate over the results

The sample code is pretty small, so I’m including it inline:

from Foundation import *
from ScriptingBridge import *
from lsm import *
import os, sys, codecs, gzip

loc = CFLocaleGetSystem()

categories = ['work', 'family', 'alumni']

queries = ['football economy','kids books dinner','linux cluster lunch']

def addToMap(m,category):
    c = LSMMapAddCategory(m)
    t = LSMTextCreate(None,m)
    f = + '.raw.gz', 'rb')
    text = CFSTR(
    r = LSMTextAddWords(t,text,loc,kLSMTextPreserveAcronyms)
    r = LSMMapAddText(m,t,c)
    return (r,c,category)

def queryMap(m,q):
    t = LSMTextCreate(None,m)
    r = LSMTextAddWords(t,CFSTR(q),loc,kLSMTextPreserveAcronyms)
    rows = LSMResultCreate(None,m,t,10,0)
    print q
    for i in range(0,LSMResultGetCount(rows)):
        c = LSMResultGetCategory(rows,i)
        s = LSMResultGetScore(rows,i)
        print c,s

# create a new map
m = LSMMapCreate(None,0)
for c in categories:
    print addToMap(m,c)
r = LSMMapCompile(m)

for q in queries:

What the above does is load a set of compressed (but otherwise raw) text files that I built by scraping messages from my mailbox - that was a matter of merely grabbing the Unicode plaintext and stripping out any quotes, remnants of previous messages and signatures using a few simple rules and the Scripting Bridge - which is a great way to get Unicode out of mail and into Python without messing about with MIME parsing, by the way.

Here’s the (rather naïve, but usable) way I’m doing that after grabbing a message via the Scripting Bridge:

def extractPlaintext(message):
  """ Extracts all the plaintext from a given message, removing quoted portions """
  plaintext = message.content().get() # this gets us plaintext immediately
  valid = []
  # Now remove quoted portions and delimiters for signatures and suchlike
  map(lambda x: valid.append(x) if ((len(x) > 2) and (x[0:2] not in ['> ','>>','--','==','__'])) else False,plaintext.split('\n'))
  # Go through the whole thing and remove most non-alphanumeric characters
  pattern = re.compile(r'[^\w\s\-\@\.\/\:]', re.U)
  valid = map(lambda x: re.sub(pattern,'',x.strip()), valid)
  return u' '.join(valid)

But back to the map. The larger script above then grabs a few random words and asks the map to match them to a category, printing out the results like so:

(0, 1, 'work')
(0, 2, 'family')
(0, 3, 'alumni')
football economy  
3 0.244184538722
2 0.21324416995
1 0.191463932395
kids books dinner
2 0.311590999365
3 0.175982058048
1 0.169467896223
linux cluster lunch
1 0.316538363695
3 0.126791742444
2 0.123535719514

The scores are pretty low for only a few words, but you get the idea. Tossing in a full e-mail message yields values above 0.5 and even better differentiation.

Of course, doing proper language detection, stemming and saving/reloading the map is left as an exercise to the reader (I’m working on the former, as soon as I manage to get suitable pure Python libraries).

Building Mail Bundles for Lion

This was comparatively trickier, considering that Apple changed their take on bundles yet again and that the last time I bothered trying was back in Tiger or so.

As it happens, this article had enough for me to get up to speed on the basics, and then it was mostly a matter of reading up on the changes brought on in version 4 and figuring out the required UUIDs to add to the .plist for compatibility with version 5.

These are the required UUIDs for the current versions on 10.7.0 (and the way to update them upon the next upgrade):

$ defaults read /Applications/ PluginCompatibilityUUID
$ defaults read /System/Library/Frameworks/Message.framework/Resources/Info PluginCompatibilityUUID

Again, the code is small enough to just add inline. Here’s

from distutils.core import setup
import py2app

plist = {
    'CFBundleGetInfoString':'TestPlugin 0.1',
    'SupportedPluginCompatibilityUUIDs': [
        # These cover many versions of
 plugin = [''],
 options=dict(py2app=dict(extension='.mailbundle', plist=plist))

…and the plugin itself:

from AppKit import *
from Foundation import *
import objc

MVMailBundle = objc.lookUpClass('MVMailBundle')
class TestPlugin(MVMailBundle):
    def initialize (cls):
        NSLog("TestPlugin registered with Mail")
    initialize = classmethod(initialize)

I’m now looking at extending the above to handle incoming messages - currently just printing out a bunch of information I can get at from inside and learning more about what happens.

Putting the two things together is sure to take me some time yet, but doing training in the background by asking about available mailboxes and feeding text to a map seems fairly easy.

The tricky thing is going to be having a sensible UI (if necessary) and figuring out a way to add my own metadata to messages without breaking them.

I’ve been there before some five years ago, so I have a few ideas. Now all I really need is time to put all of this together.

  1. The wrapper works, but I ignored nearly everything in it except the framework declarations - don’t bother installing it, just get the source and extract those. ↩︎