I’ve got a specific itch that needs scratching regarding e-mail management (i.e., I get too much of it, which is normal, but I want it properly contextualized and filed away, which is nearly impossible to achieve by conventional means), so I’ve been investigating both Apple’s Latent Semantic Mapping framework and building mail plugins with PyObjC in order to try to build an even better “related messages” feature than that which ships with Lion.
Which is a pretty tall order, but one that I suppose will have fun trying to achieve with the minimal amount of code required.
And, in the great tradition of my iSync hacking (back when it was actually a part of the OS, oh, up until a few weeks ago), I’m delving into undocumented stuff.
LSM in a nutshell
No, not the theory. There’s a massive corpus of machine learning devoted to it, and Apple’s framework doesn’t (apparently) implement the most sophisticated techniques, so I’m just going to explain how to use it with this rather thin wrapper1:
LSMMapCreate
to get a handle for a mapLSMMapStartTraining
to prime it for loading dataLSMMapAddCategory
to add categories andLSMMapAddText
to add representative text for those categoriesLSMMapCompile
to compile the map and prime it for queriesLSMResultCreate
to query it and iterate over the results
The sample code is pretty small, so I’m including it inline:
from Foundation import *
from ScriptingBridge import *
from lsm import *
import os, sys, codecs, gzip
loc = CFLocaleGetSystem()
categories = ['work', 'family', 'alumni']
queries = ['football economy','kids books dinner','linux cluster lunch']
def addToMap(m,category):
c = LSMMapAddCategory(m)
t = LSMTextCreate(None,m)
f = gzip.open(category + '.raw.gz', 'rb')
text = CFSTR(f.read())
f.close()
r = LSMTextAddWords(t,text,loc,kLSMTextPreserveAcronyms)
r = LSMMapAddText(m,t,c)
return (r,c,category)
def queryMap(m,q):
t = LSMTextCreate(None,m)
r = LSMTextAddWords(t,CFSTR(q),loc,kLSMTextPreserveAcronyms)
rows = LSMResultCreate(None,m,t,10,0)
print q
for i in range(0,LSMResultGetCount(rows)):
c = LSMResultGetCategory(rows,i)
s = LSMResultGetScore(rows,i)
print c,s
# create a new map
m = LSMMapCreate(None,0)
LSMMapStartTraining(m)
for c in categories:
print addToMap(m,c)
r = LSMMapCompile(m)
for q in queries:
queryMap(m,q)
What the above does is load a set of compressed (but otherwise raw) text files that I built by scraping messages from my mailbox - that was a matter of merely grabbing the Unicode plaintext and stripping out any quotes, remnants of previous messages and signatures using a few simple rules and the Scripting Bridge - which is a great way to get Unicode out of mail and into Python without messing about with MIME parsing, by the way.
Here’s the (rather naïve, but usable) way I’m doing that after grabbing a message via the Scripting Bridge:
def extractPlaintext(message):
""" Extracts all the plaintext from a given message, removing quoted portions """
plaintext = message.content().get() # this gets us plaintext immediately
valid = []
# Now remove quoted portions and delimiters for signatures and suchlike
map(lambda x: valid.append(x) if ((len(x) > 2) and (x[0:2] not in ['> ','>>','--','==','__'])) else False,plaintext.split('\n'))
# Go through the whole thing and remove most non-alphanumeric characters
pattern = re.compile(r'[^\w\s\-\@\.\/\:]', re.U)
valid = map(lambda x: re.sub(pattern,'',x.strip()), valid)
return u' '.join(valid)
But back to the map. The larger script above then grabs a few random words and asks the map to match them to a category, printing out the results like so:
(0, 1, 'work') (0, 2, 'family') (0, 3, 'alumni') football economy 3 0.244184538722 2 0.21324416995 1 0.191463932395 kids books dinner 2 0.311590999365 3 0.175982058048 1 0.169467896223 linux cluster lunch 1 0.316538363695 3 0.126791742444 2 0.123535719514
The scores are pretty low for only a few words, but you get the idea. Tossing in a full e-mail message yields values above 0.5 and even better differentiation.
Of course, doing proper language detection, stemming and saving/reloading the map is left as an exercise to the reader (I’m working on the former, as soon as I manage to get suitable pure Python libraries).
Building Mail Bundles for Lion
This was comparatively trickier, considering that Apple changed their take on bundles yet again and that the last time I bothered trying was back in Tiger or so.
As it happens, this article had enough for me to get up to speed on the basics, and then it was mostly a matter of reading up on the changes brought on in version 4 and figuring out the required UUIDs to add to the .plist
for compatibility with version 5.
These are the required UUIDs for the current versions on 10.7.0 (and the way to update them upon the next upgrade):
$ defaults read /Applications/Mail.app/Contents/Info PluginCompatibilityUUID 2DE49D65-B49E-4303-A280-8448872EFE87 $ defaults read /System/Library/Frameworks/Message.framework/Resources/Info PluginCompatibilityUUID 1146A009-E373-4DB6-AB4D-47E59A7E50FD
Again, the code is small enough to just add inline. Here’s setup.py
:
from distutils.core import setup
import py2app
plist = {
'NSPrincipalClass':'TestPlugin',
'CFBundleInfoDictionaryVersion':'6.0',
'CFBundlePackageType':'APPL',
'CFBundleName':'TestPlugin',
'CFBundleSignature':'????',
'CFBundleGetInfoString':'TestPlugin 0.1',
'CFBundleVersion':'0.1',
'CFBundleShortVersionString':'0.1',
'SupportedPluginCompatibilityUUIDs': [
# These cover many versions of Mail.app
'0CB5F2A0-A173-4809-86E3-9317261F1745',
'1146A009-E373-4DB6-AB4D-47E59A7E50FD',
'1C58722D-AFBD-464E-81BB-0E05C108BE06',
'225E0A48-2CDB-44A6-8D99-A9BB8AF6BA04',
'2610F061-32C6-4C6B-B90A-7A3102F9B9C8',
'2DE49D65-B49E-4303-A280-8448872EFE87',
'2F0CF6F9-35BA-4812-9CB2-155C0FDB9B0F',
'36555EB0-53A7-4B29-9B84-6C0C6BACFC23',
'857A142A-AB81-4D99-BECC-D1B55A86D94E',
'9049EF7D-5873-4F54-A447-51D722009310',
'99BB3782-6C16-4C6F-B910-25ED1C1CB38B',
'B3F3FC72-315D-4323-BE85-7AB76090224D',
'B842F7D0-4D81-4DDF-A672-129CA5B32D57',
'BDD81F4D-6881-4A8D-94A7-E67410089EEB',
'E71BD599-351A-42C5-9B63-EA5C47F7CE8E',
]
}
setup(
plugin = ['TestPlugin.py'],
options=dict(py2app=dict(extension='.mailbundle', plist=plist))
)
…and the plugin itself:
from AppKit import *
from Foundation import *
import objc
MVMailBundle = objc.lookUpClass('MVMailBundle')
class TestPlugin(MVMailBundle):
def initialize (cls):
MVMailBundle.registerBundle()
NSLog("TestPlugin registered with Mail")
initialize = classmethod(initialize)
I’m now looking at extending the above to handle incoming messages - currently just printing out a bunch of information I can get at from inside Mail.app and learning more about what happens.
Putting the two things together is sure to take me some time yet, but doing training in the background by asking Mail.app about available mailboxes and feeding text to a map seems fairly easy.
The tricky thing is going to be having a sensible UI (if necessary) and figuring out a way to add my own metadata to messages without breaking them.
I’ve been there before some five years ago, so I have a few ideas. Now all I really need is time to put all of this together.
-
The wrapper works, but I ignored nearly everything in it except the framework declarations - don’t bother installing it, just get the source and extract those. ↩︎