# Patching .emlx files

Even amidst all the ruckus caused by my home renovations, my temporary loss of two other machines and a whole new set of personal logistics (besides work, of course), I can’t stop pondering solutions for information management – and that includes RSS.

Now, if you happen to recall my latest piece regarding my RSS setup, you’ll remember that I am still tackling the issue of how to go about doing Bayesian classification, and that I was using newspipe as an archiver.

A lot of people missed the point there and pointed out that Mail.app has built-in RSS support – which is correct, except that Mail.app does not store inline images or enclosures along with the feed items, something that I find to be a rather myopic omission (just filed as #5777759) and absolutely essential for archiving.

Now, Mail.app stores its RSS items into .emlx format (which Jamie Zawinsky documented), just like “ordinary” messages. And the format is pretty straightforward:

byte count for message as first line
MIME dump of message
XML plist with flags


And the MIME dump invariably contains a part with well-formed HTML (at least the samples I looked at), with neat direct references to inline images and stuff.

So I had one of those shower epiphanies: Why not parse the .emlx file, download the referenced images (since the URLs are low-hanging fruit), and add the images back into the .emlx file as inline attachments?

That way I could just use Mail.app (without newspipe in the middle) and run a simple archival script every now and then.

Lo and behold, after 20 minutes of Python coding (and thanks to the ineffable miracle that is Beautiful Soup), I have a little proof of concept that does just that – images in RSS items are downloaded, injected into a new MIME message, and the whole thing is replaced into the .emlx file, updating the byte count appropriately.

And Mail.app seems to like it, too.

Update: Here’s the source code, after a few cleanups and some re-structuring towards making it a Python class that I can re-use later:

#!/usr/bin/env python
# encoding: utf-8
"""
emlx.py

Created by Rui Carmo on 2008-03-03.
"""

from BeautifulSoup import BeautifulSoup
import os, re, codecs, email, urllib2
from email.MIMEImage import MIMEImage
from email.MIMEMultipart import MIMEMultipart

# Message headers used by Mail.app that we want to preserve
"X-Uniform-Type-Identifier",
"Subject",
"Message-Id",
"Mime-Version",
"Date"
]

class emlx:
"""emlx parser"""
def __init__(self, filename):
"""initialization"""
self.filename = filename
self.opener = urllib2.build_opener()
# Mimic Mail.app User-agent

# open the .emlx file as binary (and not using codecs) to ensure byte offsets work
self.fh = open(self.filename,'rb')
# the remaining bytes are the .plist
self.fh.close()

def save(self, filename):
fh = open(filename,'wb')
bytes = len(str(self.message))
fh.write("%d\n%s%s" % (bytes, self.message, self.plist))
fh.close()

def grab(self, url):
"""grab images (not very sophisticated yet, doesn't handle redirects and such)"""
h = self.opener.open(url)
return (mtype,data)

def parse(self):
for part in self.message.walk():
if part.get_content_type() == 'text/html':
self.rebuild(part)
return

def rebuild(self,part):
# parse the HTML
# strain out all images referenced by HTTP/HTTPS
images = soup('img',{'src':re.compile('^http')})
count = 0

# prepare new MIME message
newmessage = MIMEMultipart('related')

attachments = []
for i in images:
# Grab the image
(mtype, data) = self.grab(i['src'])
# Build a cid for it
subtype = mtype.split('/')[1]
cid = '%(count)d.%(subtype)s' % locals()
# Create and attach new MIME part
# we use all reference methods to ensure cross-MUA compatibility
image = MIMEImage(data, subtype,name=cid)
attachments.append(image)
# update references to images
i['src'] = 'cid:%s' % cid
count = count + 1
# inject rewritten HTML first
newmessage.attach(part)
# now add inline images as extra MIME parts
for a in attachments:
newmessage.attach(a)
# replace the message
self.message = newmessage

if __name__ == "__main__":
a = emlx('320611.emlx')
a.parse()
a.save('injected.emlx')


Right now, I’m considering tweaking the plist flags a bit, and since I absolutely loathe the bright blue header Mail.app uses to display feed items (which often hides large portions of item titles) I will be doing outright conversion to “normal” e-mail messages.

Plus, of course, I still need a decent way to invoke it upon an entire folder crammed with RSS items. That is easy enough to do, but I’d rather try to code something that can be re-used by other folk, and as such I’m looking into developing an Automator action for this.

Time (my scarcest resource) will tell if it’s doable. Still, I wonder why Apple doesn’t allow for archival of RSS items with inline images – it’s not as if they don’t have all the pieces (and Automator already has plenty of RSS support…).