Even amidst all the ruckus caused by my home renovations, my temporary loss of two other machines and a whole new set of personal logistics (besides work, of course), I can’t stop pondering solutions for information management – and that includes RSS.
Now, if you happen to recall my latest piece regarding my RSS setup, you’ll remember that I am still tackling the issue of how to go about doing Bayesian classification, and that I was using newspipe as an archiver.
A lot of people missed the point there and pointed out that Mail.app has built-in RSS support – which is correct, except that Mail.app does not store inline images or enclosures along with the feed items, something that I find to be a rather myopic omission (just filed as “#5777759”:Radar:5777759) and absolutely essential for archiving.
Now, Mail.app stores its RSS items into .emlx
format (which Jamie Zawinsky documented), just like “ordinary” messages. And the format is pretty straightforward:
byte count for message as first line MIME dump of message XML plist with flags
And the MIME dump invariably contains a part with well-formed HTML (at least the samples I looked at), with neat direct references to inline images and stuff.
So I had one of those shower epiphanies: Why not parse the .emlx
file, download the referenced images (since the URLs are low-hanging fruit), and add the images back into the .emlx
file as inline attachments?
That way I could just use Mail.app (without newspipe in the middle) and run a simple archival script every now and then.
Lo and behold, after 20 minutes of Python coding (and thanks to the ineffable miracle that is Beautiful Soup), I have a little proof of concept that does just that – images in RSS items are downloaded, injected into a new MIME message, and the whole thing is replaced into the .emlx
file, updating the byte count appropriately.
And Mail.app seems to like it, too.
Update: Here’s the source code, after a few cleanups and some re-structuring towards making it a Python class that I can re-use later:
#!/usr/bin/env python
# encoding: utf-8
"""
emlx.py
Created by Rui Carmo on 2008-03-03.
Released under the MIT license
"""
from BeautifulSoup import BeautifulSoup
import os, re, codecs, email, urllib2
from email.MIMEImage import MIMEImage
from email.MIMEMultipart import MIMEMultipart
# Message headers used by Mail.app that we want to preserve
preserved_headers = [
"X-Uniform-Type-Identifier",
"X-Mail-Rss-Source-Url",
"X-Mail-Rss-Article-Identifier",
"X-Mail-Rss-Article-Url",
"Received",
"Subject",
"X-Mail-Rss-Author"
"Message-Id",
"X-Mail-Rss-Source-Name",
"Reply-To",
"Mime-Version",
"Date"
]
class emlx:
"""emlx parser"""
def __init__(self, filename):
"""initialization"""
self.filename = filename
self.opener = urllib2.build_opener()
# Mimic Mail.app User-agent
self.opener.addheaders = [('User-agent', 'Apple-PubSub/59')]
self.load()
def load(self):
# open the .emlx file as binary (and not using codecs) to ensure byte offsets work
self.fh = open(self.filename,'rb')
# get the payload length
self.bytes = int(self.fh.readline().strip())
# get the MIME payload
self.message = email.message_from_string(self.fh.read(self.bytes))
# the remaining bytes are the .plist
self.plist = ''.join(self.fh.readlines())
self.fh.close()
def save(self, filename):
fh = open(filename,'wb')
# get the payload length
bytes = len(str(self.message))
fh.write("%d\n%s%s" % (bytes, self.message, self.plist))
fh.close()
def grab(self, url):
"""grab images (not very sophisticated yet, doesn't handle redirects and such)"""
h = self.opener.open(url)
mtype = h.info().getheader('Content-Type')
data = h.read()
return (mtype,data)
def parse(self):
for part in self.message.walk():
if part.get_content_type() == 'text/html':
self.rebuild(part)
return
def rebuild(self,part):
# parse the HTML
soup = BeautifulSoup(part.get_payload())
# strain out all images referenced by HTTP/HTTPS
images = soup('img',{'src':re.compile('^http')})
count = 0
# prepare new MIME message
newmessage = MIMEMultipart('related')
for h in preserved_headers:
newmessage.add_header(h,self.message[h])
attachments = []
for i in images:
# Grab the image
(mtype, data) = self.grab(i['src'])
# Build a cid for it
subtype = mtype.split('/')[1]
cid = '%(count)d.%(subtype)s' % locals()
# Create and attach new MIME part
# we use all reference methods to ensure cross-MUA compatibility
image = MIMEImage(data, subtype,name=cid)
image.add_header('Content-ID', '<%s>' % cid)
image.add_header('Content-Location', cid)
image.add_header('Content-Disposition','inline', filename=("%s" % cid))
attachments.append(image)
# update references to images
i['src'] = '%s' % cid
count = count + 1
# inject rewritten HTML first
part.set_payload(str(soup))
newmessage.attach(part)
# now add inline images as extra MIME parts
for a in attachments:
newmessage.attach(a)
# replace the message
self.message = newmessage
if __name__ == "__main__":
a = emlx('320611.emlx')
a.parse()
a.save('injected.emlx')
Right now, I’m considering tweaking the plist
flags a bit, and since I absolutely loathe the bright blue header Mail.app uses to display feed items (which often hides large portions of item titles) I will be doing outright conversion to “normal” e-mail messages.
Plus, of course, I still need a decent way to invoke it upon an entire folder crammed with RSS items. That is easy enough to do, but I’d rather try to code something that can be re-used by other folk, and as such I’m looking into developing an Automator action for this.
Time (my scarcest resource) will tell if it’s doable. Still, I wonder why Apple doesn’t allow for archival of RSS items with inline images – it’s not as if they don’t have all the pieces (and Automator already has plenty of RSS support…).