The Little Python IMAP Archiver That Couldn't


Update: Bob Ippolito left a very helpful note pointing me to Bug# 1092502. I had come across this and discarded it since it pertained to Panther, and it's been open since 2004 (you would have expected this to have been fixed by now, right?). I've managed to work around the issue (see below) and will be testing it thoroughly...

What initially began as a simple script to backup my IMAP mailboxes to mbox format in an efficient way (by checking whether messages already existed, etc., etc.) has turned out to be a frustrating excursion into the limitations of Python's built-in libraries and memory management.

Yes, yes, I know all about fetchmail, and incantations thereof like -

fetchmail --keep --fetchall --user username --folder "Personal/2006/Q3" \
  --proto imap --mda "formail >> Personal.2006.Q3.mbox" server.local

...after all, I've been using it for ages. I even have somewhere a no-nonsense "full dump" script that lists IMAP mailboxes and dumps the entire tree using a loop surrounding the above.

But that's 8 gigabytes in one shot, so my goal was to improve upon that in order to do incremental backups, not full mailbox dumps. Open the local .mbox files, get a list of Message-Ids, and only download "new" messages.

Although the Python script I concocted works for small messages (to the point of my having already tested Mail.app import of generated files, which works flawlessly), I have been poring over mailing-list archives for the last day or so trying to figure out why Python's imaplib is so damnably hopeless in dealing with very large messages, and the hows and whys of the Mac OS X build's memory management.

To cut a long story short, the following piece of code:

CHUNK_SIZE = 1024*1024
msgsize = re.compile("\d+ \(RFC822.SIZE (\d+)\)")

def updateMailbox(server, imap_folder, mailbox, messages, existing):
  server.select(imap_folder)
  # check if server supports PEEK
  # (bit redundant to do it every time, I know...)
  fetch_command = "(RFC822.PEEK)"
  response = server.fetch("1:1", fetch_command)
  if response[0] != "OK":
    fetch_command = "RFC822"
  else:
    fetch_command = "RFC822.PEEK"
  i = 0
  mbx = file(mailbox,'a')
  for id in messages.keys():
    if id not in existing.keys():
      typ, data = server.fetch(messages[id], '(RFC822.SIZE)')
      length = int(msgsize.match(data[0]).group(1))
      print "Message %d bytes in size" % (length)
      buffer = "From nobody %s\n" % time.strftime('%a %m %d %H:%M:%S %Y')
      mbx.write(buffer)
      buffer = ''
      for offset in xrange(1,length,CHUNK_SIZE):
        print "Grabbing %d to %d" % (offset, offset + CHUNK_SIZE)
        typ, data = server.partial(messages[id], fetch_command, offset, offset + CHUNK_SIZE)
        mbx.write(data[0][1])
        del data
        gc.collect()
        offset = offset + CHUNK_SIZE + 1
      mbx.write('\n')
      i = i + 1
  mbx.close()
  print "Appended %d messages to disk file." % i

...blows up unceremoniously whenever I try to download a 5MB message from an IMAP server, with the python interpreter gobbling up to three gigabytes of virtual memory.

Yes, you read that right: THREE GIGABYTES OF MEMORY. Un-fsck-ing believable.

So, far, I've tried:

  • Doing the full-blown FETCH (duh).
  • Doing partial fetches with several different CHUNK_SIZEs.
  • Creating a new IMAP4 connection object (complete with a new LOGIN and SELECT) after del()ing the previous one for each message.

Tuning CHUNK_SIZE is pretty useless, since anything under 32768 bytes just takes ages to download average-sized messages and fragments memory to no end - using gc.collect() appears to mitigate the issue, but at the expense of it spending a while clearing out the mess.

So I'm beginning to wonder just how "partial" IMAP4.partial() really is, and whether I should just give up and check out the Perl IMAP modules (which I'd rather not, since the rest of the code I've written is working fine).

Here's the traceback with CHUNK_SIZE set to 1MB, for those of you so inclined:

Message 15849659 bytes in size
Grabbing 1 to 1048577
Grabbing 1048577 to 2097153
Grabbing 2097153 to 3145729
Grabbing 3145729 to 4194305
Grabbing 4194305 to 5242881
python(672) malloc: *** vm_allocate(size=1331200) failed (error code=3)
python(672) malloc: *** error: can't allocate region
python(672) malloc: *** set a breakpoint in szone_error to debug
Traceback (most recent call last):
  File "test.py", line 136, in ?
    tree(server)
  File "test.py", line 122, in tree
    updateMailbox(server, imap_folder, filename, messages, existing)
  File "test.py", line 81, in updateMailbox
    typ, data = server.partial(messages[id], fetch_command, offset, offset + CHUNK_SIZE)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/imaplib.py", line 550, in partial
    typ, dat = self._simple_command(name, message_num, message_part, start, length)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/imaplib.py", line 1000, in _simple_command
    return self._command_complete(name, self._command(name, *args))
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/imaplib.py", line 830, in _command_complete
    typ, data = self._get_tagged_response(tag)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/imaplib.py", line 931, in _get_tagged_response
    self._get_response()
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/imaplib.py", line 893, in _get_response
    data = self.read(size)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/imaplib.py", line 231, in read
    return self.file.read(size)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/socket.py", line 301, in read
    data = self._sock.recv(recv_size)
MemoryError

I've also found a number of ancient mentions to malloc() behavior changes on the Panther Python build and odd, inscrutable notes in cuneiform writing pertaining to the gc library, but so far nothing of real use, so Python wizards are free to toss a few eyes of newt my way...

The Fix

So, after Bob Ippolito's comment, I decided to take a look at _fileobject.read() in /System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/socket.py (starting at line #270), and rather than wantonly vandalize a system file, I decided to create my own version of read() and replace the method at runtime like so (portions omitted for clarity):

import socket

# Hideous fix to counteract http://python.org/sf/1092502
# (which should have been fixed ages ago.)
def _fixed_socket_read(self, size=-1):
  ...
      while True:
          left = size - buf_len
          recv_size = min(self._rbufsize, left) # this is the actual fix
          data = self._sock.recv(recv_size)
      ...
      return "".join(buffers)

# patch the method at runtime
socket._fileobject.read = _fixed_socket_read

So far the fixed script has handled messages 15MB in size (with CHUNK_SIZE set to 1MB) and seems to maintain a sensible memory footprint - I will update this post with a link to a cleaned-up version once I've done at least one full set of backups with it and added some niceties like command-line arguments.


See Also: