The Quest For Easier Information Management

After unwinding a bit, I've decided to take a look at my personal ToDo list and see what needed doing. It has been duly cleansed and re-organized, but it turns out the list isn't half the story by far - there are higher-level goals that never made the list but nevertheless deserve some attention.

All of them have to do with dealing with the deluge of information we're all exposed to these days. Even though there are lots of creative people out there building software crutches to enable our feeble primate brains to cope with it, I tend to prefer simple, utterly reliable and easy to use solutions than the latest and greatest do-it-all software tool.

Being a systems engineer by training, I always have the option of writing my own. But being a pragmatic person and having a lot better things to do with my time, I'll only do it if I have no alternative - and when I do, I keep it simple enough to manage.

But without further ado, here are the Four Quests I have embarked upon:

The First Holy Grail: The Chart Of Bookmarked Knowledge

My first quest was for the Holy Wiki, although I did not know it at the time. Since I'm prone to latching on to issues for weeks and am constantly undergoing context shifts (changing machines, picking up another project where I left off yesterday, etc.), I needed some way to manage bookmarks and assorted notes no matter where I was.

Very short-term stuff (like IP addresses, daily task lists and open issues) is dealt with by the time-honored tradition of using Notepad as a virtual Post-It note. Project stuff goes into a standard trouble-ticketing system, etc., but research (bookmarks, articles to read, notes on non-corporate stuff, etc.) was falling through the cracks.

After a couple of years, I'm now perfectly comfortable with using a Wiki for those things - I just keep adding to it, and it's available everywhere I have a network connection (which, these days, means literally anywhere, thanks to GPRS and my 7650).

Keeping corporate notes is still a pain, though (I'm probably more aware than most of the need for proper information security), and that's why I need a PDA again.

The Second Holy Grail: The Library Of Self

My second (and ongoing) quest has to do with managing both my absurdly vast e-mail archive and the increasing acreage of storage space I have. Both are extensive libraries of documentation, media and source trees that I occasionally have to wade through in search of specific items, and It's taken a while to get them even minimally organized.

The Bottomless IMAP Archive

My current e-mail archive consists of an IMAP server where I aggregate all my incoming mail (via a procmail/SpamAssassin combination). I've so far kept to the mbox format since it makes for more efficient storage and backup than maildir (and I'm the only user, so performance issues are irrelevant).

It is fast, completely platform-independent, easy to backup and searchable. It can, however, feel a bit like a bottomless pit if you want to find specific items and don't know if they're part of a specific project or mailing-list.

Using Zoë

Enter Zoë. It can best be described as "Google for your e-mail", but is quite a bit more than that. Besides providing full-text indexing and searching of e-mail messages via a browser, it also provides very useful attachment and contact lists for each message, group of messages or date.

Mine is setup to index my IMAP repository remotely, and has proven so useful I've set up another on my work machine for the sole purpose of tracking project documentation and progress.

It suffers from a single serious flaw, however: Since it uses a maildir-like format to store messages, it eats up disk space like there is no tomorrow...

Harvesting Files

Files tend to be easy to find, and are relatively free from context (unless they are project reports and so on, but those are usually in my e-mail archive).

I've already decided that there is absolutely nothing one can do to deal with the hordes of assorted files and folders you can find scattered throughout your hard disk other than:

Create high-level categories like "Documents", "Music", "Photos", etc. and sub-folders by project, genre, date, etc.
Use plain, simple and brutal searching.

I don't particularly care for the current trend towards content indexing in most operating systems, since it tends to add bloated, slow and OS-specific search methods. For now, Mac OS X does a reasonably good job at searching filesystems, even remote ones - at least as good as the latest incarnation of Windows file searching, but without the visual clutter and, best of all, without the grossly innefective content indexer.

Central Storage Station

Another simple and effective strategy for finding stuff is putting it just in one place. Pretty obvious, I know, but it took me a while to make sure I was doing the right thing by hanging all my storage off one Windows box.

Yeah, Windows. I'm not going to ignore the obvious choice, for this is one of the areas where Windows XP shines. Not only does it provide extremely fast, stable and reliable SMB file services on top of NTFS, but I can increase storage space with just about any device under the sun and manage it remotely with Terminal Services.

All without endless fiddling with Samba, accented character mappings and other voodoo.

What about media?

Media types are best left to specific cataloguing applications, but I can't say I have much regard for either iTunes or iPhoto: iTunes insists on having a local music repository with everything (but I regularly use more than one machine, and syncing is a pain) and iPhoto (despite being an excellent way to import and pre-process photos) suffers from the same basic flaw.

So I just use my SliMP3 to organize my music and store all my photos on my home server in monthly folders.

The Third Quest: Killing the RSS Hydra

Since I've started using RSS, I've cancelled just about all my e-mail newsletters. RSS is now highly popular and provides just about all the news I need without having to read messages or visit pages filled with junk advertising.

However, like e-mail, it is becoming more than a bit of a problem to manage. I subscribe to around 120 RSS feeds (on average), and I've found the main issues with managing those to be:

The sheer impossibility of keeping all my RSS readers in sync (even with OPML)
The lack of a standard archival mechanism for interesting items
The need to somehow classify information (490 posts a day begs for Bayesian classification)
The variety of formats (with/without post contents, images, etc.)
Sooner or later, RSS feeds will start carrying advertisements, so I'd best have some sort of filtering in place soon.

I keep NetNewsWire around to remind me of how simple it all should be, but it doesn't address even half of the problems above. So I've started work on an RSS-to-email aggregator, but progress is slow.

I'm more and more convinced that this is the right approach, if only because having both an RSS aggregator and an e-mail client open is twice as distracting.

The Fourth Quest: Sheperding Contacts With LDAP

Oooo, this is a complex one. The concept is simple: I want every single contact I have to be stored centrally and available to all my e-mail clients through LDAP.

However, the current status of affairs is laughable. I've already ranted about the apparent impossibility of moving vCard information to LDAP in a straightforward fashion, and so far I've come up with exactly zero ways of easily building an LDAP-based address book from vCard files.

(And by easily I mean drag-and-drop, not some half-baked command-line from hell.)

My current approach to this is to grab a simple LDAP engine, perform ruthless surgery on it until it knows nothing more than vCard fields, and have its back-end query a shared folder full of vCard files. Wanna update? Edit the file, save, done. Wanna export? Drag and drop. Wanna import? Drag and drop.

Seems like a product someone might have designed already, right?

Wrong. There's no trace of anything similar, so I've been considering writing a simple LDAP-like daemon in Python. The only problem with that (besides my relative lack of experience in Python coding) is that the wire protocol specs in RFC:2251 have so much ASN.1 in them that even jaded telecomms engineers like myself feel the urge to spit.

(We used to call ASN.1 the "asinine syntax notation version 0.1", by the way...)

In short, I have no real solution for this one right now - making it all the more interesting for now...

Update: After some pretty extensive Googling, I came across this minimal LDAP daemon written in Perl.

Tao of Mac