Gmail

So it looks increasingly like Gmail is not an April Fools'. Duh. Okay, let's regroup our neurons and reflect upon this.

Having been involved in more than a few e-mail projects myself (the first large one was 6-odd years ago when the ISP I was working at started looking at building a distributed-storage qmail installation), the 1GB-per-user thing starts making sense only when it finally hits you that average disk sizes are ten to twenty times larger today than it used to be way back then.

Sun, for instance (and to use a more recent example dear to my heart) used to sell SCSI storage arrays with 9GB hard disks. Today, you can build arrays with 300GB SATA disks for roughly the same cost, especially if you buy them in large amounts. So that's thirty times the single drive capacity, and assuming the SCSI/SATA differences are also offset by a volume deal, it looks doable.

So that was my first nagging hint that it could be done affordably for a relatively large (say tens of thousands) base. The years spent struggling against my meager 50MB corporate quota have obviously dulled my reasoning (but that makes sense for other reasons, and is not really the point here...).

Playing With Blocks

Let's think how such a beast could be built. For the sake of completeness, I'm going to sketch a brief (quasi-historical) rundown of large-scale e-mail services - the following paragraphs are rife with oversimplification, but they should give you an idea of how things work.

Large e-mail installations used to comprise four sets of components, layered from top to bottom:

A load-balancing network (usually implemented using a set of Layer 7 switches like Nortel Alteons or similar devices)
SMTP and POP3/IMAP front-ends to deal with inbound/outbound mail and user sessions
An LDAP directory that stored e-mail aliases, passwords and mailbox location data (often also spread across several load-balanced boxes to lower read response times)
A "dumb" storage layer that the POP3/IMAP servers accessed directly (using NFS, for instance), mostly optimized for writing for the following reasons:
- You want delivery to be quick, so your SMTP servers can deal with a 10-to-1 inbound/outbound ratio.
- Users slurp e-mail via slow connections
- Way back then, message deletion implied quite a few disk accesses.

In storage terms, the whole thing was geared towards holding e-mail until the user emptied the mailbox (99% of users relied on POP3 to retrieve messages and delete them from the server).

Things progressed quickly on the mail storage front, though, with the age-old mbox format (one file to hold all mail, which is nice for reading and appending new messages but lousy for deleting them) being replaced with maildir (one file per message inside a folder, which made deletion an atomic operation but implied maintaining a message index file per directory).

Then along came webmail, and everything changed.

Browsing Mail

Once people figured out you could easily hack together a few CGI scripts to access a mailbox, the mail architecture began to change. First you had a few HTTP servers alongside the POP3/IMAP boxes, then HTTP-to-IMAP software like Horde's IMP became popular.

Why? Because the whole process of reading mail changed once you stopped deleting it from the POP3 server to read on your computer. People needed folders (a native IMAP feature) to sort out their stuff, and deletion became rather less frequent, since webmail's most important feature was that you could get at it from anywhere and your stuff would still (mostly) be there.

Again, I'm oversimplifying, but the gist of the matter is that mail started accumulating on the servers. The storage layer changed. Proprietary systems (like Sun's, Netscape's, Notes and Exchange) bet the farm on pseudo-database storage systems, using technology developed for corporate environments where saving storage space and indexing content quickly became both a necessity (due to the increasing size of Office attachments) and value-added features.

Honey, I Blew Up The Server

So any ISP that wanted to offer webmail had to commit significant resources, and all sorts of vendor-driven consultancy services popped up to help ISPs deal with it. Most of their advice was basic common sense (and heavily biased), but one soon figured out that sizing factors for large e-mail installations can be split in two categories:

Processing and Throughput, sized mostly according to:
- Inbound and Outbound messages/second (which has an indirect impact on storage architecture to minimize access times)
- Simultaneous POP3/IMAP/HTTP/HTTPS sessions (not relevant for Google, since it's bound to be just an HTTPS auth and plain HTTP after that)
Storage, sized according to:
- Number of Active Users
- Average mailbox size
- Amount of available cash

The first category is all about network bandwidth, CPU usage and RAM. Inbound SMTP is very cheap by itself (i.e., if you discount anti-spam and anti-virus), but user sessions can be extremely resource-intensive (listing folder contents, for instance, is often one of the first things the HTTP-to-IMAP component needs to be optimized for).

Modern webmail systems have deprecated IMAP in favour of "lighter" (often proprietary) means to access the message store, and Exchange, for instance, now has a WebDAV interface to it - so the web front-ends often act as little more than HTTP proxies for portions of message transactions.

But that's not important right now. Let's focus on the second category.

Oooh, Look At The Size

Let's assume that bandwidth is not an issue for Google, that their web, LDAP and SMTP servers are more than adequately sized for the load, that their IPO will provide all the required cash, and that storage is the only real issue here.

How do you size storage on an e-mail service, then? Well, there are several issues here:

Not everyone will actually use their account.

Usage patterns will vary, and you can define a Hotmail-like policy and deactivate accounts after a given period of time. In normal ISPs, the percentage of "inactive" or unused accounts can vary somewhere between 30% and 70% - it all depends on how you determine what an "active" user is. Whatever the criteria for "active" users, mail quotas are typically distributed as follows (this is a rule of thumb that, oddly, hasn't changed much in 6 years and seems to be independent of mailbox quotas):

 5% of users are at or near quota
30% are above half their quota
65% have little or no e-mail in their mailboxes

Webmail has made a difference on the composition of those 65%, true, but webmail users are not very keen on swapping large attachments - not even with broadband (which, by the way, tends to be asymmetrical, further deterring them from uploading large files).

There is a significant difference between total message size (i.e., your quota) and average message size.

An example: my 5-year work archive, which is filled to the brim with big, unsightly Office attachments, is only 3.8GB, and I've archived 400MB of e-mail this quarter. Average message size is 200KB - thanks to a lot of documents, most proeminent of which are two PowerPoint presentations with 11MB each (yes, both authored by me).

Average message size for a consumer e-mail service is between 8 and 32KB (yes, even considering all those MP3s and movies people are likely to be swapping via e-mail once Gmail gets off the ground), and 90% of messages are smaller than that. Spam, in particular, averages around 12KB these days, judging from my SpamAssassin logs.

You can use reference-based storage to try to store a single copy of each message if it's addressed to several people on the system.

This one is a no-brainer. Just using the Message-ID field saves a lot of bother.

You can gamble on the fact that it takes most people a long time to fill up 1GB.

At a rate of ten messages a day, most people will take years to fill the 1GB quota. Let's do the math, though, and consider three scenarios with a constant message/day rate:

Someone who uses Gmail to receive large attachments occasionally and has a small amount of traffic (5 msg/day).
A moderate user that subscribes to a few mailing-lists (20 msg/day) - remember that most mailing-lists these days have fairly small messages, so that significantly lowers the average message size.
A heavy user that uses Gmail to keep track of several mailing-lists (100 msg/day):

Let's assume that 1MB is roughly 1000K (heck, if hard disk manufacturers do it, we can do it too):

Messages/day:               5     20    100
Average message size(k):  100     50     20
Average Quota/day(MB):    0.5      1      2
Average Quota/year(GB):   0.2    0.4    0.7
Years Until 1GB Reached:  5.5    2.7    1.4

And yes, you can always point out that the real usage pattern will be nothing like this, but these numbers should give you pause, especially if you consider that there is likely to be some sort of prevention against abuse (such as an upper limit on message size, content, etc.) and that I left one very important scenario out -

Someone who receives an average of 10 messages a day at 32KB/message will take nine years to fill their mailbox.

Do the math. It's simple. And that's a lot of mail.

The Me Too Scenario

Let's look at the competition, then. Hotmail has roughly 110 million subscribers - or so we're told, since for all we know half of those could be dormant, or simply stored Passport/LDAP entries with no storage space attached.

Hotmail currently offers a comparably insignificant amount of disk quota (2MB, with a 1MB cap on messages). Assuming they weren't overbooking their storage (which they surely are), that would be a little under a quarter petabyte (yeah, thousands of terabytes, or millions of gigabytes).

It is also the world's largest sink hole for spam, but that's besides the point right now. If they were to provide 1GB mailboxes (and assuming they got their anti-spam act together), they would need a nominal capacity of oh... Let's say a hundred Peta (by the way, do you know what a Zettabyte is?).

A pack of 20 300GB drives can probably be driven down to around US$5000, so, even accounting for minimal expense on RAID, we're looking at roughly US$200 million in hardware costs. Yeah, it's a gross exaggeration, since I've not factored in oversubscribing (which probably shrinks the required disk capacity to 30%) or made a real attempt at estimating other costs rather than the disks themselves - let's say things even themselves out and it can be done at half that (US$100million).

Looks cheap, even for Bill. Expect Hotmail to try to outflank Google.

Why Google Will Probably Win

It's getting late (and it was a very exausting week to be doing guesstimates about a webmail service at 1AM), so I'll make this one a shortlist:

It will be as great as using Zoë.
You can compress the messages - especially if you index them for searching, as Google aims to do.
They will not try to stick it to a particular operating system or browser (even though it currently doesn't work with Safari).
They will actually earn money with the ads (more on that another day).
They will probably do it better.

And that, as usual, will make all the difference.

Tao of Mac