The Follies, Part One

Around this time every year, nostalgia kicks in and I am reminded of my telco years. I’ve written about that period before (The Blue Packet was an allegory for the Vodafone 360 debacle and Template For Small Countries merely one of the poignant moments I witnessed there), but as it happens I have a tangible souvenir of those years:

Part of my home office decoration for the past nineteen years

This oddly shaped, unwieldy chunk of purple plastic (which is around 6cm to a side, if you’re wondering) has been on my office desk for nearly twenty years now, and despite it being fundamentally useless (it doesn’t even make for a good paperweight) I keep it as a daily reminder of how dogma and preconceived notions can turn well-meaning engineering into a massive iceberg of technical debt.

The Very Beginning

I joined Telecel (which was later to become Vodafone Portugal) on April 1st 1999, originally as a product manager for “internet addiction”, an amazingly prescient job description that at the time entailed finding ways to keep dial-up ISP customers online for as much time as possible. Call termination rates and per-minute charges accounted for a sizable chunk of revenue, and it was in our best interest to maximize the latter.

So my first afternoon on the job was a memorable meeting with the CMO and the Director for New Business (my boss, who was effectively the head of the ISP business unit and easily the best line and skip manager I ever had) to discuss tariff plans and which ISP services had the most potential to keep people online for long periods of time.

We had the usual staple services: e-mail, NNTP, personal web hosting and a plain portal with a built-in search engine, but at the time “internet addiction” essentially meant IRC and online gaming, because those had the added advantage of being low-bandwidth and nearly residual impact on peering costs to other ISPs (PIX, the Portuguese Internet eXchange, was hardly useful at the time, and Portugal Telecom charged heavily for traffic exchange, so that was a factor too).

The fun bit was that one of the reasons I got a Marketing job instead of going to Engineering with other friends who joined at the time was because I ran the Quake servers at my former employer, IP Global1.

And Quake also played a role in challenging the engineering status quo, but I’m getting ahead of myself here.

Flipping The NETC Bird

The NETC logo

The ISP we were putting together was called “Netcétera” (NETC for short), and bootstrapping it was the technical responsibility of one the Telecel shareholders (AirTouch, which was later acquired by Vodafone). The name came with a whimsical tag line and logo, and since at the time you actually had to provide people with software to get connected (a SLIRP/PPP dialer, a mail client and a browser), we actually shipped subscription CDs in a plastic bird cage, with the subscription papers in the bottom:

The famous birdcage.
Courtesy of Jorge Alves

Those were cute, innovative, very appealing for store displays, but a major pain to stock due to their size. Oh, and the guy who ended up building hybrid CD images for Mac and PC? Well, that was me too, since I had experience using IEAK (the Internet Explorer Administration Kit)2.

This was at the dawn of portals and search engines, and we were competing against Portugal Telecom’s Telepac (which had SAPO on its side as a prime online destination) and a bunch of other local ISPs.

The Menagerie Of The Rising Sun

AirTouch brought to Portugal a very experienced, senior engineer (whom I’ll call Calvin Brown) who had built several regional ISPs in the USA and was considered to be at the top of his game. And they also hired sysadmins, webmasters, DBAs and other roles from local ISPs, including one of my opposite numbers at Portugal Telecom—Telepac’s Quake server admin (whose handle was Underspell, and who became my partner in crime in this story).

And the culture (and budgeting) difference showed. Most other local ISPs had built their services atop cheap Linux or FreeBSD servers. Brown, however, swore by Sun hardware (and, unfortunately, also its ISP software suite) and followed the approach that a Portuguese national ISP should be at least as big (if not bigger) than the ones he’d built before, so the whole thing was a massive multi-tier deployment, with separate front-end (web, SMTP, POP), mid-tier (search, support systems, etc.) and back-end systems (mail, databases, storage, etc.).

I remember poring over the network diagram and thinking that it was at least three times as many boxes as what I was used to. That and the naming convention was also somewhat whimsical–machines were named after marine life (cod, beluga, etc.) according to their size and “depth” in the architecture, and there were a lot of names.

But one of the things that stuck with me was that every tier had four separate LAN segments (public, internal, monitoring, management, etc.). Almost every machine had multiple Ethernet interfaces, and all those LANs, for all layers, talked to each other via a SunScreen firewall with multiple quad Ethernet cards.

Keep this last bit in mind, it’s important.

Hundreds Of Little Feet

The actual size of the whole thing only became apparent to me when I visited the data center and looked upon dozens of racks filled with a sizable sample of the Sun Microsystems portfolio.

And by sizable I mean that a typical low-end machine in that setup would be a Sun Enterprise 250, a boxy affair that was rack-mounted sideways. But there were also a lot of Sun Enterprise 4500s, voluminous mid-range modular beasts that shipped with four purple plastic feet.

The Sun Enterprise 4500 in all its glory. Note the feet.

I forget exactly how many there were, but what I do remember was that when I leaned onto a couple of stacked cardboard boxes near the datacenter entrance, my elbow slid into the topmost with a sound that was not unlike LEGO bricks.

I reached in and realized that the box was filled to the brim with E4500 plastic feet that had been removed for rack mounting. Both boxes, in fact—there were a couple of discarded bezels and other things, but they were both 90% full of… feet.

Having previously worked at a Sun reseller, I was completely floored by the amount of cash all those feet translated into. If it sounds like overkill, well, it was.

By now you probably figured out the chunk of plastic I kept for all these years is one of those feet. And there was so much more I could write about that datacenter tour3

Our estimation at the time was that if not for mail storage and databases (which would run better on Sun gear at that time), we could have run the whole thing on Compaq servers and taken up probably 50% less rack space and definitely much less cash4.

Launch Day

That was a day to remember, since we were under attack from the get go. Every single script kiddie in the country (and a few pros in the competition) wanted to get a feel for what we were running, and it brought to light a number of harsh realities, the first of which was that it was trivial to bring down the entire portal by typing in a single letter in the search field and hitting Return.

I can’t find any quality screenshots from that time (I suspect there are plenty squirreled away in e-mails, but nothing decent online), but this one should give you an idea of what the home page looked like on launch day:

All that yellow is just glorious, isn't it? Sorry you have to squint at it.

All you had to do was run a few simultaneous searches for “a”, and the whole thing would just collapse. Sun Web Server gave up and crashed, database load went through the roof, etc.

And then someone decided to script the GET requests at scale, and the entire ISP went down for hours–including sign-up pages, service links, tariff plans, the works.

Why? Well, because there was no input validation, no real separation between the search engine and the front-end, and no caching at any layer. And I’m pretty sure of that, because after much debate I was one of those who ended up trying to patch it until we could get the local company who coded it (which shall remain nameless) to fix the “search engine”.

Everyone was going nuts at the time, and I somehow ended up in front of a terminal session to one of the front-ends looking at the code alongside Underspell and a couple of other people.

The first thing we did was to add an if (strlen(input) < 3) { return; }. It wasn’t any sort of decent fix, but merely the simplest thing we could do across servers in 5 minutes until someone else tried to reconfigure the firewall.

But then some bright spark (whose IP address we logged, and tracked back to a rather amusing domain name I knew very well) decided to search for %%%, and I dug further into the code to discover (with considerable horror) that the “search engine” was effectively doing a “SELECT * FROM pages WHERE text LIKE '%” . query . “%';” 5.

It took us a few days to get it properly fixed, but being able to articulate what was wrong was one of the reasons I ended up taking a permanent seat in Engineering and spending more time there than in Marketing during those early years.

Breaking the Monolith With Quake

We had constant trouble with the Sun ISP software stack. Sun Web Server was the first thing to be replaced (with Apache), LDAP was a major pain (we used it as a subscriber database for unified dial-up, mail and FTP login) that we spent years trying to fix, and Sun Mail Server was the worst e-mail system I’ve ever dealt with with regards to management and troubleshooting.

I remember Brown (who was in his fifties, bespectacled, blue-eyed, and white-haired), becoming visibly agitated and red-faced whenever we suggested replacing any of the software with “unsupported” Open Source alternatives, and how subdued he was when Sun support kept failing to get things to work.

He considered Linux inferior and insecure, but begrudgingly put up with running a few machines at the outer layer, as long as they were isolated from the rest. After all, we had to run some things on Intel (for instance, we couldn’t run any game servers on Solaris), but that was always a sore point with Brown.

So we begun setting up game servers (and other things) outside the “monolith”. By 2000, we had a separate portal for games. I paid (with my own money) for a separate domain name (one that became quite notorious, sadly fallen into disrepair as a malware haven these days), and we got to the point where we even ran our own e-mail and web hosting services–and quite a few extra projects started using those instead of the Sun ones.

A joint promotion with YORN (our mobile youth brand) for Internet World 2001.

But the games service was the main topic of dissent. Besides the internal discussions about branding and whatnot (it was one of those things that just grew into its own brand and became very popular indeed), the remaining Sun partisans were somewhat annoyed at us running it off a handful of Cobalt boxes (who someone wittingly named piranhas)6.

The first two piranhas.
Courtesy of Bruno Rodrigues

That was until we started creating mailboxes for them to use when Sun Mail Server blew up (for testing, obviously…), and eventually managed to build upon that to sneak in software that actually worked, mostly by testing things on Linux, swapping some of it out on Sun boxes and letting things run for weeks without hitches before presenting Brown with accomplished fact.

But we never managed to touch the NNTP service–Brown was obsessed with News/NNTP, and managed those boxes himself.

Our shenanigans were tolerated because by that time we had fixed DNS, RADIUS and most critical services to a fair degree, but we had to put up with the Sun ISP software and architecture design for a long time, and it was a constant struggle.

At one time, we had so much unofficial support to get rid of Sun hardware that I even designed (and put up on the wall behind my desk) an “Eclipse” logo:

The Eclipse logo.

Black Hole Sun

But the SunScreen firewall took the cake when it came down to utter uselessness. Remember that it sat in between all the architecture tiers?

With four LANs per tier, the network design was so complex the thing had trouble with even basic routing, but when it got really confused and rebooted on its own, SunScreen would often set all the Ethernet ports to the same MAC address (zeroed out, for extra kicks).

That, in turn, brought down the entire ISP (except for the piranhas, which just kept going…) and forced us to manually reboot all the Cisco gear.

But I’m not even going to start on how much over-engineering and technical debt we got into on the networking side (at least in the ISP core, since the dial-up network was designed by my former colleagues and they ran things sanely). I’m just going to stop here and reminisce about having hair (and huge monitors) at the time:

At my desk, a few years later, as we were packing to move off the ISP floor.
Courtesy of Bruno Rodrigues

After all, this post is getting really long and we’re only halfway there–I expect to be able to tell the rest of this saga (or at least a few more highlights) in a few weeks…

But I think you get why I kept that plastic foot. It stands for all that we went through at the time, and is the reason why I vowed never to get into anything else I couldn’t understand (or design) from the ground up.

And, more to the point, to never throw money at problems until they go away.

Because they don’t–they just feed on it and become massively huge problems that can take years to fix.

I should know, right?

  1. (that later became Novis and was eventually merged into NOS, the third Portuguese telco by chronological order). ↩︎

  2. That I would end up writing about this while working at Microsoft a couple of decades later is ironic at best. ↩︎

  3. Another thing that really stuck with me was that all the boxes with redundant power supplies had both PSUs hooked up to the same power rail via Y cables. I have no idea who signed off on that… ↩︎

  4. Sun salespeople were on-site every week to dispel that notion, obviously. ↩︎

  5. Yep, no prepared statements. And I don’t recall there being a LIMIT clause when I looked at that code, either. And there were also some SQL injection attacks–the whole thing was a spectacular failure that proved to be a very valuable lesson to our management about subcontracting anything… ↩︎

  6. Somewhat amusingly, those Linux boxes outlasted all of the Sun machines, and I ended up bringing a couple of “dead” Cobalt boxes home with me and swapping parts to make a working one. This site actually ran on it for a year or so from my cable connection… ↩︎

See Also: