Dealing With a Million Legacy Files Every Day

There is a very frequent and particularly messy ETL scenario that I like to call “death by a million files” in which large corporations find themselves in need of collecting and processing thousands of old-school CSV and XML files (or worse) and take them to the cloud, which is something that no traditional ETL tool can do efficiently (especially not graphical ones).

Last week I sat down during one of those engagements and drew up a pipeline (partially) based on a set of Azure Functions that took those files off a blob store, parsed them, collated the data and inserted it into a SQL database.

A twisty little maze of queues, all alike

I was working my way through how many queues I’d need to thread data through a handful of different transformation functions until I remembered that Azure has an equivalent to AWS Step Functions under the form of Durable Functions, an extension to the Azure Functions stack (which, incidentally, is fully Open Source) that lets you string Functions together in complex orchestration patterns without having to manage queueing myself.

So yes, the serverless revolution is real. You can do pretty interesting things without touching a VM, and with a pretty great development cycle–git push your code, have the service set things up and fetch dependencies for you (like I do in piku, actually), and everything runs (and scales up and down) on-demand without having to mess around with cron jobs, and with nice monitoring to boot:

You can monitor Functions in real-time via Application Insights—CPU, RAM, exceptions, the works, as well as custom metrics.

From ETL to static site generation

This approach is extremely useful for a number of scenarios and I needed not only a re-usable sample but also something I could fiddle with at length (I still don’t like coding in NodeJS, but I have hope of turning that distaste into a useful skill), so after coding the first iteration of the PoC I decided to generalise things a bit and turn it into a simple, fully serverless static site generator:

It can sync with OneDrive, too - but that's another kettle of fish.

And so far, it’s working out really well, to the point where I’ve thrown most of this site at it and (other than lacking proper design and breaking in several pieces due to missing functionality) it can readily cope with thousands of files a second once it’s warmed up.

Considering that the code for the above currently clocks in at around 200 active, useful lines as of this writing, I would say this is pretty damn good bang for the buck (and it should cost much less than a couple of Euro to run per month, too).

Although there are a few constraints when compared with running a site generator in a standard execution environment (for instance, you can’t go off and enumerate other files in the same folder while rendering a page without some planning), it seems I finally found something that can, with a little more work, provide an interesting alternative to my current wiki engine.

As long as I have the patience to re-code some of the smarter bits in JavaScript, that is…

The Mouth Of Hell

A peek down into the "Boca do Inferno" (Hellmouth) in Cascais.

Slow Summer

Posting has been slow for a number of reasons, so here is a short update on the whys and whats of it.


Notes On The Raspberry Pi 4

Of course I ordered one. I did it partly because I need to plan ahead for replacing my ageing ODROID U2, which has been the main house server for nearly six years (since it was the only ARM device I had with 2GB of RAM), and partly because my lab setup (which runs on a 3B+) is a little short on RAM.



Last Sunday I spent a few hours revisiting LISP-related languages, partly because I miss writing Clojure and partly because I wanted to do a relatively simple thing: issue a bunch of HTTPS requests, collate the resulting JSON data and then issue a final POST request. And I wanted to do it with an HTTP library that didn’t suck, in the smallest possible amount of space, and with a static binary. Two out of three wouldn’t be bad, right?


Making k3s Self-Aware

Over the past couple of bank holidays I’ve kept playing around with k3s, which is a fun way to take my mind off the end-of-fiscal-year madness that peaks around this time. In this installment, we’re going to start making it self-aware, or, at the very least, infrastructure-aware, which is the only real way to do truly flexible cloud solutions.


Catching Up

A great deal has happened this week, which kicked off with what was likely the most eventful Apple WWDC keynote in recent years. I have had little to no time to spend writing my thoughts about it, but an extended weekend is just the ticket for fixing that (as well as posting a few updates on multi-arch Docker images and my upcoming migration away from Dropbox).