The Strange Case Of The Illegal Instruction

Here’s something I do often, which is to go off into the weeds and try to fix something that suddenly broke. Except this time it’s not cloud (it’s on-premises) and it’s not someone else’s (it’s mine, through and through, although most of the dependencies and stack are in widespread use).

It’s also something pretty low-level and borderline inscrutable…

Problem Statement

For quite some time now, I’ve maintained my own Node-RED cross-platform container image, which I use pretty much everywhere.

I started building those images for , and they currently support arm32v6 (Pi Zero), arm32v7 (Pi 2/3+, my and most other modern 32-bit ARM chips), aarch64 (Pi 4 and some cloud hosts) and, of course, amd64.

But while upgrading my home automation setup on my late last Sunday, I had the nasty surprise of finding that my latest container crashed with an Illegal instruction (core dumped) message.

strace was useless as it spewed too much information, and export NODE_DEBUG=module didn’t afford immediate clues since whatever happened was immediately after module loads, so I pulled the same container onto a 3B I have on my desk, and it ran fine.

Plus the ZigBee dongle I use mysteriously stopped working on one of the USB ports on Monday morning (always a nice way to start your week), so I started wondering what was wrong with the itself.

After all, it’s been in service , which is no mean feat, and it has a rather niche Exynos CPU, so I started suspecting it might just be obsolete by now.

To add to this predicament, although I usually keep older container versions around, Murphy’s Law (and excess confidence) had ensured I hadn’t done that this time around1, so half of my home automation (the bits that relied on , like ZigBee bridging to HomeKit) was dead, including part of my office lighting.

So I had to fix this ASAP.

Parenthesis: My Current Build Process

I’ve been doing ARM cross-compilation for a long time, but over the past couple of years the technique I’ve come to favor is to use qemu-user-static on Linux and using “wrap” containers. In short, this relies on registering QEMU on the host as a binfmt_misc handler and injecting the respective qemu-$(ARCH)-static binary into a base container2, which is then invoked upon build3.

Since armv6 has been falling out of favor (and I still have a Pi Zero W “running” Node-RED) and there were a number of years when getting an up-to-date, native armv6 or armv7 build of nodejs was almost impossible, I’ve taken to building the full stack (nodejs, Node-RED and all associated binaries), which takes a few hours for all the architectures4.

It’s usually a “fire and forget” thing I do on one of my machines, and is usually faster and much less hassle than building on native hardware due to I/O performance alone. Still, it’s slow enough that I usually only have fresh containers four to eight hours later (depending on build).

Side Note About docker buildx

Before someone writes in asking why I don’t use docker buildx which would be a lot neater, allow me to say a few things:

  • Yes, I have tried it every few months.
  • No, it doesn’t work for me across both 4.5.x and 5.x kernels, on public cloud or on my i5/i7 hosts, because it keeps crashing randomly.
  • Yes, I tried it again this week (Docker version 20.10.5, build 55c4c88) and it bombed out on both the arm32v6 and arm32v7 builds.
  • When it works, I find it a lot slower than my current process (partially because it tries to do all the builds at once, and partially because it does not, by default, use all CPU cores).

Also, I lost an entire day (in wall clock time, around 2 hours spread throughout the day) trying to sort out the above during the rest of my investigation.

Binary Sleuthing

My initial suspicion (based on both the literal error message and past experience) was that a QEMU upgrade might have broken the final binaries somehow, and that the ‘s Exynos chip might not like the binary’s instruction set – this usually happened in the past with floating point extensions, and the reports itself as armv7l, with a typical binary stating support for VFP3 but not NEON:

# readelf -A /bin/sh
Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "7-A"
  Tag_CPU_arch: v7
  Tag_CPU_arch_profile: Application
  Tag_ARM_ISA_use: Yes
  Tag_THUMB_ISA_use: Thumb-2
  Tag_FP_arch: VFPv3-D16
  Tag_ABI_PCS_wchar_t: 4
  Tag_ABI_FP_rounding: Needed
  Tag_ABI_FP_denormal: Needed
  Tag_ABI_FP_exceptions: Needed
  Tag_ABI_FP_number_model: IEEE 754
  Tag_ABI_align_needed: 8-byte
  Tag_ABI_align_preserved: 8-byte, except leaf SP
  Tag_ABI_enum_size: int
  Tag_ABI_VFP_args: VFP registers
  Tag_CPU_unaligned_access: v6

…and my custom node builds (which had previously worked) were earmarked as having NEON instructions:

# readelf -A /usr/local/bin/node 
Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "7-A"
  Tag_CPU_arch: v7
  Tag_CPU_arch_profile: Application
  Tag_ARM_ISA_use: Yes
  Tag_THUMB_ISA_use: Thumb-2
  Tag_FP_arch: VFPv3
  Tag_Advanced_SIMD_arch: NEONv1
  Tag_ABI_PCS_wchar_t: 4
  Tag_ABI_FP_rounding: Needed
  Tag_ABI_FP_denormal: Needed
  Tag_ABI_FP_exceptions: Needed
  Tag_ABI_FP_number_model: IEEE 754
  Tag_ABI_align_needed: 8-byte
  Tag_ABI_enum_size: int
  Tag_ABI_VFP_args: VFP registers
  Tag_ABI_optimization_goals: Aggressive Size
  Tag_CPU_unaligned_access: v6

Having dealt with compiler vagaries before, I knew that this might have to do with either compiler defaults or auto-detected QEMU features, so I decided to go back and check exactly where on my build process this had been broken.

But first, I tried installing and running nodejs 14.16.0 in a stock alpine 3.13.4 container on the :

# readelf -A /usr/bin/node 
Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "7-A"
  Tag_CPU_arch: v7
  Tag_CPU_arch_profile: Application
  Tag_ARM_ISA_use: Yes
  Tag_THUMB_ISA_use: Thumb-2
  Tag_FP_arch: VFPv3
  Tag_ABI_PCS_wchar_t: 4
  Tag_ABI_FP_rounding: Needed
  Tag_ABI_FP_denormal: Needed
  Tag_ABI_FP_exceptions: Needed
  Tag_ABI_FP_number_model: IEEE 754
  Tag_ABI_align_needed: 8-byte
  Tag_ABI_enum_size: int
  Tag_ABI_VFP_args: VFP registers
  Tag_ABI_optimization_goals: Aggressive Size
  Tag_CPU_unaligned_access: v6

And guess what, it crashed immediately upon execution (even without any input):

$ /usr/bin/node 

# Fatal error in , line 0
# unreachable code
#FailureMessage Object: 0xbec1220c
Illegal instruction (core dumped)

So the next step was to downgrade to Alpine 3.12 (which ships with nodejs 12.21.0). That worked, but, like my custom builds, crashed when running .

Persons of Interest

Assuming my custom builds were somehow magically better than stock alpine packages, there were multiple suspects that I could try in descending order of complexity/overhead:

  • The alpine base images themselves, which I had bumped to 3.13 recently, and might have different compiler defaults (I have to use alpine as ubuntu doesn’t support arm32v6).
  • The qemu-user-static version, which I had also updated to 5.1.0.
  • nodejs itself, which is finicky as heck to build and highly sensitive to compiler options (which was at 14.15.3).
  • And maybe, just maybe, itself and its modules.

Since alpine 3.13.4 now ships with 14.16.0, I decided to (for science) try using that instead of my own custom builds as well.

The Hours

So I started trying different builds – which took me around 15 minutes every few hours to start and later pull down to the and a to test.

This added up to a lot of wall time (in fact, this took me the entire week to sort out)5.

A great (but perplexing) find was that my customarm32v6containers ran perfectly (even if slower) on the , which allowed me to get my home automation back up on Monday afternoon.

So I tagged it as node-red/stable on docker-compose and started going through a number of combinations:

Architecture Alpine NodeJS build Version Node-RED Version QEMU version Result
arm32v6 3.12 custom 14.15.3 1.2.9 5.1.0-8 Works, "new" build with full regression.
Works, "old" build that was still on Docker Hub
3.13.4 14.16.0 5.2.0-2 Works again, so this has to be related to arm32v7 somehow
arm32v7 3.12 14.15.3 1.2.8 4.0.0-2 Crashes, binaries have VFP3 but no NEON, works on Pi
1.2.9 5.1.0-8 Crashes, binaries have VFP3 and NEONv1, works on Pi
Crashes, works on Pi. "new" build with full regression. Time to try other tactics
3.13.4 alpine 14.16.0 Crashes, works on Pi. So built-in packages aren't the way (3.12 has Node 12)
custom 5.2.0-2 Crashes, works on Pi, so updating QEMU doesn't fix it

And when I wrote “Crashes”, I meant that it crashed while running , for I soon realized that my custom nodejs builds would run fine (or at least give me a usable prompt) nearly all the time.

This took me the best of four days, and by that time I was starting to think there had to be some other cause–after all, nodejs is a major pain to build, but I was becoming skeptical of it being the culprit on all cases, as well as either alpine or qemu.

The Weeds

Despite the failure of the alpine stock build, my suspicions turned to the extra modules I build and deploy with , many of which (like most of npm, really) are poorly maintained one-offs with weird dependencies.

Any one of these might tip nodejs over during startup:

    "dependencies": {
        "@node-red-contrib-themes/midnight-red": "1.4.7",
        "node-red-admin": "0.2.7",
        "node-red-contrib-dir2files": "0.3.0",
        "node-red-contrib-fs-ops": "1.6.0",
        "node-red-contrib-homebridge-automation": "0.0.79",
        "node-red-contrib-httpauth": "1.0.12",
        "node-red-contrib-lgtv": "1.1.0",
        "node-red-contrib-light-scheduler": "0.0.17",
        "node-red-contrib-linux-diskio": "0.2.4",
        "node-red-contrib-linux-memory": "0.8.4",
        "node-red-contrib-linux-network-stats": "0.2.4",
        "node-red-contrib-meobox": "1.0.0",
        "node-red-contrib-moment": "4.0.0",
        "node-red-contrib-msg-speed": "2.0.0",
        "node-red-contrib-os": "0.2.0",
        "node-red-contrib-persist": "1.1.1",
        "node-red-contrib-redis": "1.3.9",
        "node-red-contrib-wemo-emulator": "1.0.1",
        "node-red-dashboard": "2.28.2",
        "node-red-node-base64": "0.3.0",
        "node-red-node-daemon": "0.2.1",
        "node-red-node-msgpack": "1.2.1",
        "node-red-node-prowl": "0.0.10",
        "node-red-node-pushover": "0.0.24",
        "node-red-node-rbe": "0.5.0",
        "node-red-node-smooth": "0.1.2",
        "node-red-node-sqlite": "0.6.0",
        "node-red-node-tail": "0.3.0",
        "node-red-node-ui-list": "0.3.4",
        "node-red-node-ui-table": "0.3.10",
        "node-red-node-ui-vega": "0.1.3",
        "node-red": "1.2.9"

So I pulled up my commit logs and started looking at the changes I had made to the bundles my build ships with (many of them prompted by dependabot, which is great to keep track of updates), and ranked them according to criticality, hackiness and complexity of their dependencies:

  • sqlite, which is one of the largest single dependencies, takes a long time to build as a module and breaks on musl around once a year anyway.
  • cheerio, which brings a lot of baggage with it and that I actually stopped using due to it crashing .
  • ssdp-discover, which is fairly low level and that I relied on (and would tie in nicely with crashing during or just after module loads).
  • lgtv, which was essential for me to automate my LG TV before it got basic HomeKit support.

These had also been recently updated, so they were the most likely suspects.

I started doing (much shorter, but still slow) builds reverting (or removing) each of them from the bundles.

And guess what, it wasn’t any one of them specifically. Some combinations of modules worked, others didn’t.

I ended up just bisecting package.json and doing a sort of binary search until I had removed a particular set of Linux OS statistics modules that I use to build CPU and I/O charts (node-red-contrib-linux-*).

I was quite thoroughly convinced they were to blame as they crashed the arm32v7 build but yielded a working one when removed, but I then did another test: I tried adding another complex module to the mix (node-red-contrib-chart-image, which has a bunch of cairo-related dependencies), and lo and behold, the resulting arm32v7 container crashes again upon execution (and yes, arm32v6 works fine).

Partial Conclusions

I’m now pretty sure that there is something wrong with nodejs 14 on arm32v7 (at least on the ), as both my custom builds and the builds that ship with alpine 3.13.4 crash on the at various points.

This might be specific to alpine and the Exynos chip, but I don’t have enough data yet. I’d need to try 3 or 4 more builds to make sure and avoid false positives.

But whatever is happening is definitely triggered by the extension modules I’m loading into , as nodejs seems to fail whenever I cross some kind of critical threshold of loaded modules.

Plus I had at least one instance where loaded completely and flows ran for a couple more seconds (not sure if that was just slowness in hitting whatever made it crash, as the has a lot of stuff running on it).

Next Steps

Since I really want to figure out what is going on I’m now looking at things like default stack sizes and other things I can shove into CFLAGS, but since I have to go on with my life (and really wanted to have spent this weeks’ evenings messing with ), there will be a few changes:

  • Until I sort this out, I’ll be running arm32v6 builds on the .
  • I’ll be thoroughly testing the arm32v6 version of the native nodejs 16.14.0 package from alpine 3.13.4 for , homebridge, etc. and skipping my custom builds whenever possible, because they take too long and add extra testing and uncertainty to the mix.
  • I’ll also start planning to phase out the in favor of something else–probably a , of which I have two, and which are quite likely to work with every single amd64 container I ever build.

Using a is extremely unlikely, as what is running on the is there because:

  • It had 2GB of RAM eight years ago, and most of my can’t fit the workloads in RAM.
  • It has built-in EMMC storage that can handle a significant amount of writes, and I don’t want to have the same containers either pounding on an SD card or forcing me to “invest” on a with SSD support.

I might also try debian or ubuntu as a base, but that will mean orphaning the older Pi setups, so I’m not really keen on it.

Lessons Learned

  • Being an early adopter will always come back to bite you (even if it’s eight years later).
  • Before replacing “production” containers in one-of-a-kind hardware, tag the current version and keep it around until you’re happy the new version works.
  • Use tried and tested builds whenever possible (my custom builds made sense when ARM was unfashionable, but I can now get fresh, LTS versions of nodejs, so it’s time to phase them out).
  • It’s not always DNS (this is probably the biggest takeaway, for all of those people out there who like to say things like “It’s always DNS”).
  • Computers are finicky and temperamental.
  • nodejs doubly so, in any form.

Well, maybe not all the above are true.

But this turned out to be a pretty intense week, and I haven’t written up the half of it, so take it with a grain of salt.

  1. One of the reasons this happened was that the was short on storage, so I removed the old containers. But it’s still no excuse. ↩︎

  2. Shipping the emulator with the containers only adds around 10MB, so it isn’t really a problem ↩︎

  3. There are ways to temporarily inject the binary, but they don’t work with docker build, only with docker run↩︎

  4. This makes it a bit of a pain to do in free, public CI/CD systems (which are usually capped at less than an hour per run), ↩︎

  5. A typical arm32v7 build of , not including nodejs, takes 2100 seconds, and that’s the shortest iteration cycle. ↩︎

Daylight Annoyance Time

As pandemic limbo continues we’re on the verge of loosening lockdown here in Portugal next week. There are already too many discussions about Easter break, family visits and, of course, how long/effective vaccination will really be, so setting the clocks kind of snuck up on me.


Building My Own Yahoo! Pipes

Pipes has been dead since 2015 or so, but I used it for a long time and it was essential to my daily news intake, so I’ve been building a personal replacement to cater to my specific needs.


So, Anyway...

A bunch of things happened this week, one of which did away with a major creativity blocker–i.e., I no longer have any plans to shut down this blog like I previously . It’s a strange, bittersweet thing, and a somewhat surprising twist in a long, arduous (and increasingly frustrating) discussion, but that timeline is now unlikely to come to pass, at least in the way I was expecting.


Living The Static Life

If you can read this, then this site is now being statically rendered and served from Azure storage (all 8000-odd posts and reference pages in it, spanning around 16 years).


One Year Later

The pandemic has been around for one full year as of this week, so even though I have had very little time or reason to write in the meantime, it made sense to put together a short update on where we’re at from my standpoint.