The Strange Case Of The Illegal Instruction

Here’s something I do often, which is to go off into the weeds and try to fix something that suddenly broke. Except this time it’s not cloud (it’s on-premises) and it’s not someone else’s (it’s mine, through and through, although most of the dependencies and stack are in widespread use).

It’s also something pretty low-level and borderline inscrutable…

Problem Statement

For quite some time now, I’ve maintained my own Node-RED cross-platform container image, which I use pretty much everywhere.

I started building those images for home automation, and they currently support arm32v6 (Pi Zero), arm32v7 (Pi 2/3+, my ODROID-U2 and most other modern 32-bit ARM chips), aarch64 (Pi 4 and some cloud hosts) and, of course, amd64.

But while upgrading my home automation setup on my ODROID late last Sunday, I had the nasty surprise of finding that my latest container crashed with an Illegal instruction (core dumped) message.

strace was useless as it spewed too much information, and export NODE_DEBUG=module didn’t afford immediate clues since whatever happened was immediately after module loads, so I pulled the same container onto a Raspberry Pi 3B I have on my desk, and it ran fine.

Plus the Zigbee dongle I use mysteriously stopped working on one of the USB ports on Monday morning (always a nice way to start your week), so I started wondering what was wrong with the ODROID itself.

After all, it’s been in service for over eight years now, which is no mean feat, and it has a rather niche Exynos CPU, so I started suspecting it might just be obsolete by now.

To add to this predicament, although I usually keep older container versions around, Murphy’s Law (and excess confidence) had ensured I hadn’t done that this time around¹, so half of my home automation (the bits that relied on Node-RED, like Zigbee bridging to HomeKit) was dead, including part of my office lighting.

So I had to fix this ASAP.

Parenthesis: My Current Build Process

I’ve been doing ARM cross-compilation for a long time, but over the past couple of years the technique I’ve come to favor is to use qemu-user-static on Linux and using “wrap” containers. In short, this relies on registering QEMU on the host as a binfmt_misc handler and injecting the respective qemu-$(ARCH)-static binary into a base container², which is then invoked upon build³.

Since armv6 has been falling out of favor (and I still have a Pi Zero W “running” Node-RED) and there were a number of years when getting an up-to-date, native armv6 or armv7 build of nodejs was almost impossible, I’ve taken to building the full stack (nodejs, Node-RED and all associated binaries), which takes a few hours for all the architectures⁴.

It’s usually a “fire and forget” thing I do on one of my machines, and is usually faster and much less hassle than building on native hardware due to I/O performance alone. Still, it’s slow enough that I usually only have fresh containers four to eight hours later (depending on build).

Side Note About `docker buildx`

Before someone writes in asking why I don’t use docker buildx which would be a lot neater, allow me to say a few things:

Yes, I have tried it every few months.
No, it doesn’t work for me across both 4.5.x and 5.x kernels, on public cloud or on my i5/i7 hosts, because it keeps crashing randomly.
Yes, I tried it again this week (Docker version 20.10.5, build 55c4c88) and it bombed out on both the arm32v6 and arm32v7 builds.
When it works, I find it a lot slower than my current process (partially because it tries to do all the builds at once, and partially because it does not, by default, use all CPU cores).

Also, I lost an entire day (in wall clock time, around 2 hours spread throughout the day) trying to sort out the above during the rest of my investigation.

Binary Sleuthing

My initial suspicion (based on both the literal error message and past experience) was that a QEMU upgrade might have broken the final binaries somehow, and that the ODROID’s Exynos chip might not like the binary’s instruction set – this usually happened in the past with floating point extensions, and the ODROID reports itself as armv7l, with a typical binary stating support for VFP3 but not NEON:

# readelf -A /bin/sh
Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "7-A"
  Tag_CPU_arch: v7
  Tag_CPU_arch_profile: Application
  Tag_ARM_ISA_use: Yes
  Tag_THUMB_ISA_use: Thumb-2
  Tag_FP_arch: VFPv3-D16
  Tag_ABI_PCS_wchar_t: 4
  Tag_ABI_FP_rounding: Needed
  Tag_ABI_FP_denormal: Needed
  Tag_ABI_FP_exceptions: Needed
  Tag_ABI_FP_number_model: IEEE 754
  Tag_ABI_align_needed: 8-byte
  Tag_ABI_align_preserved: 8-byte, except leaf SP
  Tag_ABI_enum_size: int
  Tag_ABI_VFP_args: VFP registers
  Tag_CPU_unaligned_access: v6

…and my custom node builds (which had previously worked) were earmarked as having NEON instructions:

# readelf -A /usr/local/bin/node 
Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "7-A"
  Tag_CPU_arch: v7
  Tag_CPU_arch_profile: Application
  Tag_ARM_ISA_use: Yes
  Tag_THUMB_ISA_use: Thumb-2
  Tag_FP_arch: VFPv3
  Tag_Advanced_SIMD_arch: NEONv1
  Tag_ABI_PCS_wchar_t: 4
  Tag_ABI_FP_rounding: Needed
  Tag_ABI_FP_denormal: Needed
  Tag_ABI_FP_exceptions: Needed
  Tag_ABI_FP_number_model: IEEE 754
  Tag_ABI_align_needed: 8-byte
  Tag_ABI_enum_size: int
  Tag_ABI_VFP_args: VFP registers
  Tag_ABI_optimization_goals: Aggressive Size
  Tag_CPU_unaligned_access: v6

Having dealt with compiler vagaries before, I knew that this might have to do with either compiler defaults or auto-detected QEMU features, so I decided to go back and check exactly where on my build process this had been broken.

But first, I tried installing and running nodejs 14.16.0 in a stock alpine 3.13.4 container on the ODROID:

# readelf -A /usr/bin/node 
Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "7-A"
  Tag_CPU_arch: v7
  Tag_CPU_arch_profile: Application
  Tag_ARM_ISA_use: Yes
  Tag_THUMB_ISA_use: Thumb-2
  Tag_FP_arch: VFPv3
  Tag_ABI_PCS_wchar_t: 4
  Tag_ABI_FP_rounding: Needed
  Tag_ABI_FP_denormal: Needed
  Tag_ABI_FP_exceptions: Needed
  Tag_ABI_FP_number_model: IEEE 754
  Tag_ABI_align_needed: 8-byte
  Tag_ABI_enum_size: int
  Tag_ABI_VFP_args: VFP registers
  Tag_ABI_optimization_goals: Aggressive Size
  Tag_CPU_unaligned_access: v6

And guess what, it crashed immediately upon execution (even without any input):

$ /usr/bin/node 


#
# Fatal error in , line 0
# unreachable code
#
#
#
#FailureMessage Object: 0xbec1220c
Illegal instruction (core dumped)

So the next step was to downgrade to Alpine 3.12 (which ships with nodejs 12.21.0). That worked, but, like my custom builds, crashed when running Node-RED.

Persons of Interest

Assuming my custom builds were somehow magically better than stock alpine packages, there were multiple suspects that I could try in descending order of complexity/overhead:

The alpine base images themselves, which I had bumped to 3.13 recently, and might have different compiler defaults (I have to use alpine as ubuntu doesn’t support arm32v6).
The qemu-user-static version, which I had also updated to 5.1.0.
nodejs itself, which is finicky as heck to build and highly sensitive to compiler options (which was at 14.15.3).
And maybe, just maybe, Node-RED itself and its modules.

Since alpine 3.13.4 now ships with 14.16.0, I decided to (for science) try using that instead of my own custom builds as well.

The Hours

So I started trying different builds – which took me around 15 minutes every few hours to start and later pull down to the ODROID and a Pi to test.

This added up to a lot of wall time (in fact, this took me the entire week to sort out)⁵.

A great (but perplexing) find was that my customarm32v6containers ran perfectly (even if slower) on the ODROID, which allowed me to get my home automation back up on Monday afternoon.

So I tagged it as node-red/stable on docker-compose and started going through a number of combinations:

Architecture	Alpine	NodeJS build	Version	Node-RED Version	QEMU version	Result
arm32v6	3.12	custom	14.15.3	1.2.9	5.1.0-8	Works, “new” build with full regression.
	3.12		14.15.3		5.1.0-8	Works, “old” build that was still on Docker Hub
	3.13.4		14.16.0		5.2.0-2	Works again, so this has to be related to arm32v7 somehow
arm32v7	3.12		14.15.3	1.2.8	4.0.0-2	Crashes, binaries have VFP3 but no NEON, works on Pi
				1.2.9	5.1.0-8	Crashes, binaries have VFP3 and NEONv1, works on Pi
						Crashes, works on Pi. “new” build with full regression. Time to try other tactics
	3.13.4	alpine	14.16.0			Crashes, works on Pi. So built-in packages aren’t the way (3.12 has Node 12)
	3.13.4	custom	14.16.0		5.2.0-2	Crashes, works on Pi, so updating QEMU doesn’t fix it

And when I wrote “Crashes”, I meant that it crashed while running Node-RED, for I soon realized that my custom nodejs builds would run fine (or at least give me a usable prompt) nearly all the time.

This took me the best of four days, and by that time I was starting to think there had to be some other cause–after all, nodejs is a major pain to build, but I was becoming skeptical of it being the culprit on all cases, as well as either alpine or qemu.

The Weeds

Despite the failure of the alpine stock build, my suspicions turned to the extra modules I build and deploy with Node-RED, many of which (like most of npm, really) are poorly maintained one-offs with weird dependencies.

Any one of these might tip nodejs over during Node-RED startup:

    "dependencies": {
        "@node-red-contrib-themes/midnight-red": "1.4.7",
        "node-red-admin": "0.2.7",
        "node-red-contrib-dir2files": "0.3.0",
        "node-red-contrib-fs-ops": "1.6.0",
        "node-red-contrib-homebridge-automation": "0.0.79",
        "node-red-contrib-httpauth": "1.0.12",
        "node-red-contrib-lgtv": "1.1.0",
        "node-red-contrib-light-scheduler": "0.0.17",
        "node-red-contrib-linux-diskio": "0.2.4",
        "node-red-contrib-linux-memory": "0.8.4",
        "node-red-contrib-linux-network-stats": "0.2.4",
        "node-red-contrib-meobox": "1.0.0",
        "node-red-contrib-moment": "4.0.0",
        "node-red-contrib-msg-speed": "2.0.0",
        "node-red-contrib-os": "0.2.0",
        "node-red-contrib-persist": "1.1.1",
        "node-red-contrib-redis": "1.3.9",
        "node-red-contrib-wemo-emulator": "1.0.1",
        "node-red-dashboard": "2.28.2",
        "node-red-node-base64": "0.3.0",
        "node-red-node-daemon": "0.2.1",
        "node-red-node-msgpack": "1.2.1",
        "node-red-node-prowl": "0.0.10",
        "node-red-node-pushover": "0.0.24",
        "node-red-node-rbe": "0.5.0",
        "node-red-node-smooth": "0.1.2",
        "node-red-node-sqlite": "0.6.0",
        "node-red-node-tail": "0.3.0",
        "node-red-node-ui-list": "0.3.4",
        "node-red-node-ui-table": "0.3.10",
        "node-red-node-ui-vega": "0.1.3",
        "node-red": "1.2.9"
    },

So I pulled up my commit logs and started looking at the changes I had made to the Node-RED bundles my build ships with (many of them prompted by dependabot, which is great to keep track of updates), and ranked them according to criticality, hackiness and complexity of their dependencies:

sqlite, which is one of the largest single dependencies, takes a long time to build as a module and breaks on musl around once a year anyway.
cheerio, which brings a lot of baggage with it and that I actually stopped using due to it crashing during my last hack.
ssdp-discover, which is fairly low level and that I relied on to sniff out Chromecast SSDP traffic (and would tie in nicely with Node-RED crashing during or just after module loads).
lgtv, which was essential for me to automate my LG TV before it got basic HomeKit support.

These had also been recently updated, so they were the most likely suspects.

I started doing (much shorter, but still slow) builds reverting (or removing) each of them from the bundles.

And guess what, it wasn’t any one of them specifically. Some combinations of modules worked, others didn’t.

I ended up just bisecting package.json and doing a sort of binary search until I had removed a particular set of Linux OS statistics modules that I use to build CPU and I/O charts (node-red-contrib-linux-*).

I was quite thoroughly convinced they were to blame as they crashed the arm32v7 build but yielded a working one when removed, but I then did another test: I tried adding another complex module to the mix (node-red-contrib-chart-image, which has a bunch of cairo-related dependencies), and lo and behold, the resulting arm32v7 container crashes again upon execution (and yes, arm32v6 works fine).

Partial Conclusions

I’m now pretty sure that there is something wrong with nodejs 14 on arm32v7 (at least on the ODROID), as both my custom builds and the builds that ship with alpine 3.13.4 crash on the ODROID at various points.

This might be specific to alpine and the Exynos chip, but I don’t have enough data yet. I’d need to try 3 or 4 more builds to make sure and avoid false positives.

But whatever is happening is definitely triggered by the extension modules I’m loading into Node-RED, as nodejs seems to fail whenever I cross some kind of critical threshold of loaded modules.

Plus I had at least one instance where Node-RED loaded completely and flows ran for a couple more seconds (not sure if that was just slowness in hitting whatever made it crash, as the ODROID has a lot of stuff running on it).

Next Steps

Since I really want to figure out what is going on I’m now looking at things like default stack sizes and other things I can shove into CFLAGS, but since I have to go on with my life (and really wanted to have spent this weeks’ evenings messing with Godot), there will be a few changes:

Until I sort this out, I’ll be running arm32v6 builds on the ODROID.
I’ll be thoroughly testing the arm32v6 version of the native nodejs 16.14.0 package from alpine 3.13.4 for Node-RED, homebridge, etc. and skipping my custom builds whenever possible, because they take too long and add extra testing and uncertainty to the mix.
I’ll also start planning to phase out the ODROID-U2 in favor of something else–probably a Z83ii, of which I have two, and which are quite likely to work with every single amd64 container I ever build.

Using a Raspberry Pi is extremely unlikely, as what is running on the ODROID is there because:

It had 2GB of RAM eight years ago, and most of my Pis can’t fit the workloads in RAM.
It has built-in EMMC storage that can handle a significant amount of writes, and I don’t want to have the same containers either pounding on an SD card or forcing me to “invest” on a pi with SSD support.

I might also try debian or ubuntu as a base, but that will mean orphaning the older Pi setups, so I’m not really keen on it.

Lessons Learned

Being an early adopter will always come back to bite you (even if it’s eight years later).
Before replacing “production” containers in one-of-a-kind hardware, tag the current version and keep it around until you’re happy the new version works.
Use tried and tested builds whenever possible (my custom builds made sense when ARM was unfashionable, but I can now get fresh, LTS versions of nodejs, so it’s time to phase them out).
It’s not always DNS (this is probably the biggest takeaway, for all of those people out there who like to say things like “It’s always DNS”).
Computers are finicky and temperamental.
nodejs doubly so, in any form.

Well, maybe not all the above are true.

But this turned out to be a pretty intense week, and I haven’t written up the half of it, so take it with a grain of salt.

One of the reasons this happened was that the ODROID was short on storage, so I removed the old containers. But it’s still no excuse. ↩︎
Shipping the emulator with the containers only adds around 10MB, so it isn’t really a problem ↩︎
There are ways to temporarily inject the binary, but they don’t work with docker build, only with docker run. ↩︎
This makes it a bit of a pain to do in free, public CI/CD systems (which are usually capped at less than an hour per run), ↩︎
A typical arm32v7 build of Node-RED, not including nodejs, takes 2100 seconds, and that’s the shortest iteration cycle. ↩︎

Tao of Mac

The Strange Case Of The Illegal Instruction

Problem Statement

Parenthesis: My Current Build Process

Side Note About `docker buildx`

Binary Sleuthing

Persons of Interest

The Hours

The Weeds

Partial Conclusions

Next Steps

Lessons Learned

This page is referenced in:

Tao of Mac

The Strange Case Of The Illegal Instruction

Problem Statement

Parenthesis: My Current Build Process

Side Note About docker buildx

Binary Sleuthing

Persons of Interest

The Hours

The Weeds

Partial Conclusions

Next Steps

Lessons Learned

This page is referenced in:

Side Note About `docker buildx`