Here’s something I do often, which is to go off into the weeds and try to fix something that suddenly broke. Except this time it’s not cloud (it’s on-premises) and it’s not someone else’s (it’s mine, through and through, although most of the dependencies and stack are in widespread use).
It’s also something pretty low-level and borderline inscrutable…
Problem Statement
For quite some time now, I’ve maintained my own Node-RED cross-platform container image, which I use pretty much everywhere.
I started building those images for home automation, and they currently support arm32v6
(Pi Zero), arm32v7
(Pi 2/3+, my ODROID-U2 and most other modern 32-bit ARM chips), aarch64
(Pi 4 and some cloud hosts) and, of course, amd64
.
But while upgrading my home automation setup on my ODROID late last Sunday, I had the nasty surprise of finding that my latest container crashed with an Illegal instruction (core dumped)
message.
strace
was useless as it spewed too much information, and export NODE_DEBUG=module
didn’t afford immediate clues since whatever happened was immediately after module loads, so I pulled the same container onto a Raspberry Pi 3B I have on my desk, and it ran fine.
Plus the Zigbee dongle I use mysteriously stopped working on one of the USB ports on Monday morning (always a nice way to start your week), so I started wondering what was wrong with the ODROID itself.
After all, it’s been in service for over eight years now, which is no mean feat, and it has a rather niche Exynos CPU, so I started suspecting it might just be obsolete by now.
To add to this predicament, although I usually keep older container versions around, Murphy’s Law (and excess confidence) had ensured I hadn’t done that this time around1, so half of my home automation (the bits that relied on Node-RED, like Zigbee bridging to HomeKit) was dead, including part of my office lighting.
So I had to fix this ASAP.
Parenthesis: My Current Build Process
I’ve been doing ARM cross-compilation for a long time, but over the past couple of years the technique I’ve come to favor is to use qemu-user-static
on Linux and using “wrap” containers. In short, this relies on registering QEMU on the host as a binfmt_misc
handler and injecting the respective qemu-$(ARCH)-static
binary into a base container2, which is then invoked upon build3.
Since armv6
has been falling out of favor (and I still have a Pi Zero W “running” Node-RED) and there were a number of years when getting an up-to-date, native armv6
or armv7
build of nodejs
was almost impossible, I’ve taken to building the full stack (nodejs
, Node-RED and all associated binaries), which takes a few hours for all the architectures4.
It’s usually a “fire and forget” thing I do on one of my machines, and is usually faster and much less hassle than building on native hardware due to I/O performance alone. Still, it’s slow enough that I usually only have fresh containers four to eight hours later (depending on build).
Side Note About docker buildx
Before someone writes in asking why I don’t use docker buildx
which would be a lot neater, allow me to say a few things:
- Yes, I have tried it every few months.
- No, it doesn’t work for me across both
4.5.x
and5.x
kernels, on public cloud or on my i5/i7 hosts, because it keeps crashing randomly. - Yes, I tried it again this week (
Docker version 20.10.5, build 55c4c88
) and it bombed out on both thearm32v6
andarm32v7
builds. - When it works, I find it a lot slower than my current process (partially because it tries to do all the builds at once, and partially because it does not, by default, use all CPU cores).
Also, I lost an entire day (in wall clock time, around 2 hours spread throughout the day) trying to sort out the above during the rest of my investigation.
Binary Sleuthing
My initial suspicion (based on both the literal error message and past experience) was that a QEMU
upgrade might have broken the final binaries somehow, and that the ODROID’s Exynos chip might not like the binary’s instruction set – this usually happened in the past with floating point extensions, and the ODROID reports itself as armv7l
, with a typical binary stating support for VFP3
but not NEON
:
# readelf -A /bin/sh
Attribute Section: aeabi
File Attributes
Tag_CPU_name: "7-A"
Tag_CPU_arch: v7
Tag_CPU_arch_profile: Application
Tag_ARM_ISA_use: Yes
Tag_THUMB_ISA_use: Thumb-2
Tag_FP_arch: VFPv3-D16
Tag_ABI_PCS_wchar_t: 4
Tag_ABI_FP_rounding: Needed
Tag_ABI_FP_denormal: Needed
Tag_ABI_FP_exceptions: Needed
Tag_ABI_FP_number_model: IEEE 754
Tag_ABI_align_needed: 8-byte
Tag_ABI_align_preserved: 8-byte, except leaf SP
Tag_ABI_enum_size: int
Tag_ABI_VFP_args: VFP registers
Tag_CPU_unaligned_access: v6
…and my custom node
builds (which had previously worked) were earmarked as having NEON
instructions:
# readelf -A /usr/local/bin/node
Attribute Section: aeabi
File Attributes
Tag_CPU_name: "7-A"
Tag_CPU_arch: v7
Tag_CPU_arch_profile: Application
Tag_ARM_ISA_use: Yes
Tag_THUMB_ISA_use: Thumb-2
Tag_FP_arch: VFPv3
Tag_Advanced_SIMD_arch: NEONv1
Tag_ABI_PCS_wchar_t: 4
Tag_ABI_FP_rounding: Needed
Tag_ABI_FP_denormal: Needed
Tag_ABI_FP_exceptions: Needed
Tag_ABI_FP_number_model: IEEE 754
Tag_ABI_align_needed: 8-byte
Tag_ABI_enum_size: int
Tag_ABI_VFP_args: VFP registers
Tag_ABI_optimization_goals: Aggressive Size
Tag_CPU_unaligned_access: v6
Having dealt with compiler vagaries before, I knew that this might have to do with either compiler defaults or auto-detected QEMU
features, so I decided to go back and check exactly where on my build process this had been broken.
But first, I tried installing and running nodejs 14.16.0
in a stock alpine 3.13.4
container on the ODROID:
# readelf -A /usr/bin/node
Attribute Section: aeabi
File Attributes
Tag_CPU_name: "7-A"
Tag_CPU_arch: v7
Tag_CPU_arch_profile: Application
Tag_ARM_ISA_use: Yes
Tag_THUMB_ISA_use: Thumb-2
Tag_FP_arch: VFPv3
Tag_ABI_PCS_wchar_t: 4
Tag_ABI_FP_rounding: Needed
Tag_ABI_FP_denormal: Needed
Tag_ABI_FP_exceptions: Needed
Tag_ABI_FP_number_model: IEEE 754
Tag_ABI_align_needed: 8-byte
Tag_ABI_enum_size: int
Tag_ABI_VFP_args: VFP registers
Tag_ABI_optimization_goals: Aggressive Size
Tag_CPU_unaligned_access: v6
And guess what, it crashed immediately upon execution (even without any input):
$ /usr/bin/node
#
# Fatal error in , line 0
# unreachable code
#
#
#
#FailureMessage Object: 0xbec1220c
Illegal instruction (core dumped)
So the next step was to downgrade to Alpine 3.12 (which ships with nodejs 12.21.0
). That worked, but, like my custom builds, crashed when running Node-RED.
Persons of Interest
Assuming my custom builds were somehow magically better than stock alpine
packages, there were multiple suspects that I could try in descending order of complexity/overhead:
- The
alpine
base images themselves, which I had bumped to3.13
recently, and might have different compiler defaults (I have to usealpine
asubuntu
doesn’t supportarm32v6
). - The
qemu-user-static
version, which I had also updated to5.1.0
. nodejs
itself, which is finicky as heck to build and highly sensitive to compiler options (which was at14.15.3
).- And maybe, just maybe, Node-RED itself and its modules.
Since alpine 3.13.4
now ships with 14.16.0
, I decided to (for science) try using that instead of my own custom builds as well.
The Hours
So I started trying different builds – which took me around 15 minutes every few hours to start and later pull down to the ODROID and a Pi to test.
This added up to a lot of wall time (in fact, this took me the entire week to sort out)5.
A great (but perplexing) find was that my customarm32v6
containers ran perfectly (even if slower) on the ODROID, which allowed me to get my home automation back up on Monday afternoon.
So I tagged it as node-red/stable
on docker-compose
and started going through a number of combinations:
Architecture | Alpine | NodeJS build | Version | Node-RED Version | QEMU version | Result |
---|---|---|---|---|---|---|
arm32v6 | 3.12 | custom | 14.15.3 | 1.2.9 | 5.1.0-8 | Works, “new” build with full regression. |
Works, “old” build that was still on Docker Hub |
||||||
3.13.4 | 14.16.0 | 5.2.0-2 | Works again, so this has to be related to arm32v7 somehow |
|||
arm32v7 | 3.12 | 14.15.3 | 1.2.8 | 4.0.0-2 | Crashes, binaries have VFP3 but no NEON, works on Pi |
|
1.2.9 | 5.1.0-8 | Crashes, binaries have VFP3 and NEONv1, works on Pi |
||||
Crashes, works on Pi. “new” build with full regression. Time to try other tactics |
||||||
3.13.4 | alpine | 14.16.0 | Crashes, works on Pi. So built-in packages aren’t the way (3.12 has Node 12) |
|||
custom | 5.2.0-2 | Crashes, works on Pi, so updating QEMU doesn’t fix it |
And when I wrote “Crashes”, I meant that it crashed while running Node-RED, for I soon realized that my custom nodejs
builds would run fine (or at least give me a usable prompt) nearly all the time.
This took me the best of four days, and by that time I was starting to think there had to be some other cause–after all, nodejs
is a major pain to build, but I was becoming skeptical of it being the culprit on all cases, as well as either alpine
or qemu
.
The Weeds
Despite the failure of the alpine
stock build, my suspicions turned to the extra modules I build and deploy with Node-RED, many of which (like most of npm
, really) are poorly maintained one-offs with weird dependencies.
Any one of these might tip nodejs
over during Node-RED startup:
"dependencies": {
"@node-red-contrib-themes/midnight-red": "1.4.7",
"node-red-admin": "0.2.7",
"node-red-contrib-dir2files": "0.3.0",
"node-red-contrib-fs-ops": "1.6.0",
"node-red-contrib-homebridge-automation": "0.0.79",
"node-red-contrib-httpauth": "1.0.12",
"node-red-contrib-lgtv": "1.1.0",
"node-red-contrib-light-scheduler": "0.0.17",
"node-red-contrib-linux-diskio": "0.2.4",
"node-red-contrib-linux-memory": "0.8.4",
"node-red-contrib-linux-network-stats": "0.2.4",
"node-red-contrib-meobox": "1.0.0",
"node-red-contrib-moment": "4.0.0",
"node-red-contrib-msg-speed": "2.0.0",
"node-red-contrib-os": "0.2.0",
"node-red-contrib-persist": "1.1.1",
"node-red-contrib-redis": "1.3.9",
"node-red-contrib-wemo-emulator": "1.0.1",
"node-red-dashboard": "2.28.2",
"node-red-node-base64": "0.3.0",
"node-red-node-daemon": "0.2.1",
"node-red-node-msgpack": "1.2.1",
"node-red-node-prowl": "0.0.10",
"node-red-node-pushover": "0.0.24",
"node-red-node-rbe": "0.5.0",
"node-red-node-smooth": "0.1.2",
"node-red-node-sqlite": "0.6.0",
"node-red-node-tail": "0.3.0",
"node-red-node-ui-list": "0.3.4",
"node-red-node-ui-table": "0.3.10",
"node-red-node-ui-vega": "0.1.3",
"node-red": "1.2.9"
},
So I pulled up my commit logs and started looking at the changes I had made to the Node-RED bundles my build ships with (many of them prompted by dependabot
, which is great to keep track of updates), and ranked them according to criticality, hackiness and complexity of their dependencies:
sqlite
, which is one of the largest single dependencies, takes a long time to build as a module and breaks onmusl
around once a year anyway.cheerio
, which brings a lot of baggage with it and that I actually stopped using due to it crashing during my last hack.ssdp-discover
, which is fairly low level and that I relied on to sniff out Chromecast SSDP traffic (and would tie in nicely with Node-RED crashing during or just after module loads).lgtv
, which was essential for me to automate my LG TV before it got basic HomeKit support.
These had also been recently updated, so they were the most likely suspects.
I started doing (much shorter, but still slow) builds reverting (or removing) each of them from the bundles.
And guess what, it wasn’t any one of them specifically. Some combinations of modules worked, others didn’t.
I ended up just bisecting package.json
and doing a sort of binary search until I had removed a particular set of Linux OS statistics modules that I use to build CPU and I/O charts (node-red-contrib-linux-*
).
I was quite thoroughly convinced they were to blame as they crashed the arm32v7
build but yielded a working one when removed, but I then did another test: I tried adding another complex module to the mix (node-red-contrib-chart-image
, which has a bunch of cairo
-related dependencies), and lo and behold, the resulting arm32v7
container crashes again upon execution (and yes, arm32v6
works fine).
Partial Conclusions
I’m now pretty sure that there is something wrong with nodejs
14 on arm32v7
(at least on the ODROID), as both my custom builds and the builds that ship with alpine 3.13.4
crash on the ODROID at various points.
This might be specific to alpine
and the Exynos chip, but I don’t have enough data yet. I’d need to try 3 or 4 more builds to make sure and avoid false positives.
But whatever is happening is definitely triggered by the extension modules I’m loading into Node-RED, as nodejs
seems to fail whenever I cross some kind of critical threshold of loaded modules.
Plus I had at least one instance where Node-RED loaded completely and flows ran for a couple more seconds (not sure if that was just slowness in hitting whatever made it crash, as the ODROID has a lot of stuff running on it).
Next Steps
Since I really want to figure out what is going on I’m now looking at things like default stack sizes and other things I can shove into CFLAGS
, but since I have to go on with my life (and really wanted to have spent this weeks’ evenings messing with Godot), there will be a few changes:
- Until I sort this out, I’ll be running
arm32v6
builds on the ODROID. - I’ll be thoroughly testing the
arm32v6
version of the nativenodejs 16.14.0
package fromalpine 3.13.4
for Node-RED,homebridge
, etc. and skipping my custom builds whenever possible, because they take too long and add extra testing and uncertainty to the mix. - I’ll also start planning to phase out the ODROID-U2 in favor of something else–probably a Z83ii, of which I have two, and which are quite likely to work with every single
amd64
container I ever build.
Using a Raspberry Pi is extremely unlikely, as what is running on the ODROID is there because:
- It had 2GB of RAM eight years ago, and most of my Pis can’t fit the workloads in RAM.
- It has built-in
EMMC
storage that can handle a significant amount of writes, and I don’t want to have the same containers either pounding on an SD card or forcing me to “invest” on a pi with SSD support.
I might also try debian
or ubuntu
as a base, but that will mean orphaning the older Pi setups, so I’m not really keen on it.
Lessons Learned
- Being an early adopter will always come back to bite you (even if it’s eight years later).
- Before replacing “production” containers in one-of-a-kind hardware, tag the current version and keep it around until you’re happy the new version works.
- Use tried and tested builds whenever possible (my custom builds made sense when ARM was unfashionable, but I can now get fresh, LTS versions of
nodejs
, so it’s time to phase them out). - It’s not always DNS (this is probably the biggest takeaway, for all of those people out there who like to say things like “It’s always DNS”).
- Computers are finicky and temperamental.
nodejs
doubly so, in any form.
Well, maybe not all the above are true.
But this turned out to be a pretty intense week, and I haven’t written up the half of it, so take it with a grain of salt.
-
One of the reasons this happened was that the ODROID was short on storage, so I removed the old containers. But it’s still no excuse. ↩︎
-
Shipping the emulator with the containers only adds around 10MB, so it isn’t really a problem ↩︎
-
There are ways to temporarily inject the binary, but they don’t work with
docker build
, only withdocker run
. ↩︎ -
This makes it a bit of a pain to do in free, public CI/CD systems (which are usually capped at less than an hour per run), ↩︎
-
A typical
arm32v7
build of Node-RED, not includingnodejs
, takes 2100 seconds, and that’s the shortest iteration cycle. ↩︎