Fusion Meltdown

Sometimes computers are just too much of a hassle.

I’ve had my fair share of hardware issues in the past (enough to make me appreciate the carefree “appliance” model of thin terminals and iPads, which, incidentally, are at least twice as appealing during these sweltering summers), but intermittent failures easily take the cake – they are the most annoying and frustrating kind of hardware problem, often taking weeks (if not years) to get properly diagnosed.

If ever.

And, guess what, as a sort of karmic follow-up to my I’ve been having a few more interesting problems with my that fall squarely into that category.

Computer Says No

The first obvious symptom was that the Big Sur 11.5 update refused to install, repeatedly erroring out with an “unable to download” message.

This has been discussed in detail by other people, but my story has an extra twist.

After the fourth or fifth attempt, I gave up on the GUI and switched to the CLI in an attempt to get more information, since (like I pointed out during ), Console.app has become borderline unusable and I wanted detailed, but easily digestible info on what was going on.

The 65% Snafu

And I did get more information, in the sense that softwareupdate logged its lack of success, but in a strange way: It reported successful download up to 65%, and then appeared to immediately proceed to expanding the update package–and failed with a file verification error.

I could see progress building up to 65% and then, after a long pause, “jumping” after which it errored out:

Strangely enough, this seemed OK at the time.

Now, that jump from 65% to 100% is typical of when a connection stalls and times out, and was consistent with the error dialogs I got from the GUI. But since Akamai was having CDN issues at the time I assumed the update file had been truncated.

I later figured out the reason this seemed to halt at 65% is that softwareupdate, for some ungodly reason, reports the overall update percentage in front of the “Downloading:” message.

I left the machine running overnight, tried it again, and got this:

This, however, was borderline Twilight Zone stuff.

Now, I don’t believe in coincidences, but there was absolutely no way a fresh package would create exactly the same error. Something else was afoot.

Wash/Rinse Cycles

After exhausting all other options, I went into System Recovery and triggered a full reinstall.

As usual, if you hit Command+L you get a detailed log of what is going on, so I was able to verify that Recovery downloaded the full muiti-gigabyte OS package and applied it. This took less than an hour, and kept all my data and user accounts.

However, when the system came up the Finder refused to run, and upon further investigation Console.app complained the binary signature was wrong (unfortunately I didn’t keep a screenshot of that).

I then began a long, long cycle of reboots, tests and troubleshooting.

Singe Disk Utility reported no problems with the drive, the next likely suspect was RAM. Which was plausible, since this is one of the last with accessible DIMM slots, and I had added an extra 16GB on top of the 8GB it shipped with.

To test it, I went through a few days’ worth of the following routine:

  • Do a PRAM and SMC reset
  • Unplug the machine, try a new combination of DIMMs (first-party, the third-party I had been running and a few more I had lying around)
  • Run the built-in hardware test–which always gave me No Issues found. and an ADP000 reference code
  • Reinstall the OS from scratch (still keeping my user data)

In the meantime, in those intervals while things apparently worked, I installed smartmontools and DriveDx, which also found no apparent issue with the drive.

Random Breakage

But I kept running into problems:

Oh-oh.
  • Random applications would stop launching (store or non-store, didn’t matter). I eventually started launching every single application I had installed systematically after each reinstall, and, sure enough, some would fail.
  • One or two strange post-launch app crashes (one-offs I could not discern a good reason for in Console.app).
  • My Mail.app message index got corrupted (and had to re-index everything):
This took a long, long time to reindex.

At this point, I had just the stock RAM in the machine, since it had passed all my tests (like every other set of DIMMs, but I wanted to have a stock configuration).

So I decided to bypass the Fusion drive altogether and plug in an NVMe USB-C enclosure I had lying around. I then installed Big Sur to that and get a minimal setup working–which worked, at least in the sense that I had zero issues while doing so.

At this point everything pointed to some kind of Fusion drive issue that, for some reason, doesn’t show up on any disk diagnostic tool, even when testing the SSD and HDD independently.

But while running off the NVMe I experienced a resurgence of , so I’m really starting to suspect that there are deeper issues here than just storage.

Apple Support (Case 101443800801)

In the meantime, I had reached out to @AppleSupport on Twitter and eventually got on a call with a support rep in the UK to whom I recounted my saga. He suggested I set up Big Sur in a new APFS volume (which took another hour, so we had to close the call before it was done).

The new volume crashed during booting, so when a second support rep (this time in PT) reached out there wasn’t much else to do–I was walked through the process of clearing out caches in all boot partitions (which was the only thing in the troubleshooting process I had never done), confirmed the test partition still crashed upon boot, supplied all logs and screenshots I had taken, was told it would be escalated to Engineering, and…

Nothing yet. It’s been a week now.

Saving Graces

I’m not surprised there isn’t a response yet (it’s Summertime and this isn’t a trivial situation), but after spending nearly two weeks troubleshooting this and without a reliable desktop machine, I decided to start counting my blessings.

Fortunately, and although I rely on my to work (sitting down) at my normal desk, my regular work machine is the on my , which is why I was able to keep working throughout all this.

And, even more fortunately than that, 80% of my work (except calls, which I prefer to do standing, and the occasional bit of coding, which I remote to other machines for) is actually done on a Windows Remote Desktop farm, so to use the machine while testing the new installs all I needed was the Remote Desktop app (which, thanks to Murphy, was one of the apps that failed to launch on one of the re-installs).

No (important) data was lost, since everything of consequence is on or my home NAS anyway (and I still have TimeMachine backups there). I am, however, a bit worried as to the integrity of those backups.

Since I still have a MacBook (a 2016 one, which I use lightly due to fear of breaking its infamous keyboard), I also started planning to use that as a stopgap until it makes sense for me to get an M1/M1X/M2 machine (which may take a good while yet–maybe until next March for a Mini/desktop refresh).

Looking Back

I now strongly suspect the is tied to this, and that I may have been having hardware issues for a good while without realising it.

For instance, I have noticed that some of the git repositories I had on my desktop to curate (decide whether to archive/cleanup/re-file) have oddly empty folders (some clearly temporary from build processes, but others just… gone).

And some of the I had (also with git repositories, but a couple of large ones) may be related as well.

Plus there were a few images in Photos that were glitchy a year or so ago, etc., etc. – but this may just be bias, since it’s quite tempting to attribute all sorts random weird behaviour to this kind of problem.

Looking Forward

Since I can’t really trust the , I decided to step up my upgrade plans to an Apple Silicon machine, but piecemeal.

As a first step, I took the off my desk and am now using my MacBook Pro with an ultra-wide monitor (more on that in a few weeks), which means this is the first time in over 15 years I don’t have a desktop Mac.

Although replacing the Fusion drive is entirely doable (and the kit is around €200-€300 depending on SSD capacity) I’m wary of doing it and failing to address the real issue.

I’m quite interested in seeing what will come out of my support case, since hauling the whole thing to be repaired without having a proper diagnosis (in the middle of Summer and, incidentally, a pandemic) is just asking for weeks of hassle.

And since Portugal doesn’t have any sort of trade-in programs (or, now that I mention it, not even a 100% “normal” Apple Store, despite a local support center and offices), there’s no trade-in upgrade path, either.

Hardware Waste

Having the machine cast aside and in this Schroedinger Cat-like situation where it’s simultaneously broken or OK depending on how I look at it feels incredibly wasteful (especially considering the built-in display was the best I ever used until now), but the inconvenience of the all-in-one format is that I can’t really make any use of it as it is (this model doesn’t have target display mode because there was really no way to implement that when it was designed).

I’ve been musing that if all else fails I can always gut the machine, get an LVDS driver board and use it as a monitor (it’s been done before), but that would be a project I don’t have the time to do right now.

Would be great fun, though.

This page is referenced in: