Notes for December 23-29

Nothing much happened during the holiday season except that in an unusual turn of events, my NAS kit killed three of the four SSDs I had in it.

Before I go into that, I should pay homage to the holiday season and mention I got the expected amount of socks, plus a few other doodads. I also spent a while trying to clean up my office, consolidating my compute environments (mostly on the ) and similar endeavors.

Leading Up To The Failure

I recently installed an updated kernel on the that enabled KVM and ZFS and was to the point where I was going to migrate all my home automation to a container running on it, so I started setting things up and tried to migrate it to my closet.

The sequence of events was roughly as follows:

  • Everything was working fine .
  • I (fortunately) sorted and migrated most of the 2.3TB of data I had on the to the over the past week, as well as setting up a few guest machines in it (that included a Windows on ARM VM and half a dozen Debian containers, including one of my XFCE environments).
  • I freed up a LAN port in my closet, went to the console and shut down the .
  • I waited until it was off, unplugged it and moved it to the new location, but it wouldn’t power on.
  • I then tried plugging it into a monitor, and it wouldn’t boot past the FriendlyElec logo.

Recovery Attempts

Since it was in a case and I didn’t have an easy way to get a console going (for some reason uBoot doesn’t output the boot log to HDMI), it took me a long time to realize that it wouldn’t boot with the SSDs connected–I initially thought the eMMC was corrupted, but I now realize that the kernel halts (waiting for an invisible console input?) when it can’t talk to the NVMe drives.

By the time I realized that, I had already disassembled the whole thing, re-flashed the eMMC (which was fine, although it did wipe out one container I didn’t have backups for), and was still trying to get it to boot–which it only did when I removed all the NVMe drives but the first (after manually numbering them, since I was hoping to rebuild the ZFS volume).

Testing The Drives

The drives in question are identical 1TB WD Blue SN580s, and I started playing with various combinations until I realized that drives 2-4 were dead–as in, they weren’t recognized in any slot, nor in an USB enclosure–they didn’t even report a serial number, and no tooling managed to elicit a peep from them.

I did notice one smoking gun, though: smartctl told me the surviving drive had tallied 72 unsafe shutdowns–which, for a brand new drive that has been running since October, is far too much.

The board hasn’t been power cycled anywhere near that number (especially considering it’s plugged into a UPS), and for comparison, the internal drive on my , is subjected to all sorts of abuse, and has only recorded 66 unsafe shutdowns in its lifetime.

Now, one drive failing is a drive problem. Three drives failing simultaneously on the same board has to be a deeper hardware issue–I’ve sent a note to FriendlyElec regarding this and have yet to hear back, but (sadly) for the time being I have to update my and my with a link to this post as a cautionary tale.

Update: In the meantime a reader sent me this link to the FriendlyElec forum (PDF copy here) where other people had similar issues, also with WD drives.

Next Steps

This is pretty sad considering the was the best ARM machine I had–with 16GB RAM, 8 cores and a decent cooling solution, it was also the platform I was going to consolidate my ARM development on–so right now I’m setting up shop temporarily on the .

I’ve started an RMA process with SanDisk to see if I can get the drives replaced (they were under warranty, and a sizable investment) and will be looking for more multi-NVMe devices over the next few months (suggestions are welcome).

In the meantime, I think I will move my home automation setup to the instead, and cut down on the services I run on ARM until I can find a suitably beefy device to replace the .

I’ll try to update this post as things progress.

This page is referenced in: