Notes for December 23-29

Nothing much happened during the holiday season except that in an unusual turn of events, my CM3588 NAS kit killed three of the four SSDs I had in it.

Before I go into that, I should pay homage to the holiday season and mention I got the expected amount of socks, plus a few other doodads. I also spent a while trying to clean up my office, consolidating my compute environments (mostly on the F4-424 Max) and similar endeavors.

Leading Up To The Failure

I recently installed an updated kernel on the CM3588 that enabled KVM and ZFS and was quite happy with it to the point where I was going to migrate all my home automation to a container running on it, so I started setting things up and tried to migrate it to my closet.

The sequence of events was roughly as follows:

Everything was working fine last week.
I (fortunately) sorted and migrated most of the 2.3TB of data I had on the CM3588 to the TerraMaster F4-424 Max over the past week, as well as setting up a few guest machines in it (that included a Windows on ARM VM and half a dozen Debian containers, including one of my XFCE environments).
I freed up a LAN port in my closet, went to the Proxmox console and shut down the CM3588.
I waited until it was off, unplugged it and moved it to the new location, but it wouldn’t power on.
I then tried plugging it into a monitor, and it wouldn’t boot past the FriendlyElec logo.

Recovery Attempts

Since it was in a case and I didn’t have an easy way to get a console going (for some reason uBoot doesn’t output the boot log to HDMI), it took me a long time to realize that it wouldn’t boot with the SSDs connected–I initially thought the eMMC was corrupted, but I now realize that the kernel halts (waiting for an invisible console input?) when it can’t talk to the NVMe drives.

By the time I realized that, I had already disassembled the whole thing, re-flashed the eMMC (which was fine, although it did wipe out one container I didn’t have backups for), and was still trying to get it to boot–which it only did when I removed all the NVMe drives but the first (after manually numbering them, since I was hoping to rebuild the ZFS volume).

Testing The Drives

The drives in question are identical 1TB WD Blue SN580s, and I started playing with various combinations until I realized that drives 2-4 were dead–as in, they weren’t recognized in any slot, nor in an USB enclosure–they didn’t even report a serial number, and no tooling managed to elicit a peep from them.

I did notice one smoking gun, though: smartctl told me the surviving drive had tallied 72 unsafe shutdowns–which, for a brand new drive that has been running since October, is far too much.

The board hasn’t been power cycled anywhere near that number (especially considering it’s plugged into a UPS), and for comparison, the internal drive on my Lenovo Flex has been around for three years, is subjected to all sorts of abuse, and has only recorded 66 unsafe shutdowns in its lifetime.

Now, one drive failing is a drive problem. Three drives failing simultaneously on the same board has to be a deeper hardware issue–I’ve sent a note to FriendlyElec regarding this and have yet to hear back, but (sadly) for the time being I have to update my original review and my notes on getting Proxmox to work with a link to this post as a cautionary tale.

Update: In the meantime a reader sent me this link to the FriendlyElec forum (PDF copy here) where other people had similar issues, also with WD drives.

Next Steps

This is pretty sad considering the CM3588 was the best ARM machine I had–with 16GB RAM, 8 cores and a decent cooling solution, it was also the platform I was going to consolidate my ARM development on–so right now I’m setting up shop temporarily on the Banana Pi M7.

I’ve started an RMA process with SanDisk to see if I can get the drives replaced (they were under warranty, and a sizable investment) and will be looking for more multi-NVMe devices over the next few months (suggestions are welcome).

In the meantime, I think I will move my home automation setup to the u59 Pro instead, and cut down on the services I run on ARM until I can find a suitably beefy device to replace the CM3588.

I’ll try to update this post as things progress.

January 11: I corresponded with FriendlyElec and they initially told me no other customer had had similar issues, but I pointed them to the forum link above and I got another reply that told me they had gotten identical SSDs to test and that they couldn’t rule out compatibility issues. I also got a few inquiries about the PSU I was using (it was a 12V 4A unit, which should have been enough), and I’m still waiting for the RMA process to even get started–the SSDs arrived at the SanDisk service center, but I haven’t heard back from them yet.

January 18: I got an UPS package with replacement SSDs from SanDisk (which true to form, UPS never really tried to deliver, since I was home at the time they said they rang the previous day), but I simply don’t have the time to set everything up again, so the CM3588 will stay on the shelf for a while (and I’ll likely check with FriendlyElec if they have any updates before doing anything).

Tao of Mac

Notes for December 23-29

Leading Up To The Failure

Recovery Attempts

Testing The Drives

Next Steps

This page is referenced in: