Nothing much happened during the holiday season except that in an unusual turn of events, my CM3588 NAS kit killed three of the four SSDs I had in it.
Before I go into that, I should pay homage to the holiday season and mention I got the expected amount of socks, plus a few other doodads. I also spent a while trying to clean up my office, consolidating my compute environments (mostly on the F4-424 Max) and similar endeavors.
Leading Up To The Failure
I recently installed an updated kernel on the CM3588 that enabled KVM and ZFS and was quite happy with it to the point where I was going to migrate all my home automation to a container running on it, so I started setting things up and tried to migrate it to my closet.
The sequence of events was roughly as follows:
- Everything was working fine last week.
- I (fortunately) sorted and migrated most of the 2.3TB of data I had on the CM3588 to the TerraMaster F4-424 Max over the past week, as well as setting up a few guest machines in it (that included a Windows on ARM VM and half a dozen Debian containers, including one of my XFCE environments).
- I freed up a LAN port in my closet, went to the Proxmox console and shut down the CM3588.
- I waited until it was off, unplugged it and moved it to the new location, but it wouldn’t power on.
- I then tried plugging it into a monitor, and it wouldn’t boot past the FriendlyElec logo.
Recovery Attempts
Since it was in a case and I didn’t have an easy way to get a console going (for some reason uBoot doesn’t output the boot log to HDMI), it took me a long time to realize that it wouldn’t boot with the SSDs connected–I initially thought the eMMC was corrupted, but I now realize that the kernel halts (waiting for an invisible console input?) when it can’t talk to the NVMe drives.
By the time I realized that, I had already disassembled the whole thing, re-flashed the eMMC (which was fine, although it did wipe out one container I didn’t have backups for), and was still trying to get it to boot–which it only did when I removed all the NVMe drives but the first (after manually numbering them, since I was hoping to rebuild the ZFS volume).
Testing The Drives
The drives in question are identical 1TB WD Blue SN580s, and I started playing with various combinations until I realized that drives 2-4 were dead–as in, they weren’t recognized in any slot, nor in an USB enclosure–they didn’t even report a serial number, and no tooling managed to elicit a peep from them.
I did notice one smoking gun, though: smartctl
told me the surviving drive had tallied 72 unsafe shutdowns–which, for a brand new drive that has been running since October, is far too much.
The board hasn’t been power cycled anywhere near that number (especially considering it’s plugged into a UPS), and for comparison, the internal drive on my Lenovo Flex has been around for three years, is subjected to all sorts of abuse, and has only recorded 66 unsafe shutdowns in its lifetime.
Now, one drive failing is a drive problem. Three drives failing simultaneously on the same board has to be a deeper hardware issue–I’ve sent a note to FriendlyElec regarding this and have yet to hear back, but (sadly) for the time being I have to update my original review and my notes on getting Proxmox to work with a link to this post as a cautionary tale.
Update: In the meantime a reader sent me this link to the FriendlyElec forum (PDF copy here) where other people had similar issues, also with WD drives.
Next Steps
This is pretty sad considering the CM3588 was the best ARM machine I had–with 16GB RAM, 8 cores and a decent cooling solution, it was also the platform I was going to consolidate my ARM development on–so right now I’m setting up shop temporarily on the Banana Pi M7.
I’ve started an RMA process with SanDisk to see if I can get the drives replaced (they were under warranty, and a sizable investment) and will be looking for more multi-NVMe devices over the next few months (suggestions are welcome).
In the meantime, I think I will move my home automation setup to the u59 Pro instead, and cut down on the services I run on ARM until I can find a suitably beefy device to replace the CM3588.
I’ll try to update this post as things progress.