Synology Volume Crash

^{Note: I may receive commissions for purchases made through links in this post. This is to help support my blog and does not have any impact on my recommendations.}

Back when I bought my Synology NAS in 2012, I picked up 4 Seagate 3TB drives. Yes, they’re the same model (ST3000DM001) that BackBlaze reported about in 2015 since these drives have an unusually high failure rate. I even managed to have one drive fail within the first few weeks, but I didn’t think much of it - just submitted an RMA. Since then, I haven’t worried too much about it and those drives have been running for over six years straight with no issues. Well, until recently anyways…

Hmm. Two drives in my NAS are starting to complain...
They’re also 6 1/2 years old…
… But replacing them costs money.
— Matt (@0x2142) April 27, 2018

Almost a month ago, I got a couple of email alerts from my Synology complaining about bad sectors. I hopped onto the device to check it out, and decided to run some of the extended SMART tests against the array. I already have these tests scheduled to run regularly, but I figured it can’t hurt to run them just in case. As luck would have it, Disk 1 of my array failed it’s SMART test within minutes.

Dear user,
Disk 1 on DS918+ has degraded severely and is in failing status. Please make sure that your data has been backed up and then replace this disk.

Additional disk information:
Brand: Seagate
Model: ST3000DM001-9YN166
Capacity: 2.7 TB
Serial number: XXXXXXXX
Firmware: CC9C

S.M.A.R.T. status: Failing
IronWolf Health Management: -
Bad sector count: 128
Disk reconnection count: 0
Disk re-identification count: 0
SSD estimated lifespan: -

The other three disks completed their tests with no reported issues. Disk 3 did worry me a bit though, because it had a rather high bad sector count (~12,000). I figured I would worry about Disk 1 for now, and once it was re-built I would start slowly working my way through the array. The drives were six years old anyways and I didn’t expect them to last forever. At the same time, I really didn’t want to dump the money to buy all new drives just yet. After doing my own research and talking to a few other people I know, I opted to try out the Western Digital Red drives. These drives are made for NAS systems and include some features that standard consumer drives don’t. My array was also around 65% full, so I figured now would be the best time to start expanding the array. I bought one of the 4TB WD40EFRX drives and had it shipped as quickly as possible. Unlike my previous Synology DS411, the 918+ supports the ability to hot-swap drives (the DS411 required disassembling the chassis). Once my new drive showed up, I pulled out disk 1 and put in the new drive. Logged into the Synology DSM interface and told it to start rebuilding the new drive. Simple enough, right?

Well, as luck would have it, I only got a few hours into the rebuild before I got this wonderful email:

Dear user,

Volume 1 has crashed. The system may not boot up.

At this point I was at work and wondering what the state of my NAS was. Did another drive die during rebuild? Have I lost data? Will I be able to recover any of my stuff? I threw in a ticket to Synology support immediately so I could get their input on this. I got back both good and bad news. The bad news? The volume was not going be repairable, and any attempt to repair disks would fail. However, the good news was that my data was fine, but accessible in a read-only state.

I still had my DS411 lying around, so I decided to make the investment in all new drives. I picked up three additional Red drives and loaded all four them into my DS411. Soon enough I had created a new volume/RAID array and I was ready to copy my data over from the DS918+.

The big question was how exactly to replicate the data. I knew Synology had a few options for NAS-to-NAS syncing/backups, but I hadn’t used them before.

The first option was to use Synology’s HyperBackup application. Unfortunately, this required two things I don’t have. First, it would require me to install Hyperbackup, which wouldn’t work since the volume was read-only (a state which also appeared to kill all non-critical packages). Second, it would require me to have at least double the space on my backup device - Enough to store a full-NAS backup, along with enough to restore that entire backup.

The second option for copying the data was using the Shared Folder Sync - which is the method I ended up using. This allowed me to set up a Synology-managed rsync job between the two NAS devices. I did need to make a configuration change to create the rsync tasks, which I couldn’t do on the crashed volume. However, I took a chance and found out that when I reboot the crashed NAS, the volume would be writable until the volume crashed again. This gave me about ten minutes after startup to quickly create the replication job.

Unfortunately my older DS411 could barely handle the CPU requirements needed to copy the data. It did work, but it took a full week to copy ~5.3TB over a gigabit network. Once that copy was finished, I did a quick double-check to make sure I had everything. Then I went ahead and swapped the new drives into my DS918+ following the same steps I used during my original migration.

Overall I think I got extremely lucky here. The only thing I ended up losing was an iSCSI LUN with a few VMs, but I was able to retain all of my critical data. While I do have all of my NAS data backed up to a cloud service, I wasn’t really happy with the idea of having to download 5.3TB over the Internet to restore my data. Since this event, I’ve configured some additional alerting/monitoring of the drives in my DS918+. Next time I see signs of drive trouble I’ll probably just replace it immediately without thinking twice about it. Hopefully this helps out anyone else who runs into this issue! Let me know in the comments if you have any questions