A few weeks ago, I noticed that my ZFS array was resilvering, due to a HD failure. This is the first time a drive has failed in my ZFS array, which is a little over 2 years old. My ZFS pool was 99.7% full; an issue I’ve been meaning to deal with for quite some time now, but have had other priorities. As a result, a resilver (rebuild/resync in RAID terms) is causing quite a bit of thrashing on the disks.
A bit of background: In early 2011, I built the successor to my 20HD 30TB Raid6 array. It is a 24HD 48TB ZFS RaidZ2 array. RaidZ2 is similar to Raid6, in that two drives are used for parity rather than storage. This means that two drives can fail without losing any data. I knowingly went outside best practices while building it, and put 24 drives in one vdev. Actually 23 drives in the vdev, and one hot spare. A vdev is considered a group of drives within a ZFS pool. Parity is localized to a vdev, and you can have multiple vdevs within a pool (with more parity drives in each vdev). You are only supposed to have up to 9 drives in a RaidZ2 vdev. At the time, it seemed like the only disadvantage to having that many drives in a pool was performance, which was not a huge factor for me. ZFS doesn’t stripe across drives in a vdev, so your vdev is only as fast as your slowest drive. What I didn’t consider was the amount of stress on the drives, and overall time it would take for a resilver to complete. Especially when your pool is 99.7% full, and it has to move data around in little tiny chunks.
So back to the drive failure. This should be no big deal. I had a hot spare, which is why the array was resilvering itself, without any intervention from me. By the time I noticed it, the resilver was about 3 hours in. The ETA to complete the resilver was 72 hours from then! That’s 72 hours of continuous hard disk thrashing, in addition to the normal load caused by the r/w of the 22 VMs I have running on that array. I shut some of my VMs down, to hopefully speed up the process, and checked back in a few hours. To my horror and dismay, the hot spare failed, and 4 other drives had taken IO errors (and were being resilvered as a result). The array continued to resilver, across the remaining drives, and was still thrashing like hell. Several hours later, 3 more drives had taken IO errors. That’s 8 drives resilvering, and 2 faulted in the pool. It doesn’t get much worse than this. I always buy my drives from multiple sources when I build a NAS, so that they will have different dates of manufacture, and are less likely to fail in huge batches. What the hell was going on? All I could figure was that the incredible stress of the resilver was too much for my consumer-grade HDs to handle.
About 10pm that night, it happened. A third drive failed, less than 18 hours into the resilver. Since one of the three was the hot spare, technically only two drives from the original pool had failed, which is the maximum allowed without losing data. At this point I am shitting bricks, and literally can’t sleep. I shut all my VMs down, and am scrambling to move critical data to other disks in the house. I lit up my old array, which hadn’t been powered on since we moved into the new house. It wouldn’t boot! Something was up with the OS on the boot drive, so I booted off of a Ubuntu Live CD and mounted the array. All was fine, but the data (which was originally a backup of what was on the new array) was quite stale. Since 48TB > 30TB, I obviously had to decide what I was willing to lose, and only copy some stuff over. I started using external USB drives, and my desktop machine as temporary storage to move data to, in case another drive failed. The next morning my wife says, “Dave, is something supposed to be beeping in the furnace room?” This can only mean one thing. A drive failed on my old array (which has an enterprise RAID card on it, and notifies you when a drive fails). What else could go wrong? Since my VMs were shut down, I did not get an email notification. I hopped on the console and noticed that drive 9 had failed, and the array was rebuilding with drive 10 (the hot spare). The ETA on this rebuild was much less: 10 hours. It completes without incident, and I swap the bad HD with a cold spare that I had on-hand.
Eventually, the resilver completes, in 73hrs. No more drives failed, and I haven’t lost any data. I’m relieved, but still incredibly spooked that I could lose it all at any minute, if another drive failed. Throughout all of this, I’ve been trying to figure out what my long-term plan was going to be. Up until this all happened, I had been considering rebuilding my old array (the one with the hardware RAID card in it) with larger (and more) disks. But now there is critical data copied on that array, and I can’t scrap it and start over. It seemed like my only option was to build another (third) NAS server, at a considerable expense. I could go ZFS, which requires lots of RAM (expensive), or RAID, which requires a hardware RAID card (expensive). I’m highly annoyed, because if I had just addressed this a few months ago when I knew I was running out of space, I would not have to build a third server. Then it occurred to me that I could buy a drive enclosure with a built-in SAS expander, and connect it to my existing server. That would require me to upgrade the amount of RAM (ZFS likes RAM), but it was doable. Of course, I would have to scrap my existing RAM, because it was ECC unbuffered, and I had maxed out what my motherboard could handle (48GB). I would have to purchase ECC Registered DIMMS to go beyond the 48GB barrier. I telling my tale of woes to a coworker, and he mentioned that we had a bunch of servers in our warehouse that we weren’t using, and that he thought they were full of RAM. I checked it out, and they were indeed full of RAM. 352GB of ECC Registered RAM, to be exact! So I borrowed 12 8GB DIMMs and put them in my server. Voila! 96GB!
I ordered up my enclosure, a Norco DS-24E, and 8 Toshiba 3TB 7200RPM SATA drives. I figured I would start with 8 drives, and expand later on. The enclosure and drives arrived a few days later, and appeared to install without incident. That is, until I realized that all of the drives detected as 2.2TB drives. WTF? Some googling quickly revealed that the LSI SAS1068E chipsets on my SAS controllers did not support 3TB drives! At this point it’s been over a week since the first drive failed, and I’m on borrowed time with this array. After a few hours of research, I order a LSI SAS2008 PCI-E SAS HBA. It’s not the best, or newest, but it is known to work in the very unique configuration I am running (ESXi hardware passthrough to a Solaris VM, to share the array back to ESXi via NFS). I also ordered 8 more 3TB drives, because I realized that because of my only putting 8 drives in a vdev now, I will have much less usable space and still needed more room to temporarily store my data. This is getting quite expensive!
The new controller and drives show up 2 days later, and I begin surgery. It goes surprisingly well. ZFS is fantastically resilient and scalable. After an export/import, the pool detected perfectly on the new controller, even though all of the drive IDs had changed. I was super relieved at this point. I then added the other 8 drives to the enclosure, and built the new pool as 2 8 drive RaidZ2 vdevs. The pool created without incident. I enabled compression, nfs and smb on the pool, and immediately began copying my data to it. It’s now been a little over 24 hours, and 20T worth of the data has been copied to it. I intend to get a current copy of everything on the failing array, and scrap it completely. I will then rebuild it with 3 8 drive RaidZ2 vdevs, just like the new array, and forgo the hot spare. I’ll lose a significant amount of storage (6TB), but this whole event will be much less likely to occur again. Any future resilvering will be limited to 8 drives, instead of 24. Also, my IOPS will be greatly improved, because ZFS stripes across multiple vdevs.