Systems Administrators tend to worry a lot. Computers break all the time, and we try to have a plan for when the inevitable occurs.
On Friday the 8th, a hard drive failed on one of our physical servers. The server had four production virtual servers on it. Supposedly, the hard drives we use in our servers are less likely to fail. They come with 5 year warranties, and graduated at the top of their class. To be on the safe side, each of our servers have 2 hard drives that are mirrored. So if one drive fails, the server just keeps operating without a hiccup.
The story could have ended there. We could have just left it until Monday, ordered a replacement drive, and installed it without any downtime. The likelihood of the second drive failing in the next few days is low, and we have backups. However, what if the other drive did fail? That would mean downtime as we reinstall and restore data… and downtime means sad customers (you).
Don’t worry, we’ve got a backup plan for our backup plan. Actually, we’ve got many layers of backup plans that could put you to sleep. Here’s the quick version of contingency plans that actually came into play.
We use Nagios (set up by Sys. Admin Matt Dailey) to monitor all our systems and let us know if something is unhappy. Matt also installed utilities on the servers for keeping track of the hard drives, and other critical hardware. If a drive gets grumpy, Nagios knows, and immediately sends us an email. And for critical things like drive failures, sends us a text message to our cell phones.
I received the text message and after verifying the problem, called Jordan (who happens to be the primary maintainer of the virtual servers in peril).
Jordan immediately transferred the virtual servers to a backup blade server we had waiting on standby. Even over a slow dial-up connection, he was able to do this from home.
The next day, Saturday, Matt Dailey came in to replace the bad hard drive with a spare we had on hand for just such an occasion. Matt pulled the bad drive out, and inserted the new drive. The good drive immediately started mirroring all its data to the new drive for the next time something goes wrong.
Everything is rosy!
Well mostly rosy. Considering this was our first disk failure since I’ve been working here, everything went surprising well. However, there are always surprises. Thankfully, none of the ones we encountered caused any hardship to the community.
First, when you plan for disaster, it is often easy to not cover the basics…. like… where the spare hard drive actual is located. We all were sure that we had put the spare drive on top of the servers. However, we installed larger capacity drives several months ago and the spare never made it to the server room from our office. Thankfully, Matt was easily able to find it.
Second, it is a pain, but it can be really helpful to simulate failures. It took the server 11 hours to fully mirror a 147 GB hard drive. This is much slower than we anticipated. Because we had already moved the production virtual servers off the system, it wasn’t an issue. We have been told by Dell that this is a limitation of the RAID controller. So this is a limitation that we’ll need to work around.
The third hiccup we’ve run into is that Hitachi no longer manufacturers the hard drive we are using. I purchased a new spare from Seagate that has all the identical specs. However, for reasons Dell doesn’t even understand, the RAID controller refuses to allow two drives from different manufacturers to work together. For systems that you plan to have in production for many years, buy more than one spare, or at least know that you may need contingency plan when you can no longer buy a specific component.

Comments 1
Lucky of you, I got my Barracuda failed and I didn’t have mirrors. Restoring data was much of pain, it taught me never run server without mirrors
Posted 07 Mar 2008 at 6:31 pm ¶Post a Comment