Looking over the last few posts, I seem to be stuck in a rut. I apologize to readers looking for stories about our remodeling project, progress on Barb’s quilt shop, cooking, or other tales of home. Believe me when I say that I’d much rather be writing about those things! Sadly I’ve been attending the school of hard knocks for the last month, and I want to record some of the things I’ve learned in the hopes of helping other people in similar situations. Heck, it may even come in handy for *me* in the future, although I sincerely hope not!
The system I’m currently fixing has a high-reliability disk setup. The server has two fibre channel interfaces, each connected to a separate RAID-5 array. The server takes care of mirroring, while the external disk arrays implement the RAID-5. This is all done with an Apple Xserve server and an Xserve RAID enclosure.
It looks like one of the RAID arrays got corrupted during our recent power outage, as diskutility (the GUI tool) is reporting “mirror degraded”. I used diskutil (CLI) and the RAID Admin utility to check the RAID arrays, but they both reported no errors on either side of the mirror. A little time on Google turned up a post which suggested [using iostat to determine which side of the mirror had failed][2], but when I tried rebuilding the mirror using those instructions I kept getting “Error -9980”. Some more digging and it looks like the “broken” side of the mirror was unmounted. I tried several things to get that half of the mirror to reconnect, even going so far as shutting down the server and the RAID and rebooting. No luck, although I was able to get ‘diskutil repairMirror *raid-disk* *slice* *from-disk* *to-disk*’ accepted. (It didn’t actually **do** anything, though. But at least it didn’t error out anymore.)
Finally I noticed that one of the RAID controller cards was behaving oddly. It reported status=okay, but it wouldn’t allow connection from the RAID Admin utility (over the LAN). I had been ignoring it, since I was connected to the other controller, but I decided to connect directly to the flaky one and see if there was something it wasn’t reporting to the controller I had been using to diagnose the problem. I found that the IP address of the dodgy controller wasn’t correct, and the passwords for monitoring and management access weren’t working. I shut down the server again, and this time when I shut down the RAID I pulled the power cords. (I had found out that without pulling the power cords, the array isn’t actually powered off; rather, it’s in “sleep” mode.) I powered up the RAID (but not the server!), then [reset the RAID controller][3] (just the password reset, not the full-blown one). Between power-cycling the RAID and resetting the controller, I was able to get the RAID Admin utility to connect to the array. Then when I rebooted the server, the disk mirror started rebuilding without any other action on my part.
The mirror seems to be rebuilding right now. The RAID arrays are 1.225 TB (1.1 [TiB][]) each, with ATA100 disks. (It’s a few years old.) At the rate it’s been going, I expect the rebuild to take about 11 hours total.
[1]: http://www.frozennorth.org/C2011481421/E20060221212020/ “Setting up mirrored disks on Mac OS X”
[2]: http://www.radiotope.com/node/23 “How to figure out which half of the mirror has failed when both report ‘okay'”
[3]: http://support.apple.com/kb/HT2758 “How to reset the Xserve RAID controller card”
[TiB]: http://en.wikipedia.org/wiki/TiB