OS X Degraded Mirror and Recovery

Here's one that I guide people through pretty often: how do you tell which disk is bad in a degraded software RAID when they both claim to be OK?

Anyone running a software RAID on an XServe has seen it: a RAID set becomes degraded for some reason, diskutil shows both drives as "OK", but the set is now split.

I'm a huge proponent of watching log files, but in this case, blink and you'll miss it. You'll see an entry in the system log that looks like this:

Jul 29 18:56:15 server kernel: AppleRAID::completeRAIDRequest - error detected for the set, "BigMirrorVol", status = 0xe00002ca.

(This is why I'm a huge proponent on having something automated watch your log files for you...more on that in the future)

At this point, no data has been lost, but we're only writing to one drive in the mirror set. The trick here is finding out which one. diskutil isn't really a help here, as it'll tell you something like this:

Name: BigMirrorVol
Unique ID: BigMirrorVold7f6ef50f84e11d985ce000a958381b8
Type: Mirror
Status: Degraded
Device Node: disk5
-------------------------------------------------------------
# Device Node Status
-------------------------------------------------------------
0 disk3 OK
1 disk2 OK
-------------------------------------------------------------

Hmmmmm...if both disks are OK, why do I no longer have a functioning RAID?

I use iostat to figure this out. On a busy server, people are still going to be reading and writing to the 'good' disk. On a quiet server, if you must, run iostat and then read some data off the disk to force some activity. Using the example above, there's two disks in this mirror set, disk 3 and disk2. Here's what they look like in iostat:

# iostat -c 20 disk3 disk2
iostat: sysctl(kern.tty_nin) failed: No such file or directory
iostat: disabling TTY statistics
disk3 disk2 cpu
KB/t tps MB/s KB/t tps MB/s us sy id
116.96 4 0.46 106.53 2 0.17 4 5 92
8.00 3 0.02 0.00 0 0.00 0 0100
0.00 0 0.00 0.00 0 0.00 0 2 98
121.25 18 2.13 0.00 0 0.00 10 20 71
125.24 340 41.63 0.00 0 0.00 2 22 76
13.25 14 0.18 0.00 0 0.00 4 4 92

We can see that it's disk3 that's still in active use, and disk2 is not even being touched. Once you re-establish the RAID set (in this case using diskutil repairmirror disk5 3 disk3 disk2), you hope that this never happens again. If it does, and it's the same disk that goes batty, signs point to it being replacement time.

This tip should, at the very least let you get the latest and greatest data, and know which disk to use as the "from disk".