Orange is my favorite color

Just a quicky – turns out our logwatch was not giving us enough of an alert when a drive failed in our raid array. Obviously you want to replace a dead drive as quickly as possible to reduce the likelihood of a second or third drive failing and potentially taking your data with it.

For Linux, HP has a tool available called hpacucli (HP Array Configuration Utility for Linux) for interrogating HP/Compaq array controllers (SmartArray 5i, 6i, whatever) from the command line. Before you can install the RPM (on CentOS/Redhat), you will need to first install a compatibility library:

yum install compat-libstdc++-296
rpm -Uvh hpacucli-8.0-14.noarch.rpm 

Then I put this snippet into a new file /etc/cron.hourly/raidstatus:

#!/bin/sh
/opt/compaq/hpacucli/bld/hpacucli ctrl all show config | egrep -i "(fail|error|offline|rebuild|ignoring|degraded|skipping|nok)"

The command /opt/compaq/hpacucli/bld/hpacucli ctrl all show config normally generates something like this (from our development database server):

Smart Array XXXXXXX in Slot 0      ()

   array A (Parallel SCSI, Unused Space: 0 MB)

      logicaldrive 1 (33.9 GB, RAID 1+0, OK)

      physicaldrive 2:0   (port 2:id 0 , Parallel SCSI, 36.4 GB, OK)
      physicaldrive 2:1   (port 2:id 1 , Parallel SCSI, 36.4 GB, OK)

   array B (Parallel SCSI, Unused Space: 0 MB)

      logicaldrive 2 (67.8 GB, RAID 1+0, OK)

      physicaldrive 2:2   (port 2:id 2 , Parallel SCSI, 36.4 GB, OK)
      physicaldrive 2:3   (port 2:id 3 , Parallel SCSI, 36.4 GB, OK)
      physicaldrive 2:4   (port 2:id 4 , Parallel SCSI, 36.4 GB, OK)
      physicaldrive 2:5   (port 2:id 5 , Parallel SCSI, 36.4 GB, OK)

I believe you can reduce the grep to just “(fail|nok)” but I’m taking the conservative approach here. Change the permissions to 0700 and if you have SELinux running make sure the context is set properly.

If your array and controller are in fine shape, then this command will output nothing. If you have a dead drive, it will generate content which will cause cron to mail the root user about it. Bingo – time to go to the colo!

I have seen other people use “ctrl all show status” which generates:

Smart Array XXXXXXX in Slot 0
   Controller Status: OK
   Cache Status: OK
   Battery Status: OK

I prefer to query the config which looks at individual physical drives in addition to the status of the array. I have seen cases (just last week) where one dead drive in the array still lists the array status as OK (because, technically, it is OK, it’s just not optimal and may be pending major disaster!)

Update 5/25/2011 Had an error today, needed this code and corrected a few things. I fixed a typo for a missing quote in the raidstatus script above and added a link to the actual utility. For reference, the output from an error looks like this:

physicaldrive 2:2   (port 2:id 2 , Parallel SCSI, ??? GB, Failed)
physicaldrive 2:5   (port 2:id 5 , Parallel SCSI, 72.8 GB, Rebuilding, active spare)

Comments are closed.