Just a quicky – turns out our logwatch was not giving us enough of an alert when a drive failed in our raid array. Obviously you want to replace a dead drive as quickly as possible to reduce the likelihood of a second or third drive failing and potentially taking your data with it.
For Linux, HP has a tool available called hpacucli for interrogating HP/Compaq array controllers (SmartArray 5i, 6i, whatever) from the command line. I installed this script into /etc/crontab.hourly/raidstatus:
#!/bin/sh
/opt/compaq/hpacucli/bld/hpacucli ctrl all show config | egrep -i (fail|error|offline|rebuild|ignoring|degraded|skipping|nok)"
The command /opt/compaq/hpacucli/bld/hpacucli ctrl all show config normally generates something like this (from our development database server):
Smart Array XXXXXXX in Slot 0 ()
array A (Parallel SCSI, Unused Space: 0 MB)
logicaldrive 1 (33.9 GB, RAID 1+0, OK)
physicaldrive 2:0 (port 2:id 0 , Parallel SCSI, 36.4 GB, OK)
physicaldrive 2:1 (port 2:id 1 , Parallel SCSI, 36.4 GB, OK)
array B (Parallel SCSI, Unused Space: 0 MB)
logicaldrive 2 (67.8 GB, RAID 1+0, OK)
physicaldrive 2:2 (port 2:id 2 , Parallel SCSI, 36.4 GB, OK)
physicaldrive 2:3 (port 2:id 3 , Parallel SCSI, 36.4 GB, OK)
physicaldrive 2:4 (port 2:id 4 , Parallel SCSI, 36.4 GB, OK)
physicaldrive 2:5 (port 2:id 5 , Parallel SCSI, 36.4 GB, OK)
I believe you can reduce the grep to just “(fail|nok)” but I’m taking the conservative approach here. Change the permissions to 0700 and if you have SELinux running make sure the context is set properly.
If your array and controller are in fine shape, then this command will output nothing. If you have a dead drive, it will generate content which will cause cron to mail the root user about it. Bingo – time to go to the colo!
I have seen other people use “ctrl all show status” which generates:
Smart Array XXXXXXX in Slot 0
Controller Status: OK
Cache Status: OK
Battery Status: OK
I prefer to query the config which looks at individual physical drives in addition to the status of the array. I have seen cases (just last week) where one dead drive in the array still lists the array status as OK (because, technically, it is OK, it’s just not optimal and may be pending major disaster!)