Question : Problem: How do I prevent apparently-spurious fan-redundancy alerts in Gateway 9510 Server (Intel SC5300)?

Ever since I installed a high-availability kit - Intel Management Module, Professional Edition (IMM) and secondary hot-swap power supply and case fans - in my Gateway 9510 Server (aka Intel SC5300 with motherboard SE7520AF2), I've been seeing occasional (sometimes frequent) transient (and apparently-spurious) fan-redundancy alerts.  There are typically five alerts (plus deasserted echoes of the first two) within ca. 10 seconds, and the cycle repeats more or less randomly, with a periodicity of minutes to hours.  The most-recent set from Gateway Systems Manager Reports | Hardware Events is ...

Date
Generator
Sensor type
Details

09/03/2007 07:51:45 PM
IPMB Slave Address
Fan
Fan Sys Fan 4: Lower Non-critical - going low

09/03/2007 07:51:45 PM
IPMB Slave Address
Fan
Fan Sys Fan 4: Lower Critical - going low

09/03/2007 07:51:47 PM
IPMB Slave Address
Fan
Fan Fan Redundancy: Redundancy Lost

09/03/2007 07:51:47 PM
IPMB Slave Address
Fan
Fan Fan Redundancy: Non-redundant - unit is functioning with minimum resources needed for normal operation

09/03/2007 07:51:56 PM
IPMB Slave Address
Fan
Fan Fan Redundancy: Fully Redundant

The machine has two 3.6 GHz Xeon CPUs with 4 Gb RAM and several SCSI HDs (boot, temp, I/O and data RAID 10 on a MegaRAID card).  Although both Xeons are Intel SKU BX80546KG3600, one is sSpec SL7ZJ (stepping N0) and the other is sSpec SL8P3 (stepping R0).  The OS is Windows SBS 2003 SP2.  I've run (and rerun) the Gateway 9510/E-9510T Server Platform Firmware Update CD Version 2.4, and have installed: (1) IMM firmware version 0.51; (2) HSC 6-Drive SCSI hot swap backplane version 1.13; (3) BIOS version P06.00.0108; and (4) FRU/SDR version 6.7.4.

I suspect that this issue (plus a CPU stepping-difference warning that I'm ignoring) may have been resolved in BIOS version P.10.00.0109 (February 8th, 2006).  Unfortunately, Gateway has not certified any BIOS updates after P06.00.0108 (having discontinued the machine), and installing such would void my warranty.

The machine is still under warranty, and Gateway tech support has suggested replacing the motherboard.  Maybe I'm cynical, and that seems like a typical tech-support SWAG.  They've also told me that this issue won't harm the machine.

What would be the best course?  Thank you.

Answer : Problem: How do I prevent apparently-spurious fan-redundancy alerts in Gateway 9510 Server (Intel SC5300)?

First, I don't think your processor stepping difference is a component of this problem - the minor bug fixes and changes in the masks that are associated with stepping changes are unlikely to affect the performance of fan monitoring software.

This problem looks like an interaction between a temperature controlled fan and the setpoint for the fan warning.

I'm speculating regarding your system, but many systems have thermally controlled fans cooling the processors - when the processors run at minimum load their temperature drops and the fan controller slows the fan down.  When the fans slow down below a set threshold, the monitoring software detects  this low-speed condition and issues an alert.

The fix is usually to lower the minimum speed threshold for the warning condition - from, say, 600 RPM to 400 RPM or lower (but not, of course, to zero), or to raise the minimum speed setpoint of the fan speed controller.

If you have access to either of these settings you might want to try adjusting them so they are not overlapping.

You might also monitor and record temperatures and fan speeds and compare the monitored values with the warnings.  There are many programs out there to do this; my favorite is Everest Ultimate from Lavalys (http://www.lavalys.com/products/overview.php?pid=3&ps=UE&lang=en).

There is also a more subtle problem in monitoring fan speeds, as many of the system monitoring routines have a relatively low priority and can be starved for resources when systems are busy with higher priority tasks.  As a result the monitoring routines do not run as often as they should (or are expected to) and the total number of fan pulses counted in a time period does not correspond to the actual number of pulses.  This type of problem, depending on where the routines are located (in the BIOS or device drivers) can be more difficult to fix.

As the support staff are pushing you toward a board swap, I would be inclined to think you might have the latter problem, but your sense that tech support is not being fully honest with you might be on target, too.

They are correct in saying it is not going to hurt your system, but it DOES hurt your ability to monitor your system - you are conditioning yourself against warnings that, someday, will be indicators of a real failure requiring your intervention: fans are the least reliable part of a system, and fan problems are the primary reason for system failures.

I would take them up on the replacement board offer - and hold their feet to the fire if it doesn't fix the problem.  On the other hand, your system is working now, and a board swap will, at minimum, result in some downtime.  At worst, you might be trading one problem for a new one - they always put new bugs in when they fix the known ones...

wb
Random Solutions  
 
programming4us programming4us