Skip to content

dell servers

I have adopted 4 dell poweredge servers from RINIS. This set me back 350 euro so far because it had to be completed with more memory.

However this chapter is more about silencing the noisy fans inside the chassis. There are 12 ! in total ! 6 blocks with two fans on either side.

I will try to describe the best way to keep noise down for these monsters.

tools

Some useful tools will be described next

  • ipmitool for setting and reading fan speeds
  • stress to stress test cpu's
  • sensors to read core temperatures

idrac

Mentioned a lot: it means : integrated dell remote access controller linux

ipmitool

(IPMI) is a standardized message-based hardware management interface. At the core of the IPMI is a hardware chip that is known as the Baseboard Management Controller (BMC), or Management Controller (MC).24 Dec 2019

get all temperatures on all fans
ipmitool sensor reading "temp" "Fan1A" "Fan1B" "Fan2A" "Fan2B" "Fan3A" "Fan4B" "Fan4A" "Fan4B" "Fan5A" "Fan5B" "Fan6A" "Fan6B"

This gives the rpms for each fan, this is with the lowest we can get these via the bios :

ipmitool sensor reading "temp" "Fan1A" "Fan1B" "Fan2A" "Fan2B" "Fan3A" "Fan4B" "Fan4A" "Fan4B" "Fan5A" "Fan5B" "Fan6A" "Fan6B"
IANA PEN registry open failed: No such file or directory
Sensor "temp" not found!
Fan1A            | 2640
Fan1B            | 2520
Fan2A            | 7560
Fan2B            | 7200
Fan3A            | 6600
Fan4B            | 6360
Fan4A            | 6600
Fan4B            | 6360
Fan5A            | 6600
Fan5B            | 6360
get all info on all fans
ipmitool sdr get "Fan1A" "Fan1B" "Fan2A" "Fan2B" "Fan3A" "Fan4B" "Fan4A" "Fan4B" "Fan5A" "Fan5B" "Fan6A" "Fan6B"

For just fan 6B you get this information

Sensor ID              : Fan6B (0x3b)
 Entity ID             : 7.1 (System Board)
 Sensor Type (Threshold)  : Fan (0x04)
 Sensor Reading        : Disabled
 Status                : Not Available
 Nominal Reading       : 6720.000
 Normal Minimum        : 16680.000
 Normal Maximum        : 23640.000
 Lower critical        : 720.000
 Lower non-critical    : 840.000
 Positive Hysteresis   : 120.000
 Negative Hysteresis   : 120.000
 Minimum sensor range  : Unspecified
 Maximum sensor range  : Unspecified
 Event Message Control : Per-threshold
 Readable Thresholds   : lcr lnc 
 Settable Thresholds   : 
 Threshold Read Mask   : lcr lnc 
 Assertion Events      : 
 Assertions Enabled    : lnc- lcr- 
 Deassertions Enabled  : lnc- lcr- 

First we need to enable the manual control to alter fan speeds.

set manual control over the fan speeds
ipmitool raw 0x30 0x30 0x01 0x00 # enable manual
ipmitool raw 0x30 0x30 0x01 0x01 # disable manual again

Note that these commands survive a reboot ! sauce : https://docs.oracle.com/cd/E19469-01/820-6413-13/IPMI_Overview.html

Then we can set the speeds with the next commands. You can exactly hear if it works or not. The last parameter sets the percentage in hex format. So i only use multiples of 16 for convenience. But for now let's set it to 20.

set rpms on the fans
ipmitool raw 0x30 0x30 0x02 0xff 0x14

The last number means the percentage in hex (0x14 means 20 %) so you could also try lower and even off !!

So here are some extremes : 0, 50% and 100 %

ipmitool raw 0x30 0x30 0x02 0xff 0x00
ipmitool raw 0x30 0x30 0x02 0xff 0x32
ipmitool raw 0x30 0x30 0x02 0xff 0x64

Never go above 40, it is terribly noisy and it never gets that hot !!

sauce : https://gist.github.com/slykar/f90ad596b18d5ab1eb1c66b2ccf51c54#set-fan-speed-to-30-

100%

This is really like all hell breaks loose, it is also totally unnecessary as we will see later. So we will never go here again. Actually 50% is too high for any purpose. To set some boundaries : if we hog all cpus to 100% we get these results.

crank up all cores
apt-get install stress
stress --cpu 20

Note that compiling klopt code and generating networks is about as intensive as this stress test !

compiling at 0% fanspeed
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +92.0°C  (high = +82.0°C, crit = +92.0°C)
Core 0:        +89.0°C  (high = +82.0°C, crit = +92.0°C)
Core 1:        +89.0°C  (high = +82.0°C, crit = +92.0°C)
Core 2:        +88.0°C  (high = +82.0°C, crit = +92.0°C)
Core 3:        +89.0°C  (high = +82.0°C, crit = +92.0°C)
Core 4:        +88.0°C  (high = +82.0°C, crit = +92.0°C)
Core 8:        +88.0°C  (high = +82.0°C, crit = +92.0°C)
Core 9:        +88.0°C  (high = +82.0°C, crit = +92.0°C)
Core 10:       +88.0°C  (high = +82.0°C, crit = +92.0°C)
Core 11:       +88.0°C  (high = +82.0°C, crit = +92.0°C)
Core 12:       +87.0°C  (high = +82.0°C, crit = +92.0°C)

Note that we see only half of the 20 cores, which has something to do with :

Maybe it is a different i5 CPU (Some are manufactured with 2 physical cores, some with 4, and some have HT enabled, which double the number of cores which the os sees. Check wiki Core i5 and run this cat /proc/cpuinfo
But you just get a general idea, the other cores won't be very much hotter.

measurements

If we don't do anything, we can almost disable the fans. However they don't go lower than 0% but the fans are still turning. At about these speeds

Fan1A            | 4800
Fan1B            | 4440
Fan2A            | 4800
Fan2B            | 4440
Fan3A            | 4920
Fan4B            | 4440
Fan4A            | 4920
Fan4B            | 4440
Fan5A            | 4800
Fan5B            | 4440

Dell states that cores handle excess heat by either shutdown or throttling. Of course we want the last one and this also seems the default when testing.

0% work

With no work at all the fan speeds deliver these temperatures

  • fanspeed 0% low use : 38 degrees
  • fanspeed 16% low use : 28 degrees
  • fanspeed 32% low use : 25 degrees

38 is perfectly fine so in rest just set speed to 0%, but also with higher stress we can manage :

100% work

With the stress command on all cores

stress --cpu 20

Also we can monitor the (a) processor speed with :

watch cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
  • fanspeed 0% seems to throttle at 92 exactly (see below)
  • fanspeed 10% goes around 90 degrees : freq : 2397409 (light throttling)
  • fanspeed 20% goes down to 70 degrees freq around : 2597202
  • fanspeed 30% about 57 (54-59)
  • fanspeed 40% goes down to 50 degrees freq stays : 2597196

The throttling seems to work ! with fanspeed 0 the temperature rises to about 91 degrees and the speed keeps going down : below 200000 -> 1998230 -> 1694683 ...

conclusion

Use fan speed 0 for normal operation, compilation, data generation. Use fan speed 20 % for hard work. 10 goes up to 90 degrees and starts throttling.

Best is just to keep the server out of the office and use 20-30% for any use.

Setting 0% is not dangerous because of throttling, but not wise either : the processor will last longer when not becoming too hot and some speed is needed anyway for running.

faulty RAM

The dell4 system was unstable and mentioned a memory issue. The advice was to check the SEL (System Error Log). When you see this, press F1 to boot anyway.

Then to check what dimm failed, you can run this command as root:

ipmitool sel list

The outcome was :

   1 | 10/27/2016 | 04:17:12 PM CEST | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted
   2 | 12/21/2016 | 04:54:04 PM CET | Physical Security #0x73 | General Chassis intrusion () | Asserted
   3 | 12/21/2016 | 04:54:14 PM CET | Power Supply #0x63 | Power Supply AC lost | Asserted
   4 | 12/21/2016 | 04:54:14 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
   5 | 12/21/2016 | 04:54:19 PM CET | Physical Security #0x73 | General Chassis intrusion () | Deasserted
   6 | 12/21/2016 | 04:54:52 PM CET | Fan #0x37 | Lower Non-critical going low  | Asserted
   7 | 12/21/2016 | 04:54:52 PM CET | Fan #0x37 | Lower Critical going low  | Asserted
   8 | 12/21/2016 | 04:54:53 PM CET | Fan #0x75 | Redundancy Lost | Asserted
   9 | 12/21/2016 | 04:56:25 PM CET | Physical Security #0x73 | General Chassis intrusion () | Asserted
   a | 12/21/2016 | 04:57:04 PM CET | Cable / Interconnect #0x7c | Config Error | Asserted
   b | 12/21/2016 | 05:00:05 PM CET | Physical Security #0x73 | General Chassis intrusion () | Asserted
   c | 12/21/2016 | 05:00:10 PM CET | Physical Security #0x73 | General Chassis intrusion () | Deasserted
   d | 12/21/2016 | 05:00:10 PM CET | Power Supply #0x63 | Power Supply AC lost | Asserted
   e | 12/21/2016 | 05:00:15 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
   f | 04/17/2035 | 01:41:23 AM CEST | Battery #0x65 | Failed | Asserted
  10 | 04/17/2035 | 01:41:27 AM CEST | Physical Security #0x73 | General Chassis intrusion () | Asserted
  11 | 04/17/2035 | 01:41:37 AM CEST | Power Supply #0x62 | Power Supply AC lost | Asserted
  12 | 04/17/2035 | 01:41:38 AM CEST | Power Supply #0x74 | Redundancy Lost | Asserted
  13 | 04/17/2035 | 01:41:43 AM CEST | Physical Security #0x73 | General Chassis intrusion () | Deasserted
  14 | 12/31/1999 | 07:00:38 PM CET | Battery #0x65 | Failed | Asserted
  15 | 12/31/1999 | 07:00:43 PM CET | Physical Security #0x73 | General Chassis intrusion () | Asserted
  16 | 12/31/1999 | 07:00:49 PM CET | Physical Security #0x73 | General Chassis intrusion () | Deasserted
  17 | 12/31/1999 | 07:00:59 PM CET | Power Supply #0x62 | Power Supply AC lost | Asserted
  18 | 12/31/1999 | 07:00:59 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
  19 | 01/01/2001 | 01:02:24 AM CET | Physical Security #0x73 | General Chassis intrusion () | Asserted
  1a | 01/01/2001 | 01:03:15 AM CET | Battery #0x65 | Failed | Deasserted
  1b | 01/01/2001 | 05:11:29 PM CET | Cable / Interconnect #0x7c | Config Error | Asserted
  1c | 11/02/2023 | 08:39:41 AM CET | Physical Security #0x73 | General Chassis intrusion () | Asserted
  1d | 11/02/2023 | 08:39:46 AM CET | Physical Security #0x73 | General Chassis intrusion () | Deasserted
  1e | 11/02/2023 | 08:39:46 AM CET | Power Supply #0x74 | Fully Redundant | Asserted
  1f | 11/02/2023 | 08:39:51 AM CET | Power Supply #0x74 | Redundancy Lost | Asserted
  20 | 11/02/2023 | 08:39:52 AM CET | Power Supply #0x62 | Power Supply AC lost | Asserted
  21 | 11/19/2023 | 03:23:54 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
  22 | 11/19/2023 | 03:23:59 PM CET | Power Supply #0x62 | Power Supply AC lost | Asserted
  23 | 03/13/2024 | 05:19:26 PM CET | Power Supply #0x62 | Power Supply AC lost | Asserted
  24 | 03/13/2024 | 05:19:31 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
  25 | 03/24/2024 | 10:07:02 PM CET | Power Supply #0x62 | Power Supply AC lost | Asserted
  26 | 03/24/2024 | 10:07:07 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
  27 | 06/02/2024 | 09:48:04 PM CEST | Power Supply #0x74 | Redundancy Lost | Asserted
  28 | 06/02/2024 | 09:48:05 PM CEST | Power Supply #0x62 | Power Supply AC lost | Asserted
  29 | 06/02/2024 | 10:25:55 PM CEST | Unknown #0x2e |  | Asserted
  2a | 06/02/2024 | 10:25:55 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  2b | 06/02/2024 | 10:25:55 PM CEST | Unknown #0x2e |  | Asserted
  2c | 06/02/2024 | 10:25:55 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  2d | 06/02/2024 | 10:25:55 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  2e | 06/04/2024 | 08:50:26 PM CEST | Power Supply #0x74 | Redundancy Lost | Asserted
  2f | 06/04/2024 | 08:50:26 PM CEST | Power Supply #0x62 | Power Supply AC lost | Asserted
  30 | 06/04/2024 | 09:05:28 PM CEST | Unknown #0x2e |  | Asserted
 31 | 06/04/2024 | 09:05:28 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  32 | 06/04/2024 | 10:39:36 PM CEST | Unknown #0x2e |  | Asserted
  33 | 06/04/2024 | 10:39:36 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  34 | 06/04/2024 | 10:39:36 PM CEST | Unknown #0x2e |  | Asserted
  35 | 06/04/2024 | 10:39:36 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  36 | 06/04/2024 | 10:39:36 PM CEST | Unknown #0x2e |  | Asserted
  37 | 06/04/2024 | 10:39:36 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  38 | 06/04/2024 | 10:39:36 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  39 | 09/02/2024 | 02:30:54 PM CEST | Power Supply #0x62 | Power Supply AC lost | Asserted
  3a | 09/02/2024 | 02:31:00 PM CEST | Power Supply #0x74 | Redundancy Lost | Asserted
  3b | 11/25/2024 | 06:23:21 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
  3c | 11/25/2024 | 06:23:21 PM CET | Power Supply #0x62 | Power Supply AC lost | Asserted
  3d | 11/25/2024 | 07:46:42 PM CET | Unknown #0x2e |  | Asserted
  3e | 11/25/2024 | 07:46:42 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  3f | 11/26/2024 | 04:48:43 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  40 | 11/26/2024 | 04:50:18 PM CET | Unknown #0x2e |  | Asserted
  41 | 11/26/2024 | 04:50:18 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  42 | 11/26/2024 | 04:50:18 PM CET | Unknown #0x2e |  | Asserted
  43 | 11/26/2024 | 04:50:18 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  44 | 11/26/2024 | 04:50:18 PM CET | Unknown #0x2e |  | Asserted
  45 | 11/26/2024 | 04:50:18 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  46 | 11/26/2024 | 04:50:18 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  47 | 03/14/2025 | 03:01:57 PM CET | Drive Slot / Bay #0xa5 | Drive Present () | Deasserted
  48 | 03/14/2025 | 03:02:12 PM CET | Drive Slot / Bay #0xa5 | Drive Present () | Asserted
  49 | 04/06/2025 | 02:07:36 PM CEST | Power Supply #0x74 | Redundancy Lost | Asserted
  4a | 04/06/2025 | 02:07:36 PM CEST | Power Supply #0x62 | Power Supply AC lost | Asserted
  4b | 04/06/2025 | 02:08:04 PM CEST | Unknown #0x2e |  | Asserted
  4c | 04/06/2025 | 02:08:04 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  4d | 12/13/2025 | 11:46:05 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  4e | 12/13/2025 | 11:46:35 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  4f | 12/13/2025 | 11:48:20 PM CET | Unknown #0x2e |  | Asserted
  50 | 12/13/2025 | 11:48:20 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  51 | 01/03/2026 | 06:09:43 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted
  52 | 01/05/2026 | 06:52:57 PM CET | Unknown #0x2e |  | Asserted
  53 | 01/05/2026 | 06:52:57 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA4) | Asserted

So this DIMM was failing for a long time, and it is always DIMMA4..

To fix this, open up the case and remove the dimm. The encasing should hint about where it is.