dell servers
I have adopted 4 dell poweredge servers from RINIS. This set me back 350 euro so far because it had to be completed with more memory.
However this chapter is more about silencing the noisy fans inside the chassis. There are 12 ! in total ! 6 blocks with two fans on either side.
I will try to describe the best way to keep noise down for these monsters.
tools
Some useful tools will be described next
- ipmitool for setting and reading fan speeds
- stress to stress test cpu's
- sensors to read core temperatures
idrac
Mentioned a lot: it means : integrated dell remote access controller linux
ipmitool
(IPMI) is a standardized message-based hardware management interface. At the core of the IPMI is a hardware chip that is known as the Baseboard Management Controller (BMC), or Management Controller (MC).24 Dec 2019
ipmitool sensor reading "temp" "Fan1A" "Fan1B" "Fan2A" "Fan2B" "Fan3A" "Fan4B" "Fan4A" "Fan4B" "Fan5A" "Fan5B" "Fan6A" "Fan6B"
This gives the rpms for each fan, this is with the lowest we can get these via the bios :
ipmitool sensor reading "temp" "Fan1A" "Fan1B" "Fan2A" "Fan2B" "Fan3A" "Fan4B" "Fan4A" "Fan4B" "Fan5A" "Fan5B" "Fan6A" "Fan6B"
IANA PEN registry open failed: No such file or directory
Sensor "temp" not found!
Fan1A | 2640
Fan1B | 2520
Fan2A | 7560
Fan2B | 7200
Fan3A | 6600
Fan4B | 6360
Fan4A | 6600
Fan4B | 6360
Fan5A | 6600
Fan5B | 6360
ipmitool sdr get "Fan1A" "Fan1B" "Fan2A" "Fan2B" "Fan3A" "Fan4B" "Fan4A" "Fan4B" "Fan5A" "Fan5B" "Fan6A" "Fan6B"
For just fan 6B you get this information
Sensor ID : Fan6B (0x3b)
Entity ID : 7.1 (System Board)
Sensor Type (Threshold) : Fan (0x04)
Sensor Reading : Disabled
Status : Not Available
Nominal Reading : 6720.000
Normal Minimum : 16680.000
Normal Maximum : 23640.000
Lower critical : 720.000
Lower non-critical : 840.000
Positive Hysteresis : 120.000
Negative Hysteresis : 120.000
Minimum sensor range : Unspecified
Maximum sensor range : Unspecified
Event Message Control : Per-threshold
Readable Thresholds : lcr lnc
Settable Thresholds :
Threshold Read Mask : lcr lnc
Assertion Events :
Assertions Enabled : lnc- lcr-
Deassertions Enabled : lnc- lcr-
First we need to enable the manual control to alter fan speeds.
ipmitool raw 0x30 0x30 0x01 0x00 # enable manual
ipmitool raw 0x30 0x30 0x01 0x01 # disable manual again
Note that these commands survive a reboot ! sauce : https://docs.oracle.com/cd/E19469-01/820-6413-13/IPMI_Overview.html
Then we can set the speeds with the next commands. You can exactly hear if it works or not. The last parameter sets the percentage in hex format. So i only use multiples of 16 for convenience. But for now let's set it to 20.
The last number means the percentage in hex (0x14 means 20 %) so you could also try lower and even off !!
So here are some extremes : 0, 50% and 100 %
ipmitool raw 0x30 0x30 0x02 0xff 0x00
ipmitool raw 0x30 0x30 0x02 0xff 0x32
ipmitool raw 0x30 0x30 0x02 0xff 0x64
Never go above 40, it is terribly noisy and it never gets that hot !!
sauce : https://gist.github.com/slykar/f90ad596b18d5ab1eb1c66b2ccf51c54#set-fan-speed-to-30-
100%
This is really like all hell breaks loose, it is also totally unnecessary as we will see later. So we will never go here again. Actually 50% is too high for any purpose. To set some boundaries : if we hog all cpus to 100% we get these results.
Note that compiling klopt code and generating networks is about as intensive as this stress test !
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +92.0°C (high = +82.0°C, crit = +92.0°C)
Core 0: +89.0°C (high = +82.0°C, crit = +92.0°C)
Core 1: +89.0°C (high = +82.0°C, crit = +92.0°C)
Core 2: +88.0°C (high = +82.0°C, crit = +92.0°C)
Core 3: +89.0°C (high = +82.0°C, crit = +92.0°C)
Core 4: +88.0°C (high = +82.0°C, crit = +92.0°C)
Core 8: +88.0°C (high = +82.0°C, crit = +92.0°C)
Core 9: +88.0°C (high = +82.0°C, crit = +92.0°C)
Core 10: +88.0°C (high = +82.0°C, crit = +92.0°C)
Core 11: +88.0°C (high = +82.0°C, crit = +92.0°C)
Core 12: +87.0°C (high = +82.0°C, crit = +92.0°C)
Note that we see only half of the 20 cores, which has something to do with :
Maybe it is a different i5 CPU (Some are manufactured with 2 physical cores, some with 4, and some have HT enabled, which double the number of cores which the os sees. Check wiki Core i5 and run this cat /proc/cpuinfo
measurements
If we don't do anything, we can almost disable the fans. However they don't go lower than 0% but the fans are still turning. At about these speeds
Fan1A | 4800
Fan1B | 4440
Fan2A | 4800
Fan2B | 4440
Fan3A | 4920
Fan4B | 4440
Fan4A | 4920
Fan4B | 4440
Fan5A | 4800
Fan5B | 4440
Dell states that cores handle excess heat by either shutdown or throttling. Of course we want the last one and this also seems the default when testing.
0% work
With no work at all the fan speeds deliver these temperatures
- fanspeed 0% low use : 38 degrees
- fanspeed 16% low use : 28 degrees
- fanspeed 32% low use : 25 degrees
38 is perfectly fine so in rest just set speed to 0%, but also with higher stress we can manage :
100% work
With the stress command on all cores
Also we can monitor the (a) processor speed with :
- fanspeed 0% seems to throttle at 92 exactly (see below)
- fanspeed 10% goes around 90 degrees : freq : 2397409 (light throttling)
- fanspeed 20% goes down to 70 degrees freq around : 2597202
- fanspeed 30% about 57 (54-59)
- fanspeed 40% goes down to 50 degrees freq stays : 2597196
The throttling seems to work ! with fanspeed 0 the temperature rises to about 91 degrees and the speed keeps going down : below 200000 -> 1998230 -> 1694683 ...
conclusion
Use fan speed 0 for normal operation, compilation, data generation. Use fan speed 20 % for hard work. 10 goes up to 90 degrees and starts throttling.
Best is just to keep the server out of the office and use 20-30% for any use.
Setting 0% is not dangerous because of throttling, but not wise either : the processor will last longer when not becoming too hot and some speed is needed anyway for running.
faulty RAM
The dell4 system was unstable and mentioned a memory issue. The advice was to check the SEL (System Error Log). When you see this, press F1 to boot anyway.
Then to check what dimm failed, you can run this command as root:
The outcome was :
1 | 10/27/2016 | 04:17:12 PM CEST | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted
2 | 12/21/2016 | 04:54:04 PM CET | Physical Security #0x73 | General Chassis intrusion () | Asserted
3 | 12/21/2016 | 04:54:14 PM CET | Power Supply #0x63 | Power Supply AC lost | Asserted
4 | 12/21/2016 | 04:54:14 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
5 | 12/21/2016 | 04:54:19 PM CET | Physical Security #0x73 | General Chassis intrusion () | Deasserted
6 | 12/21/2016 | 04:54:52 PM CET | Fan #0x37 | Lower Non-critical going low | Asserted
7 | 12/21/2016 | 04:54:52 PM CET | Fan #0x37 | Lower Critical going low | Asserted
8 | 12/21/2016 | 04:54:53 PM CET | Fan #0x75 | Redundancy Lost | Asserted
9 | 12/21/2016 | 04:56:25 PM CET | Physical Security #0x73 | General Chassis intrusion () | Asserted
a | 12/21/2016 | 04:57:04 PM CET | Cable / Interconnect #0x7c | Config Error | Asserted
b | 12/21/2016 | 05:00:05 PM CET | Physical Security #0x73 | General Chassis intrusion () | Asserted
c | 12/21/2016 | 05:00:10 PM CET | Physical Security #0x73 | General Chassis intrusion () | Deasserted
d | 12/21/2016 | 05:00:10 PM CET | Power Supply #0x63 | Power Supply AC lost | Asserted
e | 12/21/2016 | 05:00:15 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
f | 04/17/2035 | 01:41:23 AM CEST | Battery #0x65 | Failed | Asserted
10 | 04/17/2035 | 01:41:27 AM CEST | Physical Security #0x73 | General Chassis intrusion () | Asserted
11 | 04/17/2035 | 01:41:37 AM CEST | Power Supply #0x62 | Power Supply AC lost | Asserted
12 | 04/17/2035 | 01:41:38 AM CEST | Power Supply #0x74 | Redundancy Lost | Asserted
13 | 04/17/2035 | 01:41:43 AM CEST | Physical Security #0x73 | General Chassis intrusion () | Deasserted
14 | 12/31/1999 | 07:00:38 PM CET | Battery #0x65 | Failed | Asserted
15 | 12/31/1999 | 07:00:43 PM CET | Physical Security #0x73 | General Chassis intrusion () | Asserted
16 | 12/31/1999 | 07:00:49 PM CET | Physical Security #0x73 | General Chassis intrusion () | Deasserted
17 | 12/31/1999 | 07:00:59 PM CET | Power Supply #0x62 | Power Supply AC lost | Asserted
18 | 12/31/1999 | 07:00:59 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
19 | 01/01/2001 | 01:02:24 AM CET | Physical Security #0x73 | General Chassis intrusion () | Asserted
1a | 01/01/2001 | 01:03:15 AM CET | Battery #0x65 | Failed | Deasserted
1b | 01/01/2001 | 05:11:29 PM CET | Cable / Interconnect #0x7c | Config Error | Asserted
1c | 11/02/2023 | 08:39:41 AM CET | Physical Security #0x73 | General Chassis intrusion () | Asserted
1d | 11/02/2023 | 08:39:46 AM CET | Physical Security #0x73 | General Chassis intrusion () | Deasserted
1e | 11/02/2023 | 08:39:46 AM CET | Power Supply #0x74 | Fully Redundant | Asserted
1f | 11/02/2023 | 08:39:51 AM CET | Power Supply #0x74 | Redundancy Lost | Asserted
20 | 11/02/2023 | 08:39:52 AM CET | Power Supply #0x62 | Power Supply AC lost | Asserted
21 | 11/19/2023 | 03:23:54 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
22 | 11/19/2023 | 03:23:59 PM CET | Power Supply #0x62 | Power Supply AC lost | Asserted
23 | 03/13/2024 | 05:19:26 PM CET | Power Supply #0x62 | Power Supply AC lost | Asserted
24 | 03/13/2024 | 05:19:31 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
25 | 03/24/2024 | 10:07:02 PM CET | Power Supply #0x62 | Power Supply AC lost | Asserted
26 | 03/24/2024 | 10:07:07 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
27 | 06/02/2024 | 09:48:04 PM CEST | Power Supply #0x74 | Redundancy Lost | Asserted
28 | 06/02/2024 | 09:48:05 PM CEST | Power Supply #0x62 | Power Supply AC lost | Asserted
29 | 06/02/2024 | 10:25:55 PM CEST | Unknown #0x2e | | Asserted
2a | 06/02/2024 | 10:25:55 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
2b | 06/02/2024 | 10:25:55 PM CEST | Unknown #0x2e | | Asserted
2c | 06/02/2024 | 10:25:55 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
2d | 06/02/2024 | 10:25:55 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
2e | 06/04/2024 | 08:50:26 PM CEST | Power Supply #0x74 | Redundancy Lost | Asserted
2f | 06/04/2024 | 08:50:26 PM CEST | Power Supply #0x62 | Power Supply AC lost | Asserted
30 | 06/04/2024 | 09:05:28 PM CEST | Unknown #0x2e | | Asserted
31 | 06/04/2024 | 09:05:28 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
32 | 06/04/2024 | 10:39:36 PM CEST | Unknown #0x2e | | Asserted
33 | 06/04/2024 | 10:39:36 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
34 | 06/04/2024 | 10:39:36 PM CEST | Unknown #0x2e | | Asserted
35 | 06/04/2024 | 10:39:36 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
36 | 06/04/2024 | 10:39:36 PM CEST | Unknown #0x2e | | Asserted
37 | 06/04/2024 | 10:39:36 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
38 | 06/04/2024 | 10:39:36 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
39 | 09/02/2024 | 02:30:54 PM CEST | Power Supply #0x62 | Power Supply AC lost | Asserted
3a | 09/02/2024 | 02:31:00 PM CEST | Power Supply #0x74 | Redundancy Lost | Asserted
3b | 11/25/2024 | 06:23:21 PM CET | Power Supply #0x74 | Redundancy Lost | Asserted
3c | 11/25/2024 | 06:23:21 PM CET | Power Supply #0x62 | Power Supply AC lost | Asserted
3d | 11/25/2024 | 07:46:42 PM CET | Unknown #0x2e | | Asserted
3e | 11/25/2024 | 07:46:42 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
3f | 11/26/2024 | 04:48:43 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
40 | 11/26/2024 | 04:50:18 PM CET | Unknown #0x2e | | Asserted
41 | 11/26/2024 | 04:50:18 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
42 | 11/26/2024 | 04:50:18 PM CET | Unknown #0x2e | | Asserted
43 | 11/26/2024 | 04:50:18 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
44 | 11/26/2024 | 04:50:18 PM CET | Unknown #0x2e | | Asserted
45 | 11/26/2024 | 04:50:18 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
46 | 11/26/2024 | 04:50:18 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
47 | 03/14/2025 | 03:01:57 PM CET | Drive Slot / Bay #0xa5 | Drive Present () | Deasserted
48 | 03/14/2025 | 03:02:12 PM CET | Drive Slot / Bay #0xa5 | Drive Present () | Asserted
49 | 04/06/2025 | 02:07:36 PM CEST | Power Supply #0x74 | Redundancy Lost | Asserted
4a | 04/06/2025 | 02:07:36 PM CEST | Power Supply #0x62 | Power Supply AC lost | Asserted
4b | 04/06/2025 | 02:08:04 PM CEST | Unknown #0x2e | | Asserted
4c | 04/06/2025 | 02:08:04 PM CEST | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
4d | 12/13/2025 | 11:46:05 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
4e | 12/13/2025 | 11:46:35 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
4f | 12/13/2025 | 11:48:20 PM CET | Unknown #0x2e | | Asserted
50 | 12/13/2025 | 11:48:20 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
51 | 01/03/2026 | 06:09:43 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
52 | 01/05/2026 | 06:52:57 PM CET | Unknown #0x2e | | Asserted
53 | 01/05/2026 | 06:52:57 PM CET | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC | DIMMA4) | Asserted
So this DIMM was failing for a long time, and it is always DIMMA4..
To fix this, open up the case and remove the dimm. The encasing should hint about where it is.