Skip to content

thermal: Adjust T6 thermal model#2435

Open
hawkw wants to merge 2 commits intomasterfrom
eliza/adjust-t6-thermals
Open

thermal: Adjust T6 thermal model#2435
hawkw wants to merge 2 commits intomasterfrom
eliza/adjust-t6-thermals

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Mar 12, 2026

Our current thermal model for the T6, which sets a target temperature of 70° C, a critical threshold of 80° C, and a power-down threshold of 85° C, is extremely conservative. According to Chelsio, the Tj Typical for the part is 100° C, so we are running with thermal parameters that will shut down the whole system long before the part even reaches the vendor's typical temperature. While I imagine that keeping the T6 well below Tj Typical is probably good for the long-term reliability of the part, the thermal loop is far too twitchy about powering down the whole system if the T6 gets a bit hot, so we can afford to be less careful with it.

This commit updates the thermal mdoel for T6 to set the target temperature to 95° C, the critical threshold to 100° C, and the power-down threshold to 115° C (which is Chelsio's Tj Max value for the part). This is still a bit more conservative than setting the target temperature to 100° C, but I feel like we can afford to be a bit conservative, if it improves long-term reliability --- this still gives us 25° C of additional headroom before the thermal loop gets scared, relative to the old thermal model.

Closes #2431

Our current thermal model for the T6, which sets a target temperature of
70° C, a critical threshold of 80° C, and a power-down threshold of 85°
C, is _extremely_ conservative. According to Chelsio, the T<sub>j</sub>
Typical for the part is 100° C, so we are running with thermal
parameters that will shut down the whole system long before the part
even reaches the vendor's typical temperature. While I imagine that
keeping the T6 well below T<sub>j</sub> Typical is probably good for the
long-term reliability of the part, the thermal loop is far too twitchy
about powering down the whole system if the T6 gets a bit hot, so we can
afford to be less careful with it.

This commit updates the thermal mdoel for T6 to set the target
temperature to 95° C, the critical threshold to 100° C, and the
power-down threshold to 115° C (which is Chelsio's T<sub>j</sub> Max
value for the part). This is still a bit more conservative than setting
the target temperature to 100° C, but I feel like we can afford to be a
bit conservative, if it improves long-term reliability --- this still
gives us 25° C of additional headroom before the thermal loop gets
scared, relative to the old thermal model.

Closes #2431
@hawkw hawkw requested review from mkeeter and rmustacc March 12, 2026 17:14
@hawkw hawkw marked this pull request as ready for review March 12, 2026 17:14
@mkeeter
Copy link
Collaborator

mkeeter commented Mar 16, 2026

I was paranoid that this would tip us from T6-limited to CPU-limited, with unpredictable consequences for the thermal loop. However, doing a quick audit of rack2, many machines are already in the CPU-limited regime, so I think this is fine.

thermal_sweep.sh Terrible bash script:
#!/bin/bash
pilot -rrack2 sp ls | egrep "(gimlet|cosmo)" | while read -r line ; do
        serial=$(echo $line|awk '{print $2}')
        type=$(echo $line|awk '{print $3}')
        ip=$(echo $line|awk '{print $5}')
        cpu=$(faux-mgs --interface=rack2sw0tp0 --discovery-addr="[$ip]:11111" inventory 2>/dev/null | grep "CPU temperature" | awk '{print $1}')
        t6=$(faux-mgs --interface=rack2sw0tp0 --discovery-addr="[$ip]:11111" inventory 2>/dev/null | grep "T6 temp" | awk '{print $1}')
        echo $serial $type $ip
        cpu_temp=$(faux-mgs --interface=rack2sw0tp0 --discovery-addr="[$ip]:11111" component-details $cpu 2>/dev/null)
        t6_temp=$(faux-mgs --interface=rack2sw0tp0 --discovery-addr="[$ip]:11111" component-details $t6 2>/dev/null)
        echo $cpu $t6
        echo $cpu_temp
        echo $t6_temp
        echo ""
done
matt@castle ~ $ ./thermal_sweep.sh
BRM42220006 gimlet fe80::aa40:25ff:fe04:181
P0/SBTSI U491
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(80.0) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(63.4375) })

BRM42220017 gimlet fe80::aa40:25ff:fe04:182
P0/SBTSI U491
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(80.625) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(60.5625) })

BRM42220051 gimlet fe80::aa40:25ff:fe04:185
P0/SBTSI U491
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(80.125) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(64.5) })

BRM42220018 gimlet fe80::aa40:25ff:fe04:1c1
P0/SBTSI U491
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(71.375) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(70.4375) })

BRM42220014 gimlet fe80::aa40:25ff:fe04:342
P0/SBTSI U491
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(79.625) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(63.75) })

BRM42220031 gimlet fe80::aa40:25ff:fe04:343
P0/SBTSI U491
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(78.125) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(69.5) })

BRM44220010 gimlet fe80::aa40:25ff:fe04:344
P0/SBTSI U491
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(75.25) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(69.9375) })

BRM44220005 gimlet fe80::aa40:25ff:fe04:347
P0/SBTSI U491
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(80.375) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(68.1875) })

BRM42220057 gimlet fe80::aa40:25ff:fe04:383
P0/SBTSI U491
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(80.25) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(66.9375) })

BRM42220016 gimlet fe80::aa40:25ff:fe04:385
P0/SBTSI U491
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(80.375) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(65.375) })

BRM42220009 gimlet fe80::aa40:25ff:fe04:3c4
P0/SBTSI U491
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(76.5) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(62.5) })

BRM44220011 gimlet fe80::aa40:25ff:fe04:3c5
P0/SBTSI U491
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(79.625) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(69.625) })

BRM13250012 cosmo fe80::aa40:25ff:fe04:402
U1/SBTSI U53
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(63.5) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(61.3125) })

BRM27230045 gimlet fe80::aa40:25ff:fe04:6c6
P0/SBTSI U491
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(80.75) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(69.5) })

BRM22250001 cosmo fe80::aa40:25ff:fe04:c86
U1/SBTSI U53
Measurement(Measurement { name: "CPU", kind: Temperature, value: Ok(61.75) })
Measurement(Measurement { name: "t6", kind: Temperature, value: Ok(58.9375) })

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adjust T6 thermal model

2 participants