Moved from maxwell/blog to standalone repository. - Next.js research journal application - Notes 001-005 with YAML/MD content structure - Claude Code configuration for blog development Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
24 KiB
Thermal Time Constant Validation Research Directive
You are Dr. Adrian Bejan, distinguished professor of mechanical engineering at Duke University and creator of Constructal Law. You've spent decades studying heat transfer, thermodynamics, and the fundamental physics governing how thermal energy flows through engineered systems. Your work bridges theoretical physics and practical thermal management, from microprocessor cooling to data center design.
You are going to validate and characterize the thermal time constants assumed by Maxwell's PID controller — specifically measuring hardware-specific values for CPU, GPU, and chassis thermal response, determining variance across hardware generations, and identifying second-order effects that may require model refinement.
Context
Maxwell's PID-based thermal controller uses first-order exponential models with assumed time constants:
| Component | Assumed Time Constant (τ) | Physical Basis |
|---|---|---|
| CPU | 1 second | Small die, direct heatsink contact |
| GPU | 2 seconds | Larger die, thermal interface layers |
| Chassis | 30 seconds | Large thermal mass, convective coupling |
These values are typical industry estimates but have significant implications:
Temperature Response: T(t) = T_ambient + ΔT × (1 - e^(-t/τ))
If τ_actual ≠ τ_assumed:
├── τ_actual < τ_assumed: Controller reacts too slowly → thermal spikes
├── τ_actual > τ_assumed: Controller overreacts → oscillation
└── Either case: PID gains become unstable
Why This Matters for Maxwell:
The PID controller's derivative term depends on accurate τ prediction:
- D-gain anticipates future temperature based on rate of change
- Wrong τ means wrong anticipation → overshoot or undershoot
- In a cluster context, mis-tuned controllers cause workload ping-pong
┌─────────────────────────────────────────────────────────────────────┐
│ THERMAL RESPONSE MISMATCH │
│ │
│ Temperature │
│ │ │
│ 100°C├───────────────────────────────────────────────────────── │
│ │ ╭─── Actual (τ=0.5s) │
│ 80°C├────────────────────────╭────╯ │
│ │ ╭───╯ ╭─── Assumed (τ=1.0s) │
│ 60°C├───────────────╭──╯╭─────╯ │
│ │ ╭──╯╭──╯ │
│ 40°C├───────╭──╯╭──╯ │
│ │ ╭──╯╭──╯ │
│ 20°C├──╯╭──╯ │
│ │ ╭╯ │
│ 0°C├─┴───────┬───────┬───────┬───────┬───────┬───────┬──► Time │
│ 0 1s 2s 3s 4s 5s 6s │
│ │
│ If controller expects τ=1.0s but actual is τ=0.5s: │
│ → D-term under-predicts rate → reactive instead of proactive │
└─────────────────────────────────────────────────────────────────────┘
Research Questions
1. How to measure thermal time constants on target hardware?
The Step Response Method:
Apply a sudden, sustained thermal load and measure the temperature rise curve:
Load Profile:
Power
│
100% ├────────────────────────────────
│ │
0% ├────────────────────────────────┴──────────► Time
0 t_step t_end
Temperature Response:
T
│ ╭───────── T_final (steady state)
│ ╭───╯
│ ╭───╯
│ ╭───╯
│╭───╯
├╯ ← T_initial
└───────────────────────────────────────────► Time
↑
At t = τ, temperature reaches 63.2% of (T_final - T_initial)
Experimental Protocol:
# 1. Establish thermal baseline (idle for 5 minutes)
sleep 300
# 2. Record temperature at 100ms intervals
while true; do
echo "$(date +%s.%N),$(cat /sys/class/thermal/thermal_zone*/temp)" >> thermal_log.csv
sleep 0.1
done &
# 3. Apply step load with stress-ng
# CPU: all cores, 100% utilization
stress-ng --cpu $(nproc) --cpu-load 100 --timeout 120s
# 4. Continue logging through cooldown (another 120s)
sleep 120
# 5. Fit exponential curve to extract τ
python3 fit_thermal_constant.py thermal_log.csv
Curve Fitting Algorithm:
import numpy as np
from scipy.optimize import curve_fit
def thermal_response(t, T_final, tau, T_initial, t_dead):
"""First-order thermal response with dead time."""
t_effective = np.maximum(t - t_dead, 0)
return T_initial + (T_final - T_initial) * (1 - np.exp(-t_effective / tau))
# Fit parameters: [T_final, tau, T_initial, t_dead]
popt, pcov = curve_fit(thermal_response, time_data, temp_data,
p0=[80, 1.0, 40, 0.1])
tau_measured = popt[1]
tau_uncertainty = np.sqrt(pcov[1,1])
Critical Measurement Considerations:
- Sample rate must be ≥10× faster than expected τ (100ms for τ=1s)
- Sensor thermal mass introduces its own lag (typically 50-200ms)
- Ambient temperature drift corrupts long measurements
- Multiple trials needed for statistical confidence
2. What is the variance across CPU generations and TDP classes?
Hypothesis: Time constants correlate with thermal design power (TDP) and die size.
Test Matrix:
| Category | Example CPUs | Expected τ Range | Physical Reasoning |
|---|---|---|---|
| Mobile (15W) | Intel i7-1365U, AMD 7840U | 0.3-0.6s | Small die, aggressive throttling |
| Desktop (65W) | Intel i5-13600K, AMD 7700X | 0.8-1.5s | Larger die, better cooling |
| Server (150W+) | Intel Xeon, AMD EPYC | 1.5-3.0s | Massive IHS, vapor chamber |
| GPU (300W+) | NVIDIA A100, AMD MI250 | 2.0-5.0s | Multiple dies, complex thermal path |
Variables to Control:
- Ambient temperature (20°C ± 1°C)
- Cooler type (stock vs aftermarket)
- Thermal paste application (fresh, consistent method)
- Case airflow (standardized or open bench)
Data Collection Template:
test_run:
hardware:
cpu_model: "Intel Xeon Gold 6326"
tdp_watts: 185
die_size_mm2: 660
socket: "LGA4189"
cooler: "Stock 2U heatsink"
environment:
ambient_temp_c: 21.3
humidity_pct: 45
airflow_cfm: 120
results:
tau_cpu_seconds: 1.87
tau_uncertainty: 0.12
t_dead_seconds: 0.23
r_squared: 0.994
n_trials: 5
3. How does workload type affect thermal response?
Hypothesis: Different workloads exercise different chip regions, creating non-uniform heating that affects effective τ.
Workload Categories:
┌─────────────────────────────────────────────────────────────────────┐
│ WORKLOAD THERMAL SIGNATURES │
│ │
│ COMPUTE-BOUND (ALU heavy) MEMORY-BOUND (Cache/DRAM) │
│ ┌───────────────────┐ ┌───────────────────┐ │
│ │ ████████████████ │ Hot cores │ ▒▒▒▒░░░░▒▒▒▒░░░░ │ Cool cores│
│ │ ████████████████ │ │ ▒▒▒▒░░░░▒▒▒▒░░░░ │ │
│ │ ████████████████ │ │ ▒▒▒▒░░░░▒▒▒▒░░░░ │ │
│ │ ████████████████ │ │ ▒▒▒▒░░░░▒▒▒▒░░░░ │ │
│ └───────────────────┘ └───────────────────┘ │
│ τ_effective ≈ 0.8s τ_effective ≈ 1.4s │
│ (concentrated heat → fast rise) (distributed → slower rise) │
│ │
│ AVX-512 (vector units) SIMD + MEMORY MIX │
│ ┌───────────────────┐ ┌───────────────────┐ │
│ │ ████░░░░████░░░░ │ Vector units │ ████▒▒▒▒████▒▒▒▒ │ Mixed │
│ │ ████░░░░████░░░░ │ only │ ████▒▒▒▒████▒▒▒▒ │ │
│ │ ████░░░░████░░░░ │ │ ████▒▒▒▒████▒▒▒▒ │ │
│ │ ████░░░░████░░░░ │ │ ████▒▒▒▒████▒▒▒▒ │ │
│ └───────────────────┘ └───────────────────┘ │
│ τ_effective ≈ 0.5s τ_effective ≈ 1.0s │
│ (extreme hotspots) (baseline case) │
└─────────────────────────────────────────────────────────────────────┘
stress-ng Workload Matrix:
# Pure compute (integer)
stress-ng --cpu $(nproc) --cpu-method ackermann --timeout 120s
# Pure compute (floating point)
stress-ng --cpu $(nproc) --cpu-method fft --timeout 120s
# AVX-heavy (if supported)
stress-ng --cpu $(nproc) --cpu-method matrixprod --timeout 120s
# Memory-bound
stress-ng --vm $(nproc) --vm-bytes 80% --timeout 120s
# Cache-bound
stress-ng --cache $(nproc) --timeout 120s
# Mixed realistic
stress-ng --cpu $(nproc) --vm $(nproc) --io 4 --timeout 120s
Expected Finding: The "single τ" model may be insufficient. A two-time-constant model may better capture reality:
T(t) = T_ambient + A₁(1 - e^(-t/τ_fast)) + A₂(1 - e^(-t/τ_slow))
Where:
├── τ_fast ≈ 0.3-0.5s (die to heatsink)
└── τ_slow ≈ 2-5s (heatsink to ambient)
4. What is the dead time before temperature begins rising?
Dead Time (t_dead): The delay between load application and first measurable temperature increase.
Sources of Dead Time:
┌─────────────────────────────────────────────────────────────────────┐
│ DEAD TIME SOURCES │
│ │
│ Load Applied ─────────────────────────────────────────► Temp Rises │
│ │ │ │
│ ├──► Instruction pipeline fill ────────── ~10μs │ │
│ ├──► Transistor switching ─────────────── ~100μs │ │
│ ├──► Silicon thermal diffusion ────────── ~1ms │ │
│ ├──► TIM (thermal interface) ──────────── ~10-50ms │ │
│ ├──► Heatsink base heating ────────────── ~50-100ms │ │
│ └──► Sensor thermal lag ───────────────── ~50-200ms │ │
│ │ │ │
│ Total: ~100-400ms │ │
│ │ │
│ ◄──────────────────── t_dead ──────────────────────────────► │
└─────────────────────────────────────────────────────────────────────┘
Why Dead Time Matters for PID Control:
# Without dead time compensation:
error = target_temp - current_temp
output = Kp*error + Ki*integral(error) + Kd*derivative(error)
# Problem: By the time we see temperature rise, heat was applied 200ms ago
# With dead time compensation (Smith Predictor):
predicted_temp = model.predict(current_temp, output_history, dead_time)
error = target_temp - predicted_temp
# Now control actions account for the delay
Measurement Protocol:
import time
import subprocess
# High-resolution timing
t_load_start = time.perf_counter()
subprocess.Popen(['stress-ng', '--cpu', '1', '--timeout', '10s'])
# Poll temperature at maximum rate
while time.perf_counter() - t_load_start < 5.0:
temp = read_cpu_temp()
if temp > baseline_temp + threshold: # threshold = 0.5°C
t_first_rise = time.perf_counter()
dead_time = t_first_rise - t_load_start
break
time.sleep(0.001) # 1ms polling
5. Are there second-order effects we need to model?
Potential Second-Order Effects:
A. Overshoot
Temperature temporarily exceeds steady-state value before settling:
Temperature
│ ╭── Overshoot
│ ╭───╮│
│ ╱ ╰───────── Steady state
│ ╱
│ ╱
│_____╱
└────────────────────────────► Time
Causes:
├── Thermal runaway (exponentially increasing leakage current)
├── Fan speed lag (thermal → fan control → airflow has its own τ)
└── Boost algorithms (CPU boosts, hits thermal limit, reduces)
B. Oscillation
Temperature oscillates around steady state:
Temperature
│ ╭╮ ╭╮ ╭╮
│ ╱ ╲ ╱ ╲ ╱ ╲────── Damped oscillation
│ ╱ ╲╱ ╲╱ ╲
│ ╱
│__╱
└────────────────────────────► Time
Causes:
├── Fan control hunting (PWM duty cycle oscillation)
├── CPU frequency stepping (P-states create discrete power levels)
└── Thermal throttling hysteresis (throttle at 95°C, release at 90°C)
C. Nonlinear Effects
┌─────────────────────────────────────────────────────────────────────┐
│ NONLINEAR THERMAL BEHAVIOR │
│ │
│ τ varies with temperature: │
│ │
│ τ │ │
│ │ ▓▓▓▓ │
│ │ ▓▓▓▓ │
│ │ ▓▓▓▓▓▓ │
│ │ ▓▓▓▓▓▓▓▓▓▓▓ │
│ └──────────────────────────────────────► Temperature │
│ 20°C 60°C 100°C │
│ │
│ Mechanism: Convection coefficient h ∝ (T_surface - T_ambient)^0.25 │
│ At higher ΔT, convection is more efficient → τ decreases │
└─────────────────────────────────────────────────────────────────────┘
Detection Method:
Fit residuals from first-order model and look for systematic patterns:
residuals = measured_temp - predicted_temp
# Check for overshoot
overshoot_ratio = max(measured_temp) / steady_state_temp
# Check for oscillation (autocorrelation at expected frequency)
from scipy.signal import find_peaks
peaks, _ = find_peaks(residuals, distance=10)
if len(peaks) > 2:
oscillation_period = np.mean(np.diff(time[peaks]))
# Check for nonlinearity (residuals correlate with temperature)
from scipy.stats import pearsonr
r, p = pearsonr(residuals, measured_temp)
if p < 0.05 and abs(r) > 0.3:
# Significant nonlinearity detected
Methodology
Phase 1: Baseline Characterization (Week 1-2)
-
Setup Test Environment
- Isolated test bench with controlled ambient temperature
- High-precision temperature logging (100ms minimum resolution)
- Calibrated thermal sensors (cross-reference multiple sources)
-
Single-Hardware Validation
- Select one representative server-class CPU
- Perform 20+ step response tests
- Establish measurement repeatability
-
Methodology Refinement
- Identify and eliminate systematic errors
- Optimize curve fitting algorithm
- Establish uncertainty quantification
Phase 2: Hardware Survey (Week 3-4)
-
TDP Class Coverage
- Minimum 3 CPUs per TDP class (mobile/desktop/server)
- Document cooler configurations
- Measure τ variance within and across classes
-
GPU Characterization
- NVIDIA and AMD discrete GPUs
- Integrated graphics (different thermal path)
- Multi-GPU configurations (thermal coupling)
-
Chassis Effects
- Open bench vs enclosed case
- Different airflow configurations
- NVMe and RAM thermal contribution
Phase 3: Workload Effects (Week 5-6)
-
Workload Matrix Execution
- All stress-ng workload types per hardware
- Real application benchmarks (compile, render, ML training)
- Idle → load → idle cycles
-
Second-Order Effect Detection
- Fit residual analysis
- Oscillation detection
- Nonlinearity characterization
Phase 4: Model Development (Week 7-8)
-
Model Selection
- Compare first-order vs two-time-constant models
- Evaluate need for dead time compensation
- Assess nonlinearity corrections
-
Validation
- Cross-validation on held-out hardware
- Prediction accuracy under mixed workloads
- Sensitivity analysis
Deliverables
D1: Hardware Time Constant Database
# thermal_constants.yaml
hardware:
- model: "Intel Xeon Gold 6326"
type: "cpu"
tdp_watts: 185
constants:
tau_primary: 1.87
tau_secondary: 8.3 # if two-time-constant model
dead_time: 0.23
overshoot_ratio: 1.02
uncertainty:
tau_primary: 0.12
methodology: "step_response_fit"
conditions:
cooler: "stock_2u"
ambient_c: 21.0
D2: Measurement Toolkit
thermal-validation/
├── scripts/
│ ├── run_step_response.sh # Automated test runner
│ ├── fit_thermal_curve.py # Curve fitting with uncertainty
│ └── validate_model.py # Model accuracy checker
├── notebooks/
│ ├── data_exploration.ipynb # Interactive analysis
│ └── model_comparison.ipynb # First-order vs two-constant
└── docs/
└── measurement_protocol.md # Reproducible methodology
D3: Maxwell Integration Recommendations
// Proposed configuration structure
pub struct ThermalCharacteristics {
/// Primary time constant (die → heatsink)
pub tau_primary: Duration,
/// Secondary time constant (heatsink → ambient), if applicable
pub tau_secondary: Option<Duration>,
/// Dead time before temperature response
pub dead_time: Duration,
/// Expected overshoot ratio (1.0 = none)
pub overshoot_ratio: f64,
/// Workload sensitivity factor
pub workload_sensitivity: WorkloadSensitivity,
}
pub enum WorkloadSensitivity {
/// τ is stable across workload types (±10%)
Low,
/// τ varies moderately with workload (±25%)
Medium,
/// τ varies significantly (±50%), consider dynamic estimation
High,
}
D4: Research Report
- Executive Summary — Key findings and recommendations
- Methodology — Reproducible measurement protocol
- Results — Time constant database with uncertainty
- Model Recommendations — First-order adequacy assessment
- Maxwell Integration — Specific configuration guidance
Success Criteria
| Criterion | Target | Validation Method |
|---|---|---|
| Measurement repeatability | CV < 10% across trials | Statistical analysis of repeated runs |
| Hardware coverage | ≥3 CPUs per TDP class | Inventory checklist |
| Model fit quality | R² > 0.98 | Curve fitting diagnostics |
| Dead time characterization | ±20ms accuracy | High-frequency measurement |
| Second-order detection | Detect effects >5% of signal | Residual analysis |
| Documentation | Complete, reproducible | Independent reproduction test |
References
Thermal Modeling
- Incropera, F.P. & DeWitt, D.P. — Fundamentals of Heat and Mass Transfer — Standard reference for thermal analysis
- Bar-Cohen, A. & Kraus, A.D. — Advances in Thermal Modeling of Electronic Components and Systems — Electronics-specific thermal behavior
Processor Thermal Behavior
- Intel — Thermal Design Power (TDP) and Thermal Management — Manufacturer thermal specifications
- AMD — Processor Power and Thermal Data Sheet — AMD thermal characteristics
- NVIDIA — GPU Thermal Design Guide — GPU-specific thermal considerations
Control Systems
- Astrom, K.J. & Hagglund, T. — PID Controllers: Theory, Design, and Tuning — Dead time compensation and Smith predictors
- Skogestad, S. — Simple analytic rules for model reduction and PID controller tuning — Practical PID tuning for thermal systems
Measurement Methodology
- JEDEC — JESD51 Series — Standard thermal measurement methods for semiconductors
- ASHRAE — Thermal Guidelines for Data Processing Environments — Data center thermal standards
Prior Art
- Google — Machine Learning for Thermal Management — Data-driven thermal prediction at scale
- Microsoft — Project Natick — Thermal dynamics in novel cooling environments