jordan 9a9e58c935 Initial commit: research notes journal

Moved from maxwell/blog to standalone repository.

- Next.js research journal application
- Notes 001-005 with YAML/MD content structure
- Claude Code configuration for blog development

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-07 13:12:07 -07:00

24 KiB

Raw Blame History

Thermal Time Constant Validation Research Directive

You are Dr. Adrian Bejan, distinguished professor of mechanical engineering at Duke University and creator of Constructal Law. You've spent decades studying heat transfer, thermodynamics, and the fundamental physics governing how thermal energy flows through engineered systems. Your work bridges theoretical physics and practical thermal management, from microprocessor cooling to data center design.

You are going to validate and characterize the thermal time constants assumed by Maxwell's PID controller — specifically measuring hardware-specific values for CPU, GPU, and chassis thermal response, determining variance across hardware generations, and identifying second-order effects that may require model refinement.

Context

Maxwell's PID-based thermal controller uses first-order exponential models with assumed time constants:

Component	Assumed Time Constant (τ)	Physical Basis
CPU	1 second	Small die, direct heatsink contact
GPU	2 seconds	Larger die, thermal interface layers
Chassis	30 seconds	Large thermal mass, convective coupling

These values are typical industry estimates but have significant implications:

Temperature Response: T(t) = T_ambient + ΔT × (1 - e^(-t/τ))

If τ_actual ≠ τ_assumed:
├── τ_actual < τ_assumed: Controller reacts too slowly → thermal spikes
├── τ_actual > τ_assumed: Controller overreacts → oscillation
└── Either case: PID gains become unstable

Why This Matters for Maxwell:

The PID controller's derivative term depends on accurate τ prediction:

D-gain anticipates future temperature based on rate of change
Wrong τ means wrong anticipation → overshoot or undershoot
In a cluster context, mis-tuned controllers cause workload ping-pong

┌─────────────────────────────────────────────────────────────────────┐
│                    THERMAL RESPONSE MISMATCH                        │
│                                                                      │
│  Temperature                                                         │
│       │                                                              │
│  100°C├─────────────────────────────────────────────────────────    │
│       │                              ╭─── Actual (τ=0.5s)           │
│   80°C├────────────────────────╭────╯                               │
│       │                   ╭───╯   ╭─── Assumed (τ=1.0s)            │
│   60°C├───────────────╭──╯╭─────╯                                   │
│       │           ╭──╯╭──╯                                          │
│   40°C├───────╭──╯╭──╯                                              │
│       │   ╭──╯╭──╯                                                  │
│   20°C├──╯╭──╯                                                      │
│       │ ╭╯                                                          │
│    0°C├─┴───────┬───────┬───────┬───────┬───────┬───────┬──► Time  │
│       0        1s       2s      3s      4s      5s      6s          │
│                                                                      │
│  If controller expects τ=1.0s but actual is τ=0.5s:                 │
│  → D-term under-predicts rate → reactive instead of proactive       │
└─────────────────────────────────────────────────────────────────────┘

Research Questions

1. How to measure thermal time constants on target hardware?

The Step Response Method:

Apply a sudden, sustained thermal load and measure the temperature rise curve:

Load Profile:
     Power
       │
  100% ├────────────────────────────────
       │                                │
    0% ├────────────────────────────────┴──────────► Time
       0        t_step                  t_end

Temperature Response:
       T
       │                    ╭───────── T_final (steady state)
       │               ╭───╯
       │          ╭───╯
       │     ╭───╯
       │╭───╯
       ├╯                   ← T_initial
       └───────────────────────────────────────────► Time
              ↑
         At t = τ, temperature reaches 63.2% of (T_final - T_initial)

Experimental Protocol:

# 1. Establish thermal baseline (idle for 5 minutes)
sleep 300

# 2. Record temperature at 100ms intervals
while true; do
  echo "$(date +%s.%N),$(cat /sys/class/thermal/thermal_zone*/temp)" >> thermal_log.csv
  sleep 0.1
done &

# 3. Apply step load with stress-ng
#    CPU: all cores, 100% utilization
stress-ng --cpu $(nproc) --cpu-load 100 --timeout 120s

# 4. Continue logging through cooldown (another 120s)
sleep 120

# 5. Fit exponential curve to extract τ
python3 fit_thermal_constant.py thermal_log.csv

Curve Fitting Algorithm:

import numpy as np
from scipy.optimize import curve_fit

def thermal_response(t, T_final, tau, T_initial, t_dead):
    """First-order thermal response with dead time."""
    t_effective = np.maximum(t - t_dead, 0)
    return T_initial + (T_final - T_initial) * (1 - np.exp(-t_effective / tau))

# Fit parameters: [T_final, tau, T_initial, t_dead]
popt, pcov = curve_fit(thermal_response, time_data, temp_data,
                       p0=[80, 1.0, 40, 0.1])
tau_measured = popt[1]
tau_uncertainty = np.sqrt(pcov[1,1])

Critical Measurement Considerations:

Sample rate must be ≥10× faster than expected τ (100ms for τ=1s)
Sensor thermal mass introduces its own lag (typically 50-200ms)
Ambient temperature drift corrupts long measurements
Multiple trials needed for statistical confidence

2. What is the variance across CPU generations and TDP classes?

Hypothesis: Time constants correlate with thermal design power (TDP) and die size.

Test Matrix:

Category	Example CPUs	Expected τ Range	Physical Reasoning
Mobile (15W)	Intel i7-1365U, AMD 7840U	0.3-0.6s	Small die, aggressive throttling
Desktop (65W)	Intel i5-13600K, AMD 7700X	0.8-1.5s	Larger die, better cooling
Server (150W+)	Intel Xeon, AMD EPYC	1.5-3.0s	Massive IHS, vapor chamber
GPU (300W+)	NVIDIA A100, AMD MI250	2.0-5.0s	Multiple dies, complex thermal path

Variables to Control:

Ambient temperature (20°C ± 1°C)
Cooler type (stock vs aftermarket)
Thermal paste application (fresh, consistent method)
Case airflow (standardized or open bench)

Data Collection Template:

test_run:
  hardware:
    cpu_model: "Intel Xeon Gold 6326"
    tdp_watts: 185
    die_size_mm2: 660
    socket: "LGA4189"
    cooler: "Stock 2U heatsink"
  environment:
    ambient_temp_c: 21.3
    humidity_pct: 45
    airflow_cfm: 120
  results:
    tau_cpu_seconds: 1.87
    tau_uncertainty: 0.12
    t_dead_seconds: 0.23
    r_squared: 0.994
    n_trials: 5

3. How does workload type affect thermal response?

Hypothesis: Different workloads exercise different chip regions, creating non-uniform heating that affects effective τ.

Workload Categories:

┌─────────────────────────────────────────────────────────────────────┐
│                    WORKLOAD THERMAL SIGNATURES                       │
│                                                                      │
│  COMPUTE-BOUND (ALU heavy)          MEMORY-BOUND (Cache/DRAM)       │
│  ┌───────────────────┐              ┌───────────────────┐           │
│  │ ████████████████  │ Hot cores    │ ▒▒▒▒░░░░▒▒▒▒░░░░  │ Cool cores│
│  │ ████████████████  │              │ ▒▒▒▒░░░░▒▒▒▒░░░░  │           │
│  │ ████████████████  │              │ ▒▒▒▒░░░░▒▒▒▒░░░░  │           │
│  │ ████████████████  │              │ ▒▒▒▒░░░░▒▒▒▒░░░░  │           │
│  └───────────────────┘              └───────────────────┘           │
│  τ_effective ≈ 0.8s                 τ_effective ≈ 1.4s              │
│  (concentrated heat → fast rise)    (distributed → slower rise)     │
│                                                                      │
│  AVX-512 (vector units)             SIMD + MEMORY MIX               │
│  ┌───────────────────┐              ┌───────────────────┐           │
│  │ ████░░░░████░░░░  │ Vector units │ ████▒▒▒▒████▒▒▒▒  │ Mixed     │
│  │ ████░░░░████░░░░  │ only         │ ████▒▒▒▒████▒▒▒▒  │           │
│  │ ████░░░░████░░░░  │              │ ████▒▒▒▒████▒▒▒▒  │           │
│  │ ████░░░░████░░░░  │              │ ████▒▒▒▒████▒▒▒▒  │           │
│  └───────────────────┘              └───────────────────┘           │
│  τ_effective ≈ 0.5s                 τ_effective ≈ 1.0s              │
│  (extreme hotspots)                 (baseline case)                  │
└─────────────────────────────────────────────────────────────────────┘

stress-ng Workload Matrix:

# Pure compute (integer)
stress-ng --cpu $(nproc) --cpu-method ackermann --timeout 120s

# Pure compute (floating point)
stress-ng --cpu $(nproc) --cpu-method fft --timeout 120s

# AVX-heavy (if supported)
stress-ng --cpu $(nproc) --cpu-method matrixprod --timeout 120s

# Memory-bound
stress-ng --vm $(nproc) --vm-bytes 80% --timeout 120s

# Cache-bound
stress-ng --cache $(nproc) --timeout 120s

# Mixed realistic
stress-ng --cpu $(nproc) --vm $(nproc) --io 4 --timeout 120s

Expected Finding: The "single τ" model may be insufficient. A two-time-constant model may better capture reality:

T(t) = T_ambient + A₁(1 - e^(-t/τ_fast)) + A₂(1 - e^(-t/τ_slow))

Where:
├── τ_fast ≈ 0.3-0.5s (die to heatsink)
└── τ_slow ≈ 2-5s (heatsink to ambient)

4. What is the dead time before temperature begins rising?

Dead Time (t_dead): The delay between load application and first measurable temperature increase.

Sources of Dead Time:

┌─────────────────────────────────────────────────────────────────────┐
│                        DEAD TIME SOURCES                             │
│                                                                      │
│  Load Applied ─────────────────────────────────────────► Temp Rises │
│       │                                                      │       │
│       ├──► Instruction pipeline fill ────────── ~10μs       │       │
│       ├──► Transistor switching ─────────────── ~100μs      │       │
│       ├──► Silicon thermal diffusion ────────── ~1ms        │       │
│       ├──► TIM (thermal interface) ──────────── ~10-50ms    │       │
│       ├──► Heatsink base heating ────────────── ~50-100ms   │       │
│       └──► Sensor thermal lag ───────────────── ~50-200ms   │       │
│                                                      │       │       │
│                                          Total: ~100-400ms   │       │
│                                                              │       │
│  ◄──────────────────── t_dead ──────────────────────────────►       │
└─────────────────────────────────────────────────────────────────────┘

Why Dead Time Matters for PID Control:

# Without dead time compensation:
error = target_temp - current_temp
output = Kp*error + Ki*integral(error) + Kd*derivative(error)
# Problem: By the time we see temperature rise, heat was applied 200ms ago

# With dead time compensation (Smith Predictor):
predicted_temp = model.predict(current_temp, output_history, dead_time)
error = target_temp - predicted_temp
# Now control actions account for the delay

Measurement Protocol:

import time
import subprocess

# High-resolution timing
t_load_start = time.perf_counter()
subprocess.Popen(['stress-ng', '--cpu', '1', '--timeout', '10s'])

# Poll temperature at maximum rate
while time.perf_counter() - t_load_start < 5.0:
    temp = read_cpu_temp()
    if temp > baseline_temp + threshold:  # threshold = 0.5°C
        t_first_rise = time.perf_counter()
        dead_time = t_first_rise - t_load_start
        break
    time.sleep(0.001)  # 1ms polling

5. Are there second-order effects we need to model?

Potential Second-Order Effects:

A. Overshoot

Temperature temporarily exceeds steady-state value before settling:

Temperature
     │              ╭── Overshoot
     │         ╭───╮│
     │        ╱     ╰───────── Steady state
     │       ╱
     │      ╱
     │_____╱
     └────────────────────────────► Time

Causes:
├── Thermal runaway (exponentially increasing leakage current)
├── Fan speed lag (thermal → fan control → airflow has its own τ)
└── Boost algorithms (CPU boosts, hits thermal limit, reduces)

B. Oscillation

Temperature oscillates around steady state:

Temperature
     │      ╭╮    ╭╮    ╭╮
     │     ╱  ╲  ╱  ╲  ╱  ╲────── Damped oscillation
     │    ╱    ╲╱    ╲╱    ╲
     │   ╱
     │__╱
     └────────────────────────────► Time

Causes:
├── Fan control hunting (PWM duty cycle oscillation)
├── CPU frequency stepping (P-states create discrete power levels)
└── Thermal throttling hysteresis (throttle at 95°C, release at 90°C)

C. Nonlinear Effects

┌─────────────────────────────────────────────────────────────────────┐
│                    NONLINEAR THERMAL BEHAVIOR                        │
│                                                                      │
│  τ varies with temperature:                                          │
│                                                                      │
│  τ │                                                                 │
│    │ ▓▓▓▓                                                           │
│    │     ▓▓▓▓                                                       │
│    │         ▓▓▓▓▓▓                                                 │
│    │               ▓▓▓▓▓▓▓▓▓▓▓                                      │
│    └──────────────────────────────────────► Temperature             │
│     20°C              60°C              100°C                        │
│                                                                      │
│  Mechanism: Convection coefficient h ∝ (T_surface - T_ambient)^0.25 │
│  At higher ΔT, convection is more efficient → τ decreases           │
└─────────────────────────────────────────────────────────────────────┘

Detection Method:

Fit residuals from first-order model and look for systematic patterns:

residuals = measured_temp - predicted_temp

# Check for overshoot
overshoot_ratio = max(measured_temp) / steady_state_temp

# Check for oscillation (autocorrelation at expected frequency)
from scipy.signal import find_peaks
peaks, _ = find_peaks(residuals, distance=10)
if len(peaks) > 2:
    oscillation_period = np.mean(np.diff(time[peaks]))

# Check for nonlinearity (residuals correlate with temperature)
from scipy.stats import pearsonr
r, p = pearsonr(residuals, measured_temp)
if p < 0.05 and abs(r) > 0.3:
    # Significant nonlinearity detected

Methodology

Phase 1: Baseline Characterization (Week 1-2)

Setup Test Environment
- Isolated test bench with controlled ambient temperature
- High-precision temperature logging (100ms minimum resolution)
- Calibrated thermal sensors (cross-reference multiple sources)
Single-Hardware Validation
- Select one representative server-class CPU
- Perform 20+ step response tests
- Establish measurement repeatability
Methodology Refinement
- Identify and eliminate systematic errors
- Optimize curve fitting algorithm
- Establish uncertainty quantification

Phase 2: Hardware Survey (Week 3-4)

TDP Class Coverage
- Minimum 3 CPUs per TDP class (mobile/desktop/server)
- Document cooler configurations
- Measure τ variance within and across classes
GPU Characterization
- NVIDIA and AMD discrete GPUs
- Integrated graphics (different thermal path)
- Multi-GPU configurations (thermal coupling)
Chassis Effects
- Open bench vs enclosed case
- Different airflow configurations
- NVMe and RAM thermal contribution

Phase 3: Workload Effects (Week 5-6)

Workload Matrix Execution
- All stress-ng workload types per hardware
- Real application benchmarks (compile, render, ML training)
- Idle → load → idle cycles
Second-Order Effect Detection
- Fit residual analysis
- Oscillation detection
- Nonlinearity characterization

Phase 4: Model Development (Week 7-8)

Model Selection
- Compare first-order vs two-time-constant models
- Evaluate need for dead time compensation
- Assess nonlinearity corrections
Validation
- Cross-validation on held-out hardware
- Prediction accuracy under mixed workloads
- Sensitivity analysis

Deliverables

D1: Hardware Time Constant Database

# thermal_constants.yaml
hardware:
  - model: "Intel Xeon Gold 6326"
    type: "cpu"
    tdp_watts: 185
    constants:
      tau_primary: 1.87
      tau_secondary: 8.3  # if two-time-constant model
      dead_time: 0.23
      overshoot_ratio: 1.02
    uncertainty:
      tau_primary: 0.12
      methodology: "step_response_fit"
    conditions:
      cooler: "stock_2u"
      ambient_c: 21.0

D2: Measurement Toolkit

thermal-validation/
├── scripts/
│   ├── run_step_response.sh     # Automated test runner
│   ├── fit_thermal_curve.py     # Curve fitting with uncertainty
│   └── validate_model.py        # Model accuracy checker
├── notebooks/
│   ├── data_exploration.ipynb   # Interactive analysis
│   └── model_comparison.ipynb   # First-order vs two-constant
└── docs/
    └── measurement_protocol.md  # Reproducible methodology

D3: Maxwell Integration Recommendations

// Proposed configuration structure
pub struct ThermalCharacteristics {
    /// Primary time constant (die → heatsink)
    pub tau_primary: Duration,

    /// Secondary time constant (heatsink → ambient), if applicable
    pub tau_secondary: Option<Duration>,

    /// Dead time before temperature response
    pub dead_time: Duration,

    /// Expected overshoot ratio (1.0 = none)
    pub overshoot_ratio: f64,

    /// Workload sensitivity factor
    pub workload_sensitivity: WorkloadSensitivity,
}

pub enum WorkloadSensitivity {
    /// τ is stable across workload types (±10%)
    Low,
    /// τ varies moderately with workload (±25%)
    Medium,
    /// τ varies significantly (±50%), consider dynamic estimation
    High,
}

D4: Research Report

Executive Summary — Key findings and recommendations
Methodology — Reproducible measurement protocol
Results — Time constant database with uncertainty
Model Recommendations — First-order adequacy assessment
Maxwell Integration — Specific configuration guidance

Success Criteria

Criterion	Target	Validation Method
Measurement repeatability	CV < 10% across trials	Statistical analysis of repeated runs
Hardware coverage	≥3 CPUs per TDP class	Inventory checklist
Model fit quality	R² > 0.98	Curve fitting diagnostics
Dead time characterization	±20ms accuracy	High-frequency measurement
Second-order detection	Detect effects >5% of signal	Residual analysis
Documentation	Complete, reproducible	Independent reproduction test

References

Thermal Modeling

Incropera, F.P. & DeWitt, D.P. — Fundamentals of Heat and Mass Transfer — Standard reference for thermal analysis
Bar-Cohen, A. & Kraus, A.D. — Advances in Thermal Modeling of Electronic Components and Systems — Electronics-specific thermal behavior

Processor Thermal Behavior

Intel — Thermal Design Power (TDP) and Thermal Management — Manufacturer thermal specifications
AMD — Processor Power and Thermal Data Sheet — AMD thermal characteristics
NVIDIA — GPU Thermal Design Guide — GPU-specific thermal considerations

Control Systems

Astrom, K.J. & Hagglund, T. — PID Controllers: Theory, Design, and Tuning — Dead time compensation and Smith predictors
Skogestad, S. — Simple analytic rules for model reduction and PID controller tuning — Practical PID tuning for thermal systems

Measurement Methodology

JEDEC — JESD51 Series — Standard thermal measurement methods for semiconductors
ASHRAE — Thermal Guidelines for Data Processing Environments — Data center thermal standards

Prior Art

Google — Machine Learning for Thermal Management — Data-driven thermal prediction at scale
Microsoft — Project Natick — Thermal dynamics in novel cooling environments

24 KiB Raw Blame History Unescape Escape