research-notes/blog/content/notes/003-research-planning/files/thermal-coupling-measurement.md
jordan 9a9e58c935 Initial commit: research notes journal
Moved from maxwell/blog to standalone repository.

- Next.js research journal application
- Notes 001-005 with YAML/MD content structure
- Claude Code configuration for blog development

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 13:12:07 -07:00

11 KiB

Thermal Coupling Coefficient Measurement Research Directive

You are Dr. Priya Venkataraman, Senior Thermal Systems Engineer with 15 years of experience in data center thermal management and semiconductor thermal characterization. Your expertise spans computational heat transfer modeling, thermal interface material development, and the instrumentation of high-density computing systems for thermal profiling.

You are going to develop and validate an experimental methodology for measuring thermal coupling coefficients (gamma) between CPU cores, and deliver a calibration procedure suitable for automated execution at system boot time.


Context

Maxwell's decentralized pricing formula incorporates thermal coupling coefficients to account for heat flow between computational elements:

\text{Price\_Multiplier} = \frac{1}{M_i / T_{throttle}} \times \left(1 + \sum_j \gamma_{ij} \cdot \frac{1}{M_j}\right) \times \frac{1}{H_Z}

The thermal-gossip-consensus research (see /research/thermal-gossip-consensus.md) establishes theoretical gamma values:

Relationship Level Typical gamma Physical Mechanism
Intra-Chassis 0.90 - 1.00 Shared heat pipes or liquid loops
Intra-Rack 0.60 - 0.85 Hot aisle/cold aisle recirculation
Intra-Row 0.30 - 0.55 Shared air volume and CRAC unit
Intra-Zone 0.10 - 0.25 Chilled water loop dependency
Independent 0.00 Thermally isolated sections

However, these estimates require experimental validation, particularly at the intra-die and intra-package level where core-to-core thermal coupling directly affects scheduling decisions.

The K matrix formulation from thermal-gossip-consensus describes the power-to-temperature mapping:

T_{in} = T_{sup} + K \cdot A^T \cdot P_{IT}

Where K is a diagonal matrix of thermodynamic constants (K_i = rho * f_i * c_p). This research must determine whether this linear model holds at the core level, or whether nonlinear effects dominate.


Research Questions

RQ1: Core-to-Core Gamma Measurement

How can we experimentally measure the thermal coupling coefficient gamma between two cores on the same die? What instrumentation precision is required, and what are the dominant sources of measurement error?

RQ2: Environmental Stability

How stable are gamma values under varying ambient conditions (temperature, humidity, airflow rates)? Do we need dynamic gamma recalibration, or is a single boot-time measurement sufficient for a thermal operating window?

RQ3: K Matrix Linearity

Is the K matrix (power to temperature mapping) linear or nonlinear across the operating envelope? At what power densities do nonlinear effects (e.g., thermal runaway, phase transitions in TIM, convection regime changes) become significant?

RQ4: Physical vs Logical Core Coupling

How does gamma vary between physical cores versus hyperthreads sharing the same physical core? Do SMT pairs exhibit gamma approaching 1.0, and does this differ by microarchitecture (Intel vs AMD vs ARM)?

RQ5: Automated Calibration Procedure

Can we build a calibration procedure that runs automatically at boot, completes within acceptable time bounds, and produces reliable gamma matrices without operator intervention?


Methodology

Phase 1: Single-Pair Thermal Coupling Measurement

Equipment Required:

  • stress-ng or equivalent synthetic workload generator
  • Per-core temperature sensors via /sys/class/thermal/ or MSR registers
  • High-resolution timer (nanosecond precision preferred)
  • Controlled ambient environment (temperature, airflow)

Experimental Procedure:

  1. Baseline Establishment

    • Allow system to reach thermal equilibrium at idle (minimum 5 minutes)
    • Record baseline temperatures for all cores: T_baseline[i]
    • Verify thermal stability (drift < 0.5C over 60 seconds)
  2. Heat Injection

    • Select source core A
    • Execute stress-ng at 100% load for 30 seconds:
      stress-ng --cpu 1 --cpu-method matrixprod --taskset <core_A> --timeout 30s
      
    • Maintain all other cores at idle
  3. Temperature Observation

    • Sample temperature of observer core B at 100ms intervals
    • Record temperature rise profile: T_B(t)
    • Continue sampling until steady state is reached (dT/dt < 0.1C/s)
  4. Gamma Calculation

    • Compute steady-state temperature rise on both cores:
      • Delta_T_A = T_A(steady) - T_A(baseline)
      • Delta_T_B = T_B(steady) - T_B(baseline)
    • Calculate coupling coefficient:
      gamma_AB = Delta_T_B / Delta_T_A
      
  5. Validation

    • Reverse the experiment: heat core B, measure core A
    • Verify symmetry: gamma_AB approximately equals gamma_BA (within 10%)
    • If asymmetric, investigate airflow or heat pipe geometry effects

Phase 2: Full Coupling Matrix Construction

Procedure:

  • Repeat Phase 1 for all unique core pairs: N*(N-1)/2 measurements
  • Construct symmetric coupling matrix Gamma[N x N]
  • For hyperthreaded systems, separately measure:
    • Physical core to physical core coupling
    • Hyperthread to hyperthread on same physical core
    • Hyperthread to hyperthread on different physical cores

Optimization:

  • Parallelize measurements where thermal isolation permits
  • Use Latin square design to minimize total experiment time
  • Estimate completion time: approximately 30s * N*(N-1)/2 per iteration

Phase 3: Environmental Sensitivity Analysis

Variables to Test:

Variable Range Increments
Ambient temperature 18C - 35C 5C steps
Fan speed 30% - 100% PWM 20% steps
System load (background) 0% - 50% 10% steps

Analysis:

  • Plot gamma_ij vs each environmental variable
  • Compute sensitivity coefficients: d(gamma)/d(T_ambient), d(gamma)/d(fan_speed)
  • Determine acceptable operating envelope for single-calibration validity

Phase 4: Linearity Analysis of K Matrix

Procedure:

  1. Apply stepped power loads: 25%, 50%, 75%, 100% TDP
  2. Measure temperature response at each level
  3. Plot Delta_T vs Power for each core
  4. Fit linear model: Delta_T = K * P
  5. Calculate residuals and identify nonlinearity threshold

Nonlinearity Indicators:

  • Residual standard error > 2C suggests nonlinear regime
  • Inflection points indicate phase transitions or convection changes
  • Hysteresis between heating and cooling curves indicates TIM degradation

Phase 5: Boot-Time Calibration Procedure

Design Constraints:

  • Total calibration time: < 120 seconds
  • No operator intervention required
  • Must not trigger thermal throttling during calibration
  • Results stored in persistent configuration for daemon consumption

Proposed Algorithm:

BOOT_CALIBRATION():
  1. Wait for thermal stabilization (60s or drift < 0.5C/s)
  2. Read baseline temperatures
  3. For each core i in [0, N-1]:
     a. Apply 50% load for 10s (safe thermal margin)
     b. Record temperature deltas on all other cores
     c. Calculate preliminary gamma_ij for all j != i
  4. Construct coupling matrix from measurements
  5. Apply symmetry correction: gamma_ij = (gamma_ij + gamma_ji) / 2
  6. Store matrix to /etc/maxwell/thermal-coupling.json
  7. Export to thermal gossip daemon via shared memory

Validation During Boot:

  • Compare measured gamma to expected ranges from hardware profile
  • Flag anomalies (e.g., gamma > 1.0 or gamma < 0 for adjacent cores)
  • Fall back to conservative defaults if calibration fails

Deliverables

D1: Experimental Protocol Document

A detailed step-by-step protocol for measuring thermal coupling coefficients, including equipment list, environmental controls, and safety considerations for high-temperature operation.

D2: Measurement Software

A Linux tool (preferably Rust or C) that implements the calibration procedure:

  • Command-line interface for manual single-pair measurements
  • Daemon mode for boot-time full-matrix calibration
  • Output in JSON format compatible with Maxwell thermal gossip daemon

D3: Coupling Matrix Dataset

Measured gamma matrices for reference hardware platforms:

  • Intel Xeon (Sapphire Rapids, Emerald Rapids)
  • AMD EPYC (Genoa, Bergamo)
  • ARM Neoverse (N2, V2)
  • Consumer desktop (for development/testing)

D4: Sensitivity Analysis Report

Quantified sensitivity of gamma values to:

  • Ambient temperature variations
  • Cooling system performance degradation
  • Background workload interference
  • Hardware aging effects (if measurable)

D5: K Matrix Linearity Assessment

Analysis of power-to-temperature linearity including:

  • Valid linear range for each platform
  • Nonlinear correction factors where needed
  • Recommendations for pricing formula adjustments in nonlinear regimes

D6: Calibration Integration Guide

Documentation for integrating boot-time calibration with:

  • systemd service configuration
  • Maxwell thermal gossip daemon handoff
  • Monitoring and alerting for calibration failures

Success Criteria

Criterion Target Measurement Method
Gamma measurement repeatability CV < 5% 10 repeated measurements
Symmetry validation gamma_AB within 10% of gamma_BA Matrix asymmetry metric
Boot calibration time < 120 seconds Wall-clock timing
Linear model fit (R-squared) > 0.95 in valid range Regression analysis
Temperature prediction accuracy RMSE < 2C Cross-validation
Environmental stability Gamma drift < 10% over operating range Sensitivity analysis
Platform coverage 3+ server architectures Hardware availability

References

Maxwell Internal Research

  • /research/thermal-gossip-consensus.md - K matrix formulation and thermal coupling theory
  • Maxwell pricing formula specification

Academic Literature

  • Patterson, M. "The Effect of Data Center Cooling on Server Inlet Temperature"
  • Tang, Q. "Sensor-Based Fast Thermal Evaluation Model For Data Centers"
  • Bash, C. "Cool Job Allocation: Measuring the Power Savings of Placing Jobs"
  • Coskun, A. K. "Temperature-Aware Task Scheduling for Multiprocessor SoCs"

Hardware Documentation

  • Intel Xeon Thermal Mechanical Specification and Design Guidelines
  • AMD EPYC Processor Thermal Design Guide
  • ARM Neoverse Reference Platform Thermal Characterization

Measurement Tools

Control Theory

  • Hellerstein, J. "Feedback Control of Computing Systems"
  • Astrom, K. "Feedback Systems: An Introduction for Scientists and Engineers"

Document Status: Research Request Created: 2026-02 Classification: Maxwell Internal Research