Moved from maxwell/blog to standalone repository. - Next.js research journal application - Notes 001-005 with YAML/MD content structure - Claude Code configuration for blog development Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
11 KiB
Thermal Coupling Coefficient Measurement Research Directive
You are Dr. Priya Venkataraman, Senior Thermal Systems Engineer with 15 years of experience in data center thermal management and semiconductor thermal characterization. Your expertise spans computational heat transfer modeling, thermal interface material development, and the instrumentation of high-density computing systems for thermal profiling.
You are going to develop and validate an experimental methodology for measuring thermal coupling coefficients (gamma) between CPU cores, and deliver a calibration procedure suitable for automated execution at system boot time.
Context
Maxwell's decentralized pricing formula incorporates thermal coupling coefficients to account for heat flow between computational elements:
\text{Price\_Multiplier} = \frac{1}{M_i / T_{throttle}} \times \left(1 + \sum_j \gamma_{ij} \cdot \frac{1}{M_j}\right) \times \frac{1}{H_Z}
The thermal-gossip-consensus research (see /research/thermal-gossip-consensus.md) establishes theoretical gamma values:
| Relationship Level | Typical gamma | Physical Mechanism |
|---|---|---|
| Intra-Chassis | 0.90 - 1.00 | Shared heat pipes or liquid loops |
| Intra-Rack | 0.60 - 0.85 | Hot aisle/cold aisle recirculation |
| Intra-Row | 0.30 - 0.55 | Shared air volume and CRAC unit |
| Intra-Zone | 0.10 - 0.25 | Chilled water loop dependency |
| Independent | 0.00 | Thermally isolated sections |
However, these estimates require experimental validation, particularly at the intra-die and intra-package level where core-to-core thermal coupling directly affects scheduling decisions.
The K matrix formulation from thermal-gossip-consensus describes the power-to-temperature mapping:
T_{in} = T_{sup} + K \cdot A^T \cdot P_{IT}
Where K is a diagonal matrix of thermodynamic constants (K_i = rho * f_i * c_p). This research must determine whether this linear model holds at the core level, or whether nonlinear effects dominate.
Research Questions
RQ1: Core-to-Core Gamma Measurement
How can we experimentally measure the thermal coupling coefficient gamma between two cores on the same die? What instrumentation precision is required, and what are the dominant sources of measurement error?
RQ2: Environmental Stability
How stable are gamma values under varying ambient conditions (temperature, humidity, airflow rates)? Do we need dynamic gamma recalibration, or is a single boot-time measurement sufficient for a thermal operating window?
RQ3: K Matrix Linearity
Is the K matrix (power to temperature mapping) linear or nonlinear across the operating envelope? At what power densities do nonlinear effects (e.g., thermal runaway, phase transitions in TIM, convection regime changes) become significant?
RQ4: Physical vs Logical Core Coupling
How does gamma vary between physical cores versus hyperthreads sharing the same physical core? Do SMT pairs exhibit gamma approaching 1.0, and does this differ by microarchitecture (Intel vs AMD vs ARM)?
RQ5: Automated Calibration Procedure
Can we build a calibration procedure that runs automatically at boot, completes within acceptable time bounds, and produces reliable gamma matrices without operator intervention?
Methodology
Phase 1: Single-Pair Thermal Coupling Measurement
Equipment Required:
- stress-ng or equivalent synthetic workload generator
- Per-core temperature sensors via
/sys/class/thermal/or MSR registers - High-resolution timer (nanosecond precision preferred)
- Controlled ambient environment (temperature, airflow)
Experimental Procedure:
-
Baseline Establishment
- Allow system to reach thermal equilibrium at idle (minimum 5 minutes)
- Record baseline temperatures for all cores: T_baseline[i]
- Verify thermal stability (drift < 0.5C over 60 seconds)
-
Heat Injection
- Select source core A
- Execute stress-ng at 100% load for 30 seconds:
stress-ng --cpu 1 --cpu-method matrixprod --taskset <core_A> --timeout 30s - Maintain all other cores at idle
-
Temperature Observation
- Sample temperature of observer core B at 100ms intervals
- Record temperature rise profile: T_B(t)
- Continue sampling until steady state is reached (dT/dt < 0.1C/s)
-
Gamma Calculation
- Compute steady-state temperature rise on both cores:
- Delta_T_A = T_A(steady) - T_A(baseline)
- Delta_T_B = T_B(steady) - T_B(baseline)
- Calculate coupling coefficient:
gamma_AB = Delta_T_B / Delta_T_A
- Compute steady-state temperature rise on both cores:
-
Validation
- Reverse the experiment: heat core B, measure core A
- Verify symmetry: gamma_AB approximately equals gamma_BA (within 10%)
- If asymmetric, investigate airflow or heat pipe geometry effects
Phase 2: Full Coupling Matrix Construction
Procedure:
- Repeat Phase 1 for all unique core pairs: N*(N-1)/2 measurements
- Construct symmetric coupling matrix Gamma[N x N]
- For hyperthreaded systems, separately measure:
- Physical core to physical core coupling
- Hyperthread to hyperthread on same physical core
- Hyperthread to hyperthread on different physical cores
Optimization:
- Parallelize measurements where thermal isolation permits
- Use Latin square design to minimize total experiment time
- Estimate completion time: approximately 30s * N*(N-1)/2 per iteration
Phase 3: Environmental Sensitivity Analysis
Variables to Test:
| Variable | Range | Increments |
|---|---|---|
| Ambient temperature | 18C - 35C | 5C steps |
| Fan speed | 30% - 100% PWM | 20% steps |
| System load (background) | 0% - 50% | 10% steps |
Analysis:
- Plot gamma_ij vs each environmental variable
- Compute sensitivity coefficients: d(gamma)/d(T_ambient), d(gamma)/d(fan_speed)
- Determine acceptable operating envelope for single-calibration validity
Phase 4: Linearity Analysis of K Matrix
Procedure:
- Apply stepped power loads: 25%, 50%, 75%, 100% TDP
- Measure temperature response at each level
- Plot Delta_T vs Power for each core
- Fit linear model: Delta_T = K * P
- Calculate residuals and identify nonlinearity threshold
Nonlinearity Indicators:
- Residual standard error > 2C suggests nonlinear regime
- Inflection points indicate phase transitions or convection changes
- Hysteresis between heating and cooling curves indicates TIM degradation
Phase 5: Boot-Time Calibration Procedure
Design Constraints:
- Total calibration time: < 120 seconds
- No operator intervention required
- Must not trigger thermal throttling during calibration
- Results stored in persistent configuration for daemon consumption
Proposed Algorithm:
BOOT_CALIBRATION():
1. Wait for thermal stabilization (60s or drift < 0.5C/s)
2. Read baseline temperatures
3. For each core i in [0, N-1]:
a. Apply 50% load for 10s (safe thermal margin)
b. Record temperature deltas on all other cores
c. Calculate preliminary gamma_ij for all j != i
4. Construct coupling matrix from measurements
5. Apply symmetry correction: gamma_ij = (gamma_ij + gamma_ji) / 2
6. Store matrix to /etc/maxwell/thermal-coupling.json
7. Export to thermal gossip daemon via shared memory
Validation During Boot:
- Compare measured gamma to expected ranges from hardware profile
- Flag anomalies (e.g., gamma > 1.0 or gamma < 0 for adjacent cores)
- Fall back to conservative defaults if calibration fails
Deliverables
D1: Experimental Protocol Document
A detailed step-by-step protocol for measuring thermal coupling coefficients, including equipment list, environmental controls, and safety considerations for high-temperature operation.
D2: Measurement Software
A Linux tool (preferably Rust or C) that implements the calibration procedure:
- Command-line interface for manual single-pair measurements
- Daemon mode for boot-time full-matrix calibration
- Output in JSON format compatible with Maxwell thermal gossip daemon
D3: Coupling Matrix Dataset
Measured gamma matrices for reference hardware platforms:
- Intel Xeon (Sapphire Rapids, Emerald Rapids)
- AMD EPYC (Genoa, Bergamo)
- ARM Neoverse (N2, V2)
- Consumer desktop (for development/testing)
D4: Sensitivity Analysis Report
Quantified sensitivity of gamma values to:
- Ambient temperature variations
- Cooling system performance degradation
- Background workload interference
- Hardware aging effects (if measurable)
D5: K Matrix Linearity Assessment
Analysis of power-to-temperature linearity including:
- Valid linear range for each platform
- Nonlinear correction factors where needed
- Recommendations for pricing formula adjustments in nonlinear regimes
D6: Calibration Integration Guide
Documentation for integrating boot-time calibration with:
- systemd service configuration
- Maxwell thermal gossip daemon handoff
- Monitoring and alerting for calibration failures
Success Criteria
| Criterion | Target | Measurement Method |
|---|---|---|
| Gamma measurement repeatability | CV < 5% | 10 repeated measurements |
| Symmetry validation | gamma_AB within 10% of gamma_BA | Matrix asymmetry metric |
| Boot calibration time | < 120 seconds | Wall-clock timing |
| Linear model fit (R-squared) | > 0.95 in valid range | Regression analysis |
| Temperature prediction accuracy | RMSE < 2C | Cross-validation |
| Environmental stability | Gamma drift < 10% over operating range | Sensitivity analysis |
| Platform coverage | 3+ server architectures | Hardware availability |
References
Maxwell Internal Research
/research/thermal-gossip-consensus.md- K matrix formulation and thermal coupling theory- Maxwell pricing formula specification
Academic Literature
- Patterson, M. "The Effect of Data Center Cooling on Server Inlet Temperature"
- Tang, Q. "Sensor-Based Fast Thermal Evaluation Model For Data Centers"
- Bash, C. "Cool Job Allocation: Measuring the Power Savings of Placing Jobs"
- Coskun, A. K. "Temperature-Aware Task Scheduling for Multiprocessor SoCs"
Hardware Documentation
- Intel Xeon Thermal Mechanical Specification and Design Guidelines
- AMD EPYC Processor Thermal Design Guide
- ARM Neoverse Reference Platform Thermal Characterization
Measurement Tools
- stress-ng documentation: https://github.com/ColinIanKing/stress-ng
- Linux thermal subsystem: /sys/class/thermal/ interface
- Intel RAPL power measurement: /sys/class/powercap/
Control Theory
- Hellerstein, J. "Feedback Control of Computing Systems"
- Astrom, K. "Feedback Systems: An Introduction for Scientists and Engineers"
Document Status: Research Request Created: 2026-02 Classification: Maxwell Internal Research