research-notes/blog/content/notes/003-research-planning/files/thermal-gossip-consensus-research.md
jordan 9a9e58c935 Initial commit: research notes journal
Moved from maxwell/blog to standalone repository.

- Next.js research journal application
- Notes 001-005 with YAML/MD content structure
- Claude Code configuration for blog development

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 13:12:07 -07:00

26 KiB
Raw Blame History

Thermal Gossip Consensus Research Directive

You are Leslie Lamport, Turing Award laureate and inventor of Paxos, Lamport clocks, and the foundational theory of distributed systems. You've spent your career proving that consensus is possible in the presence of failures, and understanding exactly when it isn't. You know that distributed systems fail in ways that seem impossible until they happen.

You are going to design a gossip protocol for thermal state propagation across a distributed cluster — specifically, a mechanism where nodes autonomously share thermal stress signals, enabling neighbor-aware price adjustment that rebalances workloads before thermal throttling occurs.


Maxwell Cluster Architecture

Critical: Maxwell runs on every node. Nodes share physical infrastructure.

This isn't abstract distributed computing. Nodes share:

  • Cooling zones — A row of racks shares CRAC units
  • Power circuits — PDU capacity is finite per row
  • Ambient temperature — Hot exhaust from Node A becomes intake for Node B
┌─────────────────────────────────────────────────────────────────────────┐
│                           DATA CENTER ROW                                │
│                                                                          │
│   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐  │
│   │ Node A  │   │ Node B  │   │ Node C  │   │ Node D  │   │ Node E  │  │
│   │ Maxwell │◄──│ Maxwell │◄──│ Maxwell │◄──│ Maxwell │◄──│ Maxwell │  │
│   │         │──▶│         │──▶│         │──▶│         │──▶│         │  │
│   │ T=78°C  │   │ T=72°C  │   │ T=85°C  │   │ T=71°C  │   │ T=74°C  │  │
│   └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘  │
│        │             │             │             │             │        │
│        └─────────────┴──────┬──────┴─────────────┴─────────────┘        │
│                             │                                            │
│                    ┌────────▼────────┐                                  │
│                    │   Shared CRAC   │  ← Cooling capacity is FINITE    │
│                    │   (25kW limit)  │                                  │
│                    └─────────────────┘                                  │
│                                                                          │
│   GOSSIP LAYER: Each Maxwell shares thermal state with neighbors        │
│   GOAL: Autonomous rebalancing before any node throttles                │
└─────────────────────────────────────────────────────────────────────────┘

The Physical Coupling Problem

Node C is overheating (85°C):
                    │
    ┌───────────────┼───────────────┐
    ▼               ▼               ▼
┌─────────┐   ┌─────────┐   ┌─────────┐
│ Fan     │   │ Power   │   │ Exhaust │
│ Ramp-up │   │ Draw ↑  │   │ Heat →  │
└────┬────┘   └────┬────┘   └────┬────┘
     │             │             │
     ▼             ▼             ▼
┌─────────┐   ┌─────────┐   ┌─────────┐
│ Noise   │   │ Circuit │   │ Node D  │
│ affects │   │ capacity│   │ intake  │
│ humans  │   │ shared  │   │ warmer  │
└─────────┘   └─────────┘   └─────────┘
                    │
                    ▼
         CASCADING THERMAL FAILURE

Without coordination: Node C throttles, its work migrates to Node D, Node D overheats, cascade continues.

With gossip: Node C signals distress, neighbors raise prices, work migrates to cool nodes (E, F, G...) BEFORE throttling.


The Paradox

Problem Statement:

Traditional approaches fail:

Approach Problem
Centralized controller Single point of failure, latency
Periodic broadcast O(N²) messages, stale data
Reactive throttling Too late — damage already done
Static topology Doesn't adapt to load patterns

The Challenge:

Design a protocol where:

  1. Node A detects thermal stress
  2. Node A gossips "I am dying" to relevant neighbors
  3. Neighbors autonomously adjust their prices
  4. Workloads migrate without central coordination
  5. System stabilizes without oscillation
  6. All of this happens in <100ms end-to-end

The Distributed Systems Constraints:

- Nodes may fail silently (thermal death)
- Network may partition (switch failure)
- Clocks are not synchronized (physical time varies)
- Messages may be delayed, duplicated, or lost
- Byzantine nodes may lie about temperature (compromised sensors)

Research Objectives

Design a thermal gossip protocol achieving:

  1. Rapid Propagation: Thermal crisis reaches affected neighbors in <10ms
  2. Minimal Overhead: Gossip bandwidth <1% of network capacity
  3. Convergence: System reaches stable price equilibrium
  4. Stability: No oscillations (hunting behavior)
  5. Partition Tolerance: Graceful degradation under network splits
  6. Byzantine Resistance: Robust to lying/faulty temperature sensors

Step 1: Model the Physical Topology

Before designing the protocol, understand what "neighbor" means physically.

1.1 Thermal Coupling Graph

Not all nodes affect each other equally.

Define: thermal_coupling(A, B) ∈ [0, 1]
  - 1.0 = same chassis (multi-GPU node)
  - 0.8 = same rack (shared fans, power)
  - 0.5 = same row (shared CRAC)
  - 0.2 = same zone (shared chilled water)
  - 0.0 = different zones (independent cooling)

Research: How to discover/measure these couplings?
  - Static config from data center DCIM?
  - Dynamic measurement (correlate temp readings)?
  - ML model from historical data?

1.2 Cooling Capacity Model

Each cooling zone has capacity:

Zone Z:
  - CRAC capacity: 100kW
  - Current load: 85kW
  - Headroom: 15kW
  - Nodes in zone: {A, B, C, D, E}

If Node C ramps to 25kW:
  - Zone oversubscribed by 10kW
  - CRAC can't keep up
  - Ambient temp rises for ALL nodes in zone

Gossip must propagate: "Zone Z is out of cooling headroom"

1.3 Power Delivery Topology

PDU hierarchy:

Substation → Transformer → PDU → Rack PDU → Node

Each level has capacity limits:
  - Rack PDU: 30kW per rack
  - PDU: 200kW per aisle
  - Transformer: 1MW per zone

Thermal stress often correlates with power stress.
Gossip should include power draw, not just temperature.

Step 2: Design the Gossip Protocol

Core mechanism for thermal state dissemination.

2.1 Message Format

ThermalGossipMessage {
  // Identity
  node_id:        UUID
  timestamp:      Lamport clock (not wall clock!)
  sequence:       Monotonic counter (detect duplicates)

  // Thermal state
  temperature_c:  uint8       // 0-255°C, 1°C resolution
  thermal_margin: int8        // Degrees below throttle (-128 to +127)
  trend:          int8        // °C/second rate of change

  // Resource state
  power_draw_w:   uint16      // Current power consumption
  fan_speed_pct:  uint8       // 0-100%

  // Zone context
  zone_id:        uint16      // Physical cooling zone
  zone_headroom:  uint8       // % remaining cooling capacity

  // Price signal
  price_multiplier: uint16    // Fixed-point, 0.01x to 655.35x

  // Protocol
  ttl:            uint8       // Hops remaining
  signature:      [32]byte    // Ed25519 signature (Byzantine resistance)
}

Size: ~64 bytes per message

2.2 Epidemic Gossip (Push Model)

Classic epidemic/rumor spreading:

Every T milliseconds:
  1. Select K random peers from thermal_neighbors
  2. Send my ThermalGossipMessage to each
  3. Receive messages from peers
  4. Update local view of cluster thermal state
  5. Adjust my prices based on neighbor states

Parameters:
  - T = gossip interval (10-100ms?)
  - K = fanout (2-4 peers per round?)

Properties:
  - Convergence time: O(log N) rounds
  - Message complexity: O(N log N) per round
  - Distributed: No coordinator required

2.3 Thermal-Aware Peer Selection

Don't gossip randomly — gossip to thermally-coupled peers.

Peer selection weighted by:
  weight(peer) = thermal_coupling(self, peer)
               × urgency(self.thermal_margin)
               × recency(last_gossip_to_peer)

Urgency function:
  urgency(margin) = {
    1.0   if margin < 5°C   (CRITICAL)
    0.5   if margin < 10°C  (WARNING)
    0.1   if margin < 20°C  (NORMAL)
    0.01  otherwise         (COOL)
  }

Result: Hot nodes gossip aggressively to thermal neighbors
        Cool nodes gossip lazily

2.4 Pull Model (On-Demand)

Alternative: Nodes request state only when needed.

When local temp crosses threshold:
  1. Query thermal neighbors for their state
  2. Compute optimal price adjustment
  3. Apply immediately

Pros: Less bandwidth when stable
Cons: Latency when crisis hits

Hybrid approach:
  - Push for critical events (margin < 5°C)
  - Pull for routine updates

2.5 Zone-Level Aggregation

Reduce message complexity with hierarchy:

┌─────────────────────────────────────────────┐
│              Zone Aggregator                │
│  (Elected leader or virtual node)           │
│                                             │
│  Aggregates: max_temp, min_margin,          │
│              total_power, zone_headroom     │
└────────────────┬────────────────────────────┘
                 │
    ┌────────────┼────────────┐
    ▼            ▼            ▼
┌───────┐   ┌───────┐   ┌───────┐
│Node A │   │Node B │   │Node C │
│gossip │   │gossip │   │gossip │
│to zone│   │to zone│   │to zone│
└───────┘   └───────┘   └───────┘

Inter-zone gossip: Zone aggregators gossip to each other
Intra-zone gossip: Nodes gossip within zone

Complexity: O(√N) vs O(N) full mesh

Step 3: Design the Price Adjustment Mechanism

Receiving gossip must trigger autonomous price adjustment.

3.1 Neighbor-Influenced Pricing

My price depends on:
  1. My own thermal state
  2. My neighbors' thermal states
  3. Zone-level capacity

price_multiplier = f(
  my_thermal_margin,
  avg_neighbor_margin,
  zone_headroom,
  historical_stability
)

Simple model:
  base_price = 1.0 / (my_thermal_margin / throttle_temp)
  neighbor_penalty = Σ (coupling[i] × (1 / neighbor_margin[i]))
  zone_penalty = 1.0 / zone_headroom

  price_multiplier = base_price × (1 + neighbor_penalty) × zone_penalty

3.2 Stability Constraints

Problem: Naive adjustment causes oscillations

Node A hot → raises price → work migrates to B
B becomes hot → raises price → work migrates back to A
Repeat forever.

Solutions:

1. Hysteresis:
   - Only raise price when margin < threshold_high
   - Only lower price when margin > threshold_low
   - threshold_low < threshold_high (dead band)

2. Rate limiting:
   - Price can only change by X% per second
   - Prevents rapid oscillation

3. Damping:
   - new_price = α × computed_price + (1-α) × old_price
   - α = 0.1 for slow adjustment, 0.5 for fast

4. Predictive:
   - Adjust based on temperature TREND, not current value
   - If trending up, raise price proactively

3.3 Game-Theoretic Stability

Question: Is the pricing equilibrium a Nash equilibrium?

Model as N-player game:
  - Each node chooses price
  - Payoff = revenue - thermal_damage
  - Neighbors' prices affect my workload

Research:
  - Does a stable equilibrium exist?
  - Is it unique?
  - How fast does best-response dynamics converge?
  - Can nodes profitably deviate?

Step 4: Handle Failure Modes

Distributed systems fail. Design for it.

4.1 Node Failure (Thermal Death)

Scenario: Node C overheats and shuts down suddenly.

Problem:
  - C stops gossiping
  - Neighbors don't know if C is dead or network partitioned
  - C's workload may auto-migrate to neighbors (overwhelming them)

Solution:
  1. Heartbeat timeout → assume dead
  2. Mark C's zone as "degraded" in gossip
  3. All zone nodes preemptively raise prices
  4. Wait for confirmation before lowering

Timeout: 3 × gossip_interval (30ms if interval = 10ms)

4.2 Network Partition

Scenario: Switch failure splits cluster into two halves.

Problem:
  - Each half sees the other as "dead"
  - Each half may accept full workload
  - When partition heals, both halves are overloaded

Solution (conservative):
  - On partition detection, raise prices proportionally
  - "I see only 50% of nodes → assume 50% capacity"
  - When partition heals, gradually lower prices

Detection:
  - Gossip includes "nodes_seen_recently" count
  - If count drops, assume partition

4.3 Byzantine Sensors

Scenario: Compromised node lies about temperature.

Attack 1 - Fake cold:
  - Node C claims 30°C (actually 90°C)
  - Other nodes route work to C
  - C catches fire (or steals work unfairly)

Attack 2 - Fake hot:
  - Node C claims 95°C (actually 50°C)
  - Other nodes avoid C
  - C gets free capacity while others overload

Defense:
  1. Signed attestation (TPM-backed temperature)
  2. Cross-validation (if C claims cold but zone claims hot → suspect)
  3. Reputation system (historically accurate nodes trusted more)
  4. Physical correlation (power draw should match temperature)

4.4 Gossip Storm

Scenario: Cascade of thermal events triggers message explosion.

Trigger:
  - Zone A overheats
  - All nodes in A gossip urgently
  - Zone B (adjacent) receives flood
  - Zone B nodes gossip about Zone A
  - Exponential message growth

Defense:
  1. Rate limiting per source
  2. TTL on messages (prevent infinite propagation)
  3. Deduplication (sequence numbers)
  4. Aggregation (one message per zone, not per node)

Step 5: Control-Theoretic Analysis

Model the system as a feedback control loop.

5.1 System Dynamics Model

State vector per node:
  x = [temperature, power_draw, price, workload]

Dynamics:
  d(temp)/dt = f(power_draw, neighbor_temps, cooling_capacity)
  d(workload)/dt = g(my_price, neighbor_prices, global_demand)
  price = h(temp, neighbor_temps, zone_headroom)

Goal: Design h() such that system is stable and optimal.

5.2 Stability Analysis

Linearize around equilibrium point.
Compute eigenvalues of system matrix.
Ensure all eigenvalues have negative real parts.

Research:
  - Under what parameter regimes is system stable?
  - What's the settling time?
  - What disturbances can the system reject?

5.3 Optimal Control Formulation

Objective: Minimize total thermal stress while maximizing throughput

min ∫ [Σ thermal_stress(i) + λ × Σ idle_capacity(i)] dt

Subject to:
  - Temperature constraints
  - Power constraints
  - Workload conservation (all work must be done)

Research: Can we derive optimal price function from this?

Step 6: Integration with Maxwell Node-Level Auction

The gossip protocol informs, but doesn't replace, the local auction.

6.1 Price Signal Flow

┌─────────────────────────────────────────────────────────────────┐
│                         GOSSIP LAYER                             │
│                                                                  │
│  Receives: neighbor thermal states, zone headroom               │
│  Computes: external_price_multiplier                            │
│                                                                  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                      LOCAL MAXWELL AUCTION                       │
│                                                                  │
│  final_price = base_price                                       │
│              × local_thermal_multiplier                         │
│              × external_price_multiplier  ← FROM GOSSIP         │
│              × demand_multiplier                                │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

6.2 Workload Migration Trigger

When should workload actually migrate?

Trigger conditions:
  1. Local price exceeds threshold
  2. At least one neighbor has lower price by margin M
  3. Network path to neighbor has capacity
  4. Workload is migratable (not pinned)

Migration decision is LOCAL — each node decides independently.
Gossip provides information, not commands.

6.3 Global vs Local Optimality

Question: Does local best-response lead to global optimum?

Potential issues:
  - Tragedy of the commons (everyone migrates to one cool node)
  - Racing conditions (two nodes migrate to each other)
  - Information lag (decisions based on stale gossip)

Research: What coordination mechanism ensures global efficiency?
  - Price-based (pure market)
  - Token-based (capacity reservations)
  - Centralized hint (optional optimizer suggests moves)

Step 7: Implementation Architecture

Concrete system design for Maxwell cluster.

7.1 Gossip Daemon

Per-node process:

ThermalGossipDaemon:
  - Reads temperature from sensors (IPMI, /sys/class/thermal)
  - Reads power from PDU or RAPL
  - Maintains peer list (discovery via mDNS or static config)
  - Runs gossip protocol (UDP multicast or unicast)
  - Exposes price_multiplier to local Maxwell scheduler
  - Writes metrics to Prometheus

Interface:
  GET /thermal/state → current thermal state
  GET /thermal/neighbors → known neighbor states
  GET /thermal/price_multiplier → computed multiplier
  WS /thermal/stream → real-time updates

7.2 Network Protocol

Transport: UDP (low latency, tolerates loss)
Discovery: mDNS for LAN, static config for cross-DC
Security: WireGuard mesh or signed messages
Multicast: For intra-zone gossip (reduces message count)

Message flow:
  1. Node → Zone multicast: "Here's my state"
  2. Zone aggregator → Inter-zone unicast: "Zone summary"
  3. Emergency: Direct unicast to thermal neighbors

7.3 Sensor Integration

Temperature sources:
  - CPU: /sys/class/thermal/thermal_zone*/temp
  - GPU: nvidia-smi, rocm-smi
  - Chassis: IPMI sensors
  - Ambient: External probe or DCIM API

Power sources:
  - CPU: Intel RAPL (/sys/class/powercap)
  - GPU: nvidia-smi power draw
  - Node: PDU SNMP or Redfish API
  - Zone: DCIM API

Sampling rate: 100ms (10 Hz)
Smoothing: Exponential moving average (α = 0.3)

Deliverables

Primary Output: Protocol Specification (20-25 pages)

1. Executive Summary (1 page)
   - Protocol overview
   - Key design decisions
   - Expected performance characteristics

2. Physical Model (3 pages)
   - Thermal coupling graph
   - Cooling capacity model
   - Power delivery topology
   - How to discover/configure

3. Gossip Protocol (6 pages)
   - Message format specification
   - Peer selection algorithm
   - Push/pull hybrid design
   - Zone aggregation
   - Pseudocode for all algorithms

4. Price Adjustment (4 pages)
   - Neighbor-influenced pricing formula
   - Stability mechanisms (hysteresis, damping)
   - Game-theoretic analysis

5. Failure Handling (4 pages)
   - Node failure detection and response
   - Network partition handling
   - Byzantine resistance
   - Gossip storm prevention

6. Control Theory Analysis (3 pages)
   - System dynamics model
   - Stability conditions
   - Convergence proofs (or conjectures)

7. Implementation (3 pages)
   - Daemon architecture
   - Network protocol
   - Sensor integration
   - Maxwell integration

8. Evaluation Plan (2 pages)
   - Simulation framework
   - Testbed requirements
   - Metrics and success criteria

Appendices:
- Full message format specification
- Pseudocode listings
- Parameter tuning guide

Secondary Outputs

  1. Protocol Comparison Matrix

    Approach Latency Bandwidth Convergence Partition Tolerance
    Full Mesh Push 10ms O(N²) O(1) rounds Poor
    Epidemic Gossip 50ms O(N log N) O(log N) Good
    Zone Aggregated 30ms O(N) O(log K) Good
    Pull On-Demand Variable O(N) O(N) worst Excellent
  2. Simulation Framework

    • Discrete-event simulator for thermal gossip
    • Configurable topology (rack, row, zone, DC)
    • Failure injection
    • Metrics collection
  3. Reference Implementation

    • Go daemon implementing recommended protocol
    • gRPC/protobuf message definitions
    • Prometheus metrics exporter

Quality Checklist

Before considering research complete:

  • Defined thermal coupling graph model
  • Specified complete message format
  • Designed peer selection algorithm
  • Analyzed convergence time (theoretical)
  • Proved or conjectured stability conditions
  • Handled node failure, partition, Byzantine
  • Integrated with Maxwell local auction
  • Provided implementation architecture
  • Simulated with realistic topology (100+ nodes)
  • Demonstrated <100ms crisis propagation

Research Philosophy

Lamport's Principles Applied:

  1. Specify before implementing — Formal TLA+ spec if possible
  2. Assume messages can be lost, delayed, duplicated — Design defensively
  3. Clocks lie — Use logical time, not wall clocks
  4. Safety over liveness — Better to be slow than wrong
  5. Simple protocols scale — Complexity is the enemy

Maxwell-Specific Constraints:

  • Gossip must not interfere with auction hot path
  • Thermal response must be faster than throttling onset (~1-2 seconds)
  • Protocol must work across rack, row, zone, and DC scales
  • Must integrate with existing DCIM/BMS systems
  • Byzantine resistance required (compromised nodes exist)

Starting Points

Papers to Review

Gossip Protocols:
- "Epidemic Algorithms for Replicated Database Maintenance" (Demers et al.)
- "Gossip-Based Computation of Aggregate Information" (Kempe et al.)
- "SWIM: Scalable Weakly-consistent Infection-style Membership" (Das et al.)

Distributed Consensus:
- "The Part-Time Parliament" (Lamport) — Paxos original
- "In Search of an Understandable Consensus Algorithm" (Raft)
- "Viewstamped Replication" (Liskov)

Thermal-Aware Computing:
- "Thermal-Aware Scheduling in Data Centers" (various)
- "Thermodynamic Computing" (emerging field)

Control Theory:
- "Feedback Control of Computing Systems" (Hellerstein et al.)

Systems to Study

- Serf (HashiCorp) — Gossip-based membership
- Cassandra gossip — Failure detection and state propagation
- Kubernetes node heartbeats — Distributed health checking
- AWS Nitro thermal management — Hypervisor-level thermal
- Facebook data center cooling — Zone-based thermal management

Code to Examine

# Serf gossip implementation
https://github.com/hashicorp/serf

# SWIM protocol implementation
https://github.com/hashicorp/memberlist

# Linux thermal subsystem
/sys/class/thermal/
/drivers/thermal/ in Linux kernel

# IPMI thermal sensors
ipmitool sensor list

Notes

Scope Boundaries:

  • Focus on intra-DC gossip (assume <1ms network latency)
  • Assume honest-but-failing sensors (Byzantine = misconfigured, not malicious)
  • Don't design cross-DC federation (future work)
  • Assume Maxwell auction exists and accepts price multiplier input

Physical Reality Check:

Thermal time constants:
  - CPU die: ~1 second to heat, ~5 seconds to cool
  - Chassis: ~30 seconds to stabilize
  - Room: ~5 minutes to stabilize
  - Zone: ~15 minutes to stabilize

Gossip must be MUCH faster than thermal response.
10ms gossip latency vs 1s thermal time constant = 100x margin.

The Key Insight:

"A node doesn't need to know the exact temperature of every other node. It needs to know: 'Is my thermal neighborhood healthy, and if not, how should I adjust my behavior?'"

Gossip is about coordination, not surveillance.

The Thermodynamic Argument (Don't Forget):

"Heat doesn't respect software boundaries. A gossip protocol that ignores physical topology is optimizing the wrong thing. The goal isn't distributed consensus on temperature — it's distributed consensus on who should back off so the rack doesn't catch fire."