Moved from maxwell/blog to standalone repository. - Next.js research journal application - Notes 001-005 with YAML/MD content structure - Claude Code configuration for blog development Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
26 KiB
Thermal Gossip Consensus Research Directive
You are Leslie Lamport, Turing Award laureate and inventor of Paxos, Lamport clocks, and the foundational theory of distributed systems. You've spent your career proving that consensus is possible in the presence of failures, and understanding exactly when it isn't. You know that distributed systems fail in ways that seem impossible until they happen.
You are going to design a gossip protocol for thermal state propagation across a distributed cluster — specifically, a mechanism where nodes autonomously share thermal stress signals, enabling neighbor-aware price adjustment that rebalances workloads before thermal throttling occurs.
Maxwell Cluster Architecture
Critical: Maxwell runs on every node. Nodes share physical infrastructure.
This isn't abstract distributed computing. Nodes share:
- Cooling zones — A row of racks shares CRAC units
- Power circuits — PDU capacity is finite per row
- Ambient temperature — Hot exhaust from Node A becomes intake for Node B
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA CENTER ROW │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Node A │ │ Node B │ │ Node C │ │ Node D │ │ Node E │ │
│ │ Maxwell │◄──│ Maxwell │◄──│ Maxwell │◄──│ Maxwell │◄──│ Maxwell │ │
│ │ │──▶│ │──▶│ │──▶│ │──▶│ │ │
│ │ T=78°C │ │ T=72°C │ │ T=85°C │ │ T=71°C │ │ T=74°C │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │ │
│ └─────────────┴──────┬──────┴─────────────┴─────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ Shared CRAC │ ← Cooling capacity is FINITE │
│ │ (25kW limit) │ │
│ └─────────────────┘ │
│ │
│ GOSSIP LAYER: Each Maxwell shares thermal state with neighbors │
│ GOAL: Autonomous rebalancing before any node throttles │
└─────────────────────────────────────────────────────────────────────────┘
The Physical Coupling Problem
Node C is overheating (85°C):
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Fan │ │ Power │ │ Exhaust │
│ Ramp-up │ │ Draw ↑ │ │ Heat → │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Noise │ │ Circuit │ │ Node D │
│ affects │ │ capacity│ │ intake │
│ humans │ │ shared │ │ warmer │
└─────────┘ └─────────┘ └─────────┘
│
▼
CASCADING THERMAL FAILURE
Without coordination: Node C throttles, its work migrates to Node D, Node D overheats, cascade continues.
With gossip: Node C signals distress, neighbors raise prices, work migrates to cool nodes (E, F, G...) BEFORE throttling.
The Paradox
Problem Statement:
Traditional approaches fail:
| Approach | Problem |
|---|---|
| Centralized controller | Single point of failure, latency |
| Periodic broadcast | O(N²) messages, stale data |
| Reactive throttling | Too late — damage already done |
| Static topology | Doesn't adapt to load patterns |
The Challenge:
Design a protocol where:
- Node A detects thermal stress
- Node A gossips "I am dying" to relevant neighbors
- Neighbors autonomously adjust their prices
- Workloads migrate without central coordination
- System stabilizes without oscillation
- All of this happens in <100ms end-to-end
The Distributed Systems Constraints:
- Nodes may fail silently (thermal death)
- Network may partition (switch failure)
- Clocks are not synchronized (physical time varies)
- Messages may be delayed, duplicated, or lost
- Byzantine nodes may lie about temperature (compromised sensors)
Research Objectives
Design a thermal gossip protocol achieving:
- Rapid Propagation: Thermal crisis reaches affected neighbors in <10ms
- Minimal Overhead: Gossip bandwidth <1% of network capacity
- Convergence: System reaches stable price equilibrium
- Stability: No oscillations (hunting behavior)
- Partition Tolerance: Graceful degradation under network splits
- Byzantine Resistance: Robust to lying/faulty temperature sensors
Step 1: Model the Physical Topology
Before designing the protocol, understand what "neighbor" means physically.
1.1 Thermal Coupling Graph
Not all nodes affect each other equally.
Define: thermal_coupling(A, B) ∈ [0, 1]
- 1.0 = same chassis (multi-GPU node)
- 0.8 = same rack (shared fans, power)
- 0.5 = same row (shared CRAC)
- 0.2 = same zone (shared chilled water)
- 0.0 = different zones (independent cooling)
Research: How to discover/measure these couplings?
- Static config from data center DCIM?
- Dynamic measurement (correlate temp readings)?
- ML model from historical data?
1.2 Cooling Capacity Model
Each cooling zone has capacity:
Zone Z:
- CRAC capacity: 100kW
- Current load: 85kW
- Headroom: 15kW
- Nodes in zone: {A, B, C, D, E}
If Node C ramps to 25kW:
- Zone oversubscribed by 10kW
- CRAC can't keep up
- Ambient temp rises for ALL nodes in zone
Gossip must propagate: "Zone Z is out of cooling headroom"
1.3 Power Delivery Topology
PDU hierarchy:
Substation → Transformer → PDU → Rack PDU → Node
Each level has capacity limits:
- Rack PDU: 30kW per rack
- PDU: 200kW per aisle
- Transformer: 1MW per zone
Thermal stress often correlates with power stress.
Gossip should include power draw, not just temperature.
Step 2: Design the Gossip Protocol
Core mechanism for thermal state dissemination.
2.1 Message Format
ThermalGossipMessage {
// Identity
node_id: UUID
timestamp: Lamport clock (not wall clock!)
sequence: Monotonic counter (detect duplicates)
// Thermal state
temperature_c: uint8 // 0-255°C, 1°C resolution
thermal_margin: int8 // Degrees below throttle (-128 to +127)
trend: int8 // °C/second rate of change
// Resource state
power_draw_w: uint16 // Current power consumption
fan_speed_pct: uint8 // 0-100%
// Zone context
zone_id: uint16 // Physical cooling zone
zone_headroom: uint8 // % remaining cooling capacity
// Price signal
price_multiplier: uint16 // Fixed-point, 0.01x to 655.35x
// Protocol
ttl: uint8 // Hops remaining
signature: [32]byte // Ed25519 signature (Byzantine resistance)
}
Size: ~64 bytes per message
2.2 Epidemic Gossip (Push Model)
Classic epidemic/rumor spreading:
Every T milliseconds:
1. Select K random peers from thermal_neighbors
2. Send my ThermalGossipMessage to each
3. Receive messages from peers
4. Update local view of cluster thermal state
5. Adjust my prices based on neighbor states
Parameters:
- T = gossip interval (10-100ms?)
- K = fanout (2-4 peers per round?)
Properties:
- Convergence time: O(log N) rounds
- Message complexity: O(N log N) per round
- Distributed: No coordinator required
2.3 Thermal-Aware Peer Selection
Don't gossip randomly — gossip to thermally-coupled peers.
Peer selection weighted by:
weight(peer) = thermal_coupling(self, peer)
× urgency(self.thermal_margin)
× recency(last_gossip_to_peer)
Urgency function:
urgency(margin) = {
1.0 if margin < 5°C (CRITICAL)
0.5 if margin < 10°C (WARNING)
0.1 if margin < 20°C (NORMAL)
0.01 otherwise (COOL)
}
Result: Hot nodes gossip aggressively to thermal neighbors
Cool nodes gossip lazily
2.4 Pull Model (On-Demand)
Alternative: Nodes request state only when needed.
When local temp crosses threshold:
1. Query thermal neighbors for their state
2. Compute optimal price adjustment
3. Apply immediately
Pros: Less bandwidth when stable
Cons: Latency when crisis hits
Hybrid approach:
- Push for critical events (margin < 5°C)
- Pull for routine updates
2.5 Zone-Level Aggregation
Reduce message complexity with hierarchy:
┌─────────────────────────────────────────────┐
│ Zone Aggregator │
│ (Elected leader or virtual node) │
│ │
│ Aggregates: max_temp, min_margin, │
│ total_power, zone_headroom │
└────────────────┬────────────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Node A │ │Node B │ │Node C │
│gossip │ │gossip │ │gossip │
│to zone│ │to zone│ │to zone│
└───────┘ └───────┘ └───────┘
Inter-zone gossip: Zone aggregators gossip to each other
Intra-zone gossip: Nodes gossip within zone
Complexity: O(√N) vs O(N) full mesh
Step 3: Design the Price Adjustment Mechanism
Receiving gossip must trigger autonomous price adjustment.
3.1 Neighbor-Influenced Pricing
My price depends on:
1. My own thermal state
2. My neighbors' thermal states
3. Zone-level capacity
price_multiplier = f(
my_thermal_margin,
avg_neighbor_margin,
zone_headroom,
historical_stability
)
Simple model:
base_price = 1.0 / (my_thermal_margin / throttle_temp)
neighbor_penalty = Σ (coupling[i] × (1 / neighbor_margin[i]))
zone_penalty = 1.0 / zone_headroom
price_multiplier = base_price × (1 + neighbor_penalty) × zone_penalty
3.2 Stability Constraints
Problem: Naive adjustment causes oscillations
Node A hot → raises price → work migrates to B
B becomes hot → raises price → work migrates back to A
Repeat forever.
Solutions:
1. Hysteresis:
- Only raise price when margin < threshold_high
- Only lower price when margin > threshold_low
- threshold_low < threshold_high (dead band)
2. Rate limiting:
- Price can only change by X% per second
- Prevents rapid oscillation
3. Damping:
- new_price = α × computed_price + (1-α) × old_price
- α = 0.1 for slow adjustment, 0.5 for fast
4. Predictive:
- Adjust based on temperature TREND, not current value
- If trending up, raise price proactively
3.3 Game-Theoretic Stability
Question: Is the pricing equilibrium a Nash equilibrium?
Model as N-player game:
- Each node chooses price
- Payoff = revenue - thermal_damage
- Neighbors' prices affect my workload
Research:
- Does a stable equilibrium exist?
- Is it unique?
- How fast does best-response dynamics converge?
- Can nodes profitably deviate?
Step 4: Handle Failure Modes
Distributed systems fail. Design for it.
4.1 Node Failure (Thermal Death)
Scenario: Node C overheats and shuts down suddenly.
Problem:
- C stops gossiping
- Neighbors don't know if C is dead or network partitioned
- C's workload may auto-migrate to neighbors (overwhelming them)
Solution:
1. Heartbeat timeout → assume dead
2. Mark C's zone as "degraded" in gossip
3. All zone nodes preemptively raise prices
4. Wait for confirmation before lowering
Timeout: 3 × gossip_interval (30ms if interval = 10ms)
4.2 Network Partition
Scenario: Switch failure splits cluster into two halves.
Problem:
- Each half sees the other as "dead"
- Each half may accept full workload
- When partition heals, both halves are overloaded
Solution (conservative):
- On partition detection, raise prices proportionally
- "I see only 50% of nodes → assume 50% capacity"
- When partition heals, gradually lower prices
Detection:
- Gossip includes "nodes_seen_recently" count
- If count drops, assume partition
4.3 Byzantine Sensors
Scenario: Compromised node lies about temperature.
Attack 1 - Fake cold:
- Node C claims 30°C (actually 90°C)
- Other nodes route work to C
- C catches fire (or steals work unfairly)
Attack 2 - Fake hot:
- Node C claims 95°C (actually 50°C)
- Other nodes avoid C
- C gets free capacity while others overload
Defense:
1. Signed attestation (TPM-backed temperature)
2. Cross-validation (if C claims cold but zone claims hot → suspect)
3. Reputation system (historically accurate nodes trusted more)
4. Physical correlation (power draw should match temperature)
4.4 Gossip Storm
Scenario: Cascade of thermal events triggers message explosion.
Trigger:
- Zone A overheats
- All nodes in A gossip urgently
- Zone B (adjacent) receives flood
- Zone B nodes gossip about Zone A
- Exponential message growth
Defense:
1. Rate limiting per source
2. TTL on messages (prevent infinite propagation)
3. Deduplication (sequence numbers)
4. Aggregation (one message per zone, not per node)
Step 5: Control-Theoretic Analysis
Model the system as a feedback control loop.
5.1 System Dynamics Model
State vector per node:
x = [temperature, power_draw, price, workload]
Dynamics:
d(temp)/dt = f(power_draw, neighbor_temps, cooling_capacity)
d(workload)/dt = g(my_price, neighbor_prices, global_demand)
price = h(temp, neighbor_temps, zone_headroom)
Goal: Design h() such that system is stable and optimal.
5.2 Stability Analysis
Linearize around equilibrium point.
Compute eigenvalues of system matrix.
Ensure all eigenvalues have negative real parts.
Research:
- Under what parameter regimes is system stable?
- What's the settling time?
- What disturbances can the system reject?
5.3 Optimal Control Formulation
Objective: Minimize total thermal stress while maximizing throughput
min ∫ [Σ thermal_stress(i) + λ × Σ idle_capacity(i)] dt
Subject to:
- Temperature constraints
- Power constraints
- Workload conservation (all work must be done)
Research: Can we derive optimal price function from this?
Step 6: Integration with Maxwell Node-Level Auction
The gossip protocol informs, but doesn't replace, the local auction.
6.1 Price Signal Flow
┌─────────────────────────────────────────────────────────────────┐
│ GOSSIP LAYER │
│ │
│ Receives: neighbor thermal states, zone headroom │
│ Computes: external_price_multiplier │
│ │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ LOCAL MAXWELL AUCTION │
│ │
│ final_price = base_price │
│ × local_thermal_multiplier │
│ × external_price_multiplier ← FROM GOSSIP │
│ × demand_multiplier │
│ │
└─────────────────────────────────────────────────────────────────┘
6.2 Workload Migration Trigger
When should workload actually migrate?
Trigger conditions:
1. Local price exceeds threshold
2. At least one neighbor has lower price by margin M
3. Network path to neighbor has capacity
4. Workload is migratable (not pinned)
Migration decision is LOCAL — each node decides independently.
Gossip provides information, not commands.
6.3 Global vs Local Optimality
Question: Does local best-response lead to global optimum?
Potential issues:
- Tragedy of the commons (everyone migrates to one cool node)
- Racing conditions (two nodes migrate to each other)
- Information lag (decisions based on stale gossip)
Research: What coordination mechanism ensures global efficiency?
- Price-based (pure market)
- Token-based (capacity reservations)
- Centralized hint (optional optimizer suggests moves)
Step 7: Implementation Architecture
Concrete system design for Maxwell cluster.
7.1 Gossip Daemon
Per-node process:
ThermalGossipDaemon:
- Reads temperature from sensors (IPMI, /sys/class/thermal)
- Reads power from PDU or RAPL
- Maintains peer list (discovery via mDNS or static config)
- Runs gossip protocol (UDP multicast or unicast)
- Exposes price_multiplier to local Maxwell scheduler
- Writes metrics to Prometheus
Interface:
GET /thermal/state → current thermal state
GET /thermal/neighbors → known neighbor states
GET /thermal/price_multiplier → computed multiplier
WS /thermal/stream → real-time updates
7.2 Network Protocol
Transport: UDP (low latency, tolerates loss)
Discovery: mDNS for LAN, static config for cross-DC
Security: WireGuard mesh or signed messages
Multicast: For intra-zone gossip (reduces message count)
Message flow:
1. Node → Zone multicast: "Here's my state"
2. Zone aggregator → Inter-zone unicast: "Zone summary"
3. Emergency: Direct unicast to thermal neighbors
7.3 Sensor Integration
Temperature sources:
- CPU: /sys/class/thermal/thermal_zone*/temp
- GPU: nvidia-smi, rocm-smi
- Chassis: IPMI sensors
- Ambient: External probe or DCIM API
Power sources:
- CPU: Intel RAPL (/sys/class/powercap)
- GPU: nvidia-smi power draw
- Node: PDU SNMP or Redfish API
- Zone: DCIM API
Sampling rate: 100ms (10 Hz)
Smoothing: Exponential moving average (α = 0.3)
Deliverables
Primary Output: Protocol Specification (20-25 pages)
1. Executive Summary (1 page)
- Protocol overview
- Key design decisions
- Expected performance characteristics
2. Physical Model (3 pages)
- Thermal coupling graph
- Cooling capacity model
- Power delivery topology
- How to discover/configure
3. Gossip Protocol (6 pages)
- Message format specification
- Peer selection algorithm
- Push/pull hybrid design
- Zone aggregation
- Pseudocode for all algorithms
4. Price Adjustment (4 pages)
- Neighbor-influenced pricing formula
- Stability mechanisms (hysteresis, damping)
- Game-theoretic analysis
5. Failure Handling (4 pages)
- Node failure detection and response
- Network partition handling
- Byzantine resistance
- Gossip storm prevention
6. Control Theory Analysis (3 pages)
- System dynamics model
- Stability conditions
- Convergence proofs (or conjectures)
7. Implementation (3 pages)
- Daemon architecture
- Network protocol
- Sensor integration
- Maxwell integration
8. Evaluation Plan (2 pages)
- Simulation framework
- Testbed requirements
- Metrics and success criteria
Appendices:
- Full message format specification
- Pseudocode listings
- Parameter tuning guide
Secondary Outputs
-
Protocol Comparison Matrix
Approach Latency Bandwidth Convergence Partition Tolerance Full Mesh Push 10ms O(N²) O(1) rounds Poor Epidemic Gossip 50ms O(N log N) O(log N) Good Zone Aggregated 30ms O(N) O(log K) Good Pull On-Demand Variable O(N) O(N) worst Excellent -
Simulation Framework
- Discrete-event simulator for thermal gossip
- Configurable topology (rack, row, zone, DC)
- Failure injection
- Metrics collection
-
Reference Implementation
- Go daemon implementing recommended protocol
- gRPC/protobuf message definitions
- Prometheus metrics exporter
Quality Checklist
Before considering research complete:
- Defined thermal coupling graph model
- Specified complete message format
- Designed peer selection algorithm
- Analyzed convergence time (theoretical)
- Proved or conjectured stability conditions
- Handled node failure, partition, Byzantine
- Integrated with Maxwell local auction
- Provided implementation architecture
- Simulated with realistic topology (100+ nodes)
- Demonstrated <100ms crisis propagation
Research Philosophy
Lamport's Principles Applied:
- Specify before implementing — Formal TLA+ spec if possible
- Assume messages can be lost, delayed, duplicated — Design defensively
- Clocks lie — Use logical time, not wall clocks
- Safety over liveness — Better to be slow than wrong
- Simple protocols scale — Complexity is the enemy
Maxwell-Specific Constraints:
- Gossip must not interfere with auction hot path
- Thermal response must be faster than throttling onset (~1-2 seconds)
- Protocol must work across rack, row, zone, and DC scales
- Must integrate with existing DCIM/BMS systems
- Byzantine resistance required (compromised nodes exist)
Starting Points
Papers to Review
Gossip Protocols:
- "Epidemic Algorithms for Replicated Database Maintenance" (Demers et al.)
- "Gossip-Based Computation of Aggregate Information" (Kempe et al.)
- "SWIM: Scalable Weakly-consistent Infection-style Membership" (Das et al.)
Distributed Consensus:
- "The Part-Time Parliament" (Lamport) — Paxos original
- "In Search of an Understandable Consensus Algorithm" (Raft)
- "Viewstamped Replication" (Liskov)
Thermal-Aware Computing:
- "Thermal-Aware Scheduling in Data Centers" (various)
- "Thermodynamic Computing" (emerging field)
Control Theory:
- "Feedback Control of Computing Systems" (Hellerstein et al.)
Systems to Study
- Serf (HashiCorp) — Gossip-based membership
- Cassandra gossip — Failure detection and state propagation
- Kubernetes node heartbeats — Distributed health checking
- AWS Nitro thermal management — Hypervisor-level thermal
- Facebook data center cooling — Zone-based thermal management
Code to Examine
# Serf gossip implementation
https://github.com/hashicorp/serf
# SWIM protocol implementation
https://github.com/hashicorp/memberlist
# Linux thermal subsystem
/sys/class/thermal/
/drivers/thermal/ in Linux kernel
# IPMI thermal sensors
ipmitool sensor list
Notes
Scope Boundaries:
- Focus on intra-DC gossip (assume <1ms network latency)
- Assume honest-but-failing sensors (Byzantine = misconfigured, not malicious)
- Don't design cross-DC federation (future work)
- Assume Maxwell auction exists and accepts price multiplier input
Physical Reality Check:
Thermal time constants:
- CPU die: ~1 second to heat, ~5 seconds to cool
- Chassis: ~30 seconds to stabilize
- Room: ~5 minutes to stabilize
- Zone: ~15 minutes to stabilize
Gossip must be MUCH faster than thermal response.
10ms gossip latency vs 1s thermal time constant = 100x margin.
The Key Insight:
"A node doesn't need to know the exact temperature of every other node. It needs to know: 'Is my thermal neighborhood healthy, and if not, how should I adjust my behavior?'"
Gossip is about coordination, not surveillance.
The Thermodynamic Argument (Don't Forget):
"Heat doesn't respect software boundaries. A gossip protocol that ignores physical topology is optimizing the wrong thing. The goal isn't distributed consensus on temperature — it's distributed consensus on who should back off so the rack doesn't catch fire."