# Task 03: ControlPlane ## Delivers `ControlPlane` in `tidal/src/replication/control.rs`. Embedded within the leader node. Manages cluster topology (shard-to-region assignments, tenant placement, region health). Exposes cluster health metrics serializable to JSON for external monitoring. No separate service — runs as a background task within the leader process. ## Complexity: L ## Dependencies - Task 01 (TenantId, TenantConfig) - Task 02 (TenantRouter, ClusterTopology) - Phase 8.2, Task 06 (ReplicationLagGauge) ## Technical Design ```rust // tidal/src/replication/control.rs /// Embedded cluster controller running on the leader node. /// /// Tracks cluster topology, tenant placement, and shard health. /// Exposes a `ClusterHealth` snapshot for external monitoring via the /// existing `MetricsState` integration. /// /// Design constraint: no external service. The control plane is an /// in-process component, consistent with tidalDB's embeddable philosophy. pub struct ControlPlane { topology: Arc>, tenant_router: Arc, lag_gauge: Arc, shard_stats: DashMap, region_health: DashMap, } /// Per-shard operational statistics. #[derive(Debug, Clone, serde::Serialize, serde::Deserialize)] pub struct ShardStats { pub shard_id: ShardId, pub region_id: RegionId, pub entity_count: u64, /// WAL events applied per second (EMA, α=0.1). pub signal_throughput_eps: f64, /// Replication lag to each follower (seqno distance). pub replication_lag: HashMap, /// Approximate disk usage for this shard's WAL directory (bytes). pub disk_bytes: u64, /// Last heartbeat from this shard (ns since epoch). pub last_heartbeat_ns: u64, } /// Per-region health state. #[derive(Debug, Clone, Copy, PartialEq, Eq, serde::Serialize, serde::Deserialize)] pub enum RegionHealth { Healthy, Degraded, // replication lag > 5s Offline, // no heartbeat for > 30s } /// Full cluster health snapshot. /// /// Serializable to JSON for monitoring dashboards (Prometheus/Grafana, etc.). #[derive(Debug, Clone, serde::Serialize, serde::Deserialize)] pub struct ClusterHealth { pub snapshot_ns: u64, pub shards: Vec, pub regions: HashMap, pub tenant_count: usize, pub total_entities: u64, pub total_signals_eps: f64, } impl ControlPlane { pub fn new( topology: Arc>, tenant_router: Arc, lag_gauge: Arc, ) -> Self { Self { topology, tenant_router, lag_gauge, shard_stats: DashMap::new(), region_health: DashMap::new(), } } /// Update shard statistics (called by each shard on its heartbeat interval). pub fn record_shard_heartbeat(&self, stats: ShardStats) { self.region_health.insert(stats.region_id, RegionHealth::Healthy); self.shard_stats.insert(stats.shard_id, stats); } /// Compute and return current cluster health snapshot. pub fn health(&self) -> ClusterHealth { let now_ns = crate::util::now_ns(); let shards: Vec<_> = self.shard_stats.iter() .map(|r| r.value().clone()) .collect(); // Mark regions offline if no heartbeat in 30s. let regions: HashMap<_, _> = self.region_health.iter() .map(|r| { let shard_for_region = shards.iter() .find(|s| s.region_id == *r.key()); let health = if let Some(s) = shard_for_region { let age_ns = now_ns.saturating_sub(s.last_heartbeat_ns); if age_ns > 30_000_000_000 { // 30s RegionHealth::Offline } else if s.replication_lag.values().any(|&lag| lag > 5_000_000_000) { // 5s RegionHealth::Degraded } else { RegionHealth::Healthy } } else { RegionHealth::Offline }; (*r.key(), health) }) .collect(); let total_entities = shards.iter().map(|s| s.entity_count).sum(); let total_signals_eps = shards.iter().map(|s| s.signal_throughput_eps).sum(); ClusterHealth { snapshot_ns: now_ns, shards, regions, tenant_count: self.tenant_router.tenant_count(), total_entities, total_signals_eps, } } /// Update topology: add or reassign a shard. /// /// Propagated to `TenantRouter` which will re-compute routes on next call. pub fn update_topology(&self, assignment: ShardAssignment) { let mut topology = self.topology.write().unwrap(); if let Some(existing) = topology.shards.iter_mut().find(|s| s.shard_id == assignment.shard_id) { *existing = assignment; } else { topology.shards.push(assignment); } } /// JSON representation of `ClusterHealth` for external monitoring. pub fn health_json(&self) -> String { serde_json::to_string_pretty(&self.health()) .unwrap_or_else(|e| format!("{{\"error\": \"{}\"}}", e)) } } ``` ### MetricsState Integration ```rust // tidal/src/db/metrics.rs (extension) impl MetricsState { pub fn cluster_health(&self) -> Option { self.control_plane.as_ref().map(|cp| cp.health()) } } ``` ## Acceptance Criteria - [ ] `ControlPlane::health()` returns a `ClusterHealth` with per-shard stats for all registered shards - [ ] `RegionHealth::Offline` is set for a shard whose `last_heartbeat_ns` is > 30 seconds ago - [ ] `RegionHealth::Degraded` is set for a shard with `replication_lag > 5s` - [ ] `health_json()` produces valid JSON deserializable back to `ClusterHealth` (round-trip test) - [ ] `update_topology(assignment)` is reflected in the next `health()` call and the next `TenantRouter::route()` call - [ ] `MetricsState::cluster_health()` returns `None` on single-node deployments (control plane not configured) - [ ] Control plane heartbeat test: 3 simulated shards, update stats for each, verify `health()` shows all 3 as `Healthy` - [ ] `cargo clippy -D warnings` and `cargo fmt` pass