tidaldb/docs/planning/milestone-8/phase-2/task-06-replication-lag-gauge.md
jordan f4cfd6c81f feat: complete M8 replication primitives + forage enhancements + docs
Milestone 8 (phases 1-4):
- Shard-aware WAL segment naming, BatchHeader v2, ShardRouter
- Transport trait, InProcessTransport, WalShipper, FollowerDb
- HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine
- Session replication bridge with SeqNo/HWM, idempotency store

Forage application:
- Multi-source discovery engine with MAB exploration
- Embedding-based label system, server handlers, UI refresh

Other:
- QUICKSTART.md, README.md, milestone-8 planning docs
- Hard negative union semantics, RLHF export enhancements
- Recovery benchmark and visibility test expansions
- Split 8 oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 13:17:19 -07:00

3.0 KiB

Task 06: ReplicationLagGauge

Delivers

ReplicationLagGauge in tidal/src/replication/lag.rs tracking per-follower lag (leader_seqno - follower_applied_seqno). Exposed via MetricsState so existing Prometheus scraping picks it up automatically.

Complexity: S

Dependencies

  • Phase 8.1 (ReplicationState)
  • Task 03 (WalShipper -- for leader_seqno)

Technical Design

// tidal/src/replication/lag.rs

/// Tracks per-follower replication lag.
///
/// Lag = leader's latest shipped seqno - follower's applied seqno.
/// A lag of 0 means the follower is fully caught up.
#[derive(Debug, Default)]
pub struct ReplicationLagGauge {
    /// Per-follower: last seqno the leader has shipped.
    leader_seqno: DashMap<ShardId, AtomicU64>,
    /// Per-follower: last seqno the follower has applied.
    follower_applied: Arc<ReplicationState>,
}

impl ReplicationLagGauge {
    pub fn new(replication_state: Arc<ReplicationState>) -> Self {
        Self {
            leader_seqno: DashMap::new(),
            follower_applied: replication_state,
        }
    }

    /// Update the leader's known shipped seqno for a follower.
    pub fn update_leader_seqno(&self, follower: ShardId, seqno: u64) {
        self.leader_seqno
            .entry(follower)
            .or_insert_with(|| AtomicU64::new(0))
            .store(seqno, Ordering::Release);
    }

    /// Get the current lag for a follower in seqno units.
    pub fn lag_seqno(&self, follower: ShardId) -> i64 {
        let leader = self.leader_seqno
            .get(&follower)
            .map(|a| a.load(Ordering::Acquire))
            .unwrap_or(0);
        let applied = self.follower_applied
            .applied_seqno(follower)
            .unwrap_or(0);
        leader as i64 - applied as i64
    }

    /// Collect Prometheus-style gauge values for all followers.
    pub fn collect_metrics(&self) -> Vec<(ShardId, i64)> {
        self.leader_seqno
            .iter()
            .map(|entry| {
                let follower = *entry.key();
                (follower, self.lag_seqno(follower))
            })
            .collect()
    }
}

MetricsState integration

// tidal/src/db/metrics.rs (existing metrics module)

impl MetricsState {
    // Add to existing collect() method:
    pub fn replication_lag_seqno(&self, follower_shard: u16) -> i64 {
        self.lag_gauge
            .as_ref()
            .map(|g| g.lag_seqno(ShardId(follower_shard)))
            .unwrap_or(0)
    }
}

Acceptance Criteria

  • ReplicationLagGauge::lag_seqno(follower) returns leader_seqno - follower_applied_seqno
  • lag_seqno returns 0 when follower is fully caught up
  • lag_seqno returns > 0 when follower is behind
  • collect_metrics() returns a snapshot of all follower lags
  • Integrated into MetricsState so existing /metrics endpoint exposes replication_lag_seqno gauge
  • Integration test: leader writes 100 segments; before follower applies them, lag = 100; after apply, lag = 0
  • cargo clippy -D warnings and cargo fmt pass