tidaldb/docs/planning/milestone-8/phase-2/task-06-replication-lag-gauge.md
jordan f4cfd6c81f feat: complete M8 replication primitives + forage enhancements + docs
Milestone 8 (phases 1-4):
- Shard-aware WAL segment naming, BatchHeader v2, ShardRouter
- Transport trait, InProcessTransport, WalShipper, FollowerDb
- HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine
- Session replication bridge with SeqNo/HWM, idempotency store

Forage application:
- Multi-source discovery engine with MAB exploration
- Embedding-based label system, server handlers, UI refresh

Other:
- QUICKSTART.md, README.md, milestone-8 planning docs
- Hard negative union semantics, RLHF export enhancements
- Recovery benchmark and visibility test expansions
- Split 8 oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 13:17:19 -07:00

97 lines
3.0 KiB
Markdown

# Task 06: ReplicationLagGauge
## Delivers
`ReplicationLagGauge` in `tidal/src/replication/lag.rs` tracking per-follower lag (leader_seqno - follower_applied_seqno). Exposed via `MetricsState` so existing Prometheus scraping picks it up automatically.
## Complexity: S
## Dependencies
- Phase 8.1 (ReplicationState)
- Task 03 (WalShipper -- for leader_seqno)
## Technical Design
```rust
// tidal/src/replication/lag.rs
/// Tracks per-follower replication lag.
///
/// Lag = leader's latest shipped seqno - follower's applied seqno.
/// A lag of 0 means the follower is fully caught up.
#[derive(Debug, Default)]
pub struct ReplicationLagGauge {
/// Per-follower: last seqno the leader has shipped.
leader_seqno: DashMap<ShardId, AtomicU64>,
/// Per-follower: last seqno the follower has applied.
follower_applied: Arc<ReplicationState>,
}
impl ReplicationLagGauge {
pub fn new(replication_state: Arc<ReplicationState>) -> Self {
Self {
leader_seqno: DashMap::new(),
follower_applied: replication_state,
}
}
/// Update the leader's known shipped seqno for a follower.
pub fn update_leader_seqno(&self, follower: ShardId, seqno: u64) {
self.leader_seqno
.entry(follower)
.or_insert_with(|| AtomicU64::new(0))
.store(seqno, Ordering::Release);
}
/// Get the current lag for a follower in seqno units.
pub fn lag_seqno(&self, follower: ShardId) -> i64 {
let leader = self.leader_seqno
.get(&follower)
.map(|a| a.load(Ordering::Acquire))
.unwrap_or(0);
let applied = self.follower_applied
.applied_seqno(follower)
.unwrap_or(0);
leader as i64 - applied as i64
}
/// Collect Prometheus-style gauge values for all followers.
pub fn collect_metrics(&self) -> Vec<(ShardId, i64)> {
self.leader_seqno
.iter()
.map(|entry| {
let follower = *entry.key();
(follower, self.lag_seqno(follower))
})
.collect()
}
}
```
### MetricsState integration
```rust
// tidal/src/db/metrics.rs (existing metrics module)
impl MetricsState {
// Add to existing collect() method:
pub fn replication_lag_seqno(&self, follower_shard: u16) -> i64 {
self.lag_gauge
.as_ref()
.map(|g| g.lag_seqno(ShardId(follower_shard)))
.unwrap_or(0)
}
}
```
## Acceptance Criteria
- [ ] `ReplicationLagGauge::lag_seqno(follower)` returns `leader_seqno - follower_applied_seqno`
- [ ] `lag_seqno` returns 0 when follower is fully caught up
- [ ] `lag_seqno` returns > 0 when follower is behind
- [ ] `collect_metrics()` returns a snapshot of all follower lags
- [ ] Integrated into `MetricsState` so existing `/metrics` endpoint exposes `replication_lag_seqno` gauge
- [ ] Integration test: leader writes 100 segments; before follower applies them, lag = 100; after apply, lag = 0
- [ ] `cargo clippy -D warnings` and `cargo fmt` pass