Milestone 8 (phases 1-4): - Shard-aware WAL segment naming, BatchHeader v2, ShardRouter - Transport trait, InProcessTransport, WalShipper, FollowerDb - HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine - Session replication bridge with SeqNo/HWM, idempotency store Forage application: - Multi-source discovery engine with MAB exploration - Embedding-based label system, server handlers, UI refresh Other: - QUICKSTART.md, README.md, milestone-8 planning docs - Hard negative union semantics, RLHF export enhancements - Recovery benchmark and visibility test expansions - Split 8 oversized source files per CODING_GUIDELINES §9 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
97 lines
3.0 KiB
Markdown
97 lines
3.0 KiB
Markdown
# Task 06: ReplicationLagGauge
|
|
|
|
## Delivers
|
|
|
|
`ReplicationLagGauge` in `tidal/src/replication/lag.rs` tracking per-follower lag (leader_seqno - follower_applied_seqno). Exposed via `MetricsState` so existing Prometheus scraping picks it up automatically.
|
|
|
|
## Complexity: S
|
|
|
|
## Dependencies
|
|
|
|
- Phase 8.1 (ReplicationState)
|
|
- Task 03 (WalShipper -- for leader_seqno)
|
|
|
|
## Technical Design
|
|
|
|
```rust
|
|
// tidal/src/replication/lag.rs
|
|
|
|
/// Tracks per-follower replication lag.
|
|
///
|
|
/// Lag = leader's latest shipped seqno - follower's applied seqno.
|
|
/// A lag of 0 means the follower is fully caught up.
|
|
#[derive(Debug, Default)]
|
|
pub struct ReplicationLagGauge {
|
|
/// Per-follower: last seqno the leader has shipped.
|
|
leader_seqno: DashMap<ShardId, AtomicU64>,
|
|
/// Per-follower: last seqno the follower has applied.
|
|
follower_applied: Arc<ReplicationState>,
|
|
}
|
|
|
|
impl ReplicationLagGauge {
|
|
pub fn new(replication_state: Arc<ReplicationState>) -> Self {
|
|
Self {
|
|
leader_seqno: DashMap::new(),
|
|
follower_applied: replication_state,
|
|
}
|
|
}
|
|
|
|
/// Update the leader's known shipped seqno for a follower.
|
|
pub fn update_leader_seqno(&self, follower: ShardId, seqno: u64) {
|
|
self.leader_seqno
|
|
.entry(follower)
|
|
.or_insert_with(|| AtomicU64::new(0))
|
|
.store(seqno, Ordering::Release);
|
|
}
|
|
|
|
/// Get the current lag for a follower in seqno units.
|
|
pub fn lag_seqno(&self, follower: ShardId) -> i64 {
|
|
let leader = self.leader_seqno
|
|
.get(&follower)
|
|
.map(|a| a.load(Ordering::Acquire))
|
|
.unwrap_or(0);
|
|
let applied = self.follower_applied
|
|
.applied_seqno(follower)
|
|
.unwrap_or(0);
|
|
leader as i64 - applied as i64
|
|
}
|
|
|
|
/// Collect Prometheus-style gauge values for all followers.
|
|
pub fn collect_metrics(&self) -> Vec<(ShardId, i64)> {
|
|
self.leader_seqno
|
|
.iter()
|
|
.map(|entry| {
|
|
let follower = *entry.key();
|
|
(follower, self.lag_seqno(follower))
|
|
})
|
|
.collect()
|
|
}
|
|
}
|
|
```
|
|
|
|
### MetricsState integration
|
|
|
|
```rust
|
|
// tidal/src/db/metrics.rs (existing metrics module)
|
|
|
|
impl MetricsState {
|
|
// Add to existing collect() method:
|
|
pub fn replication_lag_seqno(&self, follower_shard: u16) -> i64 {
|
|
self.lag_gauge
|
|
.as_ref()
|
|
.map(|g| g.lag_seqno(ShardId(follower_shard)))
|
|
.unwrap_or(0)
|
|
}
|
|
}
|
|
```
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `ReplicationLagGauge::lag_seqno(follower)` returns `leader_seqno - follower_applied_seqno`
|
|
- [ ] `lag_seqno` returns 0 when follower is fully caught up
|
|
- [ ] `lag_seqno` returns > 0 when follower is behind
|
|
- [ ] `collect_metrics()` returns a snapshot of all follower lags
|
|
- [ ] Integrated into `MetricsState` so existing `/metrics` endpoint exposes `replication_lag_seqno` gauge
|
|
- [ ] Integration test: leader writes 100 segments; before follower applies them, lag = 100; after apply, lag = 0
|
|
- [ ] `cargo clippy -D warnings` and `cargo fmt` pass
|