tidaldb/docs/planning/milestone-8/phase-6/task-04-performance-and-ci.md
jordan f4cfd6c81f feat: complete M8 replication primitives + forage enhancements + docs
Milestone 8 (phases 1-4):
- Shard-aware WAL segment naming, BatchHeader v2, ShardRouter
- Transport trait, InProcessTransport, WalShipper, FollowerDb
- HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine
- Session replication bridge with SeqNo/HWM, idempotency store

Forage application:
- Multi-source discovery engine with MAB exploration
- Embedding-based label system, server handlers, UI refresh

Other:
- QUICKSTART.md, README.md, milestone-8 planning docs
- Hard negative union semantics, RLHF export enhancements
- Recovery benchmark and visibility test expansions
- Split 8 oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 13:17:19 -07:00

197 lines
6.8 KiB
Markdown

# Task 04: Performance Assertions + CI Integration
## Delivers
Performance assertions added to `m8_uat.rs` that verify: cross-region replication < 2s p99, failover < 10s, reconciliation overhead < 100ms. CI configuration ensuring M8 tests run on every PR without flakiness. A benchmark in `tidal/benches/replication.rs` for sustained 25K signals/sec throughput measurement.
## Complexity: S
## Dependencies
- Task 03 (UAT scenario tests)
## Technical Design
```rust
// tidal/tests/m8_uat.rs (additions)
/// Performance: cross-region replication latency < 2s p99.
///
/// Measures the latency from WAL write on leader to applied on follower.
/// Uses InProcessTransport (no real network). Asserts p99 < 2s.
#[tokio::test]
async fn perf_replication_latency_p99() {
let cluster = SimulatedCluster::build(three_region_config()).await;
let mut latencies_ns: Vec<u64> = Vec::with_capacity(1000);
for i in 0u64..1000 {
let item = EntityId::new(i);
let before_ns = crate::util::now_ns();
cluster.write_signal("view", item, 1.0);
// Wait until eu-west follower has applied this specific event.
cluster.await_event_applied(RegionId(1), before_ns, Duration::from_secs(3)).await;
let after_ns = crate::util::now_ns();
latencies_ns.push(after_ns - before_ns);
}
latencies_ns.sort_unstable();
let p99_ns = latencies_ns[(latencies_ns.len() as f64 * 0.99) as usize];
let p99_ms = p99_ns / 1_000_000;
assert!(
p99_ms < 2000,
"replication latency p99 = {}ms, must be < 2000ms (in-process transport overhead)",
p99_ms
);
println!("Replication latency: p50={}ms p99={}ms",
latencies_ns[latencies_ns.len() / 2] / 1_000_000,
p99_ms,
);
}
/// Performance: failover completes in < 10 seconds.
#[tokio::test]
async fn perf_failover_under_10s() {
let cluster = Arc::new(SimulatedCluster::build(three_region_config()).await);
let start = Instant::now();
let _crash = ShardCrash::crash(ShardId(0), cluster.clone(), false).await;
while !cluster.has_leader() {
tokio::time::sleep(Duration::from_millis(50)).await;
assert!(
start.elapsed() < Duration::from_secs(10),
"failover must complete within 10 seconds"
);
}
let elapsed = start.elapsed();
println!("Failover completed in {}ms", elapsed.as_millis());
assert!(elapsed < Duration::from_secs(10));
}
/// Performance: reconciliation overhead < 100ms for 10K events per side.
#[tokio::test]
async fn perf_reconciliation_overhead() {
let cluster = SimulatedCluster::build(three_region_config()).await;
// Inject partition.
let partition = NetworkPartition::symmetric(
RegionId(0), RegionId(2), cluster.transport_factory()
);
// Write 10K events on each side.
for i in 0..10_000u64 {
cluster.write_signal("view", EntityId::new(i), 1.0);
cluster.node(RegionId(2)).db
.signal("view", EntityId::new(i + 10_000), 1.0, Timestamp::now())
.unwrap();
}
drop(partition); // Heal.
let reconcile_start = Instant::now();
cluster.reconcile_all().await;
cluster.await_full_convergence(Duration::from_secs(10)).await;
let reconcile_elapsed = reconcile_start.elapsed();
println!("Reconciliation of 20K events took {}ms", reconcile_elapsed.as_millis());
assert!(
reconcile_elapsed < Duration::from_millis(100),
"reconciliation overhead must be < 100ms for 20K total events (got {}ms)",
reconcile_elapsed.as_millis()
);
}
```
```rust
// tidal/benches/replication.rs
//! Replication throughput benchmark: sustained 25K signals/sec across 3 regions.
use criterion::{criterion_group, criterion_main, Criterion, Throughput};
fn bench_signal_throughput(c: &mut Criterion) {
let rt = tokio::runtime::Runtime::new().unwrap();
let cluster = rt.block_on(SimulatedCluster::build(three_region_config()));
let mut group = c.benchmark_group("replication");
group.throughput(Throughput::Elements(25_000));
group.bench_function("25k_signals_per_sec", |b| {
b.iter(|| {
rt.block_on(async {
for i in 0..25_000u64 {
cluster.write_signal("view", EntityId::new(i % 10_000), 1.0);
}
cluster.await_full_convergence(Duration::from_secs(5)).await;
});
});
});
group.finish();
}
criterion_group!(benches, bench_signal_throughput);
criterion_main!(benches);
```
### CI Configuration
```yaml
# .github/workflows/m8-tests.yml (or equivalent in the project's CI)
name: M8 Replication Tests
on:
pull_request:
paths:
- 'tidal/src/replication/**'
- 'tidal/src/testing/**'
- 'tidal/tests/m8*'
jobs:
m8-unit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- run: cargo test --manifest-path tidal/Cargo.toml --lib --features test-utils
m8-integration:
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- run: cargo test --manifest-path tidal/Cargo.toml --test m8_uat --features test-utils
- run: cargo test --manifest-path tidal/Cargo.toml --test m8p2_replication --features test-utils
- run: cargo test --manifest-path tidal/Cargo.toml --test m8p3_crdt --features test-utils
- run: cargo test --manifest-path tidal/Cargo.toml --test m8p4_session --features test-utils
- run: cargo test --manifest-path tidal/Cargo.toml --test m8p5_multitenancy --features test-utils
clippy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
with:
components: clippy, rustfmt
- run: cargo clippy --manifest-path tidal/Cargo.toml -D warnings --features test-utils
- run: cargo fmt --manifest-path tidal/Cargo.toml --check
```
## Acceptance Criteria
- [ ] `perf_replication_latency_p99`: 1000-sample p99 replication latency < 2000ms with InProcessTransport; prints p50 and p99
- [ ] `perf_failover_under_10s`: leader election + follower promotion completes within 10 seconds; timing printed
- [ ] `perf_reconciliation_overhead`: reconciliation of 20K total events (10K per side) completes in < 100ms; timing printed
- [ ] `benches/replication.rs`: 25K signals/sec benchmark runs without panic; throughput number printed by criterion
- [ ] CI configuration: M8 integration tests run on PRs that touch `tidal/src/replication/**` or `tidal/tests/m8*`; job timeout = 5 minutes
- [ ] No flaky tests: run `cargo test --test m8_uat` 5 times in a row; all passes (deterministic due to InProcessTransport)
- [ ] Total CI job runtime (all M8 integration tests) < 3 minutes
- [ ] `cargo clippy -D warnings` and `cargo fmt` pass