Milestone 8 (phases 1-4): - Shard-aware WAL segment naming, BatchHeader v2, ShardRouter - Transport trait, InProcessTransport, WalShipper, FollowerDb - HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine - Session replication bridge with SeqNo/HWM, idempotency store Forage application: - Multi-source discovery engine with MAB exploration - Embedding-based label system, server handlers, UI refresh Other: - QUICKSTART.md, README.md, milestone-8 planning docs - Hard negative union semantics, RLHF export enhancements - Recovery benchmark and visibility test expansions - Split 8 oversized source files per CODING_GUIDELINES §9 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
197 lines
6.8 KiB
Markdown
197 lines
6.8 KiB
Markdown
# Task 04: Performance Assertions + CI Integration
|
|
|
|
## Delivers
|
|
|
|
Performance assertions added to `m8_uat.rs` that verify: cross-region replication < 2s p99, failover < 10s, reconciliation overhead < 100ms. CI configuration ensuring M8 tests run on every PR without flakiness. A benchmark in `tidal/benches/replication.rs` for sustained 25K signals/sec throughput measurement.
|
|
|
|
## Complexity: S
|
|
|
|
## Dependencies
|
|
|
|
- Task 03 (UAT scenario tests)
|
|
|
|
## Technical Design
|
|
|
|
```rust
|
|
// tidal/tests/m8_uat.rs (additions)
|
|
|
|
/// Performance: cross-region replication latency < 2s p99.
|
|
///
|
|
/// Measures the latency from WAL write on leader to applied on follower.
|
|
/// Uses InProcessTransport (no real network). Asserts p99 < 2s.
|
|
#[tokio::test]
|
|
async fn perf_replication_latency_p99() {
|
|
let cluster = SimulatedCluster::build(three_region_config()).await;
|
|
|
|
let mut latencies_ns: Vec<u64> = Vec::with_capacity(1000);
|
|
|
|
for i in 0u64..1000 {
|
|
let item = EntityId::new(i);
|
|
let before_ns = crate::util::now_ns();
|
|
|
|
cluster.write_signal("view", item, 1.0);
|
|
|
|
// Wait until eu-west follower has applied this specific event.
|
|
cluster.await_event_applied(RegionId(1), before_ns, Duration::from_secs(3)).await;
|
|
|
|
let after_ns = crate::util::now_ns();
|
|
latencies_ns.push(after_ns - before_ns);
|
|
}
|
|
|
|
latencies_ns.sort_unstable();
|
|
let p99_ns = latencies_ns[(latencies_ns.len() as f64 * 0.99) as usize];
|
|
let p99_ms = p99_ns / 1_000_000;
|
|
|
|
assert!(
|
|
p99_ms < 2000,
|
|
"replication latency p99 = {}ms, must be < 2000ms (in-process transport overhead)",
|
|
p99_ms
|
|
);
|
|
|
|
println!("Replication latency: p50={}ms p99={}ms",
|
|
latencies_ns[latencies_ns.len() / 2] / 1_000_000,
|
|
p99_ms,
|
|
);
|
|
}
|
|
|
|
/// Performance: failover completes in < 10 seconds.
|
|
#[tokio::test]
|
|
async fn perf_failover_under_10s() {
|
|
let cluster = Arc::new(SimulatedCluster::build(three_region_config()).await);
|
|
|
|
let start = Instant::now();
|
|
let _crash = ShardCrash::crash(ShardId(0), cluster.clone(), false).await;
|
|
|
|
while !cluster.has_leader() {
|
|
tokio::time::sleep(Duration::from_millis(50)).await;
|
|
assert!(
|
|
start.elapsed() < Duration::from_secs(10),
|
|
"failover must complete within 10 seconds"
|
|
);
|
|
}
|
|
|
|
let elapsed = start.elapsed();
|
|
println!("Failover completed in {}ms", elapsed.as_millis());
|
|
assert!(elapsed < Duration::from_secs(10));
|
|
}
|
|
|
|
/// Performance: reconciliation overhead < 100ms for 10K events per side.
|
|
#[tokio::test]
|
|
async fn perf_reconciliation_overhead() {
|
|
let cluster = SimulatedCluster::build(three_region_config()).await;
|
|
|
|
// Inject partition.
|
|
let partition = NetworkPartition::symmetric(
|
|
RegionId(0), RegionId(2), cluster.transport_factory()
|
|
);
|
|
|
|
// Write 10K events on each side.
|
|
for i in 0..10_000u64 {
|
|
cluster.write_signal("view", EntityId::new(i), 1.0);
|
|
cluster.node(RegionId(2)).db
|
|
.signal("view", EntityId::new(i + 10_000), 1.0, Timestamp::now())
|
|
.unwrap();
|
|
}
|
|
|
|
drop(partition); // Heal.
|
|
|
|
let reconcile_start = Instant::now();
|
|
cluster.reconcile_all().await;
|
|
cluster.await_full_convergence(Duration::from_secs(10)).await;
|
|
let reconcile_elapsed = reconcile_start.elapsed();
|
|
|
|
println!("Reconciliation of 20K events took {}ms", reconcile_elapsed.as_millis());
|
|
assert!(
|
|
reconcile_elapsed < Duration::from_millis(100),
|
|
"reconciliation overhead must be < 100ms for 20K total events (got {}ms)",
|
|
reconcile_elapsed.as_millis()
|
|
);
|
|
}
|
|
```
|
|
|
|
```rust
|
|
// tidal/benches/replication.rs
|
|
|
|
//! Replication throughput benchmark: sustained 25K signals/sec across 3 regions.
|
|
|
|
use criterion::{criterion_group, criterion_main, Criterion, Throughput};
|
|
|
|
fn bench_signal_throughput(c: &mut Criterion) {
|
|
let rt = tokio::runtime::Runtime::new().unwrap();
|
|
let cluster = rt.block_on(SimulatedCluster::build(three_region_config()));
|
|
|
|
let mut group = c.benchmark_group("replication");
|
|
group.throughput(Throughput::Elements(25_000));
|
|
group.bench_function("25k_signals_per_sec", |b| {
|
|
b.iter(|| {
|
|
rt.block_on(async {
|
|
for i in 0..25_000u64 {
|
|
cluster.write_signal("view", EntityId::new(i % 10_000), 1.0);
|
|
}
|
|
cluster.await_full_convergence(Duration::from_secs(5)).await;
|
|
});
|
|
});
|
|
});
|
|
group.finish();
|
|
}
|
|
|
|
criterion_group!(benches, bench_signal_throughput);
|
|
criterion_main!(benches);
|
|
```
|
|
|
|
### CI Configuration
|
|
|
|
```yaml
|
|
# .github/workflows/m8-tests.yml (or equivalent in the project's CI)
|
|
|
|
name: M8 Replication Tests
|
|
|
|
on:
|
|
pull_request:
|
|
paths:
|
|
- 'tidal/src/replication/**'
|
|
- 'tidal/src/testing/**'
|
|
- 'tidal/tests/m8*'
|
|
|
|
jobs:
|
|
m8-unit:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
- uses: dtolnay/rust-toolchain@stable
|
|
- run: cargo test --manifest-path tidal/Cargo.toml --lib --features test-utils
|
|
|
|
m8-integration:
|
|
runs-on: ubuntu-latest
|
|
timeout-minutes: 5
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
- uses: dtolnay/rust-toolchain@stable
|
|
- run: cargo test --manifest-path tidal/Cargo.toml --test m8_uat --features test-utils
|
|
- run: cargo test --manifest-path tidal/Cargo.toml --test m8p2_replication --features test-utils
|
|
- run: cargo test --manifest-path tidal/Cargo.toml --test m8p3_crdt --features test-utils
|
|
- run: cargo test --manifest-path tidal/Cargo.toml --test m8p4_session --features test-utils
|
|
- run: cargo test --manifest-path tidal/Cargo.toml --test m8p5_multitenancy --features test-utils
|
|
|
|
clippy:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
- uses: dtolnay/rust-toolchain@stable
|
|
with:
|
|
components: clippy, rustfmt
|
|
- run: cargo clippy --manifest-path tidal/Cargo.toml -D warnings --features test-utils
|
|
- run: cargo fmt --manifest-path tidal/Cargo.toml --check
|
|
```
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `perf_replication_latency_p99`: 1000-sample p99 replication latency < 2000ms with InProcessTransport; prints p50 and p99
|
|
- [ ] `perf_failover_under_10s`: leader election + follower promotion completes within 10 seconds; timing printed
|
|
- [ ] `perf_reconciliation_overhead`: reconciliation of 20K total events (10K per side) completes in < 100ms; timing printed
|
|
- [ ] `benches/replication.rs`: 25K signals/sec benchmark runs without panic; throughput number printed by criterion
|
|
- [ ] CI configuration: M8 integration tests run on PRs that touch `tidal/src/replication/**` or `tidal/tests/m8*`; job timeout = 5 minutes
|
|
- [ ] No flaky tests: run `cargo test --test m8_uat` 5 times in a row; all passes (deterministic due to InProcessTransport)
|
|
- [ ] Total CI job runtime (all M8 integration tests) < 3 minutes
|
|
- [ ] `cargo clippy -D warnings` and `cargo fmt` pass
|