stemedb/ai-lookup/features/chaos-testing.md
jordan b3e8a9a058 feat: Multi-application expansion with chaos testing and community UI
Major additions:
- Community Next.js app (port 18187) for browsing claims with API docs
- stemedb-chaos crate: Fault injection, chaos testing, CRDT properties
- Latent ingestion system: Reddit/FDA ingesters with ADK-Go agents
- Disputed claims handling: Manual review workflows and validation
- Aphoria security scanner: New extractors (SQL injection, command
  injection, weak crypto, TLS version), policy-based ignores, UAT reports
- Docker infrastructure: Dockerfile, docker-compose.yml for full stack
- VulnBank demo: Intentionally vulnerable multi-language test corpus

SDK & API enhancements:
- Source registry handlers for tracking data provenance
- Metrics endpoint
- Skeptic filtering improvements

Code quality:
- Split 14 large files (>500 lines) into focused modules
- All files now under 500-line limit per project guidelines

Documentation:
- Chaos testing guide, circuit breakers, observability docs
- Phase 7 UAT documentation updates
- Martin Kleppmann technical writer agent

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 01:24:14 -07:00

220 lines
6.5 KiB
Markdown

# Chaos Testing (Phase 8A)
The `stemedb-chaos` crate provides infrastructure for testing Episteme distributed clusters under failure conditions.
## Overview
Chaos testing verifies that Episteme clusters:
- Continue accepting writes during network partitions
- Converge correctly after partition heals
- Handle node failures and recovery
- Maintain CRDT invariants under all conditions
- Handle clock skew correctly with HLC timestamps
## Components
### Test Harness
| Component | Purpose |
|-----------|---------|
| `ChaosNode` | Simulated cluster node with fault injection support |
| `TestCluster` | Manages N ChaosNodes with shared fault controllers |
### Fault Injection
| Controller | Capabilities |
|------------|--------------|
| `NetworkController` | Partitions, latency, message drops |
| `ClockController` | Clock skew injection for HLC testing |
### CRDT Property Verification
| Function | Verifies |
|----------|----------|
| `verify_commutativity()` | `merge(A, B) = merge(B, A)` |
| `verify_associativity()` | `(A merge B) merge C = A merge (B merge C)` |
| `verify_idempotence()` | `merge(A, A) = A` |
## Running Chaos Tests
```bash
# All chaos tests
cargo test -p stemedb-chaos
# Partition tests only
cargo test -p stemedb-chaos --test partition_tests
# Consistency tests only
cargo test -p stemedb-chaos --test consistency_tests
# Unit tests only
cargo test -p stemedb-chaos --lib
```
## Test Categories
### Partition Tests (8 tests)
| Test | Scenario |
|------|----------|
| `test_5_node_kill_2_convergence` | 5-node cluster survives 2 node failures |
| `test_partition_between_groups_convergence` | [0,1,2] vs [3,4] partition and heal |
| `test_message_reordering_convergence` | 100 writes in random order converge |
| `test_message_duplication_idempotent` | Repeated syncs don't create duplicates |
| `test_cascading_failure_recovery` | Sequential node failures and recovery |
| `test_swim_suspicion_not_false_positive` | Slow node marked Suspect, then Alive |
| `test_asymmetric_partition` | One-way partition (0→1 works, 1→0 blocked) |
| `test_write_availability_during_partition` | All nodes can write when fully partitioned |
### Consistency Tests (11 tests)
| Test | Scenario |
|------|----------|
| `test_crdt_eventual_consistency` | 1000 concurrent writes across 5 nodes |
| `test_crdt_commutativity` | Different merge orders produce same result |
| `test_crdt_associativity` | Merge grouping doesn't affect result |
| `test_crdt_idempotence` | Syncing same data repeatedly is stable |
| `test_hlc_handles_clock_skew` | ±5 second skew still converges |
| `test_hlc_monotonic_under_partition` | HLC remains monotonic during partition |
| `test_supersession_ordering_with_clock_skew` | HLC ordering with 2s skew |
| `test_concurrent_writes_same_subject_under_partition` | Both writes survive (append-only) |
| `test_large_merkle_diff_eventual_convergence` | 1500 vs 500 assertions converge |
| `test_all_crdt_properties` | Property-based verification |
| `test_eventual_consistency_property` | Eventual consistency verification |
## Example Usage
### Basic Cluster Test
```rust
use stemedb_chaos::TestCluster;
#[tokio::test]
async fn test_basic_convergence() {
let mut cluster = TestCluster::spawn(3).await.expect("spawn");
// Write to node 0
cluster.get_node_mut(0)
.write_assertion("subject", "pred", 1000)
.await.expect("write");
// Sync all nodes
cluster.sync_all().await.expect("sync");
// Verify convergence
cluster.assert_converged();
}
```
### Partition Testing
```rust
use stemedb_chaos::TestCluster;
#[tokio::test]
async fn test_partition() {
let mut cluster = TestCluster::spawn(4).await.expect("spawn");
// Create partition: [0,1] vs [2,3]
cluster.network().partition(&[0, 1], &[2, 3]);
// Write to both sides
cluster.get_node_mut(0).write_assertion("a", "pred", 1000).await.expect("write");
cluster.get_node_mut(2).write_assertion("b", "pred", 2000).await.expect("write");
// Heal and sync
cluster.network().heal();
cluster.sync_all().await.expect("sync");
// Both writes survive
cluster.assert_converged();
assert_eq!(cluster.get_node(0).assertion_count(), 2);
}
```
### Clock Skew Testing
```rust
use stemedb_chaos::TestCluster;
#[tokio::test]
async fn test_clock_skew() {
let mut cluster = TestCluster::spawn(2).await.expect("spawn");
// Inject +5 second skew on node 0
cluster.clock().inject_skew(0, 5000);
// Verify skew is detected
assert!(cluster.clock().has_significant_skew(0, 1));
// Write with skewed timestamps
cluster.get_node_mut(0).write_assertion("skewed", "pred", 1000).await.expect("write");
// Cluster still converges
cluster.sync_all().await.expect("sync");
cluster.assert_converged();
}
```
## Architecture
```
TestCluster
├── nodes: Vec<ChaosNode>
├── network: Arc<NetworkController>
└── clock: Arc<ClockController>
ChaosNode
├── crdt_store: CrdtAssertionStore
├── merkle_tree: MerkleTree
├── hash_to_data: HashMap<Hash, (Subject, Data)>
├── hlc: SkewedHlc (respects ClockController)
└── alive: bool (kill/revive simulation)
NetworkController
├── partitions: DashMap<(from, to), bool>
├── latencies: DashMap<(from, to), Duration>
└── drop_rates: DashMap<(from, to), f64>
ClockController
├── node_offsets: DashMap<node, offset_ms>
└── global_offset_ms: AtomicI64
```
## Design Decisions
### Channel-Based vs iptables/tc
**Chosen: Channel-based interception**
- Aligns with existing `SimNode` pattern in `partition_tolerance.rs`
- Deterministic and CI-friendly (no elevated privileges)
- Production code stays unchanged
- Real network tests can be added later as optional e2e suite
### Sync Semantics
- `sync_from()` on ChaosNode checks partition state before syncing
- `sync_all()` on TestCluster does full mesh sync respecting partitions
- Content-addressed storage ensures idempotent merges
## Metrics
The controllers track:
- `messages_dropped`: Total messages dropped (partition + drop rate)
- `messages_delayed`: Total messages delayed (latency)
- `partition_events`: Number of partition operations
```rust
let summary = cluster.summary();
println!("Dropped: {}", summary.messages_dropped);
println!("Delayed: {}", summary.messages_delayed);
println!("Max skew: {}ms", summary.max_clock_skew_ms);
```
## Related Documentation
- [Architecture](../../architecture.md) - Overall system design
- [Distributed Write Path](../../docs/research/distributed-write-path.md) - CRDT replication
- [Phase 6 UAT](./phase6-uat.md) - Cluster coordination tests