stemedb/ai-lookup/features/chaos-testing.md
jordan b3e8a9a058 feat: Multi-application expansion with chaos testing and community UI
Major additions:
- Community Next.js app (port 18187) for browsing claims with API docs
- stemedb-chaos crate: Fault injection, chaos testing, CRDT properties
- Latent ingestion system: Reddit/FDA ingesters with ADK-Go agents
- Disputed claims handling: Manual review workflows and validation
- Aphoria security scanner: New extractors (SQL injection, command
  injection, weak crypto, TLS version), policy-based ignores, UAT reports
- Docker infrastructure: Dockerfile, docker-compose.yml for full stack
- VulnBank demo: Intentionally vulnerable multi-language test corpus

SDK & API enhancements:
- Source registry handlers for tracking data provenance
- Metrics endpoint
- Skeptic filtering improvements

Code quality:
- Split 14 large files (>500 lines) into focused modules
- All files now under 500-line limit per project guidelines

Documentation:
- Chaos testing guide, circuit breakers, observability docs
- Phase 7 UAT documentation updates
- Martin Kleppmann technical writer agent

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 01:24:14 -07:00

6.5 KiB

Chaos Testing (Phase 8A)

The stemedb-chaos crate provides infrastructure for testing Episteme distributed clusters under failure conditions.

Overview

Chaos testing verifies that Episteme clusters:

  • Continue accepting writes during network partitions
  • Converge correctly after partition heals
  • Handle node failures and recovery
  • Maintain CRDT invariants under all conditions
  • Handle clock skew correctly with HLC timestamps

Components

Test Harness

Component Purpose
ChaosNode Simulated cluster node with fault injection support
TestCluster Manages N ChaosNodes with shared fault controllers

Fault Injection

Controller Capabilities
NetworkController Partitions, latency, message drops
ClockController Clock skew injection for HLC testing

CRDT Property Verification

Function Verifies
verify_commutativity() merge(A, B) = merge(B, A)
verify_associativity() (A merge B) merge C = A merge (B merge C)
verify_idempotence() merge(A, A) = A

Running Chaos Tests

# All chaos tests
cargo test -p stemedb-chaos

# Partition tests only
cargo test -p stemedb-chaos --test partition_tests

# Consistency tests only
cargo test -p stemedb-chaos --test consistency_tests

# Unit tests only
cargo test -p stemedb-chaos --lib

Test Categories

Partition Tests (8 tests)

Test Scenario
test_5_node_kill_2_convergence 5-node cluster survives 2 node failures
test_partition_between_groups_convergence [0,1,2] vs [3,4] partition and heal
test_message_reordering_convergence 100 writes in random order converge
test_message_duplication_idempotent Repeated syncs don't create duplicates
test_cascading_failure_recovery Sequential node failures and recovery
test_swim_suspicion_not_false_positive Slow node marked Suspect, then Alive
test_asymmetric_partition One-way partition (0→1 works, 1→0 blocked)
test_write_availability_during_partition All nodes can write when fully partitioned

Consistency Tests (11 tests)

Test Scenario
test_crdt_eventual_consistency 1000 concurrent writes across 5 nodes
test_crdt_commutativity Different merge orders produce same result
test_crdt_associativity Merge grouping doesn't affect result
test_crdt_idempotence Syncing same data repeatedly is stable
test_hlc_handles_clock_skew ±5 second skew still converges
test_hlc_monotonic_under_partition HLC remains monotonic during partition
test_supersession_ordering_with_clock_skew HLC ordering with 2s skew
test_concurrent_writes_same_subject_under_partition Both writes survive (append-only)
test_large_merkle_diff_eventual_convergence 1500 vs 500 assertions converge
test_all_crdt_properties Property-based verification
test_eventual_consistency_property Eventual consistency verification

Example Usage

Basic Cluster Test

use stemedb_chaos::TestCluster;

#[tokio::test]
async fn test_basic_convergence() {
    let mut cluster = TestCluster::spawn(3).await.expect("spawn");

    // Write to node 0
    cluster.get_node_mut(0)
        .write_assertion("subject", "pred", 1000)
        .await.expect("write");

    // Sync all nodes
    cluster.sync_all().await.expect("sync");

    // Verify convergence
    cluster.assert_converged();
}

Partition Testing

use stemedb_chaos::TestCluster;

#[tokio::test]
async fn test_partition() {
    let mut cluster = TestCluster::spawn(4).await.expect("spawn");

    // Create partition: [0,1] vs [2,3]
    cluster.network().partition(&[0, 1], &[2, 3]);

    // Write to both sides
    cluster.get_node_mut(0).write_assertion("a", "pred", 1000).await.expect("write");
    cluster.get_node_mut(2).write_assertion("b", "pred", 2000).await.expect("write");

    // Heal and sync
    cluster.network().heal();
    cluster.sync_all().await.expect("sync");

    // Both writes survive
    cluster.assert_converged();
    assert_eq!(cluster.get_node(0).assertion_count(), 2);
}

Clock Skew Testing

use stemedb_chaos::TestCluster;

#[tokio::test]
async fn test_clock_skew() {
    let mut cluster = TestCluster::spawn(2).await.expect("spawn");

    // Inject +5 second skew on node 0
    cluster.clock().inject_skew(0, 5000);

    // Verify skew is detected
    assert!(cluster.clock().has_significant_skew(0, 1));

    // Write with skewed timestamps
    cluster.get_node_mut(0).write_assertion("skewed", "pred", 1000).await.expect("write");

    // Cluster still converges
    cluster.sync_all().await.expect("sync");
    cluster.assert_converged();
}

Architecture

TestCluster
├── nodes: Vec<ChaosNode>
├── network: Arc<NetworkController>
└── clock: Arc<ClockController>

ChaosNode
├── crdt_store: CrdtAssertionStore
├── merkle_tree: MerkleTree
├── hash_to_data: HashMap<Hash, (Subject, Data)>
├── hlc: SkewedHlc (respects ClockController)
└── alive: bool (kill/revive simulation)

NetworkController
├── partitions: DashMap<(from, to), bool>
├── latencies: DashMap<(from, to), Duration>
└── drop_rates: DashMap<(from, to), f64>

ClockController
├── node_offsets: DashMap<node, offset_ms>
└── global_offset_ms: AtomicI64

Design Decisions

Channel-Based vs iptables/tc

Chosen: Channel-based interception

  • Aligns with existing SimNode pattern in partition_tolerance.rs
  • Deterministic and CI-friendly (no elevated privileges)
  • Production code stays unchanged
  • Real network tests can be added later as optional e2e suite

Sync Semantics

  • sync_from() on ChaosNode checks partition state before syncing
  • sync_all() on TestCluster does full mesh sync respecting partitions
  • Content-addressed storage ensures idempotent merges

Metrics

The controllers track:

  • messages_dropped: Total messages dropped (partition + drop rate)
  • messages_delayed: Total messages delayed (latency)
  • partition_events: Number of partition operations
let summary = cluster.summary();
println!("Dropped: {}", summary.messages_dropped);
println!("Delayed: {}", summary.messages_delayed);
println!("Max skew: {}ms", summary.max_clock_skew_ms);