stemedb/.agentive-remediation/arena-gap-analysis.md
jordan 3320c24afa feat: WAL hardening (Phase 5B) - CRC32C, crash recovery, group commit, log rotation
Add CRC32C checksums to WAL record format (v2), implement crash recovery
with automatic truncation of corrupt records, add feature-gated group commit
buffer for batched fsync under concurrent load, and implement log rotation
via segment files with global offset addressing.

Key changes:
- Record format v2: [len:u32][crc32c:u32][blake3:32][payload:N]
- recover_file() scans and truncates corrupt tail records
- GroupCommitBuffer batches fsync via MPSC channel (tokio feature gate)
- SegmentManager with binary search resolution and cursor-based cleanup
- Journal::read() auto-refreshes segments on miss for writer/reader split
- Split recovery.rs and key_codec.rs into directory modules for 500-line max

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 12:36:35 -07:00

22 KiB

Arena Simulation: Critical Gap Analysis

Analyst: BurntSushi (Rust Quality Engineer) Date: 2026-02-01 Subject: Production Readiness Assessment of stemedb-sim Arena Tests


Executive Summary

The Arena simulation tests fundamental write→read paths through the system, but exhibits significant gaps between simulation coverage and real-world production scenarios. While the simulation successfully validates basic pipeline connectivity, it provides a false sense of security regarding system robustness.

Production Readiness Score: 58/100

Category Score Status
Basic Integration Coverage 85/100 Good
Failure Mode Testing 30/100 Critical Gap
Performance Validation 20/100 Critical Gap
API Integration 0/100 Not Tested
Concurrency Testing 15/100 Critical Gap

What's Actually Being Tested

Arena 0: Spine (WAL → Ingestor → KVStore)

Code Path Exercised:

Agent.sign_assertion() → write_assertion_to_wal() → Journal.append()
  → IngestWorker.step() → IngestWorker.ingest_assertion()
  → HybridStore.put() → IndexStore.add_to_indexes()

What Works:

  • Basic serialization round-trip (serialize_assertiondeserialize)
  • Ed25519 signature verification in IngestWorker.verify_assertion_signatures()
  • Content-addressed storage (H:{blake3_hash})
  • Subject index updates (S:{subject})
  • Compound index updates (SP:{subject}:{predicate})

Confidence Rating: 90% — The simulation accurately tests the happy path for assertion ingestion.


Arena 1: Cortex (QueryEngine → Lenses)

Code Path Exercised:

QueryEngine.execute() → fetch_by_subject_predicate()
  → RecencyLens.resolve() / VoteAwareConsensusLens.resolve_async()
  → Resolution with winner selection

What Works:

  • Query execution via compound index (SP:{subject}:{predicate})
  • Subject-only queries (S:{subject})
  • Lifecycle filtering in Query.matches()
  • RecencyLens timestamp comparison
  • Audit trail storage via GenericAuditStore.put_audit()

Confidence Rating: 85% — Query path is tested, but lacks edge cases.


Arena 2: Voting & Consensus

Code Path Exercised:

Agent.vote() → write_vote_to_wal() → IngestWorker.ingest_vote()
  → VoteStore.put_vote() → VoteStore cache updates (VC:{hash}, VW:{hash})
  → VoteAwareConsensusLens.resolve_async() → Aggregate weight ranking

What Works:

  • Vote serialization/deserialization
  • Vote storage at V:{assertion_hash}:{vote_hash}
  • Vote count cache increments (VC:{assertion_hash})
  • Aggregate weight cache updates (VW:{assertion_hash})
  • Winner selection based on highest aggregate weight
  • Tiebreaker by timestamp

Confidence Rating: 80% — Vote mechanics work, but lacks adversarial testing.


Critical Gaps: What's NOT Being Tested

1. Failure Modes (Severity: CRITICAL)

Missing:

  • WAL corruption mid-write (partial record)
  • Sled crashes during put() (fsync failure)
  • Network partition (if distributed)
  • OOM conditions during large scans
  • Disk full during ingestion
  • Malformed rkyv archives (corrupted length fields)
  • Hash collisions (theoretical but untested)

Why This Matters: The simulation calls .expect() on all storage operations, hiding errors:

// lib.rs:466
match write_assertion_to_wal(&journal, &assertion).await {
    Ok(raw_bytes) => { /* success path */ }
    Err(e) => {
        result.errors.push(SimulationError { /* logged but not replayed */ });
    }
}

Problem: Errors are collected but the simulation continues. No test verifies recovery behavior after failure.

Recommendation: Create arena-3-chaos.rs that:

  • Kills IngestWorker mid-step, restarts, verifies cursor recovery
  • Corrupts WAL records, verifies quarantine behavior
  • Fills disk to 100%, verifies graceful degradation

Confidence Rating: 10% — Failure mode coverage is nearly absent.


2. Concurrency & Race Conditions (Severity: CRITICAL)

Missing:

  • Multiple IngestWorkers reading same WAL (race on cursor)
  • Concurrent queries while ingestion is active
  • Vote cache race conditions (VC/VW updates not atomic)
  • Index corruption from concurrent writes
  • Deadlock detection between Journal mutex and Store mutex

Why This Matters: The simulation is strictly sequential:

// lib.rs:458 - Arena 0 tick loop is single-threaded
for tick in 0..config.tick_count {
    let agent = &agents[tick % agents.len()];
    let assertion = agent.sign_assertion(...);
    match write_assertion_to_wal(&journal, &assertion).await { /* ... */ }
}

Then waits 500ms:

// lib.rs:486
tokio::time::sleep(std::time::Duration::from_millis(config.ingestion_wait_ms)).await;

Problem: No concurrent writes, no overlapping ingestion, no parallel queries.

Real-World Scenario:

  • 1000 agents writing assertions simultaneously
  • 10 queries executing while votes are being ingested
  • Materializer updating MVs while QueryEngine reads them

Recommendation:

// Spawn 100 concurrent writers
let handles: Vec<_> = (0..100).map(|i| {
    let journal = journal.clone();
    tokio::spawn(async move {
        for _ in 0..10 {
            write_assertion_to_wal(&journal, &assertion).await?;
        }
    })
}).collect();

futures::future::join_all(handles).await;

Confidence Rating: 15% — Concurrency testing is effectively zero.


3. Performance Validation (Severity: HIGH)

Missing:

  • Throughput measurement (assertions/sec, votes/sec)
  • Latency percentiles (p50, p99, p99.9)
  • Memory allocation profiling
  • Index scan performance under load
  • VoteStore cache hit rate
  • Materialized view staleness impact

Why This Matters: The simulation uses hardcoded sleep timers:

// lib.rs:486
tokio::time::sleep(std::time::Duration::from_millis(config.ingestion_wait_ms)).await;

Problem: No verification that 500ms is sufficient. If ingestion lags, tests pass but assertions are missing.

Actual Check:

// lib.rs:378
assert!(result.is_success(), "Simulation should succeed: {:?}", result.errors);
assert_eq!(result.assertions_written, 14);
assert_eq!(result.assertions_verified, 10);  // ⚠️ Only 10 of 14 verified!

Wait, what? The test writes 14 assertions but only verifies 10. The 4 extras are from Arena 2 (vote consensus/troll resistance), which are verified through lens resolution, not direct verification loop.

BUT: This is fragile. If ingestion is slow, assertions_verified could be 0 and the test would fail for the wrong reason.

Recommendation:

  • Add IngestWorker.wait_until_offset(target) to block until ingestion catches up
  • Measure ingestion lag: current_offset - wal_tail_offset
  • Fail if lag exceeds threshold

Confidence Rating: 20% — Performance is not measured, only assumed.


4. API Integration (Severity: CRITICAL)

Missing:

  • HTTP API endpoints (stemedb-api crate is completely untested)
  • JSON serialization/deserialization
  • OpenAPI spec validation
  • Rate limiting enforcement
  • Authentication/authorization
  • Quota enforcement

Why This Matters: The simulation tests internal Rust APIs:

// lib.rs:335
async fn write_assertion_to_wal(journal: &Arc<Mutex<Journal>>, assertion: &Assertion)
    -> Result<Vec<u8>, String>

Real clients will use HTTP:

curl -X POST http://localhost:8080/assertions \
  -H "Content-Type: application/json" \
  -d '{"subject": "Tesla", "predicate": "revenue", "object": 96.7}'

Gap: No test verifies:

  • JSON schema compatibility
  • HTTP error responses (400, 500)
  • Content-Type negotiation
  • CORS headers
  • Request timeout handling

Recommendation: Create tests/api_integration.rs:

#[tokio::test]
async fn test_post_assertion_via_http() {
    let app = stemedb_api::create_app(store).await;
    let response = app
        .oneshot(Request::builder()
            .method("POST")
            .uri("/assertions")
            .header("content-type", "application/json")
            .body(Body::from(r#"{"subject":"Tesla","predicate":"revenue"}"#))
            .unwrap())
        .await
        .unwrap();

    assert_eq!(response.status(), StatusCode::CREATED);
}

Confidence Rating: 0% — API layer is completely untested.


5. Edge Cases & Adversarial Inputs (Severity: HIGH)

Missing:

  • Subject/predicate strings with UTF-8 edge cases (e.g., emoji, RTL text, null bytes)
  • Extremely long subject strings (DoS via memory exhaustion)
  • Malicious signatures (wrong length, invalid curve points)
  • Votes with negative weights (should be rejected)
  • Assertions with confidence > 1.0 or < 0.0
  • Circular parent_hash references
  • Timestamp overflow (year 2038 problem)

Why This Matters: The simulation creates well-formed data:

// lib.rs:459
let agent = &agents[tick % agents.len()];
let assertion = agent.sign_assertion(
    &format!("Entity_{}", tick),  // ⚠️ Always ASCII alphanumeric
    "has_property",
    ObjectValue::Text(format!("Value_{}", tick)),
);

Real-World Inputs:

  • subject: "🚗 Tesla\x00Inc" (emoji + null byte)
  • subject: "a".repeat(1_000_000) (1MB string, DoS)
  • weight: -999.9 (negative vote weight)
  • confidence: f32::NAN (NaN value)

Current Code:

// worker.rs:256 - Signature verification (GOOD)
if assertion.signatures.is_empty() {
    return Err(IngestError::InvalidSignature("must have at least one signature".to_string()));
}

But NO validation for:

if assertion.subject.len() > MAX_SUBJECT_LEN {
    return Err(IngestError::InvalidInput("subject too long"));
}
if assertion.confidence < 0.0 || assertion.confidence > 1.0 {
    return Err(IngestError::InvalidInput("confidence out of range"));
}

Recommendation: Add defensive validation in IngestWorker::ingest_assertion():

const MAX_SUBJECT_LEN: usize = 1024;
const MAX_PREDICATE_LEN: usize = 1024;

fn validate_assertion(a: &Assertion) -> Result<()> {
    if a.subject.len() > MAX_SUBJECT_LEN {
        return Err(IngestError::InvalidInput("subject too long".into()));
    }
    if a.confidence < 0.0 || a.confidence > 1.0 || a.confidence.is_nan() {
        return Err(IngestError::InvalidInput("invalid confidence".into()));
    }
    // ... more checks
    Ok(())
}

Confidence Rating: 25% — Edge case coverage is minimal.


6. Test Isolation (Severity: MEDIUM)

Current Approach:

// lib.rs:401-405
let temp_wal_dir = tempfile::tempdir().map_err(...)?;
let temp_db_dir = tempfile::tempdir().map_err(...)?;

Problem: Tests are isolated, but not hermetic. The simulation:

  • Uses real filesystem I/O
  • Uses real Ed25519 crypto (slow)
  • Uses real sleep timers (flaky)

Impact on CI:

  • Tests are slow (500ms+ sleep per scenario)
  • Tests can be flaky if disk is slow
  • Parallel test execution may interfere (disk I/O contention)

Recommendation:

  • Mock KVStore trait for unit tests (use HashMap backend)
  • Mock Journal for ingestion tests
  • Use tokio::time::pause() to control time advancement

Confidence Rating: 70% — Isolation is good, but not hermetic.


Specific File:Line Issues

/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-sim/src/lib.rs

Line 97-111: SimulationResult::is_success() Logic Flaw

pub fn is_success(&self) -> bool {
    self.errors.is_empty()
        && self.recency_test_passed
        && self.lifecycle_test_passed
        && self.audit_test_passed
        && self.vote_consensus_test_passed
        && self.troll_resistance_test_passed
}

Issue: No check for assertions_written == assertions_verified.

Why: Comment claims "Arena 2 tests write additional test assertions that are verified through the VoteAwareConsensusLens rather than the main verification loop."

Problem: This is confusing and fragile. If ingestion fails silently, assertions_verified could be 0 but the test passes because errors.is_empty().

Recommendation:

pub fn is_success(&self) -> bool {
    let base_assertions_verified = self.assertions_verified >= (self.tick_count as u64);
    base_assertions_verified
        && self.errors.is_empty()
        && /* ... rest of checks */
}

Severity: MEDIUM


Line 279-281: Signature Message Format Mismatch Risk

// lib.rs:281
let message = format!("{}:{}", subject, predicate);
let signature: Signature = self.signing_key.sign(message.as_bytes());

vs. worker.rs:268:

// worker.rs:268
let message = format!("{}:{}", assertion.subject, assertion.predicate);

Issue: The signed message format is implicitly coupled across files. If one changes (e.g., adding |{timestamp}), signatures break.

Recommendation: Extract to stemedb-core/src/types.rs:

impl Assertion {
    pub fn signing_message(&self) -> String {
        format!("{}:{}", self.subject, self.predicate)
    }
}

Severity: MEDIUM


Line 486: Hardcoded Sleep Timer

tokio::time::sleep(std::time::Duration::from_millis(config.ingestion_wait_ms)).await;

Issue: No verification that ingestion actually completed. If system is slow, test fails with "assertion not found" instead of "ingestion timeout".

Recommendation:

async fn wait_for_ingestion(journal: &Journal, store: &impl KVStore, expected_count: usize) -> Result<(), String> {
    let deadline = tokio::time::Instant::now() + Duration::from_secs(5);
    loop {
        let count = store.scan_prefix(b"H:").await?.len();
        if count >= expected_count {
            return Ok(());
        }
        if tokio::time::Instant::now() > deadline {
            return Err(format!("Timeout: only {}/{} assertions ingested", count, expected_count));
        }
        tokio::time::sleep(Duration::from_millis(10)).await;
    }
}

Severity: MEDIUM


Line 696, 820, 1073, 1260: Duplicate tokio::time::sleep() Calls

Pattern:

tokio::time::sleep(std::time::Duration::from_millis(300)).await;

Issue: Magic number 300 repeated 4+ times. No justification for why 300ms is correct.

Recommendation:

const INGESTION_SETTLE_TIME_MS: u64 = 300;
// ... then use:
tokio::time::sleep(Duration::from_millis(INGESTION_SETTLE_TIME_MS)).await;

Severity: LOW


/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-ingest/src/worker.rs

Line 158-174: EOF Handling Ambiguity

let record = {
    let journal = self.journal.lock().await;
    match journal.read(self.current_offset) {
        Ok(record) => record,
        Err(stemedb_wal::QuarantineError::Io { .. }) => {
            return Ok(0);  // ⚠️ Assuming EOF
        }
        Err(stemedb_wal::QuarantineError::IoGeneric(e))
            if e.kind() == std::io::ErrorKind::UnexpectedEof => {
            return Ok(0);  // ⚠️ Assuming EOF
        }
        Err(e) => return Err(IngestError::Wal(e)),
    }
};

Issue: Both QuarantineError::Io and UnexpectedEof return Ok(0), which means "no new data". But QuarantineError::Io could also mean:

  • Disk read error
  • Network mount failure
  • Permission denied

Problem: Transient I/O errors are silently ignored, treated as EOF.

Recommendation:

Err(stemedb_wal::QuarantineError::Io { source, .. }) => {
    if source.kind() == std::io::ErrorKind::UnexpectedEof {
        return Ok(0);  // True EOF
    }
    warn!(error = %source, "Transient I/O error, retrying");
    return Ok(0);  // Retry on next tick
}

Severity: HIGH


Line 191-192: Vote Count Cache Race Condition

// worker.rs:181-192 (in put_vote)
let current_count = match self.store.get(&count_key).await? {
    Some(bytes) if bytes.len() == 8 => {
        let arr: [u8; 8] = bytes.try_into().map_err(...)?;
        u64::from_le_bytes(arr)
    }
    _ => 0,
};
let new_count = current_count.saturating_add(1);
self.store.put(&count_key, &new_count.to_le_bytes()).await?;

Issue: This is a read-modify-write race. If two IngestWorkers call put_vote concurrently:

  1. Worker A reads count=5, Worker B reads count=5
  2. Worker A writes count=6, Worker B writes count=6
  3. Final count is 6, but should be 7 (lost update)

Simulation doesn't test this because IngestWorker runs sequentially.

Recommendation: Use a CAS (compare-and-swap) operation or atomic counter:

// Requires adding atomic counter support to KVStore trait
let new_count = self.store.increment_counter(&count_key, 1).await?;

Severity: CRITICAL (if multiple ingestors are ever deployed)


/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-storage/src/vote_store.rs

Line 183-192: Same Race Condition as Above

Same code, same issue. VoteStore is used by IngestWorker, so this is the root cause.

Severity: CRITICAL


/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-lens/src/vote_aware_consensus.rs

Line 99-102: Defensive Empty Hash on Serialization Failure

fn compute_assertion_hash(assertion: &Assertion) -> Hash {
    let bytes = match stemedb_core::serde::serialize(assertion) {
        Ok(b) => b,
        Err(_) => return [0u8; 32],  // ⚠️ Silent failure
    };
    *blake3::hash(&bytes).as_bytes()
}

Issue: If serialization fails, returns [0u8; 32] (all zeros). This could cause:

  • All failures to hash to the same value
  • Votes to be looked up at wrong key
  • Silent data corruption

Recommendation:

fn compute_assertion_hash(assertion: &Assertion) -> Result<Hash, String> {
    let bytes = stemedb_core::serde::serialize(assertion)
        .map_err(|e| format!("Failed to serialize assertion: {}", e))?;
    Ok(*blake3::hash(&bytes).as_bytes())
}

Then propagate error up:

// line 134
let assertion_hash = match Self::compute_assertion_hash(assertion) {
    Ok(h) => h,
    Err(e) => {
        debug!(error = %e, "Failed to hash assertion, skipping");
        continue;
    }
};

Severity: HIGH


/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-query/src/engine.rs

Line 117-159: Staleness Check Has Time Skew Risk

if let Some(max_stale) = query.max_stale {
    let now = SystemTime::now().duration_since(UNIX_EPOCH).map(|d| d.as_secs()).unwrap_or(0);
    let age = now.saturating_sub(view.materialized_at);

    if age > max_stale {
        debug!("Materialized view is stale, falling back to slow path");
        return Ok(None);
    }
}

Issue: If server clock is set backwards, now < view.materialized_at, and saturating_sub returns 0, making all MVs appear fresh.

Recommendation:

let age = match now.checked_sub(view.materialized_at) {
    Some(age) => age,
    None => {
        warn!(now, materialized_at = view.materialized_at,
              "Clock skew detected: MV timestamp is in the future, falling back");
        return Ok(None);  // Reject future-dated MVs
    }
};

Severity: MEDIUM


Summary of Gaps by Severity

Severity Count Examples
CRITICAL 4 Vote cache race, API untested, failure mode gaps, concurrency gaps
HIGH 3 EOF handling, defensive hash, edge case coverage
MEDIUM 4 Time skew, verification logic, signature coupling, hardcoded sleeps
LOW 1 Magic numbers

Recommendations by Priority

Immediate (Before Production)

  1. Fix vote cache race condition (worker.rs:191, vote_store.rs:183)

    • Use atomic counters or CAS operations
    • Add concurrency test: 100 agents voting simultaneously
  2. Add API integration tests (missing entirely)

    • Create stemedb-api/tests/http_endpoints.rs
    • Test JSON serialization, error codes, rate limiting
  3. Add failure mode tests (missing entirely)

    • Crash recovery: kill IngestWorker mid-step, verify cursor recovery
    • WAL corruption: inject malformed record, verify quarantine
    • Disk full: verify graceful degradation

Short-Term (Next Sprint)

  1. Add concurrency stress test

    • 1000 agents writing assertions concurrently
    • 100 queries running while ingestion is active
    • Measure throughput and p99 latency
  2. Add input validation

    • Max length checks on subject/predicate
    • Confidence range validation (0.0 to 1.0)
    • Weight validation (no negatives)
  3. Fix defensive error handling

    • Replace [0u8; 32] fallback with proper error propagation (vote_aware_consensus.rs:102)
    • Distinguish EOF from transient I/O errors (worker.rs:163)

Long-Term (Next Quarter)

  1. Add performance benchmarks

    • Measure assertions/sec throughput
    • Measure p50/p99/p99.9 latency
    • Measure memory allocation per operation
  2. Add property-based tests

    • Use proptest to generate random assertions
    • Verify invariants: all writes are eventually readable, votes always increase counts
  3. Add observability

    • Export metrics (assertions ingested, queries executed, cache hit rate)
    • Add distributed tracing (OpenTelemetry)

Conclusion

The Arena simulation successfully validates basic integration of the write and read paths. However, it provides a false sense of confidence because it:

  1. Only tests happy paths — no failures, no corruption, no adversarial inputs
  2. Only tests sequential execution — no concurrency, no race conditions
  3. Only tests internal APIs — HTTP layer is completely untested
  4. Only tests well-formed data — no edge cases, no DoS attempts
  5. Uses hardcoded timers — no verification of actual ingestion completion

Production Readiness: 58/100

To reach production readiness (90+), the system needs:

  • Failure mode testing (crash recovery, corruption handling)
  • Concurrency testing (race conditions, deadlocks)
  • API integration testing (HTTP, JSON, error codes)
  • Input validation (length limits, range checks)
  • Performance benchmarks (throughput, latency)

Critical Blockers:

  1. Fix vote cache race condition (CRITICAL)
  2. Add API integration tests (CRITICAL)
  3. Add failure mode tests (CRITICAL)

Estimated Effort to 90/100: 3-4 weeks with 1 engineer focused on testing.


BurntSushi's Verdict: Ship the current codebase to a closed beta or internal staging environment, but not to production. The vote cache race alone could cause silent data corruption under load. Fix the critical issues first, then open to external users.