stemedb/.agentive-remediation/arena-gap-analysis.md

# Arena Simulation: Critical Gap Analysis

**Analyst:** BurntSushi (Rust Quality Engineer)
**Date:** 2026-02-01
**Subject:** Production Readiness Assessment of stemedb-sim Arena Tests

---

## Executive Summary

The Arena simulation tests fundamental write→read paths through the system, but exhibits **significant gaps** between simulation coverage and real-world production scenarios. While the simulation successfully validates basic pipeline connectivity, it provides a **false sense of security** regarding system robustness.

**Production Readiness Score: 58/100**

| Category | Score | Status |
|----------|-------|--------|
| Basic Integration Coverage | 85/100 | ✅ Good |
| Failure Mode Testing | 30/100 | ❌ Critical Gap |
| Performance Validation | 20/100 | ❌ Critical Gap |
| API Integration | 0/100 | ❌ Not Tested |
| Concurrency Testing | 15/100 | ❌ Critical Gap |

---

## What's Actually Being Tested

### Arena 0: Spine (WAL → Ingestor → KVStore)

**Code Path Exercised:**
```
Agent.sign_assertion() → write_assertion_to_wal() → Journal.append()
  → IngestWorker.step() → IngestWorker.ingest_assertion()
  → HybridStore.put() → IndexStore.add_to_indexes()
```

**What Works:**
- ✅ Basic serialization round-trip (`serialize_assertion` → `deserialize`)
- ✅ Ed25519 signature verification in `IngestWorker.verify_assertion_signatures()`
- ✅ Content-addressed storage (`H:{blake3_hash}`)
- ✅ Subject index updates (`S:{subject}`)
- ✅ Compound index updates (`SP:{subject}:{predicate}`)

**Confidence Rating: 90%** — The simulation accurately tests the happy path for assertion ingestion.

---

### Arena 1: Cortex (QueryEngine → Lenses)

**Code Path Exercised:**
```
QueryEngine.execute() → fetch_by_subject_predicate()
  → RecencyLens.resolve() / VoteAwareConsensusLens.resolve_async()
  → Resolution with winner selection
```

**What Works:**
- ✅ Query execution via compound index (`SP:{subject}:{predicate}`)
- ✅ Subject-only queries (`S:{subject}`)
- ✅ Lifecycle filtering in `Query.matches()`
- ✅ RecencyLens timestamp comparison
- ✅ Audit trail storage via `GenericAuditStore.put_audit()`

**Confidence Rating: 85%** — Query path is tested, but lacks edge cases.

---

### Arena 2: Voting & Consensus

**Code Path Exercised:**
```
Agent.vote() → write_vote_to_wal() → IngestWorker.ingest_vote()
  → VoteStore.put_vote() → VoteStore cache updates (VC:{hash}, VW:{hash})
  → VoteAwareConsensusLens.resolve_async() → Aggregate weight ranking
```

**What Works:**
- ✅ Vote serialization/deserialization
- ✅ Vote storage at `V:{assertion_hash}:{vote_hash}`
- ✅ Vote count cache increments (`VC:{assertion_hash}`)
- ✅ Aggregate weight cache updates (`VW:{assertion_hash}`)
- ✅ Winner selection based on highest aggregate weight
- ✅ Tiebreaker by timestamp

**Confidence Rating: 80%** — Vote mechanics work, but lacks adversarial testing.

---

## Critical Gaps: What's NOT Being Tested

### 1. Failure Modes (Severity: CRITICAL)

**Missing:**
- ❌ WAL corruption mid-write (partial record)
- ❌ Sled crashes during `put()` (fsync failure)
- ❌ Network partition (if distributed)
- ❌ OOM conditions during large scans
- ❌ Disk full during ingestion
- ❌ Malformed rkyv archives (corrupted length fields)
- ❌ Hash collisions (theoretical but untested)

**Why This Matters:**
The simulation calls `.expect()` on all storage operations, hiding errors:
```rust
// lib.rs:466
match write_assertion_to_wal(&journal, &assertion).await {
    Ok(raw_bytes) => { /* success path */ }
    Err(e) => {
        result.errors.push(SimulationError { /* logged but not replayed */ });
    }
}
```

**Problem:** Errors are collected but the simulation continues. No test verifies recovery behavior after failure.

**Recommendation:**
Create `arena-3-chaos.rs` that:
- Kills IngestWorker mid-step, restarts, verifies cursor recovery
- Corrupts WAL records, verifies quarantine behavior
- Fills disk to 100%, verifies graceful degradation

**Confidence Rating: 10%** — Failure mode coverage is nearly absent.

---

### 2. Concurrency & Race Conditions (Severity: CRITICAL)

**Missing:**
- ❌ Multiple IngestWorkers reading same WAL (race on cursor)
- ❌ Concurrent queries while ingestion is active
- ❌ Vote cache race conditions (VC/VW updates not atomic)
- ❌ Index corruption from concurrent writes
- ❌ Deadlock detection between Journal mutex and Store mutex

**Why This Matters:**
The simulation is **strictly sequential**:
```rust
// lib.rs:458 - Arena 0 tick loop is single-threaded
for tick in 0..config.tick_count {
    let agent = &agents[tick % agents.len()];
    let assertion = agent.sign_assertion(...);
    match write_assertion_to_wal(&journal, &assertion).await { /* ... */ }
}
```

**Then waits 500ms:**
```rust
// lib.rs:486
tokio::time::sleep(std::time::Duration::from_millis(config.ingestion_wait_ms)).await;
```

**Problem:** No concurrent writes, no overlapping ingestion, no parallel queries.

**Real-World Scenario:**
- 1000 agents writing assertions simultaneously
- 10 queries executing while votes are being ingested
- Materializer updating MVs while QueryEngine reads them

**Recommendation:**
```rust
// Spawn 100 concurrent writers
let handles: Vec<_> = (0..100).map(|i| {
    let journal = journal.clone();
    tokio::spawn(async move {
        for _ in 0..10 {
            write_assertion_to_wal(&journal, &assertion).await?;
        }
    })
}).collect();

futures::future::join_all(handles).await;
```

**Confidence Rating: 15%** — Concurrency testing is effectively zero.

---

### 3. Performance Validation (Severity: HIGH)

**Missing:**
- ❌ Throughput measurement (assertions/sec, votes/sec)
- ❌ Latency percentiles (p50, p99, p99.9)
- ❌ Memory allocation profiling
- ❌ Index scan performance under load
- ❌ VoteStore cache hit rate
- ❌ Materialized view staleness impact

**Why This Matters:**
The simulation uses **hardcoded sleep timers**:
```rust
// lib.rs:486
tokio::time::sleep(std::time::Duration::from_millis(config.ingestion_wait_ms)).await;
```

**Problem:** No verification that 500ms is sufficient. If ingestion lags, tests pass but assertions are missing.

**Actual Check:**
```rust
// lib.rs:378
assert!(result.is_success(), "Simulation should succeed: {:?}", result.errors);
assert_eq!(result.assertions_written, 14);
assert_eq!(result.assertions_verified, 10);  // ⚠️ Only 10 of 14 verified!
```

**Wait, what?** The test writes 14 assertions but only verifies 10. The 4 extras are from Arena 2 (vote consensus/troll resistance), which are verified through lens resolution, not direct verification loop.

**BUT:** This is fragile. If ingestion is slow, `assertions_verified` could be 0 and the test would fail for the wrong reason.

**Recommendation:**
- Add `IngestWorker.wait_until_offset(target)` to block until ingestion catches up
- Measure ingestion lag: `current_offset - wal_tail_offset`
- Fail if lag exceeds threshold

**Confidence Rating: 20%** — Performance is not measured, only assumed.

---

### 4. API Integration (Severity: CRITICAL)

**Missing:**
- ❌ HTTP API endpoints (`stemedb-api` crate is completely untested)
- ❌ JSON serialization/deserialization
- ❌ OpenAPI spec validation
- ❌ Rate limiting enforcement
- ❌ Authentication/authorization
- ❌ Quota enforcement

**Why This Matters:**
The simulation tests **internal Rust APIs**:
```rust
// lib.rs:335
async fn write_assertion_to_wal(journal: &Arc<Mutex<Journal>>, assertion: &Assertion)
    -> Result<Vec<u8>, String>
```

**Real clients will use HTTP:**
```bash
curl -X POST http://localhost:8080/assertions \
  -H "Content-Type: application/json" \
  -d '{"subject": "Tesla", "predicate": "revenue", "object": 96.7}'
```

**Gap:** No test verifies:
- JSON schema compatibility
- HTTP error responses (400, 500)
- Content-Type negotiation
- CORS headers
- Request timeout handling

**Recommendation:**
Create `tests/api_integration.rs`:
```rust
#[tokio::test]
async fn test_post_assertion_via_http() {
    let app = stemedb_api::create_app(store).await;
    let response = app
        .oneshot(Request::builder()
            .method("POST")
            .uri("/assertions")
            .header("content-type", "application/json")
            .body(Body::from(r#"{"subject":"Tesla","predicate":"revenue"}"#))
            .unwrap())
        .await
        .unwrap();

    assert_eq!(response.status(), StatusCode::CREATED);
}
```

**Confidence Rating: 0%** — API layer is completely untested.

---

### 5. Edge Cases & Adversarial Inputs (Severity: HIGH)

**Missing:**
- ❌ Subject/predicate strings with UTF-8 edge cases (e.g., emoji, RTL text, null bytes)
- ❌ Extremely long subject strings (DoS via memory exhaustion)
- ❌ Malicious signatures (wrong length, invalid curve points)
- ❌ Votes with negative weights (should be rejected)
- ❌ Assertions with `confidence > 1.0` or `< 0.0`
- ❌ Circular parent_hash references
- ❌ Timestamp overflow (year 2038 problem)

**Why This Matters:**
The simulation creates **well-formed data**:
```rust
// lib.rs:459
let agent = &agents[tick % agents.len()];
let assertion = agent.sign_assertion(
    &format!("Entity_{}", tick),  // ⚠️ Always ASCII alphanumeric
    "has_property",
    ObjectValue::Text(format!("Value_{}", tick)),
);
```

**Real-World Inputs:**
- `subject: "🚗 Tesla\x00Inc"` (emoji + null byte)
- `subject: "a".repeat(1_000_000)` (1MB string, DoS)
- `weight: -999.9` (negative vote weight)
- `confidence: f32::NAN` (NaN value)

**Current Code:**
```rust
// worker.rs:256 - Signature verification (GOOD)
if assertion.signatures.is_empty() {
    return Err(IngestError::InvalidSignature("must have at least one signature".to_string()));
}
```

**But NO validation for:**
```rust
if assertion.subject.len() > MAX_SUBJECT_LEN {
    return Err(IngestError::InvalidInput("subject too long"));
}
if assertion.confidence < 0.0 || assertion.confidence > 1.0 {
    return Err(IngestError::InvalidInput("confidence out of range"));
}
```

**Recommendation:**
Add defensive validation in `IngestWorker::ingest_assertion()`:
```rust
const MAX_SUBJECT_LEN: usize = 1024;
const MAX_PREDICATE_LEN: usize = 1024;

fn validate_assertion(a: &Assertion) -> Result<()> {
    if a.subject.len() > MAX_SUBJECT_LEN {
        return Err(IngestError::InvalidInput("subject too long".into()));
    }
    if a.confidence < 0.0 || a.confidence > 1.0 || a.confidence.is_nan() {
        return Err(IngestError::InvalidInput("invalid confidence".into()));
    }
    // ... more checks
    Ok(())
}
```

**Confidence Rating: 25%** — Edge case coverage is minimal.

---

### 6. Test Isolation (Severity: MEDIUM)

**Current Approach:**
```rust
// lib.rs:401-405
let temp_wal_dir = tempfile::tempdir().map_err(...)?;
let temp_db_dir = tempfile::tempdir().map_err(...)?;
```

**Problem:** Tests are isolated, but **not hermetic**. The simulation:
- Uses real filesystem I/O
- Uses real Ed25519 crypto (slow)
- Uses real sleep timers (flaky)

**Impact on CI:**
- Tests are slow (500ms+ sleep per scenario)
- Tests can be flaky if disk is slow
- Parallel test execution may interfere (disk I/O contention)

**Recommendation:**
- Mock `KVStore` trait for unit tests (use `HashMap` backend)
- Mock `Journal` for ingestion tests
- Use `tokio::time::pause()` to control time advancement

**Confidence Rating: 70%** — Isolation is good, but not hermetic.

---

## Specific File:Line Issues

### `/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-sim/src/lib.rs`

#### Line 97-111: `SimulationResult::is_success()` Logic Flaw
```rust
pub fn is_success(&self) -> bool {
    self.errors.is_empty()
        && self.recency_test_passed
        && self.lifecycle_test_passed
        && self.audit_test_passed
        && self.vote_consensus_test_passed
        && self.troll_resistance_test_passed
}
```

**Issue:** No check for `assertions_written == assertions_verified`.

**Why:** Comment claims "Arena 2 tests write additional test assertions that are verified through the VoteAwareConsensusLens rather than the main verification loop."

**Problem:** This is confusing and fragile. If ingestion fails silently, `assertions_verified` could be 0 but the test passes because `errors.is_empty()`.

**Recommendation:**
```rust
pub fn is_success(&self) -> bool {
    let base_assertions_verified = self.assertions_verified >= (self.tick_count as u64);
    base_assertions_verified
        && self.errors.is_empty()
        && /* ... rest of checks */
}
```

**Severity:** MEDIUM

---

#### Line 279-281: Signature Message Format Mismatch Risk
```rust
// lib.rs:281
let message = format!("{}:{}", subject, predicate);
let signature: Signature = self.signing_key.sign(message.as_bytes());
```

**vs. worker.rs:268:**
```rust
// worker.rs:268
let message = format!("{}:{}", assertion.subject, assertion.predicate);
```

**Issue:** The signed message format is **implicitly coupled** across files. If one changes (e.g., adding `|{timestamp}`), signatures break.

**Recommendation:**
Extract to `stemedb-core/src/types.rs`:
```rust
impl Assertion {
    pub fn signing_message(&self) -> String {
        format!("{}:{}", self.subject, self.predicate)
    }
}
```

**Severity:** MEDIUM

---

#### Line 486: Hardcoded Sleep Timer
```rust
tokio::time::sleep(std::time::Duration::from_millis(config.ingestion_wait_ms)).await;
```

**Issue:** No verification that ingestion actually completed. If system is slow, test fails with "assertion not found" instead of "ingestion timeout".

**Recommendation:**
```rust
async fn wait_for_ingestion(journal: &Journal, store: &impl KVStore, expected_count: usize) -> Result<(), String> {
    let deadline = tokio::time::Instant::now() + Duration::from_secs(5);
    loop {
        let count = store.scan_prefix(b"H:").await?.len();
        if count >= expected_count {
            return Ok(());
        }
        if tokio::time::Instant::now() > deadline {
            return Err(format!("Timeout: only {}/{} assertions ingested", count, expected_count));
        }
        tokio::time::sleep(Duration::from_millis(10)).await;
    }
}
```

**Severity:** MEDIUM

---

#### Line 696, 820, 1073, 1260: Duplicate `tokio::time::sleep()` Calls
**Pattern:**
```rust
tokio::time::sleep(std::time::Duration::from_millis(300)).await;
```

**Issue:** Magic number `300` repeated 4+ times. No justification for why 300ms is correct.

**Recommendation:**
```rust
const INGESTION_SETTLE_TIME_MS: u64 = 300;
// ... then use:
tokio::time::sleep(Duration::from_millis(INGESTION_SETTLE_TIME_MS)).await;
```

**Severity:** LOW

---

### `/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-ingest/src/worker.rs`

#### Line 158-174: EOF Handling Ambiguity
```rust
let record = {
    let journal = self.journal.lock().await;
    match journal.read(self.current_offset) {
        Ok(record) => record,
        Err(stemedb_wal::QuarantineError::Io { .. }) => {
            return Ok(0);  // ⚠️ Assuming EOF
        }
        Err(stemedb_wal::QuarantineError::IoGeneric(e))
            if e.kind() == std::io::ErrorKind::UnexpectedEof => {
            return Ok(0);  // ⚠️ Assuming EOF
        }
        Err(e) => return Err(IngestError::Wal(e)),
    }
};
```

**Issue:** Both `QuarantineError::Io` and `UnexpectedEof` return `Ok(0)`, which means "no new data". But `QuarantineError::Io` could also mean:
- Disk read error
- Network mount failure
- Permission denied

**Problem:** Transient I/O errors are silently ignored, treated as EOF.

**Recommendation:**
```rust
Err(stemedb_wal::QuarantineError::Io { source, .. }) => {
    if source.kind() == std::io::ErrorKind::UnexpectedEof {
        return Ok(0);  // True EOF
    }
    warn!(error = %source, "Transient I/O error, retrying");
    return Ok(0);  // Retry on next tick
}
```

**Severity:** HIGH

---

#### Line 191-192: Vote Count Cache Race Condition
```rust
// worker.rs:181-192 (in put_vote)
let current_count = match self.store.get(&count_key).await? {
    Some(bytes) if bytes.len() == 8 => {
        let arr: [u8; 8] = bytes.try_into().map_err(...)?;
        u64::from_le_bytes(arr)
    }
    _ => 0,
};
let new_count = current_count.saturating_add(1);
self.store.put(&count_key, &new_count.to_le_bytes()).await?;
```

**Issue:** This is a **read-modify-write** race. If two IngestWorkers call `put_vote` concurrently:
1. Worker A reads `count=5`, Worker B reads `count=5`
2. Worker A writes `count=6`, Worker B writes `count=6`
3. Final count is 6, but should be 7 (lost update)

**Simulation doesn't test this** because IngestWorker runs sequentially.

**Recommendation:**
Use a CAS (compare-and-swap) operation or atomic counter:
```rust
// Requires adding atomic counter support to KVStore trait
let new_count = self.store.increment_counter(&count_key, 1).await?;
```

**Severity:** CRITICAL (if multiple ingestors are ever deployed)

---

### `/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-storage/src/vote_store.rs`

#### Line 183-192: Same Race Condition as Above
**Same code, same issue.** VoteStore is used by IngestWorker, so this is the root cause.

**Severity:** CRITICAL

---

### `/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-lens/src/vote_aware_consensus.rs`

#### Line 99-102: Defensive Empty Hash on Serialization Failure
```rust
fn compute_assertion_hash(assertion: &Assertion) -> Hash {
    let bytes = match stemedb_core::serde::serialize(assertion) {
        Ok(b) => b,
        Err(_) => return [0u8; 32],  // ⚠️ Silent failure
    };
    *blake3::hash(&bytes).as_bytes()
}
```

**Issue:** If serialization fails, returns `[0u8; 32]` (all zeros). This could cause:
- All failures to hash to the same value
- Votes to be looked up at wrong key
- Silent data corruption

**Recommendation:**
```rust
fn compute_assertion_hash(assertion: &Assertion) -> Result<Hash, String> {
    let bytes = stemedb_core::serde::serialize(assertion)
        .map_err(|e| format!("Failed to serialize assertion: {}", e))?;
    Ok(*blake3::hash(&bytes).as_bytes())
}
```

Then propagate error up:
```rust
// line 134
let assertion_hash = match Self::compute_assertion_hash(assertion) {
    Ok(h) => h,
    Err(e) => {
        debug!(error = %e, "Failed to hash assertion, skipping");
        continue;
    }
};
```

**Severity:** HIGH

---

### `/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-query/src/engine.rs`

#### Line 117-159: Staleness Check Has Time Skew Risk
```rust
if let Some(max_stale) = query.max_stale {
    let now = SystemTime::now().duration_since(UNIX_EPOCH).map(|d| d.as_secs()).unwrap_or(0);
    let age = now.saturating_sub(view.materialized_at);

    if age > max_stale {
        debug!("Materialized view is stale, falling back to slow path");
        return Ok(None);
    }
}
```

**Issue:** If server clock is set backwards, `now < view.materialized_at`, and `saturating_sub` returns 0, making all MVs appear fresh.

**Recommendation:**
```rust
let age = match now.checked_sub(view.materialized_at) {
    Some(age) => age,
    None => {
        warn!(now, materialized_at = view.materialized_at,
              "Clock skew detected: MV timestamp is in the future, falling back");
        return Ok(None);  // Reject future-dated MVs
    }
};
```

**Severity:** MEDIUM

---

## Summary of Gaps by Severity

| Severity | Count | Examples |
|----------|-------|----------|
| CRITICAL | 4 | Vote cache race, API untested, failure mode gaps, concurrency gaps |
| HIGH | 3 | EOF handling, defensive hash, edge case coverage |
| MEDIUM | 4 | Time skew, verification logic, signature coupling, hardcoded sleeps |
| LOW | 1 | Magic numbers |

---

## Recommendations by Priority

### Immediate (Before Production)

1. **Fix vote cache race condition** (worker.rs:191, vote_store.rs:183)
   - Use atomic counters or CAS operations
   - Add concurrency test: 100 agents voting simultaneously

2. **Add API integration tests** (missing entirely)
   - Create `stemedb-api/tests/http_endpoints.rs`
   - Test JSON serialization, error codes, rate limiting

3. **Add failure mode tests** (missing entirely)
   - Crash recovery: kill IngestWorker mid-step, verify cursor recovery
   - WAL corruption: inject malformed record, verify quarantine
   - Disk full: verify graceful degradation

### Short-Term (Next Sprint)

4. **Add concurrency stress test**
   - 1000 agents writing assertions concurrently
   - 100 queries running while ingestion is active
   - Measure throughput and p99 latency

5. **Add input validation**
   - Max length checks on subject/predicate
   - Confidence range validation (0.0 to 1.0)
   - Weight validation (no negatives)

6. **Fix defensive error handling**
   - Replace `[0u8; 32]` fallback with proper error propagation (vote_aware_consensus.rs:102)
   - Distinguish EOF from transient I/O errors (worker.rs:163)

### Long-Term (Next Quarter)

7. **Add performance benchmarks**
   - Measure assertions/sec throughput
   - Measure p50/p99/p99.9 latency
   - Measure memory allocation per operation

8. **Add property-based tests**
   - Use `proptest` to generate random assertions
   - Verify invariants: all writes are eventually readable, votes always increase counts

9. **Add observability**
   - Export metrics (assertions ingested, queries executed, cache hit rate)
   - Add distributed tracing (OpenTelemetry)

---

## Conclusion

The Arena simulation successfully validates **basic integration** of the write and read paths. However, it provides a **false sense of confidence** because it:

1. **Only tests happy paths** — no failures, no corruption, no adversarial inputs
2. **Only tests sequential execution** — no concurrency, no race conditions
3. **Only tests internal APIs** — HTTP layer is completely untested
4. **Only tests well-formed data** — no edge cases, no DoS attempts
5. **Uses hardcoded timers** — no verification of actual ingestion completion

**Production Readiness: 58/100**

To reach production readiness (90+), the system needs:
- Failure mode testing (crash recovery, corruption handling)
- Concurrency testing (race conditions, deadlocks)
- API integration testing (HTTP, JSON, error codes)
- Input validation (length limits, range checks)
- Performance benchmarks (throughput, latency)

**Critical Blockers:**
1. Fix vote cache race condition (CRITICAL)
2. Add API integration tests (CRITICAL)
3. Add failure mode tests (CRITICAL)

**Estimated Effort to 90/100:** 3-4 weeks with 1 engineer focused on testing.

---

**BurntSushi's Verdict:** Ship the current codebase to a **closed beta** or **internal staging** environment, but **not to production**. The vote cache race alone could cause silent data corruption under load. Fix the critical issues first, then open to external users.