stemedb/.agentive-remediation/arena-gap-analysis.md
jordan 3320c24afa feat: WAL hardening (Phase 5B) - CRC32C, crash recovery, group commit, log rotation
Add CRC32C checksums to WAL record format (v2), implement crash recovery
with automatic truncation of corrupt records, add feature-gated group commit
buffer for batched fsync under concurrent load, and implement log rotation
via segment files with global offset addressing.

Key changes:
- Record format v2: [len:u32][crc32c:u32][blake3:32][payload:N]
- recover_file() scans and truncates corrupt tail records
- GroupCommitBuffer batches fsync via MPSC channel (tokio feature gate)
- SegmentManager with binary search resolution and cursor-based cleanup
- Journal::read() auto-refreshes segments on miss for writer/reader split
- Split recovery.rs and key_codec.rs into directory modules for 500-line max

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 12:36:35 -07:00

731 lines
22 KiB
Markdown

# Arena Simulation: Critical Gap Analysis
**Analyst:** BurntSushi (Rust Quality Engineer)
**Date:** 2026-02-01
**Subject:** Production Readiness Assessment of stemedb-sim Arena Tests
---
## Executive Summary
The Arena simulation tests fundamental write→read paths through the system, but exhibits **significant gaps** between simulation coverage and real-world production scenarios. While the simulation successfully validates basic pipeline connectivity, it provides a **false sense of security** regarding system robustness.
**Production Readiness Score: 58/100**
| Category | Score | Status |
|----------|-------|--------|
| Basic Integration Coverage | 85/100 | ✅ Good |
| Failure Mode Testing | 30/100 | ❌ Critical Gap |
| Performance Validation | 20/100 | ❌ Critical Gap |
| API Integration | 0/100 | ❌ Not Tested |
| Concurrency Testing | 15/100 | ❌ Critical Gap |
---
## What's Actually Being Tested
### Arena 0: Spine (WAL → Ingestor → KVStore)
**Code Path Exercised:**
```
Agent.sign_assertion() → write_assertion_to_wal() → Journal.append()
→ IngestWorker.step() → IngestWorker.ingest_assertion()
→ HybridStore.put() → IndexStore.add_to_indexes()
```
**What Works:**
- ✅ Basic serialization round-trip (`serialize_assertion` → `deserialize`)
- ✅ Ed25519 signature verification in `IngestWorker.verify_assertion_signatures()`
- ✅ Content-addressed storage (`H:{blake3_hash}`)
- ✅ Subject index updates (`S:{subject}`)
- ✅ Compound index updates (`SP:{subject}:{predicate}`)
**Confidence Rating: 90%** — The simulation accurately tests the happy path for assertion ingestion.
---
### Arena 1: Cortex (QueryEngine → Lenses)
**Code Path Exercised:**
```
QueryEngine.execute() → fetch_by_subject_predicate()
→ RecencyLens.resolve() / VoteAwareConsensusLens.resolve_async()
→ Resolution with winner selection
```
**What Works:**
- ✅ Query execution via compound index (`SP:{subject}:{predicate}`)
- ✅ Subject-only queries (`S:{subject}`)
- ✅ Lifecycle filtering in `Query.matches()`
- ✅ RecencyLens timestamp comparison
- ✅ Audit trail storage via `GenericAuditStore.put_audit()`
**Confidence Rating: 85%** — Query path is tested, but lacks edge cases.
---
### Arena 2: Voting & Consensus
**Code Path Exercised:**
```
Agent.vote() → write_vote_to_wal() → IngestWorker.ingest_vote()
→ VoteStore.put_vote() → VoteStore cache updates (VC:{hash}, VW:{hash})
→ VoteAwareConsensusLens.resolve_async() → Aggregate weight ranking
```
**What Works:**
- ✅ Vote serialization/deserialization
- ✅ Vote storage at `V:{assertion_hash}:{vote_hash}`
- ✅ Vote count cache increments (`VC:{assertion_hash}`)
- ✅ Aggregate weight cache updates (`VW:{assertion_hash}`)
- ✅ Winner selection based on highest aggregate weight
- ✅ Tiebreaker by timestamp
**Confidence Rating: 80%** — Vote mechanics work, but lacks adversarial testing.
---
## Critical Gaps: What's NOT Being Tested
### 1. Failure Modes (Severity: CRITICAL)
**Missing:**
- ❌ WAL corruption mid-write (partial record)
- ❌ Sled crashes during `put()` (fsync failure)
- ❌ Network partition (if distributed)
- ❌ OOM conditions during large scans
- ❌ Disk full during ingestion
- ❌ Malformed rkyv archives (corrupted length fields)
- ❌ Hash collisions (theoretical but untested)
**Why This Matters:**
The simulation calls `.expect()` on all storage operations, hiding errors:
```rust
// lib.rs:466
match write_assertion_to_wal(&journal, &assertion).await {
Ok(raw_bytes) => { /* success path */ }
Err(e) => {
result.errors.push(SimulationError { /* logged but not replayed */ });
}
}
```
**Problem:** Errors are collected but the simulation continues. No test verifies recovery behavior after failure.
**Recommendation:**
Create `arena-3-chaos.rs` that:
- Kills IngestWorker mid-step, restarts, verifies cursor recovery
- Corrupts WAL records, verifies quarantine behavior
- Fills disk to 100%, verifies graceful degradation
**Confidence Rating: 10%** — Failure mode coverage is nearly absent.
---
### 2. Concurrency & Race Conditions (Severity: CRITICAL)
**Missing:**
- ❌ Multiple IngestWorkers reading same WAL (race on cursor)
- ❌ Concurrent queries while ingestion is active
- ❌ Vote cache race conditions (VC/VW updates not atomic)
- ❌ Index corruption from concurrent writes
- ❌ Deadlock detection between Journal mutex and Store mutex
**Why This Matters:**
The simulation is **strictly sequential**:
```rust
// lib.rs:458 - Arena 0 tick loop is single-threaded
for tick in 0..config.tick_count {
let agent = &agents[tick % agents.len()];
let assertion = agent.sign_assertion(...);
match write_assertion_to_wal(&journal, &assertion).await { /* ... */ }
}
```
**Then waits 500ms:**
```rust
// lib.rs:486
tokio::time::sleep(std::time::Duration::from_millis(config.ingestion_wait_ms)).await;
```
**Problem:** No concurrent writes, no overlapping ingestion, no parallel queries.
**Real-World Scenario:**
- 1000 agents writing assertions simultaneously
- 10 queries executing while votes are being ingested
- Materializer updating MVs while QueryEngine reads them
**Recommendation:**
```rust
// Spawn 100 concurrent writers
let handles: Vec<_> = (0..100).map(|i| {
let journal = journal.clone();
tokio::spawn(async move {
for _ in 0..10 {
write_assertion_to_wal(&journal, &assertion).await?;
}
})
}).collect();
futures::future::join_all(handles).await;
```
**Confidence Rating: 15%** — Concurrency testing is effectively zero.
---
### 3. Performance Validation (Severity: HIGH)
**Missing:**
- ❌ Throughput measurement (assertions/sec, votes/sec)
- ❌ Latency percentiles (p50, p99, p99.9)
- ❌ Memory allocation profiling
- ❌ Index scan performance under load
- ❌ VoteStore cache hit rate
- ❌ Materialized view staleness impact
**Why This Matters:**
The simulation uses **hardcoded sleep timers**:
```rust
// lib.rs:486
tokio::time::sleep(std::time::Duration::from_millis(config.ingestion_wait_ms)).await;
```
**Problem:** No verification that 500ms is sufficient. If ingestion lags, tests pass but assertions are missing.
**Actual Check:**
```rust
// lib.rs:378
assert!(result.is_success(), "Simulation should succeed: {:?}", result.errors);
assert_eq!(result.assertions_written, 14);
assert_eq!(result.assertions_verified, 10); // ⚠️ Only 10 of 14 verified!
```
**Wait, what?** The test writes 14 assertions but only verifies 10. The 4 extras are from Arena 2 (vote consensus/troll resistance), which are verified through lens resolution, not direct verification loop.
**BUT:** This is fragile. If ingestion is slow, `assertions_verified` could be 0 and the test would fail for the wrong reason.
**Recommendation:**
- Add `IngestWorker.wait_until_offset(target)` to block until ingestion catches up
- Measure ingestion lag: `current_offset - wal_tail_offset`
- Fail if lag exceeds threshold
**Confidence Rating: 20%** — Performance is not measured, only assumed.
---
### 4. API Integration (Severity: CRITICAL)
**Missing:**
- ❌ HTTP API endpoints (`stemedb-api` crate is completely untested)
- ❌ JSON serialization/deserialization
- ❌ OpenAPI spec validation
- ❌ Rate limiting enforcement
- ❌ Authentication/authorization
- ❌ Quota enforcement
**Why This Matters:**
The simulation tests **internal Rust APIs**:
```rust
// lib.rs:335
async fn write_assertion_to_wal(journal: &Arc<Mutex<Journal>>, assertion: &Assertion)
-> Result<Vec<u8>, String>
```
**Real clients will use HTTP:**
```bash
curl -X POST http://localhost:8080/assertions \
-H "Content-Type: application/json" \
-d '{"subject": "Tesla", "predicate": "revenue", "object": 96.7}'
```
**Gap:** No test verifies:
- JSON schema compatibility
- HTTP error responses (400, 500)
- Content-Type negotiation
- CORS headers
- Request timeout handling
**Recommendation:**
Create `tests/api_integration.rs`:
```rust
#[tokio::test]
async fn test_post_assertion_via_http() {
let app = stemedb_api::create_app(store).await;
let response = app
.oneshot(Request::builder()
.method("POST")
.uri("/assertions")
.header("content-type", "application/json")
.body(Body::from(r#"{"subject":"Tesla","predicate":"revenue"}"#))
.unwrap())
.await
.unwrap();
assert_eq!(response.status(), StatusCode::CREATED);
}
```
**Confidence Rating: 0%** — API layer is completely untested.
---
### 5. Edge Cases & Adversarial Inputs (Severity: HIGH)
**Missing:**
- ❌ Subject/predicate strings with UTF-8 edge cases (e.g., emoji, RTL text, null bytes)
- ❌ Extremely long subject strings (DoS via memory exhaustion)
- ❌ Malicious signatures (wrong length, invalid curve points)
- ❌ Votes with negative weights (should be rejected)
- ❌ Assertions with `confidence > 1.0` or `< 0.0`
- ❌ Circular parent_hash references
- ❌ Timestamp overflow (year 2038 problem)
**Why This Matters:**
The simulation creates **well-formed data**:
```rust
// lib.rs:459
let agent = &agents[tick % agents.len()];
let assertion = agent.sign_assertion(
&format!("Entity_{}", tick), // ⚠️ Always ASCII alphanumeric
"has_property",
ObjectValue::Text(format!("Value_{}", tick)),
);
```
**Real-World Inputs:**
- `subject: "🚗 Tesla\x00Inc"` (emoji + null byte)
- `subject: "a".repeat(1_000_000)` (1MB string, DoS)
- `weight: -999.9` (negative vote weight)
- `confidence: f32::NAN` (NaN value)
**Current Code:**
```rust
// worker.rs:256 - Signature verification (GOOD)
if assertion.signatures.is_empty() {
return Err(IngestError::InvalidSignature("must have at least one signature".to_string()));
}
```
**But NO validation for:**
```rust
if assertion.subject.len() > MAX_SUBJECT_LEN {
return Err(IngestError::InvalidInput("subject too long"));
}
if assertion.confidence < 0.0 || assertion.confidence > 1.0 {
return Err(IngestError::InvalidInput("confidence out of range"));
}
```
**Recommendation:**
Add defensive validation in `IngestWorker::ingest_assertion()`:
```rust
const MAX_SUBJECT_LEN: usize = 1024;
const MAX_PREDICATE_LEN: usize = 1024;
fn validate_assertion(a: &Assertion) -> Result<()> {
if a.subject.len() > MAX_SUBJECT_LEN {
return Err(IngestError::InvalidInput("subject too long".into()));
}
if a.confidence < 0.0 || a.confidence > 1.0 || a.confidence.is_nan() {
return Err(IngestError::InvalidInput("invalid confidence".into()));
}
// ... more checks
Ok(())
}
```
**Confidence Rating: 25%** — Edge case coverage is minimal.
---
### 6. Test Isolation (Severity: MEDIUM)
**Current Approach:**
```rust
// lib.rs:401-405
let temp_wal_dir = tempfile::tempdir().map_err(...)?;
let temp_db_dir = tempfile::tempdir().map_err(...)?;
```
**Problem:** Tests are isolated, but **not hermetic**. The simulation:
- Uses real filesystem I/O
- Uses real Ed25519 crypto (slow)
- Uses real sleep timers (flaky)
**Impact on CI:**
- Tests are slow (500ms+ sleep per scenario)
- Tests can be flaky if disk is slow
- Parallel test execution may interfere (disk I/O contention)
**Recommendation:**
- Mock `KVStore` trait for unit tests (use `HashMap` backend)
- Mock `Journal` for ingestion tests
- Use `tokio::time::pause()` to control time advancement
**Confidence Rating: 70%** — Isolation is good, but not hermetic.
---
## Specific File:Line Issues
### `/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-sim/src/lib.rs`
#### Line 97-111: `SimulationResult::is_success()` Logic Flaw
```rust
pub fn is_success(&self) -> bool {
self.errors.is_empty()
&& self.recency_test_passed
&& self.lifecycle_test_passed
&& self.audit_test_passed
&& self.vote_consensus_test_passed
&& self.troll_resistance_test_passed
}
```
**Issue:** No check for `assertions_written == assertions_verified`.
**Why:** Comment claims "Arena 2 tests write additional test assertions that are verified through the VoteAwareConsensusLens rather than the main verification loop."
**Problem:** This is confusing and fragile. If ingestion fails silently, `assertions_verified` could be 0 but the test passes because `errors.is_empty()`.
**Recommendation:**
```rust
pub fn is_success(&self) -> bool {
let base_assertions_verified = self.assertions_verified >= (self.tick_count as u64);
base_assertions_verified
&& self.errors.is_empty()
&& /* ... rest of checks */
}
```
**Severity:** MEDIUM
---
#### Line 279-281: Signature Message Format Mismatch Risk
```rust
// lib.rs:281
let message = format!("{}:{}", subject, predicate);
let signature: Signature = self.signing_key.sign(message.as_bytes());
```
**vs. worker.rs:268:**
```rust
// worker.rs:268
let message = format!("{}:{}", assertion.subject, assertion.predicate);
```
**Issue:** The signed message format is **implicitly coupled** across files. If one changes (e.g., adding `|{timestamp}`), signatures break.
**Recommendation:**
Extract to `stemedb-core/src/types.rs`:
```rust
impl Assertion {
pub fn signing_message(&self) -> String {
format!("{}:{}", self.subject, self.predicate)
}
}
```
**Severity:** MEDIUM
---
#### Line 486: Hardcoded Sleep Timer
```rust
tokio::time::sleep(std::time::Duration::from_millis(config.ingestion_wait_ms)).await;
```
**Issue:** No verification that ingestion actually completed. If system is slow, test fails with "assertion not found" instead of "ingestion timeout".
**Recommendation:**
```rust
async fn wait_for_ingestion(journal: &Journal, store: &impl KVStore, expected_count: usize) -> Result<(), String> {
let deadline = tokio::time::Instant::now() + Duration::from_secs(5);
loop {
let count = store.scan_prefix(b"H:").await?.len();
if count >= expected_count {
return Ok(());
}
if tokio::time::Instant::now() > deadline {
return Err(format!("Timeout: only {}/{} assertions ingested", count, expected_count));
}
tokio::time::sleep(Duration::from_millis(10)).await;
}
}
```
**Severity:** MEDIUM
---
#### Line 696, 820, 1073, 1260: Duplicate `tokio::time::sleep()` Calls
**Pattern:**
```rust
tokio::time::sleep(std::time::Duration::from_millis(300)).await;
```
**Issue:** Magic number `300` repeated 4+ times. No justification for why 300ms is correct.
**Recommendation:**
```rust
const INGESTION_SETTLE_TIME_MS: u64 = 300;
// ... then use:
tokio::time::sleep(Duration::from_millis(INGESTION_SETTLE_TIME_MS)).await;
```
**Severity:** LOW
---
### `/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-ingest/src/worker.rs`
#### Line 158-174: EOF Handling Ambiguity
```rust
let record = {
let journal = self.journal.lock().await;
match journal.read(self.current_offset) {
Ok(record) => record,
Err(stemedb_wal::QuarantineError::Io { .. }) => {
return Ok(0); // ⚠️ Assuming EOF
}
Err(stemedb_wal::QuarantineError::IoGeneric(e))
if e.kind() == std::io::ErrorKind::UnexpectedEof => {
return Ok(0); // ⚠️ Assuming EOF
}
Err(e) => return Err(IngestError::Wal(e)),
}
};
```
**Issue:** Both `QuarantineError::Io` and `UnexpectedEof` return `Ok(0)`, which means "no new data". But `QuarantineError::Io` could also mean:
- Disk read error
- Network mount failure
- Permission denied
**Problem:** Transient I/O errors are silently ignored, treated as EOF.
**Recommendation:**
```rust
Err(stemedb_wal::QuarantineError::Io { source, .. }) => {
if source.kind() == std::io::ErrorKind::UnexpectedEof {
return Ok(0); // True EOF
}
warn!(error = %source, "Transient I/O error, retrying");
return Ok(0); // Retry on next tick
}
```
**Severity:** HIGH
---
#### Line 191-192: Vote Count Cache Race Condition
```rust
// worker.rs:181-192 (in put_vote)
let current_count = match self.store.get(&count_key).await? {
Some(bytes) if bytes.len() == 8 => {
let arr: [u8; 8] = bytes.try_into().map_err(...)?;
u64::from_le_bytes(arr)
}
_ => 0,
};
let new_count = current_count.saturating_add(1);
self.store.put(&count_key, &new_count.to_le_bytes()).await?;
```
**Issue:** This is a **read-modify-write** race. If two IngestWorkers call `put_vote` concurrently:
1. Worker A reads `count=5`, Worker B reads `count=5`
2. Worker A writes `count=6`, Worker B writes `count=6`
3. Final count is 6, but should be 7 (lost update)
**Simulation doesn't test this** because IngestWorker runs sequentially.
**Recommendation:**
Use a CAS (compare-and-swap) operation or atomic counter:
```rust
// Requires adding atomic counter support to KVStore trait
let new_count = self.store.increment_counter(&count_key, 1).await?;
```
**Severity:** CRITICAL (if multiple ingestors are ever deployed)
---
### `/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-storage/src/vote_store.rs`
#### Line 183-192: Same Race Condition as Above
**Same code, same issue.** VoteStore is used by IngestWorker, so this is the root cause.
**Severity:** CRITICAL
---
### `/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-lens/src/vote_aware_consensus.rs`
#### Line 99-102: Defensive Empty Hash on Serialization Failure
```rust
fn compute_assertion_hash(assertion: &Assertion) -> Hash {
let bytes = match stemedb_core::serde::serialize(assertion) {
Ok(b) => b,
Err(_) => return [0u8; 32], // ⚠️ Silent failure
};
*blake3::hash(&bytes).as_bytes()
}
```
**Issue:** If serialization fails, returns `[0u8; 32]` (all zeros). This could cause:
- All failures to hash to the same value
- Votes to be looked up at wrong key
- Silent data corruption
**Recommendation:**
```rust
fn compute_assertion_hash(assertion: &Assertion) -> Result<Hash, String> {
let bytes = stemedb_core::serde::serialize(assertion)
.map_err(|e| format!("Failed to serialize assertion: {}", e))?;
Ok(*blake3::hash(&bytes).as_bytes())
}
```
Then propagate error up:
```rust
// line 134
let assertion_hash = match Self::compute_assertion_hash(assertion) {
Ok(h) => h,
Err(e) => {
debug!(error = %e, "Failed to hash assertion, skipping");
continue;
}
};
```
**Severity:** HIGH
---
### `/Users/jordanwashburn/Workspace/orchard9/stemedb/crates/stemedb-query/src/engine.rs`
#### Line 117-159: Staleness Check Has Time Skew Risk
```rust
if let Some(max_stale) = query.max_stale {
let now = SystemTime::now().duration_since(UNIX_EPOCH).map(|d| d.as_secs()).unwrap_or(0);
let age = now.saturating_sub(view.materialized_at);
if age > max_stale {
debug!("Materialized view is stale, falling back to slow path");
return Ok(None);
}
}
```
**Issue:** If server clock is set backwards, `now < view.materialized_at`, and `saturating_sub` returns 0, making all MVs appear fresh.
**Recommendation:**
```rust
let age = match now.checked_sub(view.materialized_at) {
Some(age) => age,
None => {
warn!(now, materialized_at = view.materialized_at,
"Clock skew detected: MV timestamp is in the future, falling back");
return Ok(None); // Reject future-dated MVs
}
};
```
**Severity:** MEDIUM
---
## Summary of Gaps by Severity
| Severity | Count | Examples |
|----------|-------|----------|
| CRITICAL | 4 | Vote cache race, API untested, failure mode gaps, concurrency gaps |
| HIGH | 3 | EOF handling, defensive hash, edge case coverage |
| MEDIUM | 4 | Time skew, verification logic, signature coupling, hardcoded sleeps |
| LOW | 1 | Magic numbers |
---
## Recommendations by Priority
### Immediate (Before Production)
1. **Fix vote cache race condition** (worker.rs:191, vote_store.rs:183)
- Use atomic counters or CAS operations
- Add concurrency test: 100 agents voting simultaneously
2. **Add API integration tests** (missing entirely)
- Create `stemedb-api/tests/http_endpoints.rs`
- Test JSON serialization, error codes, rate limiting
3. **Add failure mode tests** (missing entirely)
- Crash recovery: kill IngestWorker mid-step, verify cursor recovery
- WAL corruption: inject malformed record, verify quarantine
- Disk full: verify graceful degradation
### Short-Term (Next Sprint)
4. **Add concurrency stress test**
- 1000 agents writing assertions concurrently
- 100 queries running while ingestion is active
- Measure throughput and p99 latency
5. **Add input validation**
- Max length checks on subject/predicate
- Confidence range validation (0.0 to 1.0)
- Weight validation (no negatives)
6. **Fix defensive error handling**
- Replace `[0u8; 32]` fallback with proper error propagation (vote_aware_consensus.rs:102)
- Distinguish EOF from transient I/O errors (worker.rs:163)
### Long-Term (Next Quarter)
7. **Add performance benchmarks**
- Measure assertions/sec throughput
- Measure p50/p99/p99.9 latency
- Measure memory allocation per operation
8. **Add property-based tests**
- Use `proptest` to generate random assertions
- Verify invariants: all writes are eventually readable, votes always increase counts
9. **Add observability**
- Export metrics (assertions ingested, queries executed, cache hit rate)
- Add distributed tracing (OpenTelemetry)
---
## Conclusion
The Arena simulation successfully validates **basic integration** of the write and read paths. However, it provides a **false sense of confidence** because it:
1. **Only tests happy paths** — no failures, no corruption, no adversarial inputs
2. **Only tests sequential execution** — no concurrency, no race conditions
3. **Only tests internal APIs** — HTTP layer is completely untested
4. **Only tests well-formed data** — no edge cases, no DoS attempts
5. **Uses hardcoded timers** — no verification of actual ingestion completion
**Production Readiness: 58/100**
To reach production readiness (90+), the system needs:
- Failure mode testing (crash recovery, corruption handling)
- Concurrency testing (race conditions, deadlocks)
- API integration testing (HTTP, JSON, error codes)
- Input validation (length limits, range checks)
- Performance benchmarks (throughput, latency)
**Critical Blockers:**
1. Fix vote cache race condition (CRITICAL)
2. Add API integration tests (CRITICAL)
3. Add failure mode tests (CRITICAL)
**Estimated Effort to 90/100:** 3-4 weeks with 1 engineer focused on testing.
---
**BurntSushi's Verdict:** Ship the current codebase to a **closed beta** or **internal staging** environment, but **not to production**. The vote cache race alone could cause silent data corruption under load. Fix the critical issues first, then open to external users.