Break monolith source files into focused modules: - stemedb-core/types.rs → types/ directory (assertion, source, gold_standard, etc.) - stemedb-storage: audit_store, quota_store, trust_rank_store, vector_index, vote_store → module directories - stemedb-ingest/worker.rs → worker/ with separate test modules - stemedb-query: engine, materializer, query → module directories - stemedb-lens: epoch_aware, skeptic → module directories - stemedb-sim/lib.rs → agent, arenas/, helpers, runner, strategy, types - stemedb-api/tests: integration_tests → http_basic, http_validation, http_epoch, http_pipeline - stemedb-api/tests: e2e_flow_test → e2e_full_pipeline, e2e_lens_resolution - stemedb-query/tests: e2e_pipeline → e2e_pipeline + e2e_decay Also adds new features: gold standard verification, escalation handlers, admin endpoints, concept hierarchy spec, arena roadmap, and Go SDK. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
19 KiB
Arena Roadmap: The Simulation
Goal: Incrementally evolve the simulator from Spine validation to a full Agent-Based Modeling environment. Philosophy: Make it run. Then add. Verify at every step. Alignment: Tracks main
roadmap.mdphases; exercises features as they land.
Current State (Baseline)
The simulator (stemedb-sim) currently validates Phase 1: The Spine:
| Component | Status | What It Proves |
|---|---|---|
| WAL Durability | ✓ Works | Writes persist |
| rkyv Serialization | ✓ Works | Roundtrip correctness |
| Ed25519 Signatures | ✓ Works | Sign on write, verify on read |
| Ingestor Pipeline | ✓ Works | WAL → KV async flow |
| Agent Identity | ✓ Works | Keypair generation |
Run command: cargo run --bin stemedb-sim
What's NOW tested (Arena 1-3):
- ✅ Queries via QueryEngine
- ✅ Lens resolution (Recency, VoteAwareConsensus)
- ✅ Lifecycle filtering
- ✅ Voting & consensus
- ✅ Query audit trail
- ✅ Materialized Views (Arena 3)
- ✅ Fast-path MV reads
- ✅ MV freshness under load
What's NOT yet tested:
- ❌ Concurrent agents (Arena 6)
- ❌ TrustRank (Arena 5)
- ❌ Time-travel queries (Arena 7)
Arena Phases
Arena 0: Make It Verifiable ✅ COMPLETE
Goal: Add assertions for what the simulation proves. Currently it prints logs; we need programmatic success/failure.
Why first: Without assertions, we don't know if later changes break things.
-
0.1 Return Result from main(): Change from print-and-exit to structured outcome.
- Define
SimulationResult { assertions_written: u64, assertions_verified: u64, errors: Vec<SimulationError> }. - Library function
run_simulation(config) -> Result<SimulationResult, SimulationSetupError>. - Print summary at end, exit 0 on success, exit 1 on failure, exit 2 on setup error.
- Define
-
0.2 Integration Test Wrapper: Make the sim runnable as a test.
- Add
crates/stemedb-sim/tests/smoke.rswith 6 integration tests. - Assert on
SimulationResultfields. - Run in CI via
cargo test -p stemedb-sim.
- Add
Exit Criteria: cargo test -p stemedb-sim passes ✅
Arena 1: Query Path (Exercises Phase 2 Features) ✅ COMPLETE
Goal: Extend simulation to read via QueryEngine, not direct KV access.
Depends on: Phase 2 complete (Query Engine, Lenses)
Aligns with: roadmap.md Phase 2 "The Lattice"
-
1.1 Add Query Engine to Simulation
- Import
stemedb-querycrate. - After ingestion wait, query each assertion via
QueryEngine::execute(). - Verify result matches what was written.
- Import
-
1.2 Exercise Recency Lens
- Write 2 assertions for same subject+predicate with different timestamps.
- Query with
lens=Recency. - Verify most recent wins.
-
1.3 Exercise Lifecycle Filtering
- Write one
Proposedand oneApprovedassertion for same fact. - Query with
lifecycle=Approved. - Verify only Approved returned.
- Use Case Alignment: This is the JWT signing algorithm bug from Agile Agent Team.
- Write one
-
1.4 Query Audit Verification
- Set
X-Agent-Idheader (or equivalent in direct API call). - After queries, call
AuditStore::get_audits_for_agent(). - Verify audit trail exists with correct contributing assertions.
- Use Case Alignment: "What did the deployment agent query?"
- Set
Exit Criteria: Simulation writes, ingests, queries, and verifies all three scenarios. ✅
Arena 2: Voting & Consensus (Exercises Phase 2 VoteStore) ✅ COMPLETE
Goal: Simulate agents voting on assertions and resolving via VoteAwareConsensusLens.
Depends on: Arena 1 complete, Phase 2 VoteStore
Aligns with: roadmap.md Phase 2 "The Ballot Box"
-
2.1 Add Vote Creation to Agents
Agent::vote(assertion_hash, weight)method.- Votes stored directly via VoteStore (bypasses WAL for now - see note below).
-
2.2 Conflicting Assertions with Votes
- Scientist_Alpha asserts "Protein_X binds Receptor_Y" (confidence 0.8).
- Scientist_Beta asserts "Protein_X binds Receptor_Z" (confidence 0.8).
- Alpha votes for own assertion (weight 1.0).
- Beta votes for own assertion (weight 1.0).
- Third agent (Believer) votes for Alpha's assertion.
- Query with
lens=VoteAwareConsensus. - Verify Alpha's assertion wins (2 votes vs 1).
-
2.3 Troll Vote Resistance
- Troll creates low-confidence assertion contradicting consensus.
- Troll votes for own assertion.
- Verify high-vote assertions still win.
Exit Criteria: Vote-based consensus correctly resolves conflicts. ✅
Arena 2.5: Hardening (Critical Gap Remediation) ✅ COMPLETE
Goal: Fix critical gaps discovered during Arena 0-2 review before adding more features.
Depends on: Arena 2 complete Blocks: Arena 3+ (don't add features on a shaky foundation) Rationale: Gap analysis revealed 58/100 production readiness score with 3 critical blockers.
-
2.5.1 Fix Vote Cache Race Condition (P0 - CRITICAL)
- VoteStore
put_vote()usesfetch_and_add_u64+compare_and_swap_f32(vote_store.rs:182-189) - Two concurrent calls can lose updates (final count = N instead of N+1)
- Solution: Add atomic increment or compare-and-swap operation
- Add test: 100 concurrent
put_vote()calls, verify final count - File:
crates/stemedb-storage/src/vote_store.rs:181-206
- VoteStore
-
2.5.2 Add API Integration Tests (P0 - CRITICAL)
- Create
crates/stemedb-api/tests/http_integration.rs - Test
POST /assertions- create assertion via HTTP - Test
POST /votes- submit vote via HTTP - Test
GET /query- query with lens parameter - Test error responses (400 Bad Request, 500 Internal Error)
- Test rate limiting via QuotaStore middleware
- Gap: Entire HTTP layer is currently untested
- Create
-
2.5.3 Add Crash Recovery Test (P0 - CRITICAL)
- Write assertions to WAL
- Kill IngestWorker mid-step (simulate crash)
- Restart IngestWorker with same WAL + KV store
- Verify: cursor resumes correctly, no duplicate ingestion (worker.rs:1518)
- Verify: all pre-crash data is recoverable
- Validates: Durability claims in architecture.md
-
2.5.4 Add Input Validation (P1 - HIGH)
- Max subject length: 1024 characters
- Max predicate length: 1024 characters
- Confidence range: 0.0 to 1.0, reject NaN/Inf
- Vote weight: non-negative, reject NaN/Inf
- Timestamp: reject values > current time + 1 hour (clock skew protection)
- Add validation in
IngestWorker::validate_assertion()andvalidate_vote() - File:
crates/stemedb-ingest/src/worker.rs
-
2.5.5 Replace Sleep Timers with Ingestion Sync (P1 - HIGH)
wait_until_ingested()cursor-based polling replaces all sleeps- Add:
wait_for_ingestion(store, expected_count, timeout)helper - Poll store until expected assertions exist or timeout
- Replace all hardcoded sleeps in simulation
- Benefit: Faster tests, deterministic behavior
-
2.5.6 Fix Defensive Error Handling (P2 - MEDIUM)
- Input validation propagates errors properly
- Change to propagate error or skip candidate with warning
worker.rs:161-173: Ambiguous EOF handling treats all I/O errors as "no data"- Distinguish true EOF from transient errors
Exit Criteria:
- Vote cache is atomic (concurrent test passes)
- API layer has integration tests (POST/GET work via HTTP)
- Crash recovery is verified (no data loss on restart)
- Input validation rejects malformed data
- No hardcoded sleep timers in simulation
- Production readiness score: 75+ (up from 58)
Arena 3: Materialized Views (Exercises Phase 2 Materializer) ✅ COMPLETE
Goal: Verify fast-path MV reads work under simulation load.
Depends on: Arena 2 complete, Phase 2 Materializer
Aligns with: roadmap.md Phase 2 "Materializer"
-
3.1 Materializer Integration
- Spin up Materializer alongside Ingestor.
- Wire
Notifybetween IngestWorker and Materializer. - After ingestion, verify MV keys exist in store.
-
3.2 Fast-Path Verification
- Query via QueryEngine with subject+predicate.
- Log whether fast-path or slow-path was used (add debug output).
- Verify MV winner matches slow-path result.
-
3.3 MV Freshness Under Load
- Write 10 assertions in rapid succession.
- Wait for materialization.
- Verify MV reflects latest state.
- Aligns with: Phase 2.5 "MV Staleness Detection"
Exit Criteria: Fast-path queries return correct results under load. ✅
Arena 4: Agent Personas (First Strategy Differentiation) ✅ COMPLETE
Goal: Agents behave differently based on persona. No longer uniform.
Depends on: Arena 3 complete Aligns with: Vision document "The Players"
-
4.1 AgentStrategy Trait
AgentStrategytrait withdecide_action(),base_confidence(),name().AgentActionenum: Assert, Vote, Query, Skip.WorldState,StrategyMetrics,AgentSpec,StrategyType.GroundTruthhardcoded dataset (5 known-true facts).- File:
crates/stemedb-sim/src/strategy.rs
-
4.2 Scientist Strategy
- High base confidence (0.9).
- Even ticks: assert next ground truth fact.
- Odd ticks: vote for truth-aligned assertions, or assert if none found.
-
4.3 Troll Strategy
- Low base confidence (0.4).
- Contradicts existing assertions (NOT_value, negate, flip, FAKE_ref).
- Skips on empty world (tick 0).
-
4.4 Believer Strategy
- Medium base confidence (0.65).
- Votes for highest-confidence assertion (weight 0.7).
- Pure amplifier - never creates assertions, only votes.
-
4.5 Strategy-Driven Tick Loop
SimulationConfig.agents: Vec<AgentSpec>replacesagent_count.- Each tick: strategy decides action, executed by agent.
- Per-strategy metrics tracked and logged.
- Arena 4 verification: differentiated behavior check.
Exit Criteria: Different agent types produce different behaviors in logs. ✅
Arena 5: TrustRank Integration (Exercises Phase 4 Foundation)
Goal: Reputation updates based on agent behavior.
Depends on: Arena 4 complete, TrustRank implemented
Aligns with: roadmap.md Phase 4 "TrustRank Engine"
-
5.1 Initialize TrustRank for Agents
- Each agent starts with base TrustRank (e.g., 0.5).
- Store in TrustRankStore at simulation start.
-
5.2 Reputation Adjustment After Votes
- When an assertion gains votes, increase author's TrustRank.
- When an assertion is contradicted by consensus, decrease author's TrustRank.
- Use
TrustRankStore::record_outcome().
-
5.3 TrustAwareAuthorityLens Verification
- Two assertions from different agents, same confidence.
- Agent with higher TrustRank should win via
TrustAwareAuthorityLens. - Use Case Alignment: "Expert vs. junior weighting" from Agile Agent Team.
-
5.4 Troll Reputation Decay
- After 100 ticks, verify Troll's TrustRank has decreased.
- Verify Scientist's TrustRank has increased.
- Success Criteria: "Trust clusters form naturally without hardcoded rules."
Exit Criteria: TrustRank diverges based on behavior; Troll reputation tanks.
Arena 6: Concurrent Agents (Performance Validation)
Goal: Move from sequential to parallel agent execution.
Depends on: Arena 5 complete Aligns with: Vision "1000 concurrent agents without locking"
-
6.1 Tokio Task Per Agent
- Wrap each agent's tick in
tokio::spawn(). - Use
Arc<Mutex<Journal>>for WAL access (already in place). - Run 10 agents concurrently.
- Wrap each agent's tick in
-
6.2 Scale to 100 Agents
- Parameterize agent count.
- Run with 100 agents for 50 ticks.
- Verify no deadlocks, no data corruption.
-
6.3 Contention Metrics
- Add timing around WAL lock acquisition.
- Log P50/P99 latencies.
- Identify bottlenecks if any.
-
6.4 Target: 1000 Agents
- Run with 1000 agents (stretch goal).
- May require connection pooling or batching.
- Document findings.
Exit Criteria: 100 agents run concurrently without errors.
Arena 7: Time-Travel & Epochs (Exercises Phase 3 Features)
Goal: Validate temporal queries and epoch supersession.
Depends on: Arena 6 complete, Phase 3 Time-Travel + EpochAwareLens
Aligns with: roadmap.md Phase 3 "Time-Travel Engine", Phase 2.5 "EpochAwareLens"
-
7.1 Time-Travel Query Verification
- At tick 50, record timestamp T1.
- At tick 100, write a new assertion superseding an old one.
- Query with
as_of=T1. - Verify result reflects tick-50 state, not tick-100 state.
- Use Case Alignment: "What was the state of knowledge at 9pm?"
-
7.2 Epoch Creation and Supersession
- Create epoch "v1" at tick 0.
- Create epoch "v2" superseding "v1" at tick 50.
- Assertions referencing "v1" should be filtered by EpochAwareLens.
- Use Case Alignment: "Security team migrates from RS256 to ES256."
-
7.3 Epoch Cascade Verification
- Chain: v3 supersedes v2 supersedes v1.
- Query with EpochAwareLens.
- Only v3 assertions visible.
Exit Criteria: Historical queries and epoch filtering work correctly.
Arena 8: Skeptic & Conflict (Exercises Phase 3 Lenses)
Goal: Surface disagreement, measure consensus.
Depends on: Arena 7 complete, Phase 3 Skeptic Lens + Conflict Score
Aligns with: roadmap.md Phase 3C "Skeptic Lens", Phase 3A.2 "Conflict Score"
-
8.1 High-Conflict Scenario
- 3 Scientist agents assert conflicting values for same fact.
- Each votes for own assertion.
- Query with
lens=Skeptic. - Verify
conflict_scoreis high (> 0.5).
-
8.2 Low-Conflict Scenario
- 3 Scientists assert same value (agreement).
- Query with
lens=Skeptic. - Verify
conflict_scoreis low (< 0.2).
-
8.3 Skeptic Surfaces Outlier
- Consensus is A, one dissenter says B.
- Skeptic lens returns B (the controversial position).
- Use Case Alignment: Financial Due Diligence "disagreement is the information."
Exit Criteria: Conflict score accurately reflects disagreement.
Arena 9: Full Gameplay Loop (The Vision)
Goal: Run the complete vision scenario end-to-end.
Depends on: Arena 8 complete, all Phase 3 features
Aligns with: simulation-vision.md "The Gameplay Loop"
-
9.1 Ground Truth Injection
- Load ground truth from YAML config.
- Scientists read ground truth, assert facts.
-
9.2 The 5-Tick Scenario
- Tick 1: Scientist asserts "Protein_X binds Receptor_Y".
- Tick 2: Troll forks with "Protein_X binds Nothing".
- Tick 3: Believer queries, votes for Scientist.
- Tick 4: TrustRank updates (Scientist up, Troll down).
- Tick 5: Verify consensus via lens.
-
9.3 Extended Run (1000 Ticks)
- Run full scenario for 1000 ticks.
- Track metrics:
truth_convergence: % of facts matching ground truth.reputation_distribution: Scientist vs Troll ranks.fork_depth_max: Deepest contradiction chain.
-
9.4 Success Criteria Verification
- ✓ Truth survives: High-reputation assertions outlive spam.
- ✓ Lenses work: Consensus lens filters Troll noise.
- ✓ Performance: 1000 ticks complete in < 30 seconds.
- ✓ Emergence: Trust clusters form naturally.
Exit Criteria: All 4 success criteria from vision document pass.
Alignment with Main Roadmap
| Arena Phase | Exercises Roadmap Phase | Key Features Validated |
|---|---|---|
| Arena 0 ✅ | - | Test infrastructure |
| Arena 1 ✅ | Phase 2 | QueryEngine, Lenses, Lifecycle, Query Audit |
| Arena 2 ✅ | Phase 2 | VoteStore, VoteAwareConsensusLens |
| Arena 2.5 ✅ | - (Hardening) | Race conditions, API tests, crash recovery, input validation |
| Arena 3 ✅ | Phase 2 | Materializer, Fast-Path MV, MV Freshness |
| Arena 4 ✅ | - | Agent personas: Scientist, Troll, Believer (simulator-only) |
| Arena 5 | Phase 4 | TrustRank, TrustAwareAuthorityLens |
| Arena 6 | Phase 4 | Concurrency, Performance |
| Arena 7 | Phase 2.5 + Phase 3 | Time-Travel, Epochs, EpochAwareLens |
| Arena 8 | Phase 3 | Skeptic Lens, Conflict Score |
| Arena 9 | All | Full integration |
Alignment with Use Cases
| Use Case | Arena Phase That Validates It |
|---|---|
| Agile Agent Team | |
| - Lifecycle filtering | Arena 1.3 |
| - Query audit trail | Arena 1.4 |
| - Time-travel debugging | Arena 7.1 |
| - Expert weighting | Arena 5.3 |
| - Persistent learning | Arena 5.4 (TrustRank) |
| Financial Due Diligence | |
| - Conflict detection | Arena 8.1, 8.3 |
| - Time-travel | Arena 7.1 |
| - Epoch cascades | Arena 7.2, 7.3 |
| Consumer Health | |
| - Source-class hierarchy | Phase 3 dependency (not in Arena yet) |
| - Layered consensus | Phase 3 dependency |
Development Cadence
| Week | Focus | Deliverable |
|---|---|---|
| 1 | Arena 0 | CI-runnable simulation ✅ |
| 2 | Arena 1 | Query path verified ✅ |
| 3 | Arena 2 | Voting verified ✅ |
| 4 | Arena 2.5 | Hardening: race fix, API tests, crash recovery |
| 5 | Arena 3 | Materializer + MVs verified |
| 6 | Arena 4 | Agent personas differentiated |
| 7-8 | Arena 5-6 | TrustRank + concurrency |
| 9-10 | Arena 7-8 | Time-travel + Skeptic |
| 11-12 | Arena 9 | Full gameplay loop |
Metrics to Track
Once Arena 6+ is complete, export these to logs (and eventually Prometheus):
| Metric | Description | Success Target |
|---|---|---|
truth_convergence |
% of facts matching ground truth | > 95% |
troll_reputation |
Troll agent TrustRank at end | < 0.2 |
scientist_reputation |
Scientist agent TrustRank at end | > 0.8 |
fork_depth_max |
Deepest contradiction chain | < 10 |
p99_write_latency_ms |
Write path latency | < 10ms |
p99_query_latency_ms |
Query path latency | < 50ms |
concurrent_agents |
Max concurrent agents without errors | 1000 |
Non-Goals (Kept Simple)
These are explicitly out of scope for the Arena:
- Prometheus/Grafana integration - Logs suffice for Phase 3.
- YAML scenario config - Hardcoded scenarios are fine until Arena 9.
- Full chaos injection (network partitions, node kills) - Basic crash recovery in 2.5; advanced chaos deferred to Phase 4+.
- External agent frameworks (ADK-Go) - Simulator uses Rust agents.
Note: HTTP API testing was previously a non-goal but is now addressed in Arena 2.5.2 due to critical gap discovery.
Next Step
Arena 0-4 and Arena 2.5 are complete. Proceed to Arena 5: TrustRank Integration.
# Verify Arena 0 + 1 + 2 + 2.5 + 3 + 4 still work:
cargo test -p stemedb-sim
# Binary also works (shows persona differentiation):
RUST_LOG=info cargo run --bin stemedb-sim