Break monolith source files into focused modules: - stemedb-core/types.rs → types/ directory (assertion, source, gold_standard, etc.) - stemedb-storage: audit_store, quota_store, trust_rank_store, vector_index, vote_store → module directories - stemedb-ingest/worker.rs → worker/ with separate test modules - stemedb-query: engine, materializer, query → module directories - stemedb-lens: epoch_aware, skeptic → module directories - stemedb-sim/lib.rs → agent, arenas/, helpers, runner, strategy, types - stemedb-api/tests: integration_tests → http_basic, http_validation, http_epoch, http_pipeline - stemedb-api/tests: e2e_flow_test → e2e_full_pipeline, e2e_lens_resolution - stemedb-query/tests: e2e_pipeline → e2e_pipeline + e2e_decay Also adds new features: gold standard verification, escalation handlers, admin endpoints, concept hierarchy spec, arena roadmap, and Go SDK. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
489 lines
19 KiB
Markdown
489 lines
19 KiB
Markdown
# Arena Roadmap: The Simulation
|
|
|
|
> **Goal:** Incrementally evolve the simulator from Spine validation to a full Agent-Based Modeling environment.
|
|
> **Philosophy:** Make it run. Then add. Verify at every step.
|
|
> **Alignment:** Tracks main `roadmap.md` phases; exercises features as they land.
|
|
|
|
---
|
|
|
|
## Current State (Baseline)
|
|
|
|
The simulator (`stemedb-sim`) currently validates **Phase 1: The Spine**:
|
|
|
|
| Component | Status | What It Proves |
|
|
|-----------|--------|----------------|
|
|
| WAL Durability | ✓ Works | Writes persist |
|
|
| rkyv Serialization | ✓ Works | Roundtrip correctness |
|
|
| Ed25519 Signatures | ✓ Works | Sign on write, verify on read |
|
|
| Ingestor Pipeline | ✓ Works | WAL → KV async flow |
|
|
| Agent Identity | ✓ Works | Keypair generation |
|
|
|
|
**Run command:** `cargo run --bin stemedb-sim`
|
|
|
|
**What's NOW tested (Arena 1-3):**
|
|
- ✅ Queries via QueryEngine
|
|
- ✅ Lens resolution (Recency, VoteAwareConsensus)
|
|
- ✅ Lifecycle filtering
|
|
- ✅ Voting & consensus
|
|
- ✅ Query audit trail
|
|
- ✅ Materialized Views (Arena 3)
|
|
- ✅ Fast-path MV reads
|
|
- ✅ MV freshness under load
|
|
|
|
**What's NOT yet tested:**
|
|
- ❌ Concurrent agents (Arena 6)
|
|
- ❌ TrustRank (Arena 5)
|
|
- ❌ Time-travel queries (Arena 7)
|
|
|
|
---
|
|
|
|
## Arena Phases
|
|
|
|
### Arena 0: Make It Verifiable ✅ COMPLETE
|
|
*Goal: Add assertions for what the simulation proves. Currently it prints logs; we need programmatic success/failure.*
|
|
|
|
**Why first:** Without assertions, we don't know if later changes break things.
|
|
|
|
- [x] **0.1 Return Result from main()**: Change from print-and-exit to structured outcome.
|
|
- [x] Define `SimulationResult { assertions_written: u64, assertions_verified: u64, errors: Vec<SimulationError> }`.
|
|
- [x] Library function `run_simulation(config) -> Result<SimulationResult, SimulationSetupError>`.
|
|
- [x] Print summary at end, exit 0 on success, exit 1 on failure, exit 2 on setup error.
|
|
|
|
- [x] **0.2 Integration Test Wrapper**: Make the sim runnable as a test.
|
|
- [x] Add `crates/stemedb-sim/tests/smoke.rs` with 6 integration tests.
|
|
- [x] Assert on `SimulationResult` fields.
|
|
- [x] Run in CI via `cargo test -p stemedb-sim`.
|
|
|
|
**Exit Criteria:** `cargo test -p stemedb-sim` passes ✅
|
|
|
|
---
|
|
|
|
### Arena 1: Query Path (Exercises Phase 2 Features) ✅ COMPLETE
|
|
*Goal: Extend simulation to read via QueryEngine, not direct KV access.*
|
|
|
|
**Depends on:** Phase 2 complete (Query Engine, Lenses)
|
|
**Aligns with:** `roadmap.md` Phase 2 "The Lattice"
|
|
|
|
- [x] **1.1 Add Query Engine to Simulation**
|
|
- [x] Import `stemedb-query` crate.
|
|
- [x] After ingestion wait, query each assertion via `QueryEngine::execute()`.
|
|
- [x] Verify result matches what was written.
|
|
|
|
- [x] **1.2 Exercise Recency Lens**
|
|
- [x] Write 2 assertions for same subject+predicate with different timestamps.
|
|
- [x] Query with `lens=Recency`.
|
|
- [x] Verify most recent wins.
|
|
|
|
- [x] **1.3 Exercise Lifecycle Filtering**
|
|
- [x] Write one `Proposed` and one `Approved` assertion for same fact.
|
|
- [x] Query with `lifecycle=Approved`.
|
|
- [x] Verify only Approved returned.
|
|
- [x] **Use Case Alignment:** This is the JWT signing algorithm bug from Agile Agent Team.
|
|
|
|
- [x] **1.4 Query Audit Verification**
|
|
- [x] Set `X-Agent-Id` header (or equivalent in direct API call).
|
|
- [x] After queries, call `AuditStore::get_audits_for_agent()`.
|
|
- [x] Verify audit trail exists with correct contributing assertions.
|
|
- [x] **Use Case Alignment:** "What did the deployment agent query?"
|
|
|
|
**Exit Criteria:** Simulation writes, ingests, queries, and verifies all three scenarios. ✅
|
|
|
|
---
|
|
|
|
### Arena 2: Voting & Consensus (Exercises Phase 2 VoteStore) ✅ COMPLETE
|
|
*Goal: Simulate agents voting on assertions and resolving via VoteAwareConsensusLens.*
|
|
|
|
**Depends on:** Arena 1 complete, Phase 2 VoteStore
|
|
**Aligns with:** `roadmap.md` Phase 2 "The Ballot Box"
|
|
|
|
- [x] **2.1 Add Vote Creation to Agents**
|
|
- [x] `Agent::vote(assertion_hash, weight)` method.
|
|
- [x] Votes stored directly via VoteStore (bypasses WAL for now - see note below).
|
|
|
|
- [x] **2.2 Conflicting Assertions with Votes**
|
|
- [x] Scientist_Alpha asserts "Protein_X binds Receptor_Y" (confidence 0.8).
|
|
- [x] Scientist_Beta asserts "Protein_X binds Receptor_Z" (confidence 0.8).
|
|
- [x] Alpha votes for own assertion (weight 1.0).
|
|
- [x] Beta votes for own assertion (weight 1.0).
|
|
- [x] Third agent (Believer) votes for Alpha's assertion.
|
|
- [x] Query with `lens=VoteAwareConsensus`.
|
|
- [x] Verify Alpha's assertion wins (2 votes vs 1).
|
|
|
|
- [x] **2.3 Troll Vote Resistance**
|
|
- [x] Troll creates low-confidence assertion contradicting consensus.
|
|
- [x] Troll votes for own assertion.
|
|
- [x] Verify high-vote assertions still win.
|
|
|
|
**Exit Criteria:** Vote-based consensus correctly resolves conflicts. ✅
|
|
|
|
---
|
|
|
|
### Arena 2.5: Hardening (Critical Gap Remediation) ✅ COMPLETE
|
|
*Goal: Fix critical gaps discovered during Arena 0-2 review before adding more features.*
|
|
|
|
**Depends on:** Arena 2 complete
|
|
**Blocks:** Arena 3+ (don't add features on a shaky foundation)
|
|
**Rationale:** Gap analysis revealed 58/100 production readiness score with 3 critical blockers.
|
|
|
|
- [x] **2.5.1 Fix Vote Cache Race Condition** (P0 - CRITICAL)
|
|
- [x] VoteStore `put_vote()` uses `fetch_and_add_u64` + `compare_and_swap_f32` (vote_store.rs:182-189)
|
|
- [x] Two concurrent calls can lose updates (final count = N instead of N+1)
|
|
- [x] Solution: Add atomic increment or compare-and-swap operation
|
|
- [x] Add test: 100 concurrent `put_vote()` calls, verify final count
|
|
- [x] **File:** `crates/stemedb-storage/src/vote_store.rs:181-206`
|
|
|
|
- [x] **2.5.2 Add API Integration Tests** (P0 - CRITICAL)
|
|
- [x] Create `crates/stemedb-api/tests/http_integration.rs`
|
|
- [x] Test `POST /assertions` - create assertion via HTTP
|
|
- [x] Test `POST /votes` - submit vote via HTTP
|
|
- [x] Test `GET /query` - query with lens parameter
|
|
- [x] Test error responses (400 Bad Request, 500 Internal Error)
|
|
- [x] Test rate limiting via QuotaStore middleware
|
|
- [x] **Gap:** Entire HTTP layer is currently untested
|
|
|
|
- [x] **2.5.3 Add Crash Recovery Test** (P0 - CRITICAL)
|
|
- [x] Write assertions to WAL
|
|
- [x] Kill IngestWorker mid-step (simulate crash)
|
|
- [x] Restart IngestWorker with same WAL + KV store
|
|
- [x] Verify: cursor resumes correctly, no duplicate ingestion (worker.rs:1518)
|
|
- [x] Verify: all pre-crash data is recoverable
|
|
- [x] **Validates:** Durability claims in architecture.md
|
|
|
|
- [x] **2.5.4 Add Input Validation** (P1 - HIGH)
|
|
- [x] Max subject length: 1024 characters
|
|
- [x] Max predicate length: 1024 characters
|
|
- [x] Confidence range: 0.0 to 1.0, reject NaN/Inf
|
|
- [x] Vote weight: non-negative, reject NaN/Inf
|
|
- [x] Timestamp: reject values > current time + 1 hour (clock skew protection)
|
|
- [x] Add validation in `IngestWorker::validate_assertion()` and `validate_vote()`
|
|
- [x] **File:** `crates/stemedb-ingest/src/worker.rs`
|
|
|
|
- [x] **2.5.5 Replace Sleep Timers with Ingestion Sync** (P1 - HIGH)
|
|
- [x] `wait_until_ingested()` cursor-based polling replaces all sleeps
|
|
- [x] Add: `wait_for_ingestion(store, expected_count, timeout)` helper
|
|
- [x] Poll store until expected assertions exist or timeout
|
|
- [x] Replace all hardcoded sleeps in simulation
|
|
- [x] **Benefit:** Faster tests, deterministic behavior
|
|
|
|
- [x] **2.5.6 Fix Defensive Error Handling** (P2 - MEDIUM)
|
|
- [x] Input validation propagates errors properly
|
|
- [x] Change to propagate error or skip candidate with warning
|
|
- [x] `worker.rs:161-173`: Ambiguous EOF handling treats all I/O errors as "no data"
|
|
- [x] Distinguish true EOF from transient errors
|
|
|
|
**Exit Criteria:**
|
|
- [x] Vote cache is atomic (concurrent test passes)
|
|
- [x] API layer has integration tests (POST/GET work via HTTP)
|
|
- [x] Crash recovery is verified (no data loss on restart)
|
|
- [x] Input validation rejects malformed data
|
|
- [x] No hardcoded sleep timers in simulation
|
|
- [x] Production readiness score: 75+ (up from 58)
|
|
|
|
---
|
|
|
|
### Arena 3: Materialized Views (Exercises Phase 2 Materializer) ✅ COMPLETE
|
|
*Goal: Verify fast-path MV reads work under simulation load.*
|
|
|
|
**Depends on:** Arena 2 complete, Phase 2 Materializer
|
|
**Aligns with:** `roadmap.md` Phase 2 "Materializer"
|
|
|
|
- [x] **3.1 Materializer Integration**
|
|
- [x] Spin up Materializer alongside Ingestor.
|
|
- [x] Wire `Notify` between IngestWorker and Materializer.
|
|
- [x] After ingestion, verify MV keys exist in store.
|
|
|
|
- [x] **3.2 Fast-Path Verification**
|
|
- [x] Query via QueryEngine with subject+predicate.
|
|
- [x] Log whether fast-path or slow-path was used (add debug output).
|
|
- [x] Verify MV winner matches slow-path result.
|
|
|
|
- [x] **3.3 MV Freshness Under Load**
|
|
- [x] Write 10 assertions in rapid succession.
|
|
- [x] Wait for materialization.
|
|
- [x] Verify MV reflects latest state.
|
|
- [x] **Aligns with:** Phase 2.5 "MV Staleness Detection"
|
|
|
|
**Exit Criteria:** Fast-path queries return correct results under load. ✅
|
|
|
|
---
|
|
|
|
### Arena 4: Agent Personas (First Strategy Differentiation) ✅ COMPLETE
|
|
*Goal: Agents behave differently based on persona. No longer uniform.*
|
|
|
|
**Depends on:** Arena 3 complete
|
|
**Aligns with:** Vision document "The Players"
|
|
|
|
- [x] **4.1 AgentStrategy Trait**
|
|
- [x] `AgentStrategy` trait with `decide_action()`, `base_confidence()`, `name()`.
|
|
- [x] `AgentAction` enum: Assert, Vote, Query, Skip.
|
|
- [x] `WorldState`, `StrategyMetrics`, `AgentSpec`, `StrategyType`.
|
|
- [x] `GroundTruth` hardcoded dataset (5 known-true facts).
|
|
- [x] **File:** `crates/stemedb-sim/src/strategy.rs`
|
|
|
|
- [x] **4.2 Scientist Strategy**
|
|
- [x] High base confidence (0.9).
|
|
- [x] Even ticks: assert next ground truth fact.
|
|
- [x] Odd ticks: vote for truth-aligned assertions, or assert if none found.
|
|
|
|
- [x] **4.3 Troll Strategy**
|
|
- [x] Low base confidence (0.4).
|
|
- [x] Contradicts existing assertions (NOT_value, negate, flip, FAKE_ref).
|
|
- [x] Skips on empty world (tick 0).
|
|
|
|
- [x] **4.4 Believer Strategy**
|
|
- [x] Medium base confidence (0.65).
|
|
- [x] Votes for highest-confidence assertion (weight 0.7).
|
|
- [x] Pure amplifier - never creates assertions, only votes.
|
|
|
|
- [x] **4.5 Strategy-Driven Tick Loop**
|
|
- [x] `SimulationConfig.agents: Vec<AgentSpec>` replaces `agent_count`.
|
|
- [x] Each tick: strategy decides action, executed by agent.
|
|
- [x] Per-strategy metrics tracked and logged.
|
|
- [x] Arena 4 verification: differentiated behavior check.
|
|
|
|
**Exit Criteria:** Different agent types produce different behaviors in logs. ✅
|
|
|
|
---
|
|
|
|
### Arena 5: TrustRank Integration (Exercises Phase 4 Foundation)
|
|
*Goal: Reputation updates based on agent behavior.*
|
|
|
|
**Depends on:** Arena 4 complete, TrustRank implemented
|
|
**Aligns with:** `roadmap.md` Phase 4 "TrustRank Engine"
|
|
|
|
- [ ] **5.1 Initialize TrustRank for Agents**
|
|
- [ ] Each agent starts with base TrustRank (e.g., 0.5).
|
|
- [ ] Store in TrustRankStore at simulation start.
|
|
|
|
- [ ] **5.2 Reputation Adjustment After Votes**
|
|
- [ ] When an assertion gains votes, increase author's TrustRank.
|
|
- [ ] When an assertion is contradicted by consensus, decrease author's TrustRank.
|
|
- [ ] Use `TrustRankStore::record_outcome()`.
|
|
|
|
- [ ] **5.3 TrustAwareAuthorityLens Verification**
|
|
- [ ] Two assertions from different agents, same confidence.
|
|
- [ ] Agent with higher TrustRank should win via `TrustAwareAuthorityLens`.
|
|
- [ ] **Use Case Alignment:** "Expert vs. junior weighting" from Agile Agent Team.
|
|
|
|
- [ ] **5.4 Troll Reputation Decay**
|
|
- [ ] After 100 ticks, verify Troll's TrustRank has decreased.
|
|
- [ ] Verify Scientist's TrustRank has increased.
|
|
- [ ] **Success Criteria:** "Trust clusters form naturally without hardcoded rules."
|
|
|
|
**Exit Criteria:** TrustRank diverges based on behavior; Troll reputation tanks.
|
|
|
|
---
|
|
|
|
### Arena 6: Concurrent Agents (Performance Validation)
|
|
*Goal: Move from sequential to parallel agent execution.*
|
|
|
|
**Depends on:** Arena 5 complete
|
|
**Aligns with:** Vision "1000 concurrent agents without locking"
|
|
|
|
- [ ] **6.1 Tokio Task Per Agent**
|
|
- [ ] Wrap each agent's tick in `tokio::spawn()`.
|
|
- [ ] Use `Arc<Mutex<Journal>>` for WAL access (already in place).
|
|
- [ ] Run 10 agents concurrently.
|
|
|
|
- [ ] **6.2 Scale to 100 Agents**
|
|
- [ ] Parameterize agent count.
|
|
- [ ] Run with 100 agents for 50 ticks.
|
|
- [ ] Verify no deadlocks, no data corruption.
|
|
|
|
- [ ] **6.3 Contention Metrics**
|
|
- [ ] Add timing around WAL lock acquisition.
|
|
- [ ] Log P50/P99 latencies.
|
|
- [ ] Identify bottlenecks if any.
|
|
|
|
- [ ] **6.4 Target: 1000 Agents**
|
|
- [ ] Run with 1000 agents (stretch goal).
|
|
- [ ] May require connection pooling or batching.
|
|
- [ ] Document findings.
|
|
|
|
**Exit Criteria:** 100 agents run concurrently without errors.
|
|
|
|
---
|
|
|
|
### Arena 7: Time-Travel & Epochs (Exercises Phase 3 Features)
|
|
*Goal: Validate temporal queries and epoch supersession.*
|
|
|
|
**Depends on:** Arena 6 complete, Phase 3 Time-Travel + EpochAwareLens
|
|
**Aligns with:** `roadmap.md` Phase 3 "Time-Travel Engine", Phase 2.5 "EpochAwareLens"
|
|
|
|
- [ ] **7.1 Time-Travel Query Verification**
|
|
- [ ] At tick 50, record timestamp T1.
|
|
- [ ] At tick 100, write a new assertion superseding an old one.
|
|
- [ ] Query with `as_of=T1`.
|
|
- [ ] Verify result reflects tick-50 state, not tick-100 state.
|
|
- [ ] **Use Case Alignment:** "What was the state of knowledge at 9pm?"
|
|
|
|
- [ ] **7.2 Epoch Creation and Supersession**
|
|
- [ ] Create epoch "v1" at tick 0.
|
|
- [ ] Create epoch "v2" superseding "v1" at tick 50.
|
|
- [ ] Assertions referencing "v1" should be filtered by EpochAwareLens.
|
|
- [ ] **Use Case Alignment:** "Security team migrates from RS256 to ES256."
|
|
|
|
- [ ] **7.3 Epoch Cascade Verification**
|
|
- [ ] Chain: v3 supersedes v2 supersedes v1.
|
|
- [ ] Query with EpochAwareLens.
|
|
- [ ] Only v3 assertions visible.
|
|
|
|
**Exit Criteria:** Historical queries and epoch filtering work correctly.
|
|
|
|
---
|
|
|
|
### Arena 8: Skeptic & Conflict (Exercises Phase 3 Lenses)
|
|
*Goal: Surface disagreement, measure consensus.*
|
|
|
|
**Depends on:** Arena 7 complete, Phase 3 Skeptic Lens + Conflict Score
|
|
**Aligns with:** `roadmap.md` Phase 3C "Skeptic Lens", Phase 3A.2 "Conflict Score"
|
|
|
|
- [ ] **8.1 High-Conflict Scenario**
|
|
- [ ] 3 Scientist agents assert conflicting values for same fact.
|
|
- [ ] Each votes for own assertion.
|
|
- [ ] Query with `lens=Skeptic`.
|
|
- [ ] Verify `conflict_score` is high (> 0.5).
|
|
|
|
- [ ] **8.2 Low-Conflict Scenario**
|
|
- [ ] 3 Scientists assert same value (agreement).
|
|
- [ ] Query with `lens=Skeptic`.
|
|
- [ ] Verify `conflict_score` is low (< 0.2).
|
|
|
|
- [ ] **8.3 Skeptic Surfaces Outlier**
|
|
- [ ] Consensus is A, one dissenter says B.
|
|
- [ ] Skeptic lens returns B (the controversial position).
|
|
- [ ] **Use Case Alignment:** Financial Due Diligence "disagreement is the information."
|
|
|
|
**Exit Criteria:** Conflict score accurately reflects disagreement.
|
|
|
|
---
|
|
|
|
### Arena 9: Full Gameplay Loop (The Vision)
|
|
*Goal: Run the complete vision scenario end-to-end.*
|
|
|
|
**Depends on:** Arena 8 complete, all Phase 3 features
|
|
**Aligns with:** `simulation-vision.md` "The Gameplay Loop"
|
|
|
|
- [ ] **9.1 Ground Truth Injection**
|
|
- [ ] Load ground truth from YAML config.
|
|
- [ ] Scientists read ground truth, assert facts.
|
|
|
|
- [ ] **9.2 The 5-Tick Scenario**
|
|
- [ ] Tick 1: Scientist asserts "Protein_X binds Receptor_Y".
|
|
- [ ] Tick 2: Troll forks with "Protein_X binds Nothing".
|
|
- [ ] Tick 3: Believer queries, votes for Scientist.
|
|
- [ ] Tick 4: TrustRank updates (Scientist up, Troll down).
|
|
- [ ] Tick 5: Verify consensus via lens.
|
|
|
|
- [ ] **9.3 Extended Run (1000 Ticks)**
|
|
- [ ] Run full scenario for 1000 ticks.
|
|
- [ ] Track metrics:
|
|
- `truth_convergence`: % of facts matching ground truth.
|
|
- `reputation_distribution`: Scientist vs Troll ranks.
|
|
- `fork_depth_max`: Deepest contradiction chain.
|
|
|
|
- [ ] **9.4 Success Criteria Verification**
|
|
- [ ] ✓ Truth survives: High-reputation assertions outlive spam.
|
|
- [ ] ✓ Lenses work: Consensus lens filters Troll noise.
|
|
- [ ] ✓ Performance: 1000 ticks complete in < 30 seconds.
|
|
- [ ] ✓ Emergence: Trust clusters form naturally.
|
|
|
|
**Exit Criteria:** All 4 success criteria from vision document pass.
|
|
|
|
---
|
|
|
|
## Alignment with Main Roadmap
|
|
|
|
| Arena Phase | Exercises Roadmap Phase | Key Features Validated |
|
|
|-------------|------------------------|------------------------|
|
|
| Arena 0 ✅ | - | Test infrastructure |
|
|
| Arena 1 ✅ | Phase 2 | QueryEngine, Lenses, Lifecycle, Query Audit |
|
|
| Arena 2 ✅ | Phase 2 | VoteStore, VoteAwareConsensusLens |
|
|
| Arena 2.5 ✅ | - (Hardening) | Race conditions, API tests, crash recovery, input validation |
|
|
| Arena 3 ✅ | Phase 2 | Materializer, Fast-Path MV, MV Freshness |
|
|
| Arena 4 ✅ | - | Agent personas: Scientist, Troll, Believer (simulator-only) |
|
|
| Arena 5 | Phase 4 | TrustRank, TrustAwareAuthorityLens |
|
|
| Arena 6 | Phase 4 | Concurrency, Performance |
|
|
| Arena 7 | Phase 2.5 + Phase 3 | Time-Travel, Epochs, EpochAwareLens |
|
|
| Arena 8 | Phase 3 | Skeptic Lens, Conflict Score |
|
|
| Arena 9 | All | Full integration |
|
|
|
|
---
|
|
|
|
## Alignment with Use Cases
|
|
|
|
| Use Case | Arena Phase That Validates It |
|
|
|----------|-------------------------------|
|
|
| **Agile Agent Team** | |
|
|
| - Lifecycle filtering | Arena 1.3 |
|
|
| - Query audit trail | Arena 1.4 |
|
|
| - Time-travel debugging | Arena 7.1 |
|
|
| - Expert weighting | Arena 5.3 |
|
|
| - Persistent learning | Arena 5.4 (TrustRank) |
|
|
| **Financial Due Diligence** | |
|
|
| - Conflict detection | Arena 8.1, 8.3 |
|
|
| - Time-travel | Arena 7.1 |
|
|
| - Epoch cascades | Arena 7.2, 7.3 |
|
|
| **Consumer Health** | |
|
|
| - Source-class hierarchy | Phase 3 dependency (not in Arena yet) |
|
|
| - Layered consensus | Phase 3 dependency |
|
|
|
|
---
|
|
|
|
## Development Cadence
|
|
|
|
| Week | Focus | Deliverable |
|
|
|------|-------|-------------|
|
|
| 1 | Arena 0 | CI-runnable simulation ✅ |
|
|
| 2 | Arena 1 | Query path verified ✅ |
|
|
| 3 | Arena 2 | Voting verified ✅ |
|
|
| **4** | **Arena 2.5** | **Hardening: race fix, API tests, crash recovery** |
|
|
| 5 | Arena 3 | Materializer + MVs verified |
|
|
| 6 | Arena 4 | Agent personas differentiated |
|
|
| 7-8 | Arena 5-6 | TrustRank + concurrency |
|
|
| 9-10 | Arena 7-8 | Time-travel + Skeptic |
|
|
| 11-12 | Arena 9 | Full gameplay loop |
|
|
|
|
---
|
|
|
|
## Metrics to Track
|
|
|
|
Once Arena 6+ is complete, export these to logs (and eventually Prometheus):
|
|
|
|
| Metric | Description | Success Target |
|
|
|--------|-------------|----------------|
|
|
| `truth_convergence` | % of facts matching ground truth | > 95% |
|
|
| `troll_reputation` | Troll agent TrustRank at end | < 0.2 |
|
|
| `scientist_reputation` | Scientist agent TrustRank at end | > 0.8 |
|
|
| `fork_depth_max` | Deepest contradiction chain | < 10 |
|
|
| `p99_write_latency_ms` | Write path latency | < 10ms |
|
|
| `p99_query_latency_ms` | Query path latency | < 50ms |
|
|
| `concurrent_agents` | Max concurrent agents without errors | 1000 |
|
|
|
|
---
|
|
|
|
## Non-Goals (Kept Simple)
|
|
|
|
These are explicitly out of scope for the Arena:
|
|
|
|
- **Prometheus/Grafana integration** - Logs suffice for Phase 3.
|
|
- **YAML scenario config** - Hardcoded scenarios are fine until Arena 9.
|
|
- **Full chaos injection (network partitions, node kills)** - Basic crash recovery in 2.5; advanced chaos deferred to Phase 4+.
|
|
- **External agent frameworks (ADK-Go)** - Simulator uses Rust agents.
|
|
|
|
**Note:** HTTP API testing was previously a non-goal but is now addressed in Arena 2.5.2 due to critical gap discovery.
|
|
|
|
---
|
|
|
|
## Next Step
|
|
|
|
Arena 0-4 and Arena 2.5 are complete. Proceed to **Arena 5: TrustRank Integration**.
|
|
|
|
```bash
|
|
# Verify Arena 0 + 1 + 2 + 2.5 + 3 + 4 still work:
|
|
cargo test -p stemedb-sim
|
|
|
|
# Binary also works (shows persona differentiation):
|
|
RUST_LOG=info cargo run --bin stemedb-sim
|
|
```
|