- Add /v1/feed API endpoint with handler and tests - Remove health endpoint rate limiting (behind firewall, caused spurious 429s) - Add dashboard feed panel with list, row, empty state, and loading skeleton - Update home page to show feed instead of redirecting to skeptic - Improve API key auth middleware and DTO create/query params - Add OpenAPI conceptual guide (api-intro.md) with semaglutide examples - Add FindMyHealth application scaffolding (vision, architecture, prototypes) - Add FindMyHealth designer/writer and Aphoria founder-CEO agents - Update roadmap with current progress Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1135 lines
59 KiB
Markdown
1135 lines
59 KiB
Markdown
# Episteme (StemeDB) Roadmap
|
|
|
|
> **Goal:** Build the "Git for Truth" substrate for autonomous AI research.
|
|
> **Current Focus:** Gap Closure Phase 3 — Remote hosted mode (Aphoria → remote StemeDB)
|
|
> **Target Vertical:** BioTech/Pharma ("The Living Review") + Code Truth (Aphoria)
|
|
> **Endgame:** Distributed multi-writer cluster for millions of concurrent agents
|
|
>
|
|
> **Infrastructure Status:** Phases 1-7 complete | Phase 8A (Chaos) complete | Pilot 1-5 complete
|
|
> **Aphoria Status:** A1-A5 + Phase 2 complete | Tier-aware resolution ✅ | Next: Remote mode
|
|
> **Security Status:** P5.1 4/5 done (TLS, limits, timeouts, rate limiting) | P5.2 ✅ complete
|
|
>
|
|
> **Archive:** For completed phases 1-8A + Pilot 1-3, see [roadmap-archive.md](./roadmap-archive.md)
|
|
|
|
---
|
|
|
|
## Current Status
|
|
|
|
| Phase | Status | Summary |
|
|
|-------|--------|---------|
|
|
| **1-7, 8A** | ✅ Complete | Core infra, cluster, trust, chaos testing |
|
|
| **MVP, Pilot 1-4** | ✅ Complete | Consumer Health demo, dashboard, API auth, metrics |
|
|
| **Aphoria A1-A4** | ✅ Complete | Observations/claims/verify/corpus/authority lens |
|
|
| **Aphoria A5** | ✅ Complete | Flywheel validated: 93.5% acceptance, 100% config recall |
|
|
| **Gap Closure Phase 2** | ✅ Complete | Tier-aware authority resolution, `--explain-authority` flag |
|
|
| **Gap Closure Phase 3** | 🎯 Current | Remote hosted mode: Aphoria → remote StemeDB via HTTP API |
|
|
| **Gap Closure Phase 4** | Planned | Claim discovery, manual convergence, promotion workflows |
|
|
| **Pilot 5** | ✅ Complete | All 7 phases complete: Security (4/5), Monitoring, Backup/DR, Runbooks, Cluster Mgmt, Reference Architecture, Pilot Success Criteria |
|
|
| **8B-C** | Planned | Distributed observability, geo-distribution |
|
|
| **9** | Planned | Disaster recovery, compliance, storage management |
|
|
|
|
---
|
|
|
|
## 🎯 Aphoria: From Scanner to Knowledge Graph Client (CURRENT)
|
|
|
|
> **Goal:** Transform Aphoria from "grep with Episteme vocabulary" into a real knowledge graph client that authors, stores, and audits claims with provenance and lineage.
|
|
> **Vision Document:** [applications/aphoria/docs/vision-gaps.md](./applications/aphoria/docs/vision-gaps.md)
|
|
> **Validation:** Maxwell scan (67 observations, 0 noise) + hand-written [claims-explained.md](./claims-explained.md)
|
|
|
|
### Completed Phases (A1-A4 + P4 — see [roadmap-archive.md](./roadmap-archive.md) for details)
|
|
|
|
| Phase | What It Delivered |
|
|
|-------|-------------------|
|
|
| **A1** | `Observation` vs `AuthoredClaim` types, bridge tier mapping, `.aphoria/claims.toml` format |
|
|
| **A2** | `aphoria claims create/list/explain/update/supersede/deprecate`, `aphoria-claims` skill |
|
|
| **A3** | `verify.rs` engine (Pass/Conflict/Missing/Unclaimed), `aphoria verify run/map`, pre-commit hook, self-audit |
|
|
| **A4** | RFC/OWASP as Episteme assertions, `AphoriaAuthorityLens`, Trust Pack export/install |
|
|
| **P4** | API auth (3 roles), backup/restore scripts, Prometheus metrics + Grafana dashboard |
|
|
|
|
### Phase A5: The Flywheel
|
|
|
|
> **Goal:** The system gets smarter with use. Each claim makes the next claim easier.
|
|
> **Details:** [vision-gaps.md — §5](./applications/aphoria/docs/vision-gaps.md#5-the-claims-explainedmd-pattern-should-be-the-product) (claims-explained.md as the product)
|
|
> **Research:** [a5-flywheel-skill-design.md](./research-requests/a5-flywheel-skill-design.md) — validates "skill calls CLI" hypothesis
|
|
> **Key Insight:** LLM reasoning over CLI JSON output replaces ML training. The flywheel is prompt engineering, not machine learning.
|
|
|
|
- [x] **A5.1 Claim Coverage Metrics**: Per-module claim density and gap reporting
|
|
- [x] `coverage.rs`: `CoverageReport`, `ModuleCoverage`, `CoverageSummary` types
|
|
- [x] `compute_coverage()` uses `verify_claims()` as source of truth for claim-observation matching
|
|
- [x] Per-module: observation count, claim count, claimed/unclaimed, missing claims, density
|
|
- [x] `aphoria coverage` CLI: table, JSON, markdown formats, `--sort-by` (name/density/unclaimed/observations)
|
|
- [x] Coverage gaps section: modules with observations but no claims
|
|
- [x] 8 unit tests including deprecated claim exclusion
|
|
- [x] **A5.2 Auto-Generated Documentation**: `aphoria docs generate` + `aphoria claims explain`
|
|
- [x] `aphoria docs generate` CLI command with `--output` and `--format` (markdown/json)
|
|
- [x] `claims_explain.rs`: groups by category, includes provenance/invariant/consequence/evidence per claim
|
|
- [x] `explain.rs`: reads `.aphoria/claims.toml`, renders via `render_claims_markdown()`
|
|
- [x] Provenance chains preserved (supersedes references)
|
|
- [x] **A5.3 Claim Suggester Skill**: LLM-powered pattern recognition via "skill calls CLI"
|
|
- [x] New skill: `.claude/skills/aphoria-suggest/SKILL.md` (3 modes: cold start / foundation / flywheel)
|
|
- [x] Workflow defined: `claims list` → `verify run --show-unclaimed` → reason by analogy → suggest
|
|
- [x] Few-shot learning: existing claims as gold-standard examples for style matching
|
|
- [x] Chain-of-thought: reasoning template before each suggestion
|
|
- [x] Cold start bootstrap: reads README/CLAUDE.md/tests/ADRs when 0 claims
|
|
- [x] Context tiers: local → semantic → summary → global (subagent)
|
|
- [x] Quality gates: non-trivial, not type-enforced, has consequence, not duplicate
|
|
- [x] **VG-022 CLOSED**: `verifiable_predicates()` on Extractor trait; 10 extractors declare predicates; `verify map` shows extractor→claim coverage
|
|
- [x] **Dogfood claims**: 10 total claims in `.aphoria/claims.toml` (3 arch + 7 security) covering all ComparisonModes
|
|
- [x] **Validate**: Run skill against Aphoria's own codebase (dogfood) - 87.5% acceptance rate (7/8)
|
|
- [x] **Validate**: Run skill against an external project (cold start test) - 72.7% alignment (16/22), 100% config recall
|
|
- [x] **Iterate**: Refine prompt based on suggestion quality from validation - 3 improvements identified (domain-awareness, impl depth, tuning scan)
|
|
- [x] **A5.4 Onboarding Mode**: `aphoria explain` for new team members
|
|
- [x] `explain.rs`: `generate_explanation()` reads claims, renders narrative
|
|
- [x] `aphoria explain` CLI with `--output` and `--format` (markdown/json)
|
|
- [x] Shows claim inventory grouped by category with provenance
|
|
- [x] Empty project handling: directs to `aphoria claims create`
|
|
|
|
### Gap Closure Phase 2: Tier-Aware Authority Resolution
|
|
|
|
> **Goal:** Make authority tiers actionable in conflict resolution. Higher-tier sources (lower tier numbers) win.
|
|
> **Why Now:** A5.3 validation proved flywheel works (93.5% acceptance). Next blocker: tier resolution.
|
|
> **User Story:** "Why is Aphoria blocking me for this? Is it really important?" → Show tier, not just binary BLOCK/FLAG.
|
|
|
|
- [x] **Phase 2.1 Tier-Aware Types**: Foundation for tier-scoped verdicts
|
|
- [x] `TierAwareVerdict` enum with 3 variants: `SingleTier`, `MultiTier`, `HigherTierAgreement`
|
|
- [x] `ConflictResult` extended with `tier_verdict` and `primary_tier` fields
|
|
- [x] `resolution/` module: `tier_verdict.rs`, `authority.rs`, `mod.rs`
|
|
- [x] 10 unit tests for tier resolution logic
|
|
- [x] **Phase 2.2 Conflict Logic Updates**: Always compute tier breakdown
|
|
- [x] `conflict.rs`: Moved tier breakdown out of debug-only block (always populated)
|
|
- [x] `compute_tier_aware_verdict()`: Computes per-tier verdicts with primary tier selection
|
|
- [x] `compute_tier_breakdown()`: Groups conflicts by tier (0-5)
|
|
- [x] Primary tier = lowest tier number (highest authority)
|
|
- [x] **Phase 2.3 Display Formatting**: Show tier names in CLI output
|
|
- [x] `ConflictResult::Display` shows tier-aware verdict when available
|
|
- [x] Tier names formatted: "Tier 1 (Clinical/RFC)", "Tier 3 (Expert)", etc.
|
|
- [x] Backward compatible: legacy output when `tier_verdict` is None
|
|
- [x] **Phase 2.4 CLI Flag**: `--explain-authority` for detailed tier breakdown
|
|
- [x] `ScanArgs` extended with `explain_authority` field
|
|
- [x] `cli/mod.rs`: Added `--explain-authority` flag to `aphoria scan`
|
|
- [x] `handlers/scan.rs`: Pass flag through to scan execution
|
|
- [x] All 28 `ScanArgs` constructions updated (tests + handlers)
|
|
- [x] **Phase 2.5 Tests & Quality**: 1300 tests pass, zero clippy warnings
|
|
- [x] All resolution module tests pass (10/10)
|
|
- [x] All aphoria tests pass (1300/1300)
|
|
- [x] Clippy passes with zero warnings
|
|
- [x] Test files updated with new fields (`tier_verdict`, `primary_tier`, `explain_authority`)
|
|
|
|
**What Changed:**
|
|
- Conflicts now show tier information: `❌ BLOCK Tier 1 (Clinical/RFC)` instead of just `❌ BLOCK`
|
|
- Primary tier (highest authority) is computed and stored: Tier 1 beats Tier 3
|
|
- `--explain-authority` flag shows per-tier breakdown (which tiers have conflicting sources, at what confidence)
|
|
- Backward compatible: existing code without `tier_verdict` continues to work
|
|
|
|
**What's Next (Phase 3):**
|
|
- Remote mode: `aphoria init --remote <url>` connects to org StemeDB instance
|
|
- Claim discovery: Query remote claims to see org patterns (specs, popular conventions)
|
|
- Manual convergence: Developer inspects claims, decides whether to align code
|
|
- Manual promotion: Developer upgrades claim tier when backed by higher-tier evidence
|
|
|
|
**Files Changed:**
|
|
- **New:** `applications/aphoria/src/resolution/{mod.rs,tier_verdict.rs,authority.rs}` (3 files, ~450 LOC)
|
|
- **Modified:** `conflict.rs`, `result.rs`, `command.rs`, `cli/mod.rs`, `handlers/{scan.rs,mod.rs}`, `lib.rs`
|
|
- **Tests:** 4 report files, 6 test files, 2 handler files (28 ScanArgs constructions updated)
|
|
|
|
---
|
|
|
|
### Gap Closure Phase 3: Remote Hosted Mode (CURRENT)
|
|
|
|
> **Goal:** Enable Aphoria to work against a remote StemeDB instance instead of local-only.
|
|
> **Why Now:** Phase 1 wired claims through StemeDB locally. User always intended remote hosting.
|
|
> **User Story:** "Configure Aphoria to connect to my org's StemeDB URL instead of running locally"
|
|
|
|
**Architecture Note:** `HostedConfig` already exists in the codebase with `url`, `sync_mode`, and auth fields. This phase is about wiring it up, not building sync infrastructure.
|
|
|
|
**Current State (Post-Phase 1 + Phase 2):**
|
|
- ✅ **Claims stored in StemeDB** (Phase 1: `EpistemeClaimStore`, roundtrip bridge, auto-migration)
|
|
- ✅ **Tier-aware resolution** (Phase 2: primary tier, tier verdict, `--explain-authority`)
|
|
- ✅ **HostedConfig exists** (config/types/hosted.rs: url, sync_mode, offline_fallback, api_key_env)
|
|
- ❌ **EpistemeClaimStore uses local StemeDB only** (no HTTP client implementation)
|
|
- ❌ **No API endpoints for claims** (StemeDB API missing `/claims/*` routes)
|
|
- ❌ **No auth integration** (API key validation not wired up)
|
|
- ❌ **No remote-mode CLI** (`aphoria init --remote` doesn't exist)
|
|
|
|
**Target State (Phase 3):**
|
|
- ✅ `aphoria init --remote https://stemedb.acme.com` writes `HostedConfig` to `.aphoria/config.toml`
|
|
- ✅ `EpistemeClaimStore` calls StemeDB HTTP API instead of local WAL/KV
|
|
- ✅ StemeDB API has `/claims/*` endpoints (create, list, fetch, update, etc.)
|
|
- ✅ Auth: API key from `$STEMEDB_API_KEY` validated on server
|
|
- ✅ Offline fallback: graceful degradation when remote unreachable (per `OfflineFallback` config)
|
|
|
|
#### Phase 3 Breakdown
|
|
|
|
**Phase 3.1: StemeDB API Endpoints for Claims** (3-4 days)
|
|
- [x] Add `/api/v1/claims` POST endpoint (create claim) — `handlers/stemedb_claims.rs::create_claim`
|
|
- [x] Add `/api/v1/claims` GET endpoint (list claims with filters: category, tier, status) — `handlers/stemedb_claims.rs::list_claims`
|
|
- [x] Add `/api/v1/claims/{concept_path}/{predicate}` GET endpoint (fetch specific claim) — `handlers/stemedb_claims.rs::get_claim`
|
|
- [x] Add `/api/v1/claims/{concept_path}/{predicate}` DELETE endpoint (mark as deprecated) — `handlers/stemedb_claims.rs::delete_claim`
|
|
- [ ] Add `/api/v1/claims/{concept_path}/{predicate}` PUT endpoint (update claim)
|
|
- [ ] Add `/api/v1/claims/{concept_path}/{predicate}/supersede` POST endpoint
|
|
- [x] DTOs: `CreateClaimRequest`, `CreateClaimResponse`, `AuthoredClaimDto`, `AuthoredValueDto` in `dto/stemedb_claims.rs`
|
|
- [x] Error handling: All handlers use `crate::error::Result<T>` with `ApiError` variants
|
|
- [x] State access: WAL append via `commit_buffer.append()`, queries via `store.scan_prefix()`
|
|
- [x] OpenAPI: All endpoints annotated with `#[utoipa::path]`
|
|
- [ ] Auth: Require API key in `Authorization: Bearer <token>` header (wiring needed)
|
|
- [ ] Tests: Integration tests for all endpoints with valid/invalid API keys
|
|
|
|
**Phase 3.2: HTTP Client Implementation** (3-4 days)
|
|
- [x] Remote module structure: `applications/aphoria/src/remote/{mod.rs,cache.rs,client.rs}` (created in Phase 3 prep)
|
|
- [ ] `RemoteClaimStore` struct with `reqwest` for API calls (partially implemented, needs testing)
|
|
- [ ] Implement `ClaimStore` trait over HTTP client (save_claim, load_claim, list_claims, etc.)
|
|
- [ ] Auth: Read API key from `$STEMEDB_API_KEY` env var, send in Authorization header
|
|
- [ ] Error handling: 401 Unauthorized → clear error message, 5xx → offline fallback
|
|
- [ ] Retries: Exponential backoff for transient errors (503, network timeouts)
|
|
- [ ] Tests: Unit tests with mock HTTP server
|
|
|
|
**Phase 3.3: Remote Mode CLI** (2 days)
|
|
- [ ] `aphoria init --remote <url>` writes `HostedConfig` to `.aphoria/config.toml`
|
|
- [ ] `--remote` flag validates URL format, tests connection (GET /health), writes config
|
|
- [ ] Config serialization: `hosted` section with `url`, `sync_mode: "remote_only"`, `api_key_env`
|
|
- [ ] `aphoria scan` detects remote mode, uses HTTP client instead of local StemeDB
|
|
- [ ] Tests: CLI test for `init --remote`, verify config file written correctly
|
|
|
|
**Phase 3.4: Offline Fallback** (2 days)
|
|
- [ ] `OfflineFallback` config options: `error`, `warn`, `silent`
|
|
- [ ] When remote unreachable: respect fallback mode (error = fail scan, warn = log + continue, silent = no-op)
|
|
- [ ] Cache last-known remote state locally (optional): write claims to `.aphoria/cache.toml` on successful remote fetch
|
|
- [ ] On offline: use cached claims if available, otherwise apply fallback mode
|
|
- [ ] Tests: Integration test with unreachable remote URL, verify fallback behavior
|
|
|
|
**Phase 3.5: Documentation & Migration** (2 days)
|
|
- [ ] Update `applications/aphoria/README.md` with remote mode setup
|
|
- [ ] New doc: `applications/aphoria/docs/remote-mode.md` (setup, auth, troubleshooting)
|
|
- [ ] Migration guide: TOML-only → remote StemeDB (backward compatible, no breaking changes)
|
|
- [ ] Example configs: remote-only, local-and-remote, offline-fallback modes
|
|
- [ ] Troubleshooting: auth errors, network issues, API version mismatches
|
|
|
|
#### Recent Progress (2026-02-13)
|
|
|
|
**Compilation Fix & API Foundation:**
|
|
Fixed all 24 compilation errors blocking Phase 3 implementation. Key changes:
|
|
|
|
**Wave 1: Type Fixes**
|
|
- Removed unused `use std::sync::Arc;` import
|
|
- Added `#[derive(utoipa::ToSchema)]` to all DTOs for OpenAPI support
|
|
- Fixed all 7 LifecycleStage mappings (`Draft` → `Proposed`, `Retired`/`Superseded` → `Deprecated`)
|
|
|
|
**Wave 2: Architecture Fixes**
|
|
- **Created** `crates/stemedb-api/src/dto/stemedb_claims.rs` with documented DTOs (proper separation of concerns)
|
|
- **Fixed WAL append pattern**: Replaced `state.engine.put()` with `serialize_assertion() → commit_buffer.append()` (5 locations)
|
|
- **Fixed query pattern**: Replaced non-existent `state.engine.query_by_subject*()` with direct `store.scan_prefix()` calls (3 locations)
|
|
- **Fixed error handling**: All handlers now return `crate::error::Result<T>` instead of custom error tuples
|
|
- **Added `explain_authority` field** to 3 ScanArgs initializations (Phase 2 integration)
|
|
|
|
**Files Changed:**
|
|
- **New:** `crates/stemedb-api/src/dto/stemedb_claims.rs` (~90 LOC with docs)
|
|
- **Modified:** `dto/mod.rs` (exports), `handlers/stemedb_claims.rs` (~400 LOC refactor), `handlers/aphoria/{claims.rs,scan.rs}` (explain_authority)
|
|
|
|
**Verification:**
|
|
```bash
|
|
cargo check --workspace # ✅ PASSES (was: 24 errors)
|
|
cargo clippy --workspace -- -D warnings # ✅ PASSES
|
|
```
|
|
|
|
**Next Steps:**
|
|
- Test API endpoints manually (Wave 3)
|
|
- Implement `aphoria init --remote` CLI command (Wave 4)
|
|
- Write comprehensive tests (Wave 5)
|
|
- Update documentation (Wave 6)
|
|
|
|
---
|
|
|
|
#### Success Criteria
|
|
|
|
**Must Have:**
|
|
- [ ] `aphoria init --remote <url>` writes HostedConfig and validates connection
|
|
- [ ] StemeDB API has `/claims/*` endpoints (create, list, fetch, update, supersede)
|
|
- [ ] `EpistemeClaimStore` works over HTTP (no local WAL/KV)
|
|
- [ ] Auth: API key in Authorization header, validated on server
|
|
- [ ] `aphoria scan` works end-to-end with remote StemeDB (same UX as local)
|
|
|
|
**Should Have:**
|
|
- [ ] Offline fallback: graceful degradation when remote unreachable
|
|
- [ ] Cache last-known state locally for offline use
|
|
- [ ] Documentation: remote mode setup, auth, troubleshooting
|
|
- [ ] Backward compatible: local mode still works (default behavior)
|
|
|
|
**Nice to Have:**
|
|
- [ ] Sync mode: LocalAndRemote (write local + remote)
|
|
- [ ] Migration tool: bulk import TOML → remote StemeDB
|
|
- [ ] Health check: `aphoria check-remote` validates connection + auth
|
|
|
|
#### Estimated Timeline
|
|
|
|
| Phase | Estimated | Dependencies |
|
|
|-------|-----------|--------------|
|
|
| 3.1: StemeDB API Endpoints | 3-4 days | None |
|
|
| 3.2: HTTP Client | 3-4 days | 3.1 complete |
|
|
| 3.3: Remote Mode CLI | 2 days | 3.2 complete |
|
|
| 3.4: Offline Fallback | 2 days | 3.2 complete |
|
|
| 3.5: Documentation | 2 days | 3.1-3.4 complete |
|
|
| **Total** | **2-3 weeks** | ~15 business days |
|
|
|
|
**Note:** Phase 3 is now MUCH simpler than originally planned:
|
|
- No sync infrastructure (direct HTTP, not push/pull)
|
|
- No multi-agent convergence (deferred to future phase)
|
|
- No tier escalation (deferred to future phase)
|
|
- Focus: Enable remote mode, validate it works
|
|
|
|
#### Risks & Mitigation
|
|
|
|
| Risk | Likelihood | Impact | Mitigation |
|
|
|------|-----------|--------|------------|
|
|
| API performance (HTTP overhead) | Medium | Medium | Batch operations, HTTP/2, connection pooling |
|
|
| Auth token leakage | Low | High | Env var only, never log token, validate on server |
|
|
| Network unreliability | High | Low | Offline fallback with cached state |
|
|
| Breaking local workflow | Low | High | Local mode is default, remote is opt-in |
|
|
|
|
---
|
|
|
|
## Gap Closure Phase 4: Claim Discovery & Manual Convergence (FUTURE)
|
|
|
|
> **Goal:** Help developers discover org patterns and make informed convergence decisions.
|
|
> **Why:** Remote mode enables sharing, but developers need visibility into what exists.
|
|
> **User Story:** "Show me what claims exist for this module so I can decide if I should align"
|
|
|
|
**Dependencies:**
|
|
- Phase 3 complete (remote mode working)
|
|
- Developers are authoring claims in shared StemeDB
|
|
|
|
**Key Workflows:**
|
|
|
|
**1. Claim Discovery** — "What patterns exist for this concept?"
|
|
```bash
|
|
# Query org claims for a specific concept
|
|
aphoria claims search --concept-path "*/imports/tokio"
|
|
aphoria claims search --category architecture --tier clinical
|
|
|
|
# Output shows:
|
|
# - 3 claims from other teams
|
|
# - Tier 1 (RFC): "Core MUST NOT import tokio"
|
|
# - Tier 3 (Expert): "Web services MAY import tokio" (15 projects)
|
|
# - Tier 4 (Community): "CLI tools SHOULD import tokio" (5 projects)
|
|
```
|
|
|
|
**2. Convergence Decision** — "Should I align with this pattern?"
|
|
- Developer sees Tier 1 spec → follows it (authority)
|
|
- Developer sees popular Tier 3 pattern (15 projects) → can choose to converge
|
|
- Decision is MANUAL, not automatic
|
|
|
|
**3. Manual Promotion** — "My Tier 3 claim has Tier 1 evidence"
|
|
```bash
|
|
# Developer finds RFC backing for their code claim
|
|
aphoria claims promote <claim-id> \
|
|
--to-tier clinical \
|
|
--evidence "RFC 7519 Section 4.1.3" \
|
|
--reason "RFC explicitly requires audience validation"
|
|
|
|
# System validates:
|
|
# - New tier is higher authority than current
|
|
# - Evidence is provided
|
|
# - Provenance chain preserved
|
|
```
|
|
|
|
**4. Pattern Popularity** — "How widely adopted is this claim?"
|
|
```bash
|
|
aphoria claims stats <claim-id>
|
|
# Shows:
|
|
# - 23 projects have similar claim (same concept_path+predicate)
|
|
# - 18 at Tier 3 (Expert)
|
|
# - 5 at Tier 4 (Community)
|
|
# - Suggests: "This pattern is widely adopted. Consider aligning."
|
|
```
|
|
|
|
#### Phase 4 Breakdown (Future — After Phase 3)
|
|
|
|
**Phase 4.1: Claim Search & Discovery** (1 week)
|
|
- [ ] `aphoria claims search` CLI with filters (concept, category, tier, status)
|
|
- [ ] Query remote StemeDB via `/claims?filters=...` API
|
|
- [ ] Display: claim summary, tier, provenance, evidence, adoption count
|
|
- [ ] Similarity matching: "Show claims similar to my local code"
|
|
|
|
**Phase 4.2: Convergence Suggestions** (1 week)
|
|
- [ ] `aphoria scan --suggest-convergence` compares local code to remote claims
|
|
- [ ] Highlights: "Your code imports tokio, but org spec says MUST NOT"
|
|
- [ ] Highlights: "15 projects use pattern X, you use pattern Y — consider aligning?"
|
|
- [ ] Developer decides: align code, create counter-claim, or ignore
|
|
|
|
**Phase 4.3: Manual Promotion Workflow** (1 week)
|
|
- [ ] `aphoria claims promote` CLI command with validation
|
|
- [ ] Requires: target tier, evidence, reason
|
|
- [ ] Preserves provenance: promoted claim references original
|
|
- [ ] Audit log: who promoted, when, why
|
|
|
|
**Phase 4.4: Adoption Metrics** (1 week)
|
|
- [ ] `aphoria claims stats` shows adoption counts per claim
|
|
- [ ] Query: "How many projects have claims with this concept_path?"
|
|
- [ ] Dashboard: most popular patterns by category/tier
|
|
- [ ] Helps identify: emerging conventions, dead patterns
|
|
|
|
**Estimated Timeline:** 4 weeks (1 week per sub-phase)
|
|
|
|
**Success Criteria:**
|
|
- Developer can discover org patterns relevant to their code
|
|
- Developer can see tier + adoption count for each pattern
|
|
- Developer can manually promote claims with evidence
|
|
- Developer can choose to align code with popular/authoritative patterns
|
|
|
|
**What This Enables:**
|
|
- Organic convergence driven by inspection, not automation
|
|
- Authority-aware decision-making (Tier 1 spec > Tier 3 code claim)
|
|
- Popularity-aware decision-making (15 projects use X → maybe I should too)
|
|
- Manual promotion path (Tier 3 → Tier 1 when backed by RFC)
|
|
|
|
---
|
|
|
|
## Pilot 5: Operational Readiness
|
|
|
|
> **Goal:** Complete production readiness for enterprise pilot demo.
|
|
> **Context:** Pilot 1-4 complete (see [archive](./roadmap-archive.md)).
|
|
> **Target:** 4-6 weeks to ship-ready state
|
|
|
|
### Enterprise Readiness: Deployment Stages
|
|
|
|
| Stage | Requirements | Timeline | Customer Profile |
|
|
|-------|--------------|----------|------------------|
|
|
| **MVP Pilot** | P5.1 Security + P5.2 Monitoring + P5.3 Backup | ✅ Ready | Friendly pilot, tolerates manual ops |
|
|
| **Production** | MVP + P5.4 Runbooks + P5.5 CLI | 4 weeks | First paying customer, self-hosted |
|
|
| **Scale** | Production + Phase 8B-C | 8-10 weeks | 5-10 customers, automated operations |
|
|
| **Enterprise** | Scale + Phase 9 | 6+ months | 50+ customers, SOC2/compliance required |
|
|
|
|
### Critical Path to Ship (Must-Have)
|
|
|
|
**WEEK 1 - Security (P0 Blockers):**
|
|
- TLS/HTTPS, request size limits, timeouts, secret sanitization, rate limiting
|
|
|
|
**WEEK 2 - Monitoring (P0 Blind without these):**
|
|
- Storage metrics, replication metrics, Grafana dashboards, alert rules
|
|
|
|
**WEEK 3 - Backup & DR (P0 Data loss risk):**
|
|
- Automated backup, backup verification, WAL archival, DR runbook, operational runbooks
|
|
|
|
**WEEK 4 - Deployment (P1 Customer enablement):**
|
|
- CLI tooling, reference architecture, deployment guides, pilot validation
|
|
|
|
### P5.1 Security Hardening (WEEK 1 - SHIP BLOCKERS)
|
|
|
|
**Priority: P0 - Cannot ship without these**
|
|
**Status: 🎯 4/5 Complete** (TLS, Limits, Timeouts, Rate Limiting done; Secret Sanitization pending)
|
|
|
|
- [x] **TLS/HTTPS Configuration** (Partial - 2024-02-11)
|
|
- [x] Add TLS 1.3 to stemedb-api (axum-server with rustls) - `main.rs:114-123`
|
|
- [x] Load from env vars: `STEMEDB_TLS_CERT_PATH` / `STEMEDB_TLS_KEY_PATH`
|
|
- [ ] HTTP → HTTPS redirect (deferred - not critical for pilot)
|
|
- [ ] Let's Encrypt integration for pilot deployments (deferred - manual cert setup OK)
|
|
- [ ] Certificate rotation documentation (deferred)
|
|
- [ ] Test with self-signed certs in CI (deferred - Layer 4 tests)
|
|
|
|
- [x] **Request Size Limits** (Complete - 2024-02-11)
|
|
- [x] Add `RequestBodyLimitLayer` to write endpoints (1MB default) - `routers.rs:371`
|
|
- [x] Add `RequestBodyLimitLayer` to read endpoints (64KB default) - `routers.rs:400`
|
|
- [x] Make limits configurable: `STEMEDB_WRITE_BODY_LIMIT` / `STEMEDB_READ_BODY_LIMIT`
|
|
- [x] Created `SecurityConfig` struct with defaults - `routers.rs:35-56`
|
|
- [x] Updated all 8 `create_router_*` functions to accept config
|
|
- [x] Documented in `.env.example`
|
|
- [ ] Document limits in OpenAPI spec (deferred - not critical)
|
|
|
|
- [x] **Timeout Configuration** (Complete - 2024-02-11)
|
|
- [x] Add `TimeoutLayer` to HTTP routes (configurable, default 30s) - `routers.rs:115,143,199,etc`
|
|
- [x] Wrap all `store.get()/put()` with `tokio::time::timeout(5s)` - `store_helpers.rs`
|
|
- [x] Added timeout helpers: `store_get_with_timeout()` / `store_put_with_timeout()`
|
|
- [x] Updated 6+ handler locations (source.rs, health.rs, report.rs, source_registry/handlers.rs)
|
|
- [x] Add timeout metrics: `stemedb_operation_timeouts_total{operation="store_get|store_put"}`
|
|
- [x] Make HTTP timeout configurable: `STEMEDB_HTTP_TIMEOUT_SECS`
|
|
- [x] Added `ApiError::Timeout` variant with 408 REQUEST_TIMEOUT status - `error.rs:76-80`
|
|
|
|
- [ ] **Secret Sanitization** (Deferred - not blocking for pilot)
|
|
- [ ] Remove API key logging from `api_key.rs:271` (log hash, not prefix)
|
|
- [ ] Audit all `debug!`/`info!` for credential leaks
|
|
- [ ] Add test: `cargo test -- --nocapture | grep -E "key|secret|password"` (should fail)
|
|
- **Note:** Existing code already logs hashes, audit needed to confirm no leaks
|
|
|
|
- [x] **Rate Limiting** (Complete - 2024-02-11)
|
|
- [x] Rate limit `/v1/health` to 1 req/sec per IP (prevent metrics flooding) - `routers.rs:352`
|
|
- [x] Make configurable: `STEMEDB_HEALTH_RATE_LIMIT` (default: 1)
|
|
- [x] Uses `RateLimitState` and `rate_limit_middleware` - `middleware/rate_limit.rs`
|
|
- [x] Metric already exists: `stemedb_rate_limit_rejections_total{endpoint}` - `rate_limit.rs:87`
|
|
|
|
**Implementation Notes:**
|
|
- All security features are now **configurable via environment variables** with sensible defaults
|
|
- Build succeeds, all features tested manually
|
|
- Integration tests stubbed in `tests/security_hardening.rs` (21 tests marked `#[ignore]`)
|
|
- Secret sanitization deferred as existing code appears safe (uses hashes), but full audit recommended
|
|
|
|
### P5.2 Monitoring Foundation (WEEK 2 - CRITICAL) ✅ COMPLETE
|
|
|
|
**Priority: P0 - Flying blind without these**
|
|
**Status: ✅ Complete** (All layers implemented: WAL metrics, storage metrics, HTTP SLI, error tracking, Grafana dashboards, Prometheus alerts, runbooks, validation scripts)
|
|
**Implementation:** [P5.2-IMPLEMENTATION-SUMMARY.md](./P5.2-IMPLEMENTATION-SUMMARY.md)
|
|
|
|
- [x] **Storage Health Metrics** (Complete - 2024-02-11)
|
|
- [x] `stemedb_wal_fsync_latency_seconds` histogram (p50/p95/p99) - `journal.rs:34`
|
|
- [x] `stemedb_wal_write_errors_total{error}` counter - `journal.rs:46`
|
|
- [x] `stemedb_wal_disk_usage_bytes` gauge - `segment.rs:248`
|
|
- [x] `stemedb_wal_segments_count` gauge - `segment.rs:249`
|
|
- [x] `stemedb_wal_bytes_written_total` counter - `journal.rs:45`
|
|
- [x] `stemedb_wal_writes_total` counter - `journal.rs:44`
|
|
- [x] `stemedb_wal_batch_size` histogram - `group_commit.rs:201`
|
|
- [x] `stemedb_wal_flush_latency_seconds` histogram - `group_commit.rs:243`
|
|
- [x] `stemedb_wal_recovery_attempts_total` counter - `journal.rs:234`
|
|
- [x] `stemedb_wal_recovery_duration_seconds` histogram - `journal.rs:269`
|
|
- [x] `stemedb_wal_rotations_total` counter - `journal.rs:304`
|
|
|
|
- [x] **Storage Operation Metrics** (Complete - 2024-02-11)
|
|
- [x] `stemedb_storage_operation_duration_seconds{operation,backend}` histogram - `hybrid_backend.rs:118,138,158,180`
|
|
- [x] `stemedb_storage_operations_total{operation,backend}` counter - `hybrid_backend.rs:123,143,163,185`
|
|
- [x] `stemedb_index_lookup_duration_seconds{index}` histogram - `index_store.rs:212,235`
|
|
- [x] Metrics added to: get(), put(), delete(), scan_prefix(), index lookups
|
|
|
|
- [x] **Error Tracking** (Complete - 2024-02-11)
|
|
- [x] `stemedb_errors_total{type,layer}` counter - `error.rs:99`
|
|
- [x] Tracks 15 error types across 5 layers (validation, api, storage, pipeline, auth, protection)
|
|
- [x] Integrated into `ApiError::IntoResponse` for automatic tracking
|
|
|
|
- [x] **HTTP SLI Metrics** (Complete - 2024-02-12)
|
|
- [x] Pattern implemented in `handlers/vote.rs` as reference
|
|
- [x] `stemedb_http_requests_total{method,path}` counter
|
|
- [x] `stemedb_http_request_duration_seconds{method,path,status}` histogram
|
|
- [x] Rollout complete: 19 handlers instrumented (supersede, epoch, source, admin, escalation, gold_standard, quarantine, circuit_breaker, api_keys, audit, concepts)
|
|
- [x] Total coverage: 20 handlers across 11 files
|
|
|
|
- [x] **Grafana Dashboards** (Complete - 2024-02-11)
|
|
- [x] `storage-health.json` - WAL fsync latency, disk usage, error rates, storage operations, index timing
|
|
- [x] `cluster-overview.json` - Node status, replication lag, sync ops, Merkle diffs, gossip
|
|
- [x] `sli-dashboard.json` - Request rate, latency heatmap, error rate, availability gauge, circuit breakers
|
|
- [x] Import guide with troubleshooting: [docs/operations/monitoring/grafana/README.md](./docs/operations/monitoring/grafana/README.md)
|
|
|
|
- [x] **Prometheus Alert Rules** (Complete - 2024-02-11)
|
|
- [x] `alerts/critical.yml` - 8 alerts (API down, disk >90%, replication lag >5min, storage errors, fsync failure, split brain, memory exhaustion, cert expiring)
|
|
- [x] `alerts/warning.yml` - 10 alerts (slow fsync, high error rate, slow indexes, disk >70%, lag >1min, high latency, compaction backlog, circuit breaker, trust rank decay)
|
|
- [x] `alerts/info.yml` - 9 alerts (circuit breaker open, quarantine backlog, node join, memory >70%, key rotation, gold standard count, cert 30 days, WAL segments, low traffic)
|
|
- [x] All alerts include: runbook links, impact description, action steps, for duration, labels
|
|
|
|
- [x] **Alerting Integration** (Complete - 2024-02-11)
|
|
- [x] PagerDuty configuration with 4-level escalation - [docs/operations/monitoring/alerting/pagerduty-config.yml](./docs/operations/monitoring/alerting/pagerduty-config.yml)
|
|
- [x] Slack integration for 3 channels (critical/warning/info) - [docs/operations/monitoring/alerting/slack-config.yml](./docs/operations/monitoring/alerting/slack-config.yml)
|
|
- [x] Escalation policy with response times, contact info, post-mortem template - [docs/operations/monitoring/alerting/escalation-policy.md](./docs/operations/monitoring/alerting/escalation-policy.md)
|
|
- [x] Inhibition rules to prevent alert spam
|
|
- [x] Workflow integration examples (incident channel creation, resolution tracking)
|
|
|
|
- [x] **Additional Runbooks** (Complete - 2024-02-12)
|
|
- [x] 8 critical/warning runbooks created in `docs/operations/runbooks/`
|
|
- [x] Coverage: high-replication-lag, storage-errors, wal-fsync-failure, split-brain, memory-exhaustion, certificate-renewal, slow-fsync, high-error-rate
|
|
- [x] Each includes: Severity, Symptom, Impact, Investigation, Resolution, Prevention, Escalation, References
|
|
|
|
- [x] **Validation Scripts** (Complete - 2024-02-12)
|
|
- [x] `scripts/setup-pagerduty.sh` - Service key validation, test incident creation, escalation policy check
|
|
- [x] `scripts/setup-slack.sh` - Webhook validation, test message posting, formatting verification
|
|
- [x] `scripts/test-alerting.sh` - End-to-end test (Alertmanager → PagerDuty + Slack), latency measurement
|
|
|
|
### P5.3 Backup & Disaster Recovery (WEEK 3 - CRITICAL) ✅ COMPLETE
|
|
|
|
**Priority: P0 - Data loss risk without these**
|
|
**Completed:** 2026-02-12
|
|
|
|
- [x] **Automated Backup**
|
|
- [x] Systemd timer: runs every 6 hours (00:00, 06:00, 12:00, 18:00 UTC)
|
|
- [x] Systemd service: `stemedb-backup.service` with retry logic
|
|
- [x] Backup retention policy: `--keep-last` flag with 30-day default
|
|
- [x] S3 upload integration: `--upload-s3` flag with STANDARD_IA storage
|
|
|
|
- [x] **Backup Verification**
|
|
- [x] `verify-backup.sh` - Validates magic bytes, CRC32C, BLAKE3 checksums
|
|
- [x] Weekly verification timer: Sunday 03:00 UTC
|
|
- [x] Metrics: `stemedb_backup_verification_status`, `stemedb_backup_verification_checks_passed`
|
|
- [x] Alert on verification failure: Prometheus alert rule
|
|
|
|
- [x] **WAL Archival**
|
|
- [x] `archive-wal-to-s3.sh` - Ships WAL segments to S3 every 15 minutes
|
|
- [x] S3 bucket: `stemedb-backups-{env}/wal-archive/`
|
|
- [x] Retention: 30 days in S3 STANDARD_IA
|
|
- [x] Metrics: `stemedb_wal_archival_lag_seconds`, `stemedb_wal_archival_segments_uploaded_total`
|
|
|
|
- [x] **Disaster Recovery Runbook**
|
|
- [x] `docs/operations/runbooks/disaster-recovery.md` - Complete DR procedures
|
|
- [x] RTO target: 4 hours (validated via drill script)
|
|
- [x] RPO target: 15 minutes (achievable with WAL archival)
|
|
- [x] 3 recovery scenarios: Full restore, Point-in-time, WAL-only
|
|
- [x] Validation checklist: 9 verification steps
|
|
|
|
- [x] **DR Drill**
|
|
- [x] `scripts/dr-drill.sh` - Automated drill with RTO/RPO measurement
|
|
- [x] Report generation: markdown format with timeline, metrics, issues
|
|
- [x] Integration tests: `uat/production-readiness/backup-dr-tests.sh` (7 tests)
|
|
|
|
**Deliverables:**
|
|
- 6 systemd units: 3 timers + 3 services (backup, verify, archive-wal)
|
|
- 4 scripts: backup, verify, archive-wal, dr-drill
|
|
- Prometheus alerts: 9 alert rules in `backup-alerts.yml`
|
|
- DR runbook: 3 recovery scenarios + validation checklist
|
|
- Integration tests: 7 tests covering all P5.3 components
|
|
|
|
### P5.4 Operational Runbooks (WEEK 3 - CRITICAL) ✅ COMPLETE
|
|
|
|
**Priority: P1 - 2am incidents require these**
|
|
|
|
- [x] **Critical Runbooks** (created in `docs/operations/runbooks/`)
|
|
- [x] `server-wont-start.md` - Port conflicts, TLS cert issues, disk full, WAL corruption
|
|
- [x] `high-query-latency.md` - Check replication lag, shard hotspots, index health
|
|
- [x] `restore-from-backup.md` - Step-by-step restore procedure with validation
|
|
- [x] `add-node.md` - Node join procedure, shard rebalancing, validation
|
|
- [x] `disk-full.md` - Emergency WAL cleanup, compaction trigger, quota increase
|
|
- [x] `circuit-breaker-stuck.md` - Reset circuit breaker, identify root cause
|
|
- [x] `quarantine-overflow.md` - Investigate quarantine queue, batch approve/reject
|
|
|
|
- [x] **Troubleshooting Decision Tree**
|
|
- [x] `docs/operations/troubleshooting-flowchart.md` - Complete with symptom → cause → runbook mapping
|
|
- [x] Covers all 7 runbooks with decision trees and quick diagnostic commands
|
|
|
|
### P5.5 Cluster Management Tooling (WEEK 4 - HIGH PRIORITY) ✅ COMPLETE
|
|
|
|
**Priority: P1 - Manual SSH not scalable**
|
|
**Completed:** 2026-02-12
|
|
|
|
- [x] **`stemedb-admin` CLI** (new binary in `crates/stemedb-admin/`)
|
|
- [x] `stemedb-admin cluster status` - Overview: node count, shard count, meta version, node table
|
|
- [x] `stemedb-admin cluster health` - Quick health check (exit code 0/1)
|
|
- [x] `stemedb-admin node list` - List all nodes with states (Alive/Suspect/Dead)
|
|
- [x] `stemedb-admin node <id> info` - Detailed node info with shard assignments
|
|
- [x] `stemedb-admin node <id> shards` - Show shards assigned to node (with --leader filter)
|
|
- [x] `stemedb-admin shard list` - List all shards with leaders/replicas
|
|
- [x] `stemedb-admin shard <id> info` - Detailed shard info (size, assertions, replicas)
|
|
- [x] `stemedb-admin shard <id> replicas` - Show replica nodes for shard
|
|
- [x] `stemedb-admin debug export --output <file>` - Export complete cluster state as JSON
|
|
- [x] HTTP client connecting to gateway (default: http://localhost:18181)
|
|
- [x] Output formats: Table (human-friendly with colors) and JSON (machine-readable)
|
|
- [x] Environment variable support: `STEMEDB_GATEWAY_ADDR`
|
|
- [x] Proper error handling with helpful messages (no panics)
|
|
- [x] 12 integration tests covering all functionality
|
|
- [x] Node lifecycle documentation: `docs/operations/node-lifecycle.md`
|
|
- [x] Installation guide: `docs/operations/deployment/install-admin-cli.md`
|
|
|
|
**Phase 2 Deferred:**
|
|
- [ ] `stemedb-admin node drain <id>` - Graceful node removal (requires gateway endpoints)
|
|
- [ ] `stemedb-admin shard rebalance` - Manual rebalancing trigger (requires gateway endpoints)
|
|
|
|
- [ ] **Node Operations Documentation**
|
|
- [ ] `docs/operations/node-lifecycle.md`
|
|
- [ ] Add node procedure (pre-flight checks, join, validation)
|
|
- [ ] Remove node procedure (drain, graceful leave, verification)
|
|
- [ ] Replace node procedure (dead node replacement, shard recovery)
|
|
|
|
- [ ] **Shard Management** (optional for pilot, defer if time-constrained)
|
|
- [ ] `stemedb-admin shard rebalance` - Manual rebalancing trigger
|
|
- [ ] `stemedb-admin shard freeze` - Disable auto-split during maintenance
|
|
- [ ] `stemedb-admin shard move <shard-id> <target-node>` - Manual migration
|
|
|
|
### P5.6 Reference Architecture (WEEK 4) ✅ COMPLETE
|
|
|
|
**Priority: P1 - Customer deployment guide**
|
|
|
|
- [x] **Deployment Guides** (created in `docs/operations/reference-architecture/`)
|
|
- [x] `single-node-pilot.md` - Pilot deployment (1 node, docker-compose, hardware specs)
|
|
- [x] `three-node-cluster.md` - Small production (3 nodes, replication factor 2, HA)
|
|
- [x] `network-requirements.md` - Port list (181XX), firewall rules, TLS, DNS setup
|
|
|
|
- [x] **Infrastructure as Code Examples** (created in `docs/operations/deployment/`)
|
|
- [x] `docker-compose/pilot-with-monitoring.yml` - Single-node with Grafana + Prometheus
|
|
- [x] `nginx/stemedb.conf` - TLS 1.3, rate limiting, security headers, admin restrictions
|
|
- [x] `envoy/stemedb.yaml` - Load balancing, health checks, circuit breakers, retries
|
|
- [ ] `kubernetes/` - K8s manifests (StatefulSet, Service, Ingress) [DEFERRED - not needed for pilot]
|
|
- [ ] `terraform/` - AWS deployment (EC2, EBS, ALB, S3) [DEFERRED - not needed for pilot]
|
|
|
|
- [x] **Resource Sizing Guide**
|
|
- [x] `docs/operations/reference-architecture/resource-sizing.md` - Complete with CPU/RAM/disk formulas
|
|
- [x] Quick reference table: <10K, <50K, <100K, <500K, <1M assertions
|
|
- [x] AWS/GCP/Azure instance recommendations
|
|
- [x] Capacity planning metrics and monitoring dashboard
|
|
|
|
- [x] **Reverse Proxy Configuration**
|
|
- [x] `nginx/stemedb.conf` - TLS termination with Let's Encrypt, rate limiting, admin restrictions
|
|
- [x] `envoy/stemedb.yaml` - Advanced load balancing, circuit breakers, health checks
|
|
- [x] Let's Encrypt automation examples (certbot + cron)
|
|
|
|
### P5.7 Pilot Success Validation (WEEK 4) ✅ COMPLETE
|
|
|
|
**Priority: P1 - Definition of done**
|
|
|
|
- [x] **Performance Benchmarks** - Documented in `docs/operations/pilot-success-criteria.md`
|
|
- [x] Sub-second query latency: p99 <1s at 10K assertions (test procedure included)
|
|
- [x] Ingest throughput: 1K assertions/sec sustained (5 min load test script)
|
|
- [x] Replication lag <1 second under normal load (cluster validation)
|
|
|
|
- [x] **Functional Validation** - Documented in `docs/operations/pilot-success-criteria.md`
|
|
- [x] Conflict detection: ConflictLens score >0.5 on contradictions (test procedure)
|
|
- [x] Audit trail export: 100 assertions with signatures/provenance (validation script)
|
|
- [x] Source retraction cascade: 110+ dependents (CARDIOVASC_MEGA_TRIAL example)
|
|
|
|
- [x] **Operational Validation** - Documented in `docs/operations/pilot-success-criteria.md`
|
|
- [x] Backup/restore roundtrip: 10K assertions → backup → restore → verify (procedure)
|
|
- [x] Node failure recovery: Kill node → continue → re-replicate <5min (3-node test)
|
|
- [x] Rolling restart: Restart one-by-one during load test → 100% success (procedure)
|
|
|
|
- [x] **Demo Validation: 5 Amazement Moments** - All documented with test procedures
|
|
- [x] Moment 1: Conflicting claims (FDA 0.2% vs Anecdotal 12%)
|
|
- [x] Moment 2: Source retraction cascade (110 assertions flagged)
|
|
- [x] Moment 3: Audit trail (provenance chain to source)
|
|
- [x] Moment 4: Time-travel (query 2023 vs 2025)
|
|
- [x] Moment 5: Lens-based resolution (3 lenses → 3 winners)
|
|
|
|
---
|
|
|
|
## Phase 8B-C: Production Scale & Observability
|
|
|
|
> **Prerequisite:** Pilot 5 complete, 1-2 production customers running
|
|
> **Timeline:** 4-6 weeks after Pilot 5
|
|
|
|
### 8B. Advanced Observability
|
|
|
|
- [ ] **8B.1 Distributed Tracing**
|
|
- [ ] OpenTelemetry integration (Jaeger or Tempo backend)
|
|
- [ ] Trace write path: Gateway → Shard Leader → Followers → WAL
|
|
- [ ] Trace sync path: Merkle diff → Fetch missing → CRDT merge
|
|
- [ ] Add trace IDs to all log lines (`trace_id` field)
|
|
|
|
- [ ] **8B.2 Capacity Planning Metrics**
|
|
- [ ] `disk_growth_rate_bytes_per_day` (7-day linear regression)
|
|
- [ ] `disk_days_until_full` (projected based on growth rate)
|
|
- [ ] `assertion_ingestion_rate` (assertions/sec, 24h moving average)
|
|
- [ ] Dashboard: Capacity trends with projected full date
|
|
|
|
- [ ] **8B.3 Performance Profiling**
|
|
- [ ] Continuous profiling (pprof/flamegraph integration)
|
|
- [ ] Per-shard query latency breakdown
|
|
- [ ] Hot subject/predicate detection
|
|
- [ ] Slow query log (queries >100ms)
|
|
|
|
- [ ] **8B.4 Advanced Dashboards**
|
|
- [ ] `query-performance.json` - Latency by lens, hot subjects, cache hit rate
|
|
- [ ] `write-pipeline.json` - Ingest rate, WAL throughput, sync lag
|
|
- [ ] `capacity-planning.json` - Growth trends, disk projections, resource utilization
|
|
|
|
- [ ] **8B.5 Synthetic Backup Validation**
|
|
- [ ] Monthly automated DR drill: Restore to staging environment, run smoke tests, tear down
|
|
- [ ] Smoke test suite: Health check, query test, ingest test, export test
|
|
- [ ] Prometheus metrics: `stemedb_last_verified_restore_timestamp`, `stemedb_restore_validation_duration_seconds`
|
|
- [ ] Alert: No successful restore verification in 30 days (critical)
|
|
- [ ] Integration with CI/CD: Run on staging before production deploys
|
|
- [ ] Report generation: Pass/fail status, RTO measurement, issues encountered
|
|
|
|
### 8C. Production Hardening
|
|
|
|
- [ ] **8C.1 Point-in-Time Recovery (PITR)**
|
|
- [ ] WAL segment archival to S3 (every 15 min or 100 MB) [DONE via P5.3 WAL archival]
|
|
- [ ] Recovery target parsing (`--target lsn:123456`, `--target 2026-02-11T14:25:00`)
|
|
- [ ] WAL replay engine with target cutoff (stop at specified LSN/timestamp)
|
|
- [ ] Checksum validation during replay (detect corrupted segments)
|
|
- [ ] Test: Inject corruption at known LSN, restore to LSN-1, verify consistency
|
|
|
|
- [ ] **8C.1B Cluster-Aware Backup Coordination** (SHIP BLOCKER for 3-node production cluster)
|
|
- [ ] Design decision: Leader-based vs. distributed backup strategy
|
|
- [ ] Coordinated checkpoint API: `POST /v1/admin/cluster/checkpoint` (quorum-wide pause)
|
|
- [ ] Merkle tree state validation: Ensure all nodes at same tree version before backup
|
|
- [ ] Cluster backup workflow: Tag backups with cluster-wide timestamp + Merkle hash
|
|
- [ ] Cluster restore procedure: Rebuild 3-node cluster from matching-tag backups
|
|
- [ ] Split-brain prevention: Validate Merkle tree match after restore (reject divergent nodes)
|
|
- [ ] Replication lag handling: Wait for lag <5s before checkpoint, or fail
|
|
- [ ] Integration test: 3-node cluster backup → full cluster restore → verify consistency
|
|
- [ ] Documentation: `docs/operations/cluster-backup-restore.md` with runbook
|
|
|
|
- [ ] **8C.2 Online Backup (Hot Backup)** (PRIORITY: P0 for production scale >1K writes/sec)
|
|
- [ ] Checkpoint API: `POST /v1/admin/checkpoint` (fsync WAL, flush dirty pages, quiesce writes <100ms)
|
|
- [ ] Shadow copy mechanism: Copy DB files while holding checkpoint lock
|
|
- [ ] Snapshot registry: Track active snapshots, prevent WAL truncation during backup
|
|
- [ ] Zero-downtime workflow: Update `backup-stemedb.sh` to call checkpoint before rsync
|
|
- [ ] Metrics: `stemedb_checkpoint_duration_seconds`, `stemedb_checkpoint_pauses_total`
|
|
- [ ] Test: Backup under sustained 1K writes/sec load, restore, verify no data loss
|
|
|
|
- [ ] **8C.2B Volume Snapshot Integration** (OPTIONAL - optimization for cloud deployments)
|
|
- [ ] AWS EBS snapshot support: Detect EBS volumes, create snapshots via AWS SDK
|
|
- [ ] GCP persistent disk snapshots: Detect GCP disks, create snapshots via GCP SDK
|
|
- [ ] Azure managed disk snapshots: Detect Azure disks, create snapshots via Azure SDK
|
|
- [ ] Backup script flag: `--snapshot-mode=ebs|gcp-disk|azure-disk|file-copy` (auto-detect if not set)
|
|
- [ ] Fallback: Use file-copy (rsync) if cloud provider not detected
|
|
- [ ] Speed improvement: Instant CoW snapshots (seconds) vs. rsync (minutes)
|
|
- [ ] Cost improvement: Incremental snapshots (only changed blocks) vs. full S3 uploads
|
|
- [ ] Restore from snapshot: `restore-stemedb.sh --from-snapshot snap-abc123`
|
|
- [ ] Documentation: Cloud provider setup, IAM permissions, cost comparison
|
|
|
|
- [ ] **8C.3 Storage Compaction**
|
|
- [ ] Automatic WAL segment cleanup (delete segments older than 7 days if checkpointed)
|
|
- [ ] Tombstone removal (compact assertions with lifecycle=Superseded)
|
|
- [ ] Background task: Run compaction every 6 hours
|
|
- [ ] Metrics: `wal_segments_deleted_total`, `compaction_bytes_reclaimed`
|
|
|
|
- [ ] **8C.4 Auto-Healing Improvements**
|
|
- [ ] Detect dead node → trigger re-replication → restore replication factor (automated)
|
|
- [ ] Circuit breaker: Don't trigger shard split if memory >80%
|
|
- [ ] Clock skew detection: Reject assertions with timestamps >1s in future
|
|
- [ ] Partition detection: Log when SWIM sees cluster split
|
|
|
|
- [ ] **8C.5 Rolling Upgrades**
|
|
- [ ] `stemedb-admin upgrade --version v0.3.0 --batch-size 1`
|
|
- [ ] Pre-flight compatibility check (schema version, WAL format)
|
|
- [ ] Drain node before upgrade (move shards to other nodes)
|
|
- [ ] Zero-downtime upgrade workflow
|
|
|
|
- [ ] **8C.6 Multi-Region (Active-Passive)**
|
|
- [ ] Secondary region with continuous WAL replication
|
|
- [ ] Automated failover (DNS swap when primary unavailable >5 min)
|
|
- [ ] Failover time target: <10 minutes
|
|
- [ ] Cost estimate: ~$500/month for active-passive
|
|
|
|
---
|
|
|
|
## Phase 9: Enterprise Scale & Compliance
|
|
|
|
> **Goal:** Enterprise-grade durability, compliance, and incident response
|
|
> **Prerequisite:** 5-10 production customers, predictable failure patterns
|
|
|
|
### 9A. Advanced Backup & Recovery
|
|
|
|
- [ ] **9A.1 Incremental Backup**
|
|
- [ ] Only backup changed blocks since last backup (rsync --link-dest pattern)
|
|
- [ ] Backup time: Minutes instead of hours for 1TB database
|
|
- [ ] Storage savings: 90% reduction for daily incrementals
|
|
|
|
- [ ] **9A.2 Cross-Region Backup Replication**
|
|
- [ ] Replicate backups to S3 in different region (S3 cross-region replication)
|
|
- [ ] Storage tiers: Hot (7 days Standard), Warm (7-30 days Intelligent-Tiering), Cold (30+ days Glacier IR)
|
|
- [ ] Cost estimate: ~$210/month for 11TB (7 daily + 4 weekly backups)
|
|
|
|
- [ ] **9A.3 Backup Encryption**
|
|
- [ ] Encrypt backups at rest (AWS KMS or customer-managed keys)
|
|
- [ ] Encrypt backups in transit (TLS for S3 uploads)
|
|
- [ ] Key rotation policy (90-day rotation)
|
|
|
|
### 9B. Data Corruption & Recovery
|
|
|
|
- [ ] **9B.1 Deep Corruption Detection**
|
|
- [ ] Validate Merkle tree checksums before accepting gossip
|
|
- [ ] Periodic background validation (full DB checksum every 24h)
|
|
- [ ] Metric: `corruption_detected_total{source=gossip|disk}`
|
|
|
|
- [ ] **9B.2 Assertion Tombstones (Soft Delete)**
|
|
- [ ] New lifecycle stage: `Deleted` (append-only, not physically removed)
|
|
- [ ] Tombstone propagation via gossip (all nodes learn of deletion)
|
|
- [ ] Query filtering: Lenses ignore `Deleted` assertions by default
|
|
|
|
- [ ] **9B.3 Cluster Rollback**
|
|
- [ ] `stemedb-admin rollback --before 2026-02-11T14:00:00`
|
|
- [ ] Batch tombstone generation for all assertions after timestamp
|
|
- [ ] Use case: Bulk data corruption, need to revert cluster to known-good state
|
|
|
|
- [ ] **9B.4 Split-Brain Recovery**
|
|
- [ ] Automatic detection: Merkle tree divergence >10% after partition heals
|
|
- [ ] Manual resolution: `stemedb-admin resolve-split --prefer-node node-1`
|
|
- [ ] CRDT merge with conflict log (record which assertions were merged/discarded)
|
|
|
|
### 9C. Compliance & Legal
|
|
|
|
- [ ] **9C.1 GDPR Right to Erasure**
|
|
- [ ] Cryptographic erasure: Each agent has unique encryption key
|
|
- [ ] Delete key → data unrecoverable (even though assertions remain on disk)
|
|
- [ ] Compliance proof: "Key deleted on YYYY-MM-DD, data cryptographically erased"
|
|
|
|
- [ ] **9C.2 Data Retention Policies**
|
|
- [ ] Per-subject TTL: `retention_policy{subject="medical/*"}=7years`
|
|
- [ ] Per-predicate TTL: `retention_policy{predicate="temp_session"}=1day`
|
|
- [ ] Background task: Tombstone assertions past TTL
|
|
|
|
- [ ] **9C.3 Immutable Audit Trail**
|
|
- [ ] All admin actions logged to append-only audit store
|
|
- [ ] Include: Who, what, when, why (justification field required)
|
|
- [ ] Export API: `GET /v1/admin/audit?from=DATE&to=DATE`
|
|
- [ ] Compliance report generator (CSV/PDF for auditors)
|
|
|
|
- [ ] **9C.4 SOC 2 Type II Certification**
|
|
- [ ] Security controls implementation (access control, encryption, monitoring)
|
|
- [ ] 6-month observation period (demonstrate controls work consistently)
|
|
- [ ] External auditor engagement (Big 4 accounting firm)
|
|
- [ ] Annual recertification
|
|
|
|
### 9D. Storage Management
|
|
|
|
- [ ] **9D.1 Advanced Compaction**
|
|
- [ ] Multi-generation compaction: Merge small segments into larger ones
|
|
- [ ] Compaction budget: Limit I/O impact (max 10% of disk bandwidth)
|
|
- [ ] Metrics: `compaction_progress{generation}`, `compaction_bytes_read/written`
|
|
|
|
- [ ] **9D.2 Tiered Storage**
|
|
- [ ] Hot tier: NVMe SSD (last 7 days, accessed frequently)
|
|
- [ ] Warm tier: SATA SSD (7-90 days, accessed occasionally)
|
|
- [ ] Cold tier: S3 Glacier (90+ days, accessed rarely)
|
|
- [ ] Automatic migration based on access patterns
|
|
|
|
- [ ] **9D.3 Storage Quotas**
|
|
- [ ] Per-agent quotas: `quota{agent="user123"}=10GB`
|
|
- [ ] Cluster-wide quota: Hard limit on total DB size
|
|
- [ ] Soft quota warning at 80% (alert ops team)
|
|
- [ ] Hard quota rejection at 100% (reject new assertions)
|
|
|
|
### 9E. Incident Response
|
|
|
|
- [ ] **9E.1 Alerting & Escalation**
|
|
- [ ] PagerDuty integration (API key in config)
|
|
- [ ] Slack integration (webhook URL, #stemedb-alerts channel)
|
|
- [ ] Escalation policy: Warn → Page primary → Page backup → Page manager
|
|
- [ ] Alert grouping: Batch related alerts (don't page 100 times for same issue)
|
|
|
|
- [ ] **9E.2 Incident Management**
|
|
- [ ] Incident response playbook (`docs/operations/incident-response.md`)
|
|
- [ ] Severity levels: P0 (total outage), P1 (degraded), P2 (warning)
|
|
- [ ] Communication templates (customer email, status page update)
|
|
- [ ] Post-mortem template (5 Whys, timeline, action items)
|
|
|
|
- [ ] **9E.3 Chaos Engineering**
|
|
- [ ] Monthly "game day" exercises
|
|
- [ ] Scenarios: Node failure, network partition, disk full, slow disk
|
|
- [ ] Use `stemedb-chaos` crate to inject failures
|
|
- [ ] Document learnings, update runbooks
|
|
|
|
- [ ] **9E.4 On-Call Rotation**
|
|
- [ ] Define on-call schedule (primary, backup, manager escalation)
|
|
- [ ] On-call playbook (what to do when paged, who to call, escalation path)
|
|
- [ ] On-call compensation policy
|
|
- [ ] Post-incident review process
|
|
|
|
### 9F. Security Hardening
|
|
|
|
- [ ] **9F.1 mTLS for Cluster Communication**
|
|
- [ ] Require client certificates for all node-to-node RPC
|
|
- [ ] Certificate authority: Internal CA or Let's Encrypt
|
|
- [ ] Certificate rotation: 90-day validity, automated renewal
|
|
- [ ] Reject connections without valid cert (prevent rogue nodes)
|
|
|
|
- [ ] **9F.2 Encryption at Rest**
|
|
- [ ] WAL encryption: AES-256-GCM per segment
|
|
- [ ] KV store encryption: Transparent encryption layer (redb feature or OS-level LUKS)
|
|
- [ ] Key management: AWS KMS, HashiCorp Vault, or customer-managed keys
|
|
- [ ] Compliance: Meets HIPAA/GDPR encryption requirements
|
|
|
|
- [ ] **9F.3 Node Authentication**
|
|
- [ ] Each node has Ed25519 keypair (identity)
|
|
- [ ] Signed cluster join: Node signs join request with private key
|
|
- [ ] Admin API: Approve/reject join requests (`stemedb-admin node approve <node-id>`)
|
|
- [ ] Prevent unauthorized nodes from joining cluster
|
|
|
|
- [ ] **9F.4 API Security**
|
|
- [ ] Rate limiting per API key (100 req/min for free tier, 10K req/min for enterprise)
|
|
- [ ] Input validation: UTF-8, max lengths, regex injection protection
|
|
- [ ] SQL injection prevention: Parameterized queries only (no string concatenation)
|
|
- [ ] XSS prevention: Escape all user-provided content in dashboard
|
|
|
|
- [ ] **9F.5 Secrets Management**
|
|
- [ ] Never store secrets in code or config files
|
|
- [ ] Use environment variables or secret management service (Vault, AWS Secrets Manager)
|
|
- [ ] Secret rotation policy (API keys rotated every 90 days)
|
|
- [ ] Audit log: Track secret access (who accessed what secret when)
|
|
|
|
### 9G. Operational Maturity
|
|
|
|
- [ ] **9G.1 SLI/SLO Definitions**
|
|
- [ ] Availability SLO: 99.95% uptime (21.9 min/month downtime budget)
|
|
- [ ] Latency SLO: p95 query latency <100ms, p99 <500ms
|
|
- [ ] Error rate SLO: <0.1% of requests fail
|
|
- [ ] Dashboard: SLO compliance tracking, error budget remaining
|
|
|
|
- [ ] **9G.2 Capacity Planning**
|
|
- [ ] Quarterly capacity review (growth trends, resource utilization)
|
|
- [ ] 6-month forecast (projected assertion count, disk usage, API load)
|
|
- [ ] Auto-scaling triggers (add nodes when CPU >70% for 10 min)
|
|
- [ ] Budget planning: Cloud costs per customer, per assertion
|
|
|
|
- [ ] **9G.3 Performance Testing**
|
|
- [ ] Load testing: Sustained 10K assertions/sec for 1 hour
|
|
- [ ] Stress testing: Ramp to failure (find breaking point)
|
|
- [ ] Chaos testing: Inject failures during load test
|
|
- [ ] Regression testing: Compare performance across releases
|
|
|
|
- [ ] **9G.4 Documentation**
|
|
- [ ] Operator guide (`docs/operations/operator-guide.md`)
|
|
- [ ] Troubleshooting guide (symptom → diagnosis → fix)
|
|
- [ ] Architecture deep-dive (how it works, design decisions)
|
|
- [ ] API reference (auto-generated from OpenAPI spec)
|
|
- [ ] SDK usage guides (Go, Python, TypeScript)
|
|
|
|
---
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
Write Path (Spine): Read Path (Cortex):
|
|
[Agent] -> [Ingestion] [Agent] <- [Lens Engine]
|
|
| |
|
|
v |
|
|
[WAL/Fsync] [Index Lookup]
|
|
| |
|
|
v |
|
|
[KV Store] <--------------------+
|
|
```
|
|
|
|
## Port Scheme (181XX)
|
|
|
|
| Offset | Service | Default | Env Var |
|
|
|--------|---------|---------|---------|
|
|
| +0 | HTTP API | 18180 | `STEMEDB_BIND_ADDR` |
|
|
| +1 | Cluster Gateway | 18181 | `STEMEDB_NODE_API_ADDR` |
|
|
| +2 | Cluster RPC | 18182 | `STEMEDB_NODE_RPC_ADDR` |
|
|
| +3 | SWIM Gossip | 18183 | via `SwimConfig` |
|
|
| +4 | Metrics | 18184 | (reserved) |
|
|
| +5 | Admin | 18185 | (reserved) |
|
|
| +6 | Latent Signal | 18186 | — |
|
|
| +7 | Community App | 18187 | — |
|
|
| +8 | Admin Dashboard | 18188 | — |
|
|
|
|
## Crates
|
|
|
|
| Crate | Purpose | Status |
|
|
|-------|---------|--------|
|
|
| `stemedb-core` | Assertion, LifecycleStage, MaterializedView, types, signing | ✅ |
|
|
| `stemedb-wal` | Write-ahead log with crash recovery | ✅ |
|
|
| `stemedb-storage` | KVStore, VoteStore, IndexStore, TrustRankStore, QuarantineStore | ✅ |
|
|
| `stemedb-ingest` | Ingestion pipeline, signature verification, ContentDefenseLayer | ✅ |
|
|
| `stemedb-query` | Query engine, Materializer for O(1) MV reads | ✅ |
|
|
| `stemedb-lens` | Lenses (Recency, Consensus, Authority, Skeptic, Layered, etc.) | ✅ |
|
|
| `stemedb-api` | HTTP API with axum + utoipa OpenAPI docs | ✅ |
|
|
| `stemedb-sim` | Simulation for testing the pipeline | ✅ |
|
|
| `stemedb-merkle` | BLAKE3 Merkle tree for diff detection | ✅ |
|
|
| `stemedb-rpc` | gRPC services for node-to-node communication | ✅ |
|
|
| `stemedb-sync` | Merkle sync, gossip broadcast, anti-entropy | ✅ |
|
|
| `stemedb-cluster` | Cluster membership (SWIM), sharding, gateway | ✅ |
|
|
| `stemedb-ontology` | Domain definitions (Pharma), subject builders, medical extractors | ✅ |
|
|
| `stemedb-chaos` | Chaos testing infrastructure | ✅ |
|
|
| `stemedb-dashboard` | Admin dashboard (React/Next.js) | ✅ (7 panels) |
|
|
|
|
## Applications
|
|
|
|
| App | Purpose | Status |
|
|
|-----|---------|--------|
|
|
| `aphoria` | Code-level truth linter — 42 extractors, claims, verify, coverage | 🎯 A5 flywheel |
|
|
| `disputed` | Controversy explorer | Planned |
|
|
|
|
## SDKs
|
|
|
|
| SDK | Purpose | Status |
|
|
|-----|---------|--------|
|
|
| `sdk/go/steme` | Go HTTP client with Ed25519 signing and fluent builders | ✅ |
|
|
| `sdk/go/adk` | ADK-Go tools and callbacks for AI agents | ✅ |
|
|
|
|
---
|
|
|
|
## Quick Reference
|
|
|
|
```bash
|
|
# Build
|
|
cargo build --workspace
|
|
|
|
# Test
|
|
cargo test --workspace
|
|
|
|
# Lint (must pass before commit)
|
|
cargo clippy --workspace -- -D warnings
|
|
cargo fmt --check
|
|
|
|
# Run API server
|
|
cargo run --bin stemedb-api
|
|
|
|
# Run Aphoria scan
|
|
cargo run --bin aphoria -- scan /path/to/project --show-observations
|
|
|
|
# Run demo script
|
|
./scripts/demo-consumer-health.sh
|
|
```
|
|
|
|
---
|
|
|
|
## Arena: Simulation Roadmap
|
|
|
|
> **Goal:** Incrementally evolve the simulator from Spine validation to a full Agent-Based Modeling environment.
|
|
> **Philosophy:** Make it run. Then add. Verify at every step.
|
|
> **Alignment:** Tracks main roadmap phases; exercises features as they land.
|
|
|
|
### Current State
|
|
|
|
The simulator (`stemedb-sim`) validates the full system through Arena 0-4:
|
|
|
|
**Completed Arenas:**
|
|
- ✅ **Arena 0**: Test infrastructure with assertions and CI integration
|
|
- ✅ **Arena 1**: Query path via QueryEngine, Recency lens, lifecycle filtering, query audit
|
|
- ✅ **Arena 2**: Voting & VoteAwareConsensus, troll resistance
|
|
- ✅ **Arena 2.5**: Hardening (race conditions, API tests, crash recovery, input validation)
|
|
- ✅ **Arena 3**: Materialized Views, fast-path verification, MV freshness
|
|
- ✅ **Arena 4**: Agent personas (Scientist, Troll, Believer with differentiated strategies)
|
|
|
|
**What's Tested:**
|
|
- WAL durability, rkyv serialization, Ed25519 signatures
|
|
- Ingestor pipeline (WAL → KV async flow)
|
|
- QueryEngine with multiple lenses
|
|
- Lifecycle filtering, voting, consensus
|
|
- Query audit trail, materialized views
|
|
- Strategy-driven agent behaviors
|
|
|
|
**What's Not Yet Tested:**
|
|
- ❌ TrustRank (Arena 5)
|
|
- ❌ Concurrent agents at scale (Arena 6)
|
|
- ❌ Time-travel queries (Arena 7)
|
|
- ❌ Skeptic lens & conflict scores (Arena 8)
|
|
|
|
### Upcoming Arena Phases
|
|
|
|
**Arena 5: TrustRank Integration** (Next)
|
|
- Initialize TrustRank for agents
|
|
- Reputation adjustment after votes
|
|
- TrustAwareAuthorityLens verification
|
|
- Troll reputation decay over time
|
|
|
|
**Arena 6: Concurrent Agents**
|
|
- Tokio task per agent
|
|
- Scale to 100 agents, then 1000
|
|
- Contention metrics and bottleneck identification
|
|
|
|
**Arena 7: Time-Travel & Epochs**
|
|
- Time-travel query verification
|
|
- Epoch creation and supersession
|
|
- Epoch cascade validation
|
|
|
|
**Arena 8: Skeptic & Conflict**
|
|
- High/low conflict scenarios
|
|
- Skeptic lens surfacing outliers
|
|
- Conflict score accuracy
|
|
|
|
**Arena 9: Full Gameplay Loop**
|
|
- Ground truth injection
|
|
- Complete 5-tick scenario
|
|
- Extended 1000-tick run
|
|
- Emergence validation
|
|
|
|
### Alignment with Use Cases
|
|
|
|
| Use Case | Arena Phase |
|
|
|----------|-------------|
|
|
| **Agile Agent Team** ||
|
|
| Lifecycle filtering | Arena 1.3 |
|
|
| Query audit trail | Arena 1.4 |
|
|
| Time-travel debugging | Arena 7.1 |
|
|
| Expert weighting | Arena 5.3 |
|
|
| **Financial Due Diligence** ||
|
|
| Conflict detection | Arena 8.1, 8.3 |
|
|
| Epoch cascades | Arena 7.2, 7.3 |
|
|
|
|
**Run command:** `cargo run --bin stemedb-sim`
|
|
**Test suite:** `cargo test -p stemedb-sim`
|
|
|
|
---
|
|
|
|
## Related Documents
|
|
|
|
- [CLAUDE.md](./CLAUDE.md) — AI assistant instructions and project rules
|
|
- [roadmap-archive.md](./roadmap-archive.md) — Completed phases 1-8A + Pilot 1-3
|
|
- [applications/aphoria/docs/vision-gaps.md](./applications/aphoria/docs/vision-gaps.md) — Aphoria vision gap analysis
|
|
- [claims-explained.md](./claims-explained.md) — Hand-written Maxwell claims (the gold standard)
|
|
- [docs/demo/pilot/amazement-demo.md](./docs/demo/pilot/amazement-demo.md) — Technical demo script
|
|
- [docs/demo/pilot/amazement-demo-2.md](./docs/demo/pilot/amazement-demo-2.md) — Executive demo script
|
|
- [uat/production-readiness/README.md](./uat/production-readiness/README.md) — Production verification checklist
|