stemedb/roadmap.md
jml fae9b47fae feat(aphoria): implement hosted mode with remote StemeDB integration
Add remote mode infrastructure for querying claims from StemeDB API:
- Remote client with caching layer for claim queries
- Authority resolution logic with tier-based verdict system
- StemeDB API handlers for claims CRUD operations
- Enhanced conflict detection with remote claim support
- Validation reports documenting A5.3 phase completion

Changes:
- applications/aphoria/src/remote/: New client + cache modules
- applications/aphoria/src/resolution/: Authority tier resolution
- crates/stemedb-api/src/handlers/stemedb_claims.rs: API handlers
- applications/aphoria/validation/a5.3/: Phase validation reports
- Updated roadmap with hosted mode milestones

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-14 09:29:56 +00:00

57 KiB

Episteme (StemeDB) Roadmap

Goal: Build the "Git for Truth" substrate for autonomous AI research. Current Focus: Gap Closure Phase 3 — Remote hosted mode (Aphoria → remote StemeDB) Target Vertical: BioTech/Pharma ("The Living Review") + Code Truth (Aphoria) Endgame: Distributed multi-writer cluster for millions of concurrent agents

Infrastructure Status: Phases 1-7 complete | Phase 8A (Chaos) complete | Pilot 1-5 complete Aphoria Status: A1-A5 + Phase 2 complete | Tier-aware resolution | Next: Remote mode Security Status: P5.1 4/5 done (TLS, limits, timeouts, rate limiting) | P5.2 complete

Archive: For completed phases 1-8A + Pilot 1-3, see roadmap-archive.md


Current Status

Phase Status Summary
1-7, 8A Complete Core infra, cluster, trust, chaos testing
MVP, Pilot 1-4 Complete Consumer Health demo, dashboard, API auth, metrics
Aphoria A1-A4 Complete Observations/claims/verify/corpus/authority lens
Aphoria A5 Complete Flywheel validated: 93.5% acceptance, 100% config recall
Gap Closure Phase 2 Complete Tier-aware authority resolution, --explain-authority flag
Gap Closure Phase 3 🎯 Current Remote hosted mode: Aphoria → remote StemeDB via HTTP API
Gap Closure Phase 4 Planned Claim discovery, manual convergence, promotion workflows
Pilot 5 Complete All 7 phases complete: Security (4/5), Monitoring, Backup/DR, Runbooks, Cluster Mgmt, Reference Architecture, Pilot Success Criteria
8B-C Planned Distributed observability, geo-distribution
9 Planned Disaster recovery, compliance, storage management

🎯 Aphoria: From Scanner to Knowledge Graph Client (CURRENT)

Goal: Transform Aphoria from "grep with Episteme vocabulary" into a real knowledge graph client that authors, stores, and audits claims with provenance and lineage. Vision Document: applications/aphoria/docs/vision-gaps.md Validation: Maxwell scan (67 observations, 0 noise) + hand-written claims-explained.md

Completed Phases (A1-A4 + P4 — see roadmap-archive.md for details)

Phase What It Delivered
A1 Observation vs AuthoredClaim types, bridge tier mapping, .aphoria/claims.toml format
A2 aphoria claims create/list/explain/update/supersede/deprecate, aphoria-claims skill
A3 verify.rs engine (Pass/Conflict/Missing/Unclaimed), aphoria verify run/map, pre-commit hook, self-audit
A4 RFC/OWASP as Episteme assertions, AphoriaAuthorityLens, Trust Pack export/install
P4 API auth (3 roles), backup/restore scripts, Prometheus metrics + Grafana dashboard

Phase A5: The Flywheel

Goal: The system gets smarter with use. Each claim makes the next claim easier. Details: vision-gaps.md — §5 (claims-explained.md as the product) Research: a5-flywheel-skill-design.md — validates "skill calls CLI" hypothesis Key Insight: LLM reasoning over CLI JSON output replaces ML training. The flywheel is prompt engineering, not machine learning.

  • A5.1 Claim Coverage Metrics: Per-module claim density and gap reporting
    • coverage.rs: CoverageReport, ModuleCoverage, CoverageSummary types
    • compute_coverage() uses verify_claims() as source of truth for claim-observation matching
    • Per-module: observation count, claim count, claimed/unclaimed, missing claims, density
    • aphoria coverage CLI: table, JSON, markdown formats, --sort-by (name/density/unclaimed/observations)
    • Coverage gaps section: modules with observations but no claims
    • 8 unit tests including deprecated claim exclusion
  • A5.2 Auto-Generated Documentation: aphoria docs generate + aphoria claims explain
    • aphoria docs generate CLI command with --output and --format (markdown/json)
    • claims_explain.rs: groups by category, includes provenance/invariant/consequence/evidence per claim
    • explain.rs: reads .aphoria/claims.toml, renders via render_claims_markdown()
    • Provenance chains preserved (supersedes references)
  • A5.3 Claim Suggester Skill: LLM-powered pattern recognition via "skill calls CLI"
    • New skill: .claude/skills/aphoria-suggest/SKILL.md (3 modes: cold start / foundation / flywheel)
    • Workflow defined: claims listverify run --show-unclaimed → reason by analogy → suggest
    • Few-shot learning: existing claims as gold-standard examples for style matching
    • Chain-of-thought: reasoning template before each suggestion
    • Cold start bootstrap: reads README/CLAUDE.md/tests/ADRs when 0 claims
    • Context tiers: local → semantic → summary → global (subagent)
    • Quality gates: non-trivial, not type-enforced, has consequence, not duplicate
    • VG-022 CLOSED: verifiable_predicates() on Extractor trait; 10 extractors declare predicates; verify map shows extractor→claim coverage
    • Dogfood claims: 10 total claims in .aphoria/claims.toml (3 arch + 7 security) covering all ComparisonModes
    • Validate: Run skill against Aphoria's own codebase (dogfood) - 87.5% acceptance rate (7/8)
    • Validate: Run skill against an external project (cold start test) - 72.7% alignment (16/22), 100% config recall
    • Iterate: Refine prompt based on suggestion quality from validation - 3 improvements identified (domain-awareness, impl depth, tuning scan)
  • A5.4 Onboarding Mode: aphoria explain for new team members
    • explain.rs: generate_explanation() reads claims, renders narrative
    • aphoria explain CLI with --output and --format (markdown/json)
    • Shows claim inventory grouped by category with provenance
    • Empty project handling: directs to aphoria claims create

Gap Closure Phase 2: Tier-Aware Authority Resolution

Goal: Make authority tiers actionable in conflict resolution. Higher-tier sources (lower tier numbers) win. Why Now: A5.3 validation proved flywheel works (93.5% acceptance). Next blocker: tier resolution. User Story: "Why is Aphoria blocking me for this? Is it really important?" → Show tier, not just binary BLOCK/FLAG.

  • Phase 2.1 Tier-Aware Types: Foundation for tier-scoped verdicts
    • TierAwareVerdict enum with 3 variants: SingleTier, MultiTier, HigherTierAgreement
    • ConflictResult extended with tier_verdict and primary_tier fields
    • resolution/ module: tier_verdict.rs, authority.rs, mod.rs
    • 10 unit tests for tier resolution logic
  • Phase 2.2 Conflict Logic Updates: Always compute tier breakdown
    • conflict.rs: Moved tier breakdown out of debug-only block (always populated)
    • compute_tier_aware_verdict(): Computes per-tier verdicts with primary tier selection
    • compute_tier_breakdown(): Groups conflicts by tier (0-5)
    • Primary tier = lowest tier number (highest authority)
  • Phase 2.3 Display Formatting: Show tier names in CLI output
    • ConflictResult::Display shows tier-aware verdict when available
    • Tier names formatted: "Tier 1 (Clinical/RFC)", "Tier 3 (Expert)", etc.
    • Backward compatible: legacy output when tier_verdict is None
  • Phase 2.4 CLI Flag: --explain-authority for detailed tier breakdown
    • ScanArgs extended with explain_authority field
    • cli/mod.rs: Added --explain-authority flag to aphoria scan
    • handlers/scan.rs: Pass flag through to scan execution
    • All 28 ScanArgs constructions updated (tests + handlers)
  • Phase 2.5 Tests & Quality: 1300 tests pass, zero clippy warnings
    • All resolution module tests pass (10/10)
    • All aphoria tests pass (1300/1300)
    • Clippy passes with zero warnings
    • Test files updated with new fields (tier_verdict, primary_tier, explain_authority)

What Changed:

  • Conflicts now show tier information: ❌ BLOCK Tier 1 (Clinical/RFC) instead of just ❌ BLOCK
  • Primary tier (highest authority) is computed and stored: Tier 1 beats Tier 3
  • --explain-authority flag shows per-tier breakdown (which tiers have conflicting sources, at what confidence)
  • Backward compatible: existing code without tier_verdict continues to work

What's Next (Phase 3):

  • Remote mode: aphoria init --remote <url> connects to org StemeDB instance
  • Claim discovery: Query remote claims to see org patterns (specs, popular conventions)
  • Manual convergence: Developer inspects claims, decides whether to align code
  • Manual promotion: Developer upgrades claim tier when backed by higher-tier evidence

Files Changed:

  • New: applications/aphoria/src/resolution/{mod.rs,tier_verdict.rs,authority.rs} (3 files, ~450 LOC)
  • Modified: conflict.rs, result.rs, command.rs, cli/mod.rs, handlers/{scan.rs,mod.rs}, lib.rs
  • Tests: 4 report files, 6 test files, 2 handler files (28 ScanArgs constructions updated)

Gap Closure Phase 3: Remote Hosted Mode (CURRENT)

Goal: Enable Aphoria to work against a remote StemeDB instance instead of local-only. Why Now: Phase 1 wired claims through StemeDB locally. User always intended remote hosting. User Story: "Configure Aphoria to connect to my org's StemeDB URL instead of running locally"

Architecture Note: HostedConfig already exists in the codebase with url, sync_mode, and auth fields. This phase is about wiring it up, not building sync infrastructure.

Current State (Post-Phase 1 + Phase 2):

  • Claims stored in StemeDB (Phase 1: EpistemeClaimStore, roundtrip bridge, auto-migration)
  • Tier-aware resolution (Phase 2: primary tier, tier verdict, --explain-authority)
  • HostedConfig exists (config/types/hosted.rs: url, sync_mode, offline_fallback, api_key_env)
  • EpistemeClaimStore uses local StemeDB only (no HTTP client implementation)
  • No API endpoints for claims (StemeDB API missing /claims/* routes)
  • No auth integration (API key validation not wired up)
  • No remote-mode CLI (aphoria init --remote doesn't exist)

Target State (Phase 3):

  • aphoria init --remote https://stemedb.acme.com writes HostedConfig to .aphoria/config.toml
  • EpistemeClaimStore calls StemeDB HTTP API instead of local WAL/KV
  • StemeDB API has /claims/* endpoints (create, list, fetch, update, etc.)
  • Auth: API key from $STEMEDB_API_KEY validated on server
  • Offline fallback: graceful degradation when remote unreachable (per OfflineFallback config)

Phase 3 Breakdown

Phase 3.1: StemeDB API Endpoints for Claims (3-4 days)

  • Add /api/v1/claims POST endpoint (create claim) — handlers/stemedb_claims.rs::create_claim
  • Add /api/v1/claims GET endpoint (list claims with filters: category, tier, status) — handlers/stemedb_claims.rs::list_claims
  • Add /api/v1/claims/{concept_path}/{predicate} GET endpoint (fetch specific claim) — handlers/stemedb_claims.rs::get_claim
  • Add /api/v1/claims/{concept_path}/{predicate} DELETE endpoint (mark as deprecated) — handlers/stemedb_claims.rs::delete_claim
  • Add /api/v1/claims/{concept_path}/{predicate} PUT endpoint (update claim)
  • Add /api/v1/claims/{concept_path}/{predicate}/supersede POST endpoint
  • DTOs: CreateClaimRequest, CreateClaimResponse, AuthoredClaimDto, AuthoredValueDto in dto/stemedb_claims.rs
  • Error handling: All handlers use crate::error::Result<T> with ApiError variants
  • State access: WAL append via commit_buffer.append(), queries via store.scan_prefix()
  • OpenAPI: All endpoints annotated with #[utoipa::path]
  • Auth: Require API key in Authorization: Bearer <token> header (wiring needed)
  • Tests: Integration tests for all endpoints with valid/invalid API keys

Phase 3.2: HTTP Client Implementation (3-4 days)

  • Remote module structure: applications/aphoria/src/remote/{mod.rs,cache.rs,client.rs} (created in Phase 3 prep)
  • RemoteClaimStore struct with reqwest for API calls (partially implemented, needs testing)
  • Implement ClaimStore trait over HTTP client (save_claim, load_claim, list_claims, etc.)
  • Auth: Read API key from $STEMEDB_API_KEY env var, send in Authorization header
  • Error handling: 401 Unauthorized → clear error message, 5xx → offline fallback
  • Retries: Exponential backoff for transient errors (503, network timeouts)
  • Tests: Unit tests with mock HTTP server

Phase 3.3: Remote Mode CLI (2 days)

  • aphoria init --remote <url> writes HostedConfig to .aphoria/config.toml
  • --remote flag validates URL format, tests connection (GET /health), writes config
  • Config serialization: hosted section with url, sync_mode: "remote_only", api_key_env
  • aphoria scan detects remote mode, uses HTTP client instead of local StemeDB
  • Tests: CLI test for init --remote, verify config file written correctly

Phase 3.4: Offline Fallback (2 days)

  • OfflineFallback config options: error, warn, silent
  • When remote unreachable: respect fallback mode (error = fail scan, warn = log + continue, silent = no-op)
  • Cache last-known remote state locally (optional): write claims to .aphoria/cache.toml on successful remote fetch
  • On offline: use cached claims if available, otherwise apply fallback mode
  • Tests: Integration test with unreachable remote URL, verify fallback behavior

Phase 3.5: Documentation & Migration (2 days)

  • Update applications/aphoria/README.md with remote mode setup
  • New doc: applications/aphoria/docs/remote-mode.md (setup, auth, troubleshooting)
  • Migration guide: TOML-only → remote StemeDB (backward compatible, no breaking changes)
  • Example configs: remote-only, local-and-remote, offline-fallback modes
  • Troubleshooting: auth errors, network issues, API version mismatches

Recent Progress (2026-02-13)

Compilation Fix & API Foundation: Fixed all 24 compilation errors blocking Phase 3 implementation. Key changes:

Wave 1: Type Fixes

  • Removed unused use std::sync::Arc; import
  • Added #[derive(utoipa::ToSchema)] to all DTOs for OpenAPI support
  • Fixed all 7 LifecycleStage mappings (DraftProposed, Retired/SupersededDeprecated)

Wave 2: Architecture Fixes

  • Created crates/stemedb-api/src/dto/stemedb_claims.rs with documented DTOs (proper separation of concerns)
  • Fixed WAL append pattern: Replaced state.engine.put() with serialize_assertion() → commit_buffer.append() (5 locations)
  • Fixed query pattern: Replaced non-existent state.engine.query_by_subject*() with direct store.scan_prefix() calls (3 locations)
  • Fixed error handling: All handlers now return crate::error::Result<T> instead of custom error tuples
  • Added explain_authority field to 3 ScanArgs initializations (Phase 2 integration)

Files Changed:

  • New: crates/stemedb-api/src/dto/stemedb_claims.rs (~90 LOC with docs)
  • Modified: dto/mod.rs (exports), handlers/stemedb_claims.rs (~400 LOC refactor), handlers/aphoria/{claims.rs,scan.rs} (explain_authority)

Verification:

cargo check --workspace  # ✅ PASSES (was: 24 errors)
cargo clippy --workspace -- -D warnings  # ✅ PASSES

Next Steps:

  • Test API endpoints manually (Wave 3)
  • Implement aphoria init --remote CLI command (Wave 4)
  • Write comprehensive tests (Wave 5)
  • Update documentation (Wave 6)

Success Criteria

Must Have:

  • aphoria init --remote <url> writes HostedConfig and validates connection
  • StemeDB API has /claims/* endpoints (create, list, fetch, update, supersede)
  • EpistemeClaimStore works over HTTP (no local WAL/KV)
  • Auth: API key in Authorization header, validated on server
  • aphoria scan works end-to-end with remote StemeDB (same UX as local)

Should Have:

  • Offline fallback: graceful degradation when remote unreachable
  • Cache last-known state locally for offline use
  • Documentation: remote mode setup, auth, troubleshooting
  • Backward compatible: local mode still works (default behavior)

Nice to Have:

  • Sync mode: LocalAndRemote (write local + remote)
  • Migration tool: bulk import TOML → remote StemeDB
  • Health check: aphoria check-remote validates connection + auth

Estimated Timeline

Phase Estimated Dependencies
3.1: StemeDB API Endpoints 3-4 days None
3.2: HTTP Client 3-4 days 3.1 complete
3.3: Remote Mode CLI 2 days 3.2 complete
3.4: Offline Fallback 2 days 3.2 complete
3.5: Documentation 2 days 3.1-3.4 complete
Total 2-3 weeks ~15 business days

Note: Phase 3 is now MUCH simpler than originally planned:

  • No sync infrastructure (direct HTTP, not push/pull)
  • No multi-agent convergence (deferred to future phase)
  • No tier escalation (deferred to future phase)
  • Focus: Enable remote mode, validate it works

Risks & Mitigation

Risk Likelihood Impact Mitigation
API performance (HTTP overhead) Medium Medium Batch operations, HTTP/2, connection pooling
Auth token leakage Low High Env var only, never log token, validate on server
Network unreliability High Low Offline fallback with cached state
Breaking local workflow Low High Local mode is default, remote is opt-in

Gap Closure Phase 4: Claim Discovery & Manual Convergence (FUTURE)

Goal: Help developers discover org patterns and make informed convergence decisions. Why: Remote mode enables sharing, but developers need visibility into what exists. User Story: "Show me what claims exist for this module so I can decide if I should align"

Dependencies:

  • Phase 3 complete (remote mode working)
  • Developers are authoring claims in shared StemeDB

Key Workflows:

1. Claim Discovery — "What patterns exist for this concept?"

# Query org claims for a specific concept
aphoria claims search --concept-path "*/imports/tokio"
aphoria claims search --category architecture --tier clinical

# Output shows:
# - 3 claims from other teams
# - Tier 1 (RFC): "Core MUST NOT import tokio"
# - Tier 3 (Expert): "Web services MAY import tokio" (15 projects)
# - Tier 4 (Community): "CLI tools SHOULD import tokio" (5 projects)

2. Convergence Decision — "Should I align with this pattern?"

  • Developer sees Tier 1 spec → follows it (authority)
  • Developer sees popular Tier 3 pattern (15 projects) → can choose to converge
  • Decision is MANUAL, not automatic

3. Manual Promotion — "My Tier 3 claim has Tier 1 evidence"

# Developer finds RFC backing for their code claim
aphoria claims promote <claim-id> \
  --to-tier clinical \
  --evidence "RFC 7519 Section 4.1.3" \
  --reason "RFC explicitly requires audience validation"

# System validates:
# - New tier is higher authority than current
# - Evidence is provided
# - Provenance chain preserved

4. Pattern Popularity — "How widely adopted is this claim?"

aphoria claims stats <claim-id>
# Shows:
# - 23 projects have similar claim (same concept_path+predicate)
# - 18 at Tier 3 (Expert)
# - 5 at Tier 4 (Community)
# - Suggests: "This pattern is widely adopted. Consider aligning."

Phase 4 Breakdown (Future — After Phase 3)

Phase 4.1: Claim Search & Discovery (1 week)

  • aphoria claims search CLI with filters (concept, category, tier, status)
  • Query remote StemeDB via /claims?filters=... API
  • Display: claim summary, tier, provenance, evidence, adoption count
  • Similarity matching: "Show claims similar to my local code"

Phase 4.2: Convergence Suggestions (1 week)

  • aphoria scan --suggest-convergence compares local code to remote claims
  • Highlights: "Your code imports tokio, but org spec says MUST NOT"
  • Highlights: "15 projects use pattern X, you use pattern Y — consider aligning?"
  • Developer decides: align code, create counter-claim, or ignore

Phase 4.3: Manual Promotion Workflow (1 week)

  • aphoria claims promote CLI command with validation
  • Requires: target tier, evidence, reason
  • Preserves provenance: promoted claim references original
  • Audit log: who promoted, when, why

Phase 4.4: Adoption Metrics (1 week)

  • aphoria claims stats shows adoption counts per claim
  • Query: "How many projects have claims with this concept_path?"
  • Dashboard: most popular patterns by category/tier
  • Helps identify: emerging conventions, dead patterns

Estimated Timeline: 4 weeks (1 week per sub-phase)

Success Criteria:

  • Developer can discover org patterns relevant to their code
  • Developer can see tier + adoption count for each pattern
  • Developer can manually promote claims with evidence
  • Developer can choose to align code with popular/authoritative patterns

What This Enables:

  • Organic convergence driven by inspection, not automation
  • Authority-aware decision-making (Tier 1 spec > Tier 3 code claim)
  • Popularity-aware decision-making (15 projects use X → maybe I should too)
  • Manual promotion path (Tier 3 → Tier 1 when backed by RFC)

Pilot 5: Operational Readiness

Goal: Complete production readiness for enterprise pilot demo. Context: Pilot 1-4 complete (see archive). Target: 4-6 weeks to ship-ready state

Enterprise Readiness: Deployment Stages

Stage Requirements Timeline Customer Profile
MVP Pilot P5.1 Security + P5.2 Monitoring + P5.3 Backup Ready Friendly pilot, tolerates manual ops
Production MVP + P5.4 Runbooks + P5.5 CLI 4 weeks First paying customer, self-hosted
Scale Production + Phase 8B-C 8-10 weeks 5-10 customers, automated operations
Enterprise Scale + Phase 9 6+ months 50+ customers, SOC2/compliance required

Critical Path to Ship (Must-Have)

WEEK 1 - Security (P0 Blockers):

  • TLS/HTTPS, request size limits, timeouts, secret sanitization, rate limiting

WEEK 2 - Monitoring (P0 Blind without these):

  • Storage metrics, replication metrics, Grafana dashboards, alert rules

WEEK 3 - Backup & DR (P0 Data loss risk):

  • Automated backup, backup verification, WAL archival, DR runbook, operational runbooks

WEEK 4 - Deployment (P1 Customer enablement):

  • CLI tooling, reference architecture, deployment guides, pilot validation

P5.1 Security Hardening (WEEK 1 - SHIP BLOCKERS)

Priority: P0 - Cannot ship without these Status: 🎯 4/5 Complete (TLS, Limits, Timeouts, Rate Limiting done; Secret Sanitization pending)

  • TLS/HTTPS Configuration (Partial - 2024-02-11)

    • Add TLS 1.3 to stemedb-api (axum-server with rustls) - main.rs:114-123
    • Load from env vars: STEMEDB_TLS_CERT_PATH / STEMEDB_TLS_KEY_PATH
    • HTTP → HTTPS redirect (deferred - not critical for pilot)
    • Let's Encrypt integration for pilot deployments (deferred - manual cert setup OK)
    • Certificate rotation documentation (deferred)
    • Test with self-signed certs in CI (deferred - Layer 4 tests)
  • Request Size Limits (Complete - 2024-02-11)

    • Add RequestBodyLimitLayer to write endpoints (1MB default) - routers.rs:371
    • Add RequestBodyLimitLayer to read endpoints (64KB default) - routers.rs:400
    • Make limits configurable: STEMEDB_WRITE_BODY_LIMIT / STEMEDB_READ_BODY_LIMIT
    • Created SecurityConfig struct with defaults - routers.rs:35-56
    • Updated all 8 create_router_* functions to accept config
    • Documented in .env.example
    • Document limits in OpenAPI spec (deferred - not critical)
  • Timeout Configuration (Complete - 2024-02-11)

    • Add TimeoutLayer to HTTP routes (configurable, default 30s) - routers.rs:115,143,199,etc
    • Wrap all store.get()/put() with tokio::time::timeout(5s) - store_helpers.rs
    • Added timeout helpers: store_get_with_timeout() / store_put_with_timeout()
    • Updated 6+ handler locations (source.rs, health.rs, report.rs, source_registry/handlers.rs)
    • Add timeout metrics: stemedb_operation_timeouts_total{operation="store_get|store_put"}
    • Make HTTP timeout configurable: STEMEDB_HTTP_TIMEOUT_SECS
    • Added ApiError::Timeout variant with 408 REQUEST_TIMEOUT status - error.rs:76-80
  • Secret Sanitization (Deferred - not blocking for pilot)

    • Remove API key logging from api_key.rs:271 (log hash, not prefix)
    • Audit all debug!/info! for credential leaks
    • Add test: cargo test -- --nocapture | grep -E "key|secret|password" (should fail)
    • Note: Existing code already logs hashes, audit needed to confirm no leaks
  • Rate Limiting (Complete - 2024-02-11)

    • Rate limit /v1/health to 1 req/sec per IP (prevent metrics flooding) - routers.rs:352
    • Make configurable: STEMEDB_HEALTH_RATE_LIMIT (default: 1)
    • Uses RateLimitState and rate_limit_middleware - middleware/rate_limit.rs
    • Metric already exists: stemedb_rate_limit_rejections_total{endpoint} - rate_limit.rs:87

Implementation Notes:

  • All security features are now configurable via environment variables with sensible defaults
  • Build succeeds, all features tested manually
  • Integration tests stubbed in tests/security_hardening.rs (21 tests marked #[ignore])
  • Secret sanitization deferred as existing code appears safe (uses hashes), but full audit recommended

P5.2 Monitoring Foundation (WEEK 2 - CRITICAL) COMPLETE

Priority: P0 - Flying blind without these Status: Complete (All layers implemented: WAL metrics, storage metrics, HTTP SLI, error tracking, Grafana dashboards, Prometheus alerts, runbooks, validation scripts) Implementation: P5.2-IMPLEMENTATION-SUMMARY.md

  • Storage Health Metrics (Complete - 2024-02-11)

    • stemedb_wal_fsync_latency_seconds histogram (p50/p95/p99) - journal.rs:34
    • stemedb_wal_write_errors_total{error} counter - journal.rs:46
    • stemedb_wal_disk_usage_bytes gauge - segment.rs:248
    • stemedb_wal_segments_count gauge - segment.rs:249
    • stemedb_wal_bytes_written_total counter - journal.rs:45
    • stemedb_wal_writes_total counter - journal.rs:44
    • stemedb_wal_batch_size histogram - group_commit.rs:201
    • stemedb_wal_flush_latency_seconds histogram - group_commit.rs:243
    • stemedb_wal_recovery_attempts_total counter - journal.rs:234
    • stemedb_wal_recovery_duration_seconds histogram - journal.rs:269
    • stemedb_wal_rotations_total counter - journal.rs:304
  • Storage Operation Metrics (Complete - 2024-02-11)

    • stemedb_storage_operation_duration_seconds{operation,backend} histogram - hybrid_backend.rs:118,138,158,180
    • stemedb_storage_operations_total{operation,backend} counter - hybrid_backend.rs:123,143,163,185
    • stemedb_index_lookup_duration_seconds{index} histogram - index_store.rs:212,235
    • Metrics added to: get(), put(), delete(), scan_prefix(), index lookups
  • Error Tracking (Complete - 2024-02-11)

    • stemedb_errors_total{type,layer} counter - error.rs:99
    • Tracks 15 error types across 5 layers (validation, api, storage, pipeline, auth, protection)
    • Integrated into ApiError::IntoResponse for automatic tracking
  • HTTP SLI Metrics (Complete - 2024-02-12)

    • Pattern implemented in handlers/vote.rs as reference
    • stemedb_http_requests_total{method,path} counter
    • stemedb_http_request_duration_seconds{method,path,status} histogram
    • Rollout complete: 19 handlers instrumented (supersede, epoch, source, admin, escalation, gold_standard, quarantine, circuit_breaker, api_keys, audit, concepts)
    • Total coverage: 20 handlers across 11 files
  • Grafana Dashboards (Complete - 2024-02-11)

    • storage-health.json - WAL fsync latency, disk usage, error rates, storage operations, index timing
    • cluster-overview.json - Node status, replication lag, sync ops, Merkle diffs, gossip
    • sli-dashboard.json - Request rate, latency heatmap, error rate, availability gauge, circuit breakers
    • Import guide with troubleshooting: docs/operations/monitoring/grafana/README.md
  • Prometheus Alert Rules (Complete - 2024-02-11)

    • alerts/critical.yml - 8 alerts (API down, disk >90%, replication lag >5min, storage errors, fsync failure, split brain, memory exhaustion, cert expiring)
    • alerts/warning.yml - 10 alerts (slow fsync, high error rate, slow indexes, disk >70%, lag >1min, high latency, compaction backlog, circuit breaker, trust rank decay)
    • alerts/info.yml - 9 alerts (circuit breaker open, quarantine backlog, node join, memory >70%, key rotation, gold standard count, cert 30 days, WAL segments, low traffic)
    • All alerts include: runbook links, impact description, action steps, for duration, labels
  • Alerting Integration (Complete - 2024-02-11)

  • Additional Runbooks (Complete - 2024-02-12)

    • 8 critical/warning runbooks created in docs/operations/runbooks/
    • Coverage: high-replication-lag, storage-errors, wal-fsync-failure, split-brain, memory-exhaustion, certificate-renewal, slow-fsync, high-error-rate
    • Each includes: Severity, Symptom, Impact, Investigation, Resolution, Prevention, Escalation, References
  • Validation Scripts (Complete - 2024-02-12)

    • scripts/setup-pagerduty.sh - Service key validation, test incident creation, escalation policy check
    • scripts/setup-slack.sh - Webhook validation, test message posting, formatting verification
    • scripts/test-alerting.sh - End-to-end test (Alertmanager → PagerDuty + Slack), latency measurement

P5.3 Backup & Disaster Recovery (WEEK 3 - CRITICAL) COMPLETE

Priority: P0 - Data loss risk without these Completed: 2026-02-12

  • Automated Backup

    • Systemd timer: runs every 6 hours (00:00, 06:00, 12:00, 18:00 UTC)
    • Systemd service: stemedb-backup.service with retry logic
    • Backup retention policy: --keep-last flag with 30-day default
    • S3 upload integration: --upload-s3 flag with STANDARD_IA storage
  • Backup Verification

    • verify-backup.sh - Validates magic bytes, CRC32C, BLAKE3 checksums
    • Weekly verification timer: Sunday 03:00 UTC
    • Metrics: stemedb_backup_verification_status, stemedb_backup_verification_checks_passed
    • Alert on verification failure: Prometheus alert rule
  • WAL Archival

    • archive-wal-to-s3.sh - Ships WAL segments to S3 every 15 minutes
    • S3 bucket: stemedb-backups-{env}/wal-archive/
    • Retention: 30 days in S3 STANDARD_IA
    • Metrics: stemedb_wal_archival_lag_seconds, stemedb_wal_archival_segments_uploaded_total
  • Disaster Recovery Runbook

    • docs/operations/runbooks/disaster-recovery.md - Complete DR procedures
    • RTO target: 4 hours (validated via drill script)
    • RPO target: 15 minutes (achievable with WAL archival)
    • 3 recovery scenarios: Full restore, Point-in-time, WAL-only
    • Validation checklist: 9 verification steps
  • DR Drill

    • scripts/dr-drill.sh - Automated drill with RTO/RPO measurement
    • Report generation: markdown format with timeline, metrics, issues
    • Integration tests: uat/production-readiness/backup-dr-tests.sh (7 tests)

Deliverables:

  • 6 systemd units: 3 timers + 3 services (backup, verify, archive-wal)
  • 4 scripts: backup, verify, archive-wal, dr-drill
  • Prometheus alerts: 9 alert rules in backup-alerts.yml
  • DR runbook: 3 recovery scenarios + validation checklist
  • Integration tests: 7 tests covering all P5.3 components

P5.4 Operational Runbooks (WEEK 3 - CRITICAL) COMPLETE

Priority: P1 - 2am incidents require these

  • Critical Runbooks (created in docs/operations/runbooks/)

    • server-wont-start.md - Port conflicts, TLS cert issues, disk full, WAL corruption
    • high-query-latency.md - Check replication lag, shard hotspots, index health
    • restore-from-backup.md - Step-by-step restore procedure with validation
    • add-node.md - Node join procedure, shard rebalancing, validation
    • disk-full.md - Emergency WAL cleanup, compaction trigger, quota increase
    • circuit-breaker-stuck.md - Reset circuit breaker, identify root cause
    • quarantine-overflow.md - Investigate quarantine queue, batch approve/reject
  • Troubleshooting Decision Tree

    • docs/operations/troubleshooting-flowchart.md - Complete with symptom → cause → runbook mapping
    • Covers all 7 runbooks with decision trees and quick diagnostic commands

P5.5 Cluster Management Tooling (WEEK 4 - HIGH PRIORITY) COMPLETE

Priority: P1 - Manual SSH not scalable Completed: 2026-02-12

  • stemedb-admin CLI (new binary in crates/stemedb-admin/)
    • stemedb-admin cluster status - Overview: node count, shard count, meta version, node table
    • stemedb-admin cluster health - Quick health check (exit code 0/1)
    • stemedb-admin node list - List all nodes with states (Alive/Suspect/Dead)
    • stemedb-admin node <id> info - Detailed node info with shard assignments
    • stemedb-admin node <id> shards - Show shards assigned to node (with --leader filter)
    • stemedb-admin shard list - List all shards with leaders/replicas
    • stemedb-admin shard <id> info - Detailed shard info (size, assertions, replicas)
    • stemedb-admin shard <id> replicas - Show replica nodes for shard
    • stemedb-admin debug export --output <file> - Export complete cluster state as JSON
    • HTTP client connecting to gateway (default: http://localhost:18181)
    • Output formats: Table (human-friendly with colors) and JSON (machine-readable)
    • Environment variable support: STEMEDB_GATEWAY_ADDR
    • Proper error handling with helpful messages (no panics)
    • 12 integration tests covering all functionality
    • Node lifecycle documentation: docs/operations/node-lifecycle.md
    • Installation guide: docs/operations/deployment/install-admin-cli.md

Phase 2 Deferred:

  • stemedb-admin node drain <id> - Graceful node removal (requires gateway endpoints)

  • stemedb-admin shard rebalance - Manual rebalancing trigger (requires gateway endpoints)

  • Node Operations Documentation

    • docs/operations/node-lifecycle.md
    • Add node procedure (pre-flight checks, join, validation)
    • Remove node procedure (drain, graceful leave, verification)
    • Replace node procedure (dead node replacement, shard recovery)
  • Shard Management (optional for pilot, defer if time-constrained)

    • stemedb-admin shard rebalance - Manual rebalancing trigger
    • stemedb-admin shard freeze - Disable auto-split during maintenance
    • stemedb-admin shard move <shard-id> <target-node> - Manual migration

P5.6 Reference Architecture (WEEK 4) COMPLETE

Priority: P1 - Customer deployment guide

  • Deployment Guides (created in docs/operations/reference-architecture/)

    • single-node-pilot.md - Pilot deployment (1 node, docker-compose, hardware specs)
    • three-node-cluster.md - Small production (3 nodes, replication factor 2, HA)
    • network-requirements.md - Port list (181XX), firewall rules, TLS, DNS setup
  • Infrastructure as Code Examples (created in docs/operations/deployment/)

    • docker-compose/pilot-with-monitoring.yml - Single-node with Grafana + Prometheus
    • nginx/stemedb.conf - TLS 1.3, rate limiting, security headers, admin restrictions
    • envoy/stemedb.yaml - Load balancing, health checks, circuit breakers, retries
    • kubernetes/ - K8s manifests (StatefulSet, Service, Ingress) [DEFERRED - not needed for pilot]
    • terraform/ - AWS deployment (EC2, EBS, ALB, S3) [DEFERRED - not needed for pilot]
  • Resource Sizing Guide

    • docs/operations/reference-architecture/resource-sizing.md - Complete with CPU/RAM/disk formulas
    • Quick reference table: <10K, <50K, <100K, <500K, <1M assertions
    • AWS/GCP/Azure instance recommendations
    • Capacity planning metrics and monitoring dashboard
  • Reverse Proxy Configuration

    • nginx/stemedb.conf - TLS termination with Let's Encrypt, rate limiting, admin restrictions
    • envoy/stemedb.yaml - Advanced load balancing, circuit breakers, health checks
    • Let's Encrypt automation examples (certbot + cron)

P5.7 Pilot Success Validation (WEEK 4) COMPLETE

Priority: P1 - Definition of done

  • Performance Benchmarks - Documented in docs/operations/pilot-success-criteria.md

    • Sub-second query latency: p99 <1s at 10K assertions (test procedure included)
    • Ingest throughput: 1K assertions/sec sustained (5 min load test script)
    • Replication lag <1 second under normal load (cluster validation)
  • Functional Validation - Documented in docs/operations/pilot-success-criteria.md

    • Conflict detection: ConflictLens score >0.5 on contradictions (test procedure)
    • Audit trail export: 100 assertions with signatures/provenance (validation script)
    • Source retraction cascade: 110+ dependents (CARDIOVASC_MEGA_TRIAL example)
  • Operational Validation - Documented in docs/operations/pilot-success-criteria.md

    • Backup/restore roundtrip: 10K assertions → backup → restore → verify (procedure)
    • Node failure recovery: Kill node → continue → re-replicate <5min (3-node test)
    • Rolling restart: Restart one-by-one during load test → 100% success (procedure)
  • Demo Validation: 5 Amazement Moments - All documented with test procedures

    • Moment 1: Conflicting claims (FDA 0.2% vs Anecdotal 12%)
    • Moment 2: Source retraction cascade (110 assertions flagged)
    • Moment 3: Audit trail (provenance chain to source)
    • Moment 4: Time-travel (query 2023 vs 2025)
    • Moment 5: Lens-based resolution (3 lenses → 3 winners)

Phase 8B-C: Production Scale & Observability

Prerequisite: Pilot 5 complete, 1-2 production customers running Timeline: 4-6 weeks after Pilot 5

8B. Advanced Observability

  • 8B.1 Distributed Tracing

    • OpenTelemetry integration (Jaeger or Tempo backend)
    • Trace write path: Gateway → Shard Leader → Followers → WAL
    • Trace sync path: Merkle diff → Fetch missing → CRDT merge
    • Add trace IDs to all log lines (trace_id field)
  • 8B.2 Capacity Planning Metrics

    • disk_growth_rate_bytes_per_day (7-day linear regression)
    • disk_days_until_full (projected based on growth rate)
    • assertion_ingestion_rate (assertions/sec, 24h moving average)
    • Dashboard: Capacity trends with projected full date
  • 8B.3 Performance Profiling

    • Continuous profiling (pprof/flamegraph integration)
    • Per-shard query latency breakdown
    • Hot subject/predicate detection
    • Slow query log (queries >100ms)
  • 8B.4 Advanced Dashboards

    • query-performance.json - Latency by lens, hot subjects, cache hit rate
    • write-pipeline.json - Ingest rate, WAL throughput, sync lag
    • capacity-planning.json - Growth trends, disk projections, resource utilization

8C. Production Hardening

  • 8C.1 Point-in-Time Recovery (PITR)

    • WAL segment archival to S3 (every 15 min or 100 MB)
    • Recovery target parsing (--target lsn:123456, --target 2026-02-11T14:25:00)
    • WAL replay engine with checksum validation
    • Test: Inject corruption at known LSN, restore to LSN-1, verify consistency
  • 8C.2 Online Backup (Hot Backup)

    • Snapshot API: POST /v1/admin/snapshot (trigger checkpoint, freeze writes briefly)
    • Shadow copy: Copy data files while DB is running
    • Snapshot registry: Track active snapshots, prevent WAL truncation
    • Zero-downtime backup workflow
  • 8C.3 Storage Compaction

    • Automatic WAL segment cleanup (delete segments older than 7 days if checkpointed)
    • Tombstone removal (compact assertions with lifecycle=Superseded)
    • Background task: Run compaction every 6 hours
    • Metrics: wal_segments_deleted_total, compaction_bytes_reclaimed
  • 8C.4 Auto-Healing Improvements

    • Detect dead node → trigger re-replication → restore replication factor (automated)
    • Circuit breaker: Don't trigger shard split if memory >80%
    • Clock skew detection: Reject assertions with timestamps >1s in future
    • Partition detection: Log when SWIM sees cluster split
  • 8C.5 Rolling Upgrades

    • stemedb-admin upgrade --version v0.3.0 --batch-size 1
    • Pre-flight compatibility check (schema version, WAL format)
    • Drain node before upgrade (move shards to other nodes)
    • Zero-downtime upgrade workflow
  • 8C.6 Multi-Region (Active-Passive)

    • Secondary region with continuous WAL replication
    • Automated failover (DNS swap when primary unavailable >5 min)
    • Failover time target: <10 minutes
    • Cost estimate: ~$500/month for active-passive

Phase 9: Enterprise Scale & Compliance

Goal: Enterprise-grade durability, compliance, and incident response Prerequisite: 5-10 production customers, predictable failure patterns

9A. Advanced Backup & Recovery

  • 9A.1 Incremental Backup

    • Only backup changed blocks since last backup (rsync --link-dest pattern)
    • Backup time: Minutes instead of hours for 1TB database
    • Storage savings: 90% reduction for daily incrementals
  • 9A.2 Cross-Region Backup Replication

    • Replicate backups to S3 in different region (S3 cross-region replication)
    • Storage tiers: Hot (7 days Standard), Warm (7-30 days Intelligent-Tiering), Cold (30+ days Glacier IR)
    • Cost estimate: ~$210/month for 11TB (7 daily + 4 weekly backups)
  • 9A.3 Backup Encryption

    • Encrypt backups at rest (AWS KMS or customer-managed keys)
    • Encrypt backups in transit (TLS for S3 uploads)
    • Key rotation policy (90-day rotation)

9B. Data Corruption & Recovery

  • 9B.1 Deep Corruption Detection

    • Validate Merkle tree checksums before accepting gossip
    • Periodic background validation (full DB checksum every 24h)
    • Metric: corruption_detected_total{source=gossip|disk}
  • 9B.2 Assertion Tombstones (Soft Delete)

    • New lifecycle stage: Deleted (append-only, not physically removed)
    • Tombstone propagation via gossip (all nodes learn of deletion)
    • Query filtering: Lenses ignore Deleted assertions by default
  • 9B.3 Cluster Rollback

    • stemedb-admin rollback --before 2026-02-11T14:00:00
    • Batch tombstone generation for all assertions after timestamp
    • Use case: Bulk data corruption, need to revert cluster to known-good state
  • 9B.4 Split-Brain Recovery

    • Automatic detection: Merkle tree divergence >10% after partition heals
    • Manual resolution: stemedb-admin resolve-split --prefer-node node-1
    • CRDT merge with conflict log (record which assertions were merged/discarded)
  • 9C.1 GDPR Right to Erasure

    • Cryptographic erasure: Each agent has unique encryption key
    • Delete key → data unrecoverable (even though assertions remain on disk)
    • Compliance proof: "Key deleted on YYYY-MM-DD, data cryptographically erased"
  • 9C.2 Data Retention Policies

    • Per-subject TTL: retention_policy{subject="medical/*"}=7years
    • Per-predicate TTL: retention_policy{predicate="temp_session"}=1day
    • Background task: Tombstone assertions past TTL
  • 9C.3 Immutable Audit Trail

    • All admin actions logged to append-only audit store
    • Include: Who, what, when, why (justification field required)
    • Export API: GET /v1/admin/audit?from=DATE&to=DATE
    • Compliance report generator (CSV/PDF for auditors)
  • 9C.4 SOC 2 Type II Certification

    • Security controls implementation (access control, encryption, monitoring)
    • 6-month observation period (demonstrate controls work consistently)
    • External auditor engagement (Big 4 accounting firm)
    • Annual recertification

9D. Storage Management

  • 9D.1 Advanced Compaction

    • Multi-generation compaction: Merge small segments into larger ones
    • Compaction budget: Limit I/O impact (max 10% of disk bandwidth)
    • Metrics: compaction_progress{generation}, compaction_bytes_read/written
  • 9D.2 Tiered Storage

    • Hot tier: NVMe SSD (last 7 days, accessed frequently)
    • Warm tier: SATA SSD (7-90 days, accessed occasionally)
    • Cold tier: S3 Glacier (90+ days, accessed rarely)
    • Automatic migration based on access patterns
  • 9D.3 Storage Quotas

    • Per-agent quotas: quota{agent="user123"}=10GB
    • Cluster-wide quota: Hard limit on total DB size
    • Soft quota warning at 80% (alert ops team)
    • Hard quota rejection at 100% (reject new assertions)

9E. Incident Response

  • 9E.1 Alerting & Escalation

    • PagerDuty integration (API key in config)
    • Slack integration (webhook URL, #stemedb-alerts channel)
    • Escalation policy: Warn → Page primary → Page backup → Page manager
    • Alert grouping: Batch related alerts (don't page 100 times for same issue)
  • 9E.2 Incident Management

    • Incident response playbook (docs/operations/incident-response.md)
    • Severity levels: P0 (total outage), P1 (degraded), P2 (warning)
    • Communication templates (customer email, status page update)
    • Post-mortem template (5 Whys, timeline, action items)
  • 9E.3 Chaos Engineering

    • Monthly "game day" exercises
    • Scenarios: Node failure, network partition, disk full, slow disk
    • Use stemedb-chaos crate to inject failures
    • Document learnings, update runbooks
  • 9E.4 On-Call Rotation

    • Define on-call schedule (primary, backup, manager escalation)
    • On-call playbook (what to do when paged, who to call, escalation path)
    • On-call compensation policy
    • Post-incident review process

9F. Security Hardening

  • 9F.1 mTLS for Cluster Communication

    • Require client certificates for all node-to-node RPC
    • Certificate authority: Internal CA or Let's Encrypt
    • Certificate rotation: 90-day validity, automated renewal
    • Reject connections without valid cert (prevent rogue nodes)
  • 9F.2 Encryption at Rest

    • WAL encryption: AES-256-GCM per segment
    • KV store encryption: Transparent encryption layer (redb feature or OS-level LUKS)
    • Key management: AWS KMS, HashiCorp Vault, or customer-managed keys
    • Compliance: Meets HIPAA/GDPR encryption requirements
  • 9F.3 Node Authentication

    • Each node has Ed25519 keypair (identity)
    • Signed cluster join: Node signs join request with private key
    • Admin API: Approve/reject join requests (stemedb-admin node approve <node-id>)
    • Prevent unauthorized nodes from joining cluster
  • 9F.4 API Security

    • Rate limiting per API key (100 req/min for free tier, 10K req/min for enterprise)
    • Input validation: UTF-8, max lengths, regex injection protection
    • SQL injection prevention: Parameterized queries only (no string concatenation)
    • XSS prevention: Escape all user-provided content in dashboard
  • 9F.5 Secrets Management

    • Never store secrets in code or config files
    • Use environment variables or secret management service (Vault, AWS Secrets Manager)
    • Secret rotation policy (API keys rotated every 90 days)
    • Audit log: Track secret access (who accessed what secret when)

9G. Operational Maturity

  • 9G.1 SLI/SLO Definitions

    • Availability SLO: 99.95% uptime (21.9 min/month downtime budget)
    • Latency SLO: p95 query latency <100ms, p99 <500ms
    • Error rate SLO: <0.1% of requests fail
    • Dashboard: SLO compliance tracking, error budget remaining
  • 9G.2 Capacity Planning

    • Quarterly capacity review (growth trends, resource utilization)
    • 6-month forecast (projected assertion count, disk usage, API load)
    • Auto-scaling triggers (add nodes when CPU >70% for 10 min)
    • Budget planning: Cloud costs per customer, per assertion
  • 9G.3 Performance Testing

    • Load testing: Sustained 10K assertions/sec for 1 hour
    • Stress testing: Ramp to failure (find breaking point)
    • Chaos testing: Inject failures during load test
    • Regression testing: Compare performance across releases
  • 9G.4 Documentation

    • Operator guide (docs/operations/operator-guide.md)
    • Troubleshooting guide (symptom → diagnosis → fix)
    • Architecture deep-dive (how it works, design decisions)
    • API reference (auto-generated from OpenAPI spec)
    • SDK usage guides (Go, Python, TypeScript)

Architecture Overview

Write Path (Spine):           Read Path (Cortex):
[Agent] -> [Ingestion]        [Agent] <- [Lens Engine]
              |                              |
              v                              |
         [WAL/Fsync]                  [Index Lookup]
              |                              |
              v                              |
         [KV Store] <--------------------+

Port Scheme (181XX)

Offset Service Default Env Var
+0 HTTP API 18180 STEMEDB_BIND_ADDR
+1 Cluster Gateway 18181 STEMEDB_NODE_API_ADDR
+2 Cluster RPC 18182 STEMEDB_NODE_RPC_ADDR
+3 SWIM Gossip 18183 via SwimConfig
+4 Metrics 18184 (reserved)
+5 Admin 18185 (reserved)
+6 Latent Signal 18186
+7 Community App 18187
+8 Admin Dashboard 18188

Crates

Crate Purpose Status
stemedb-core Assertion, LifecycleStage, MaterializedView, types, signing
stemedb-wal Write-ahead log with crash recovery
stemedb-storage KVStore, VoteStore, IndexStore, TrustRankStore, QuarantineStore
stemedb-ingest Ingestion pipeline, signature verification, ContentDefenseLayer
stemedb-query Query engine, Materializer for O(1) MV reads
stemedb-lens Lenses (Recency, Consensus, Authority, Skeptic, Layered, etc.)
stemedb-api HTTP API with axum + utoipa OpenAPI docs
stemedb-sim Simulation for testing the pipeline
stemedb-merkle BLAKE3 Merkle tree for diff detection
stemedb-rpc gRPC services for node-to-node communication
stemedb-sync Merkle sync, gossip broadcast, anti-entropy
stemedb-cluster Cluster membership (SWIM), sharding, gateway
stemedb-ontology Domain definitions (Pharma), subject builders, medical extractors
stemedb-chaos Chaos testing infrastructure
stemedb-dashboard Admin dashboard (React/Next.js) (7 panels)

Applications

App Purpose Status
aphoria Code-level truth linter — 42 extractors, claims, verify, coverage 🎯 A5 flywheel
disputed Controversy explorer Planned

SDKs

SDK Purpose Status
sdk/go/steme Go HTTP client with Ed25519 signing and fluent builders
sdk/go/adk ADK-Go tools and callbacks for AI agents

Quick Reference

# Build
cargo build --workspace

# Test
cargo test --workspace

# Lint (must pass before commit)
cargo clippy --workspace -- -D warnings
cargo fmt --check

# Run API server
cargo run --bin stemedb-api

# Run Aphoria scan
cargo run --bin aphoria -- scan /path/to/project --show-observations

# Run demo script
./scripts/demo-consumer-health.sh

Arena: Simulation Roadmap

Goal: Incrementally evolve the simulator from Spine validation to a full Agent-Based Modeling environment. Philosophy: Make it run. Then add. Verify at every step. Alignment: Tracks main roadmap phases; exercises features as they land.

Current State

The simulator (stemedb-sim) validates the full system through Arena 0-4:

Completed Arenas:

  • Arena 0: Test infrastructure with assertions and CI integration
  • Arena 1: Query path via QueryEngine, Recency lens, lifecycle filtering, query audit
  • Arena 2: Voting & VoteAwareConsensus, troll resistance
  • Arena 2.5: Hardening (race conditions, API tests, crash recovery, input validation)
  • Arena 3: Materialized Views, fast-path verification, MV freshness
  • Arena 4: Agent personas (Scientist, Troll, Believer with differentiated strategies)

What's Tested:

  • WAL durability, rkyv serialization, Ed25519 signatures
  • Ingestor pipeline (WAL → KV async flow)
  • QueryEngine with multiple lenses
  • Lifecycle filtering, voting, consensus
  • Query audit trail, materialized views
  • Strategy-driven agent behaviors

What's Not Yet Tested:

  • TrustRank (Arena 5)
  • Concurrent agents at scale (Arena 6)
  • Time-travel queries (Arena 7)
  • Skeptic lens & conflict scores (Arena 8)

Upcoming Arena Phases

Arena 5: TrustRank Integration (Next)

  • Initialize TrustRank for agents
  • Reputation adjustment after votes
  • TrustAwareAuthorityLens verification
  • Troll reputation decay over time

Arena 6: Concurrent Agents

  • Tokio task per agent
  • Scale to 100 agents, then 1000
  • Contention metrics and bottleneck identification

Arena 7: Time-Travel & Epochs

  • Time-travel query verification
  • Epoch creation and supersession
  • Epoch cascade validation

Arena 8: Skeptic & Conflict

  • High/low conflict scenarios
  • Skeptic lens surfacing outliers
  • Conflict score accuracy

Arena 9: Full Gameplay Loop

  • Ground truth injection
  • Complete 5-tick scenario
  • Extended 1000-tick run
  • Emergence validation

Alignment with Use Cases

Use Case Arena Phase
Agile Agent Team
Lifecycle filtering Arena 1.3
Query audit trail Arena 1.4
Time-travel debugging Arena 7.1
Expert weighting Arena 5.3
Financial Due Diligence
Conflict detection Arena 8.1, 8.3
Epoch cascades Arena 7.2, 7.3

Run command: cargo run --bin stemedb-sim Test suite: cargo test -p stemedb-sim