jml fae9b47fae feat(aphoria): implement hosted mode with remote StemeDB integration

Add remote mode infrastructure for querying claims from StemeDB API:
- Remote client with caching layer for claim queries
- Authority resolution logic with tier-based verdict system
- StemeDB API handlers for claims CRUD operations
- Enhanced conflict detection with remote claim support
- Validation reports documenting A5.3 phase completion

Changes:
- applications/aphoria/src/remote/: New client + cache modules
- applications/aphoria/src/resolution/: Authority tier resolution
- crates/stemedb-api/src/handlers/stemedb_claims.rs: API handlers
- applications/aphoria/validation/a5.3/: Phase validation reports
- Updated roadmap with hosted mode milestones

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-14 09:29:56 +00:00

57 KiB

Raw Blame History

Episteme (StemeDB) Roadmap

Goal: Build the "Git for Truth" substrate for autonomous AI research. Current Focus: Gap Closure Phase 3 — Remote hosted mode (Aphoria → remote StemeDB) Target Vertical: BioTech/Pharma ("The Living Review") + Code Truth (Aphoria) Endgame: Distributed multi-writer cluster for millions of concurrent agents

Infrastructure Status: Phases 1-7 complete | Phase 8A (Chaos) complete | Pilot 1-5 complete Aphoria Status: A1-A5 + Phase 2 complete | Tier-aware resolution ✅ | Next: Remote mode Security Status: P5.1 4/5 done (TLS, limits, timeouts, rate limiting) | P5.2 ✅ complete

Archive: For completed phases 1-8A + Pilot 1-3, see roadmap-archive.md

Current Status

Phase	Status	Summary
1-7, 8A	✅ Complete	Core infra, cluster, trust, chaos testing
MVP, Pilot 1-4	✅ Complete	Consumer Health demo, dashboard, API auth, metrics
Aphoria A1-A4	✅ Complete	Observations/claims/verify/corpus/authority lens
Aphoria A5	✅ Complete	Flywheel validated: 93.5% acceptance, 100% config recall
Gap Closure Phase 2	✅ Complete	Tier-aware authority resolution, `--explain-authority` flag
Gap Closure Phase 3	🎯 Current	Remote hosted mode: Aphoria → remote StemeDB via HTTP API
Gap Closure Phase 4	Planned	Claim discovery, manual convergence, promotion workflows
Pilot 5	✅ Complete	All 7 phases complete: Security (4/5), Monitoring, Backup/DR, Runbooks, Cluster Mgmt, Reference Architecture, Pilot Success Criteria
8B-C	Planned	Distributed observability, geo-distribution
9	Planned	Disaster recovery, compliance, storage management

🎯 Aphoria: From Scanner to Knowledge Graph Client (CURRENT)

Goal: Transform Aphoria from "grep with Episteme vocabulary" into a real knowledge graph client that authors, stores, and audits claims with provenance and lineage. Vision Document: applications/aphoria/docs/vision-gaps.md Validation: Maxwell scan (67 observations, 0 noise) + hand-written claims-explained.md

Completed Phases (A1-A4 + P4 — see roadmap-archive.md for details)

Phase	What It Delivered
A1	`Observation` vs `AuthoredClaim` types, bridge tier mapping, `.aphoria/claims.toml` format
A2	`aphoria claims create/list/explain/update/supersede/deprecate`, `aphoria-claims` skill
A3	`verify.rs` engine (Pass/Conflict/Missing/Unclaimed), `aphoria verify run/map`, pre-commit hook, self-audit
A4	RFC/OWASP as Episteme assertions, `AphoriaAuthorityLens`, Trust Pack export/install
P4	API auth (3 roles), backup/restore scripts, Prometheus metrics + Grafana dashboard

Phase A5: The Flywheel

Goal: The system gets smarter with use. Each claim makes the next claim easier. Details: vision-gaps.md — §5 (claims-explained.md as the product) Research: a5-flywheel-skill-design.md — validates "skill calls CLI" hypothesis Key Insight: LLM reasoning over CLI JSON output replaces ML training. The flywheel is prompt engineering, not machine learning.

A5.1 Claim Coverage Metrics: Per-module claim density and gap reporting
- coverage.rs: CoverageReport, ModuleCoverage, CoverageSummary types
- compute_coverage() uses verify_claims() as source of truth for claim-observation matching
- Per-module: observation count, claim count, claimed/unclaimed, missing claims, density
- aphoria coverage CLI: table, JSON, markdown formats, --sort-by (name/density/unclaimed/observations)
- Coverage gaps section: modules with observations but no claims
- 8 unit tests including deprecated claim exclusion
A5.2 Auto-Generated Documentation: aphoria docs generate + aphoria claims explain
- aphoria docs generate CLI command with --output and --format (markdown/json)
- claims_explain.rs: groups by category, includes provenance/invariant/consequence/evidence per claim
- explain.rs: reads .aphoria/claims.toml, renders via render_claims_markdown()
- Provenance chains preserved (supersedes references)
A5.3 Claim Suggester Skill: LLM-powered pattern recognition via "skill calls CLI"
- New skill: .claude/skills/aphoria-suggest/SKILL.md (3 modes: cold start / foundation / flywheel)
- Workflow defined: claims list → verify run --show-unclaimed → reason by analogy → suggest
- Few-shot learning: existing claims as gold-standard examples for style matching
- Chain-of-thought: reasoning template before each suggestion
- Cold start bootstrap: reads README/CLAUDE.md/tests/ADRs when 0 claims
- Context tiers: local → semantic → summary → global (subagent)
- Quality gates: non-trivial, not type-enforced, has consequence, not duplicate
- VG-022 CLOSED: verifiable_predicates() on Extractor trait; 10 extractors declare predicates; verify map shows extractor→claim coverage
- Dogfood claims: 10 total claims in .aphoria/claims.toml (3 arch + 7 security) covering all ComparisonModes
- Validate: Run skill against Aphoria's own codebase (dogfood) - 87.5% acceptance rate (7/8)
- Validate: Run skill against an external project (cold start test) - 72.7% alignment (16/22), 100% config recall
- Iterate: Refine prompt based on suggestion quality from validation - 3 improvements identified (domain-awareness, impl depth, tuning scan)
A5.4 Onboarding Mode: aphoria explain for new team members
- explain.rs: generate_explanation() reads claims, renders narrative
- aphoria explain CLI with --output and --format (markdown/json)
- Shows claim inventory grouped by category with provenance
- Empty project handling: directs to aphoria claims create

Gap Closure Phase 2: Tier-Aware Authority Resolution

Goal: Make authority tiers actionable in conflict resolution. Higher-tier sources (lower tier numbers) win. Why Now: A5.3 validation proved flywheel works (93.5% acceptance). Next blocker: tier resolution. User Story: "Why is Aphoria blocking me for this? Is it really important?" → Show tier, not just binary BLOCK/FLAG.

Phase 2.1 Tier-Aware Types: Foundation for tier-scoped verdicts
- TierAwareVerdict enum with 3 variants: SingleTier, MultiTier, HigherTierAgreement
- ConflictResult extended with tier_verdict and primary_tier fields
- resolution/ module: tier_verdict.rs, authority.rs, mod.rs
- 10 unit tests for tier resolution logic
Phase 2.2 Conflict Logic Updates: Always compute tier breakdown
- conflict.rs: Moved tier breakdown out of debug-only block (always populated)
- compute_tier_aware_verdict(): Computes per-tier verdicts with primary tier selection
- compute_tier_breakdown(): Groups conflicts by tier (0-5)
- Primary tier = lowest tier number (highest authority)
Phase 2.3 Display Formatting: Show tier names in CLI output
- ConflictResult::Display shows tier-aware verdict when available
- Tier names formatted: "Tier 1 (Clinical/RFC)", "Tier 3 (Expert)", etc.
- Backward compatible: legacy output when tier_verdict is None
Phase 2.4 CLI Flag: --explain-authority for detailed tier breakdown
- ScanArgs extended with explain_authority field
- cli/mod.rs: Added --explain-authority flag to aphoria scan
- handlers/scan.rs: Pass flag through to scan execution
- All 28 ScanArgs constructions updated (tests + handlers)
Phase 2.5 Tests & Quality: 1300 tests pass, zero clippy warnings
- All resolution module tests pass (10/10)
- All aphoria tests pass (1300/1300)
- Clippy passes with zero warnings
- Test files updated with new fields (tier_verdict, primary_tier, explain_authority)

What Changed:

Conflicts now show tier information: ❌ BLOCK Tier 1 (Clinical/RFC) instead of just ❌ BLOCK
Primary tier (highest authority) is computed and stored: Tier 1 beats Tier 3
--explain-authority flag shows per-tier breakdown (which tiers have conflicting sources, at what confidence)
Backward compatible: existing code without tier_verdict continues to work

What's Next (Phase 3):

Remote mode: aphoria init --remote <url> connects to org StemeDB instance
Claim discovery: Query remote claims to see org patterns (specs, popular conventions)
Manual convergence: Developer inspects claims, decides whether to align code
Manual promotion: Developer upgrades claim tier when backed by higher-tier evidence

Files Changed:

New: applications/aphoria/src/resolution/{mod.rs,tier_verdict.rs,authority.rs} (3 files, ~450 LOC)
Modified: conflict.rs, result.rs, command.rs, cli/mod.rs, handlers/{scan.rs,mod.rs}, lib.rs
Tests: 4 report files, 6 test files, 2 handler files (28 ScanArgs constructions updated)

Gap Closure Phase 3: Remote Hosted Mode (CURRENT)

Goal: Enable Aphoria to work against a remote StemeDB instance instead of local-only. Why Now: Phase 1 wired claims through StemeDB locally. User always intended remote hosting. User Story: "Configure Aphoria to connect to my org's StemeDB URL instead of running locally"

Architecture Note: HostedConfig already exists in the codebase with url, sync_mode, and auth fields. This phase is about wiring it up, not building sync infrastructure.

Current State (Post-Phase 1 + Phase 2):

✅ Claims stored in StemeDB (Phase 1: EpistemeClaimStore, roundtrip bridge, auto-migration)
✅ Tier-aware resolution (Phase 2: primary tier, tier verdict, --explain-authority)
✅ HostedConfig exists (config/types/hosted.rs: url, sync_mode, offline_fallback, api_key_env)
❌ EpistemeClaimStore uses local StemeDB only (no HTTP client implementation)
❌ No API endpoints for claims (StemeDB API missing /claims/* routes)
❌ No auth integration (API key validation not wired up)
❌ No remote-mode CLI (aphoria init --remote doesn't exist)

Target State (Phase 3):

✅ aphoria init --remote https://stemedb.acme.com writes HostedConfig to .aphoria/config.toml
✅ EpistemeClaimStore calls StemeDB HTTP API instead of local WAL/KV
✅ StemeDB API has /claims/* endpoints (create, list, fetch, update, etc.)
✅ Auth: API key from $STEMEDB_API_KEY validated on server
✅ Offline fallback: graceful degradation when remote unreachable (per OfflineFallback config)

Phase 3 Breakdown

Phase 3.1: StemeDB API Endpoints for Claims (3-4 days)

Add /api/v1/claims POST endpoint (create claim) — handlers/stemedb_claims.rs::create_claim
Add /api/v1/claims GET endpoint (list claims with filters: category, tier, status) — handlers/stemedb_claims.rs::list_claims
Add /api/v1/claims/{concept_path}/{predicate} GET endpoint (fetch specific claim) — handlers/stemedb_claims.rs::get_claim
Add /api/v1/claims/{concept_path}/{predicate} DELETE endpoint (mark as deprecated) — handlers/stemedb_claims.rs::delete_claim
Add /api/v1/claims/{concept_path}/{predicate} PUT endpoint (update claim)
Add /api/v1/claims/{concept_path}/{predicate}/supersede POST endpoint
DTOs: CreateClaimRequest, CreateClaimResponse, AuthoredClaimDto, AuthoredValueDto in dto/stemedb_claims.rs
Error handling: All handlers use crate::error::Result<T> with ApiError variants
State access: WAL append via commit_buffer.append(), queries via store.scan_prefix()
OpenAPI: All endpoints annotated with #[utoipa::path]
Auth: Require API key in Authorization: Bearer <token> header (wiring needed)
Tests: Integration tests for all endpoints with valid/invalid API keys

Phase 3.2: HTTP Client Implementation (3-4 days)

Remote module structure: applications/aphoria/src/remote/{mod.rs,cache.rs,client.rs} (created in Phase 3 prep)
RemoteClaimStore struct with reqwest for API calls (partially implemented, needs testing)
Implement ClaimStore trait over HTTP client (save_claim, load_claim, list_claims, etc.)
Auth: Read API key from $STEMEDB_API_KEY env var, send in Authorization header
Error handling: 401 Unauthorized → clear error message, 5xx → offline fallback
Retries: Exponential backoff for transient errors (503, network timeouts)
Tests: Unit tests with mock HTTP server

Phase 3.3: Remote Mode CLI (2 days)

aphoria init --remote <url> writes HostedConfig to .aphoria/config.toml
--remote flag validates URL format, tests connection (GET /health), writes config
Config serialization: hosted section with url, sync_mode: "remote_only", api_key_env
aphoria scan detects remote mode, uses HTTP client instead of local StemeDB
Tests: CLI test for init --remote, verify config file written correctly

Phase 3.4: Offline Fallback (2 days)

OfflineFallback config options: error, warn, silent
When remote unreachable: respect fallback mode (error = fail scan, warn = log + continue, silent = no-op)
Cache last-known remote state locally (optional): write claims to .aphoria/cache.toml on successful remote fetch
On offline: use cached claims if available, otherwise apply fallback mode
Tests: Integration test with unreachable remote URL, verify fallback behavior

Phase 3.5: Documentation & Migration (2 days)

Update applications/aphoria/README.md with remote mode setup
New doc: applications/aphoria/docs/remote-mode.md (setup, auth, troubleshooting)
Migration guide: TOML-only → remote StemeDB (backward compatible, no breaking changes)
Example configs: remote-only, local-and-remote, offline-fallback modes
Troubleshooting: auth errors, network issues, API version mismatches

Recent Progress (2026-02-13)

Compilation Fix & API Foundation: Fixed all 24 compilation errors blocking Phase 3 implementation. Key changes:

Wave 1: Type Fixes

Removed unused use std::sync::Arc; import
Added #[derive(utoipa::ToSchema)] to all DTOs for OpenAPI support
Fixed all 7 LifecycleStage mappings (Draft → Proposed, Retired/Superseded → Deprecated)

Wave 2: Architecture Fixes

Created crates/stemedb-api/src/dto/stemedb_claims.rs with documented DTOs (proper separation of concerns)
Fixed WAL append pattern: Replaced state.engine.put() with serialize_assertion() → commit_buffer.append() (5 locations)
Fixed query pattern: Replaced non-existent state.engine.query_by_subject*() with direct store.scan_prefix() calls (3 locations)
Fixed error handling: All handlers now return crate::error::Result<T> instead of custom error tuples
Added explain_authority field to 3 ScanArgs initializations (Phase 2 integration)

Files Changed:

New: crates/stemedb-api/src/dto/stemedb_claims.rs (~90 LOC with docs)
Modified: dto/mod.rs (exports), handlers/stemedb_claims.rs (~400 LOC refactor), handlers/aphoria/{claims.rs,scan.rs} (explain_authority)

Verification:

cargo check --workspace  # ✅ PASSES (was: 24 errors)
cargo clippy --workspace -- -D warnings  # ✅ PASSES

Next Steps:

Test API endpoints manually (Wave 3)
Implement aphoria init --remote CLI command (Wave 4)
Write comprehensive tests (Wave 5)
Update documentation (Wave 6)

Success Criteria

Must Have:

aphoria init --remote <url> writes HostedConfig and validates connection
StemeDB API has /claims/* endpoints (create, list, fetch, update, supersede)
EpistemeClaimStore works over HTTP (no local WAL/KV)
Auth: API key in Authorization header, validated on server
aphoria scan works end-to-end with remote StemeDB (same UX as local)

Should Have:

Offline fallback: graceful degradation when remote unreachable
Cache last-known state locally for offline use
Documentation: remote mode setup, auth, troubleshooting
Backward compatible: local mode still works (default behavior)

Nice to Have:

Sync mode: LocalAndRemote (write local + remote)
Migration tool: bulk import TOML → remote StemeDB
Health check: aphoria check-remote validates connection + auth

Estimated Timeline

Phase	Estimated	Dependencies
3.1: StemeDB API Endpoints	3-4 days	None
3.2: HTTP Client	3-4 days	3.1 complete
3.3: Remote Mode CLI	2 days	3.2 complete
3.4: Offline Fallback	2 days	3.2 complete
3.5: Documentation	2 days	3.1-3.4 complete
Total	2-3 weeks	~15 business days

Note: Phase 3 is now MUCH simpler than originally planned:

No sync infrastructure (direct HTTP, not push/pull)
No multi-agent convergence (deferred to future phase)
No tier escalation (deferred to future phase)
Focus: Enable remote mode, validate it works

Risks & Mitigation

Risk	Likelihood	Impact	Mitigation
API performance (HTTP overhead)	Medium	Medium	Batch operations, HTTP/2, connection pooling
Auth token leakage	Low	High	Env var only, never log token, validate on server
Network unreliability	High	Low	Offline fallback with cached state
Breaking local workflow	Low	High	Local mode is default, remote is opt-in

Gap Closure Phase 4: Claim Discovery & Manual Convergence (FUTURE)

Goal: Help developers discover org patterns and make informed convergence decisions. Why: Remote mode enables sharing, but developers need visibility into what exists. User Story: "Show me what claims exist for this module so I can decide if I should align"

Dependencies:

Phase 3 complete (remote mode working)
Developers are authoring claims in shared StemeDB

Key Workflows:

1. Claim Discovery — "What patterns exist for this concept?"

# Query org claims for a specific concept
aphoria claims search --concept-path "*/imports/tokio"
aphoria claims search --category architecture --tier clinical

# Output shows:
# - 3 claims from other teams
# - Tier 1 (RFC): "Core MUST NOT import tokio"
# - Tier 3 (Expert): "Web services MAY import tokio" (15 projects)
# - Tier 4 (Community): "CLI tools SHOULD import tokio" (5 projects)

2. Convergence Decision — "Should I align with this pattern?"

Developer sees Tier 1 spec → follows it (authority)
Developer sees popular Tier 3 pattern (15 projects) → can choose to converge
Decision is MANUAL, not automatic

3. Manual Promotion — "My Tier 3 claim has Tier 1 evidence"

# Developer finds RFC backing for their code claim
aphoria claims promote <claim-id> \
  --to-tier clinical \
  --evidence "RFC 7519 Section 4.1.3" \
  --reason "RFC explicitly requires audience validation"

# System validates:
# - New tier is higher authority than current
# - Evidence is provided
# - Provenance chain preserved

4. Pattern Popularity — "How widely adopted is this claim?"

aphoria claims stats <claim-id>
# Shows:
# - 23 projects have similar claim (same concept_path+predicate)
# - 18 at Tier 3 (Expert)
# - 5 at Tier 4 (Community)
# - Suggests: "This pattern is widely adopted. Consider aligning."

Phase 4 Breakdown (Future — After Phase 3)

Phase 4.1: Claim Search & Discovery (1 week)

aphoria claims search CLI with filters (concept, category, tier, status)
Query remote StemeDB via /claims?filters=... API
Display: claim summary, tier, provenance, evidence, adoption count
Similarity matching: "Show claims similar to my local code"

Phase 4.2: Convergence Suggestions (1 week)

aphoria scan --suggest-convergence compares local code to remote claims
Highlights: "Your code imports tokio, but org spec says MUST NOT"
Highlights: "15 projects use pattern X, you use pattern Y — consider aligning?"
Developer decides: align code, create counter-claim, or ignore

Phase 4.3: Manual Promotion Workflow (1 week)

aphoria claims promote CLI command with validation
Requires: target tier, evidence, reason
Preserves provenance: promoted claim references original
Audit log: who promoted, when, why

Phase 4.4: Adoption Metrics (1 week)

aphoria claims stats shows adoption counts per claim
Query: "How many projects have claims with this concept_path?"
Dashboard: most popular patterns by category/tier
Helps identify: emerging conventions, dead patterns

Estimated Timeline: 4 weeks (1 week per sub-phase)

Success Criteria:

Developer can discover org patterns relevant to their code
Developer can see tier + adoption count for each pattern
Developer can manually promote claims with evidence
Developer can choose to align code with popular/authoritative patterns

What This Enables:

Organic convergence driven by inspection, not automation
Authority-aware decision-making (Tier 1 spec > Tier 3 code claim)
Popularity-aware decision-making (15 projects use X → maybe I should too)
Manual promotion path (Tier 3 → Tier 1 when backed by RFC)

Pilot 5: Operational Readiness

Goal: Complete production readiness for enterprise pilot demo. Context: Pilot 1-4 complete (see archive). Target: 4-6 weeks to ship-ready state

Enterprise Readiness: Deployment Stages

Stage	Requirements	Timeline	Customer Profile
MVP Pilot	P5.1 Security + P5.2 Monitoring + P5.3 Backup	✅ Ready	Friendly pilot, tolerates manual ops
Production	MVP + P5.4 Runbooks + P5.5 CLI	4 weeks	First paying customer, self-hosted
Scale	Production + Phase 8B-C	8-10 weeks	5-10 customers, automated operations
Enterprise	Scale + Phase 9	6+ months	50+ customers, SOC2/compliance required

Critical Path to Ship (Must-Have)

WEEK 1 - Security (P0 Blockers):

TLS/HTTPS, request size limits, timeouts, secret sanitization, rate limiting

WEEK 2 - Monitoring (P0 Blind without these):

Storage metrics, replication metrics, Grafana dashboards, alert rules

WEEK 3 - Backup & DR (P0 Data loss risk):

Automated backup, backup verification, WAL archival, DR runbook, operational runbooks

WEEK 4 - Deployment (P1 Customer enablement):

CLI tooling, reference architecture, deployment guides, pilot validation

P5.1 Security Hardening (WEEK 1 - SHIP BLOCKERS)

Priority: P0 - Cannot ship without these Status: 🎯 4/5 Complete (TLS, Limits, Timeouts, Rate Limiting done; Secret Sanitization pending)

TLS/HTTPS Configuration (Partial - 2024-02-11)
- Add TLS 1.3 to stemedb-api (axum-server with rustls) - main.rs:114-123
- Load from env vars: STEMEDB_TLS_CERT_PATH / STEMEDB_TLS_KEY_PATH
- HTTP → HTTPS redirect (deferred - not critical for pilot)
- Let's Encrypt integration for pilot deployments (deferred - manual cert setup OK)
- Certificate rotation documentation (deferred)
- Test with self-signed certs in CI (deferred - Layer 4 tests)
Request Size Limits (Complete - 2024-02-11)
- Add RequestBodyLimitLayer to write endpoints (1MB default) - routers.rs:371
- Add RequestBodyLimitLayer to read endpoints (64KB default) - routers.rs:400
- Make limits configurable: STEMEDB_WRITE_BODY_LIMIT / STEMEDB_READ_BODY_LIMIT
- Created SecurityConfig struct with defaults - routers.rs:35-56
- Updated all 8 create_router_* functions to accept config
- Documented in .env.example
- Document limits in OpenAPI spec (deferred - not critical)
Timeout Configuration (Complete - 2024-02-11)
- Add TimeoutLayer to HTTP routes (configurable, default 30s) - routers.rs:115,143,199,etc
- Wrap all store.get()/put() with tokio::time::timeout(5s) - store_helpers.rs
- Added timeout helpers: store_get_with_timeout() / store_put_with_timeout()
- Updated 6+ handler locations (source.rs, health.rs, report.rs, source_registry/handlers.rs)
- Add timeout metrics: stemedb_operation_timeouts_total{operation="store_get|store_put"}
- Make HTTP timeout configurable: STEMEDB_HTTP_TIMEOUT_SECS
- Added ApiError::Timeout variant with 408 REQUEST_TIMEOUT status - error.rs:76-80
Secret Sanitization (Deferred - not blocking for pilot)
- Remove API key logging from api_key.rs:271 (log hash, not prefix)
- Audit all debug!/info! for credential leaks
- Add test: cargo test -- --nocapture | grep -E "key|secret|password" (should fail)
- Note: Existing code already logs hashes, audit needed to confirm no leaks
Rate Limiting (Complete - 2024-02-11)
- Rate limit /v1/health to 1 req/sec per IP (prevent metrics flooding) - routers.rs:352
- Make configurable: STEMEDB_HEALTH_RATE_LIMIT (default: 1)
- Uses RateLimitState and rate_limit_middleware - middleware/rate_limit.rs
- Metric already exists: stemedb_rate_limit_rejections_total{endpoint} - rate_limit.rs:87

Implementation Notes:

All security features are now configurable via environment variables with sensible defaults
Build succeeds, all features tested manually
Integration tests stubbed in tests/security_hardening.rs (21 tests marked #[ignore])
Secret sanitization deferred as existing code appears safe (uses hashes), but full audit recommended

P5.2 Monitoring Foundation (WEEK 2 - CRITICAL) ✅ COMPLETE

Priority: P0 - Flying blind without these Status: ✅ Complete (All layers implemented: WAL metrics, storage metrics, HTTP SLI, error tracking, Grafana dashboards, Prometheus alerts, runbooks, validation scripts) Implementation: P5.2-IMPLEMENTATION-SUMMARY.md

Storage Health Metrics (Complete - 2024-02-11)
- stemedb_wal_fsync_latency_seconds histogram (p50/p95/p99) - journal.rs:34
- stemedb_wal_write_errors_total{error} counter - journal.rs:46
- stemedb_wal_disk_usage_bytes gauge - segment.rs:248
- stemedb_wal_segments_count gauge - segment.rs:249
- stemedb_wal_bytes_written_total counter - journal.rs:45
- stemedb_wal_writes_total counter - journal.rs:44
- stemedb_wal_batch_size histogram - group_commit.rs:201
- stemedb_wal_flush_latency_seconds histogram - group_commit.rs:243
- stemedb_wal_recovery_attempts_total counter - journal.rs:234
- stemedb_wal_recovery_duration_seconds histogram - journal.rs:269
- stemedb_wal_rotations_total counter - journal.rs:304
Storage Operation Metrics (Complete - 2024-02-11)
- stemedb_storage_operation_duration_seconds{operation,backend} histogram - hybrid_backend.rs:118,138,158,180
- stemedb_storage_operations_total{operation,backend} counter - hybrid_backend.rs:123,143,163,185
- stemedb_index_lookup_duration_seconds{index} histogram - index_store.rs:212,235
- Metrics added to: get(), put(), delete(), scan_prefix(), index lookups
Error Tracking (Complete - 2024-02-11)
- stemedb_errors_total{type,layer} counter - error.rs:99
- Tracks 15 error types across 5 layers (validation, api, storage, pipeline, auth, protection)
- Integrated into ApiError::IntoResponse for automatic tracking
HTTP SLI Metrics (Complete - 2024-02-12)
- Pattern implemented in handlers/vote.rs as reference
- stemedb_http_requests_total{method,path} counter
- stemedb_http_request_duration_seconds{method,path,status} histogram
- Rollout complete: 19 handlers instrumented (supersede, epoch, source, admin, escalation, gold_standard, quarantine, circuit_breaker, api_keys, audit, concepts)
- Total coverage: 20 handlers across 11 files
Grafana Dashboards (Complete - 2024-02-11)
- storage-health.json - WAL fsync latency, disk usage, error rates, storage operations, index timing
- cluster-overview.json - Node status, replication lag, sync ops, Merkle diffs, gossip
- sli-dashboard.json - Request rate, latency heatmap, error rate, availability gauge, circuit breakers
- Import guide with troubleshooting: docs/operations/monitoring/grafana/README.md
Prometheus Alert Rules (Complete - 2024-02-11)
- alerts/critical.yml - 8 alerts (API down, disk >90%, replication lag >5min, storage errors, fsync failure, split brain, memory exhaustion, cert expiring)
- alerts/warning.yml - 10 alerts (slow fsync, high error rate, slow indexes, disk >70%, lag >1min, high latency, compaction backlog, circuit breaker, trust rank decay)
- alerts/info.yml - 9 alerts (circuit breaker open, quarantine backlog, node join, memory >70%, key rotation, gold standard count, cert 30 days, WAL segments, low traffic)
- All alerts include: runbook links, impact description, action steps, for duration, labels
Alerting Integration (Complete - 2024-02-11)
- PagerDuty configuration with 4-level escalation - docs/operations/monitoring/alerting/pagerduty-config.yml
- Slack integration for 3 channels (critical/warning/info) - docs/operations/monitoring/alerting/slack-config.yml
- Escalation policy with response times, contact info, post-mortem template - docs/operations/monitoring/alerting/escalation-policy.md
- Inhibition rules to prevent alert spam
- Workflow integration examples (incident channel creation, resolution tracking)
Additional Runbooks (Complete - 2024-02-12)
- 8 critical/warning runbooks created in docs/operations/runbooks/
- Coverage: high-replication-lag, storage-errors, wal-fsync-failure, split-brain, memory-exhaustion, certificate-renewal, slow-fsync, high-error-rate
- Each includes: Severity, Symptom, Impact, Investigation, Resolution, Prevention, Escalation, References
Validation Scripts (Complete - 2024-02-12)
- scripts/setup-pagerduty.sh - Service key validation, test incident creation, escalation policy check
- scripts/setup-slack.sh - Webhook validation, test message posting, formatting verification
- scripts/test-alerting.sh - End-to-end test (Alertmanager → PagerDuty + Slack), latency measurement

P5.3 Backup & Disaster Recovery (WEEK 3 - CRITICAL) ✅ COMPLETE

Priority: P0 - Data loss risk without these Completed: 2026-02-12

Automated Backup
- Systemd timer: runs every 6 hours (00:00, 06:00, 12:00, 18:00 UTC)
- Systemd service: stemedb-backup.service with retry logic
- Backup retention policy: --keep-last flag with 30-day default
- S3 upload integration: --upload-s3 flag with STANDARD_IA storage
Backup Verification
- verify-backup.sh - Validates magic bytes, CRC32C, BLAKE3 checksums
- Weekly verification timer: Sunday 03:00 UTC
- Metrics: stemedb_backup_verification_status, stemedb_backup_verification_checks_passed
- Alert on verification failure: Prometheus alert rule
WAL Archival
- archive-wal-to-s3.sh - Ships WAL segments to S3 every 15 minutes
- S3 bucket: stemedb-backups-{env}/wal-archive/
- Retention: 30 days in S3 STANDARD_IA
- Metrics: stemedb_wal_archival_lag_seconds, stemedb_wal_archival_segments_uploaded_total
Disaster Recovery Runbook
- docs/operations/runbooks/disaster-recovery.md - Complete DR procedures
- RTO target: 4 hours (validated via drill script)
- RPO target: 15 minutes (achievable with WAL archival)
- 3 recovery scenarios: Full restore, Point-in-time, WAL-only
- Validation checklist: 9 verification steps
DR Drill
- scripts/dr-drill.sh - Automated drill with RTO/RPO measurement
- Report generation: markdown format with timeline, metrics, issues
- Integration tests: uat/production-readiness/backup-dr-tests.sh (7 tests)

Deliverables:

6 systemd units: 3 timers + 3 services (backup, verify, archive-wal)
4 scripts: backup, verify, archive-wal, dr-drill
Prometheus alerts: 9 alert rules in backup-alerts.yml
DR runbook: 3 recovery scenarios + validation checklist
Integration tests: 7 tests covering all P5.3 components

P5.4 Operational Runbooks (WEEK 3 - CRITICAL) ✅ COMPLETE

Priority: P1 - 2am incidents require these

Critical Runbooks (created in docs/operations/runbooks/)
- server-wont-start.md - Port conflicts, TLS cert issues, disk full, WAL corruption
- high-query-latency.md - Check replication lag, shard hotspots, index health
- restore-from-backup.md - Step-by-step restore procedure with validation
- add-node.md - Node join procedure, shard rebalancing, validation
- disk-full.md - Emergency WAL cleanup, compaction trigger, quota increase
- circuit-breaker-stuck.md - Reset circuit breaker, identify root cause
- quarantine-overflow.md - Investigate quarantine queue, batch approve/reject
Troubleshooting Decision Tree
- docs/operations/troubleshooting-flowchart.md - Complete with symptom → cause → runbook mapping
- Covers all 7 runbooks with decision trees and quick diagnostic commands

P5.5 Cluster Management Tooling (WEEK 4 - HIGH PRIORITY) ✅ COMPLETE

Priority: P1 - Manual SSH not scalable Completed: 2026-02-12

stemedb-admin CLI (new binary in crates/stemedb-admin/)
- stemedb-admin cluster status - Overview: node count, shard count, meta version, node table
- stemedb-admin cluster health - Quick health check (exit code 0/1)
- stemedb-admin node list - List all nodes with states (Alive/Suspect/Dead)
- stemedb-admin node <id> info - Detailed node info with shard assignments
- stemedb-admin node <id> shards - Show shards assigned to node (with --leader filter)
- stemedb-admin shard list - List all shards with leaders/replicas
- stemedb-admin shard <id> info - Detailed shard info (size, assertions, replicas)
- stemedb-admin shard <id> replicas - Show replica nodes for shard
- stemedb-admin debug export --output <file> - Export complete cluster state as JSON
- HTTP client connecting to gateway (default: http://localhost:18181)
- Output formats: Table (human-friendly with colors) and JSON (machine-readable)
- Environment variable support: STEMEDB_GATEWAY_ADDR
- Proper error handling with helpful messages (no panics)
- 12 integration tests covering all functionality
- Node lifecycle documentation: docs/operations/node-lifecycle.md
- Installation guide: docs/operations/deployment/install-admin-cli.md

Phase 2 Deferred:

stemedb-admin node drain <id> - Graceful node removal (requires gateway endpoints)
stemedb-admin shard rebalance - Manual rebalancing trigger (requires gateway endpoints)
Node Operations Documentation
- docs/operations/node-lifecycle.md
- Add node procedure (pre-flight checks, join, validation)
- Remove node procedure (drain, graceful leave, verification)
- Replace node procedure (dead node replacement, shard recovery)
Shard Management (optional for pilot, defer if time-constrained)
- stemedb-admin shard rebalance - Manual rebalancing trigger
- stemedb-admin shard freeze - Disable auto-split during maintenance
- stemedb-admin shard move <shard-id> <target-node> - Manual migration

P5.6 Reference Architecture (WEEK 4) ✅ COMPLETE

Priority: P1 - Customer deployment guide

Deployment Guides (created in docs/operations/reference-architecture/)
- single-node-pilot.md - Pilot deployment (1 node, docker-compose, hardware specs)
- three-node-cluster.md - Small production (3 nodes, replication factor 2, HA)
- network-requirements.md - Port list (181XX), firewall rules, TLS, DNS setup
Infrastructure as Code Examples (created in docs/operations/deployment/)
- docker-compose/pilot-with-monitoring.yml - Single-node with Grafana + Prometheus
- nginx/stemedb.conf - TLS 1.3, rate limiting, security headers, admin restrictions
- envoy/stemedb.yaml - Load balancing, health checks, circuit breakers, retries
- kubernetes/ - K8s manifests (StatefulSet, Service, Ingress) [DEFERRED - not needed for pilot]
- terraform/ - AWS deployment (EC2, EBS, ALB, S3) [DEFERRED - not needed for pilot]
Resource Sizing Guide
- docs/operations/reference-architecture/resource-sizing.md - Complete with CPU/RAM/disk formulas
- Quick reference table: <10K, <50K, <100K, <500K, <1M assertions
- AWS/GCP/Azure instance recommendations
- Capacity planning metrics and monitoring dashboard
Reverse Proxy Configuration
- nginx/stemedb.conf - TLS termination with Let's Encrypt, rate limiting, admin restrictions
- envoy/stemedb.yaml - Advanced load balancing, circuit breakers, health checks
- Let's Encrypt automation examples (certbot + cron)

P5.7 Pilot Success Validation (WEEK 4) ✅ COMPLETE

Priority: P1 - Definition of done

Performance Benchmarks - Documented in docs/operations/pilot-success-criteria.md
- Sub-second query latency: p99 <1s at 10K assertions (test procedure included)
- Ingest throughput: 1K assertions/sec sustained (5 min load test script)
- Replication lag <1 second under normal load (cluster validation)
Functional Validation - Documented in docs/operations/pilot-success-criteria.md
- Conflict detection: ConflictLens score >0.5 on contradictions (test procedure)
- Audit trail export: 100 assertions with signatures/provenance (validation script)
- Source retraction cascade: 110+ dependents (CARDIOVASC_MEGA_TRIAL example)
Operational Validation - Documented in docs/operations/pilot-success-criteria.md
- Backup/restore roundtrip: 10K assertions → backup → restore → verify (procedure)
- Node failure recovery: Kill node → continue → re-replicate <5min (3-node test)
- Rolling restart: Restart one-by-one during load test → 100% success (procedure)
Demo Validation: 5 Amazement Moments - All documented with test procedures
- Moment 1: Conflicting claims (FDA 0.2% vs Anecdotal 12%)
- Moment 2: Source retraction cascade (110 assertions flagged)
- Moment 3: Audit trail (provenance chain to source)
- Moment 4: Time-travel (query 2023 vs 2025)
- Moment 5: Lens-based resolution (3 lenses → 3 winners)

Phase 8B-C: Production Scale & Observability

Prerequisite: Pilot 5 complete, 1-2 production customers running Timeline: 4-6 weeks after Pilot 5

8B. Advanced Observability

8B.1 Distributed Tracing
- OpenTelemetry integration (Jaeger or Tempo backend)
- Trace write path: Gateway → Shard Leader → Followers → WAL
- Trace sync path: Merkle diff → Fetch missing → CRDT merge
- Add trace IDs to all log lines (trace_id field)
8B.2 Capacity Planning Metrics
- disk_growth_rate_bytes_per_day (7-day linear regression)
- disk_days_until_full (projected based on growth rate)
- assertion_ingestion_rate (assertions/sec, 24h moving average)
- Dashboard: Capacity trends with projected full date
8B.3 Performance Profiling
- Continuous profiling (pprof/flamegraph integration)
- Per-shard query latency breakdown
- Hot subject/predicate detection
- Slow query log (queries >100ms)
8B.4 Advanced Dashboards
- query-performance.json - Latency by lens, hot subjects, cache hit rate
- write-pipeline.json - Ingest rate, WAL throughput, sync lag
- capacity-planning.json - Growth trends, disk projections, resource utilization

8C. Production Hardening

8C.1 Point-in-Time Recovery (PITR)
- WAL segment archival to S3 (every 15 min or 100 MB)
- Recovery target parsing (--target lsn:123456, --target 2026-02-11T14:25:00)
- WAL replay engine with checksum validation
- Test: Inject corruption at known LSN, restore to LSN-1, verify consistency
8C.2 Online Backup (Hot Backup)
- Snapshot API: POST /v1/admin/snapshot (trigger checkpoint, freeze writes briefly)
- Shadow copy: Copy data files while DB is running
- Snapshot registry: Track active snapshots, prevent WAL truncation
- Zero-downtime backup workflow
8C.3 Storage Compaction
- Automatic WAL segment cleanup (delete segments older than 7 days if checkpointed)
- Tombstone removal (compact assertions with lifecycle=Superseded)
- Background task: Run compaction every 6 hours
- Metrics: wal_segments_deleted_total, compaction_bytes_reclaimed
8C.4 Auto-Healing Improvements
- Detect dead node → trigger re-replication → restore replication factor (automated)
- Circuit breaker: Don't trigger shard split if memory >80%
- Clock skew detection: Reject assertions with timestamps >1s in future
- Partition detection: Log when SWIM sees cluster split
8C.5 Rolling Upgrades
- stemedb-admin upgrade --version v0.3.0 --batch-size 1
- Pre-flight compatibility check (schema version, WAL format)
- Drain node before upgrade (move shards to other nodes)
- Zero-downtime upgrade workflow
8C.6 Multi-Region (Active-Passive)
- Secondary region with continuous WAL replication
- Automated failover (DNS swap when primary unavailable >5 min)
- Failover time target: <10 minutes
- Cost estimate: ~$500/month for active-passive

Phase 9: Enterprise Scale & Compliance

Goal: Enterprise-grade durability, compliance, and incident response Prerequisite: 5-10 production customers, predictable failure patterns

9A. Advanced Backup & Recovery

9A.1 Incremental Backup
- Only backup changed blocks since last backup (rsync --link-dest pattern)
- Backup time: Minutes instead of hours for 1TB database
- Storage savings: 90% reduction for daily incrementals
9A.2 Cross-Region Backup Replication
- Replicate backups to S3 in different region (S3 cross-region replication)
- Storage tiers: Hot (7 days Standard), Warm (7-30 days Intelligent-Tiering), Cold (30+ days Glacier IR)
- Cost estimate: ~$210/month for 11TB (7 daily + 4 weekly backups)
9A.3 Backup Encryption
- Encrypt backups at rest (AWS KMS or customer-managed keys)
- Encrypt backups in transit (TLS for S3 uploads)
- Key rotation policy (90-day rotation)

9B. Data Corruption & Recovery

9B.1 Deep Corruption Detection
- Validate Merkle tree checksums before accepting gossip
- Periodic background validation (full DB checksum every 24h)
- Metric: corruption_detected_total{source=gossip|disk}
9B.2 Assertion Tombstones (Soft Delete)
- New lifecycle stage: Deleted (append-only, not physically removed)
- Tombstone propagation via gossip (all nodes learn of deletion)
- Query filtering: Lenses ignore Deleted assertions by default
9B.3 Cluster Rollback
- stemedb-admin rollback --before 2026-02-11T14:00:00
- Batch tombstone generation for all assertions after timestamp
- Use case: Bulk data corruption, need to revert cluster to known-good state
9B.4 Split-Brain Recovery
- Automatic detection: Merkle tree divergence >10% after partition heals
- Manual resolution: stemedb-admin resolve-split --prefer-node node-1
- CRDT merge with conflict log (record which assertions were merged/discarded)

9C. Compliance & Legal

9C.1 GDPR Right to Erasure
- Cryptographic erasure: Each agent has unique encryption key
- Delete key → data unrecoverable (even though assertions remain on disk)
- Compliance proof: "Key deleted on YYYY-MM-DD, data cryptographically erased"
9C.2 Data Retention Policies
- Per-subject TTL: retention_policy{subject="medical/*"}=7years
- Per-predicate TTL: retention_policy{predicate="temp_session"}=1day
- Background task: Tombstone assertions past TTL
9C.3 Immutable Audit Trail
- All admin actions logged to append-only audit store
- Include: Who, what, when, why (justification field required)
- Export API: GET /v1/admin/audit?from=DATE&to=DATE
- Compliance report generator (CSV/PDF for auditors)
9C.4 SOC 2 Type II Certification
- Security controls implementation (access control, encryption, monitoring)
- 6-month observation period (demonstrate controls work consistently)
- External auditor engagement (Big 4 accounting firm)
- Annual recertification

9D. Storage Management

9D.1 Advanced Compaction
- Multi-generation compaction: Merge small segments into larger ones
- Compaction budget: Limit I/O impact (max 10% of disk bandwidth)
- Metrics: compaction_progress{generation}, compaction_bytes_read/written
9D.2 Tiered Storage
- Hot tier: NVMe SSD (last 7 days, accessed frequently)
- Warm tier: SATA SSD (7-90 days, accessed occasionally)
- Cold tier: S3 Glacier (90+ days, accessed rarely)
- Automatic migration based on access patterns
9D.3 Storage Quotas
- Per-agent quotas: quota{agent="user123"}=10GB
- Cluster-wide quota: Hard limit on total DB size
- Soft quota warning at 80% (alert ops team)
- Hard quota rejection at 100% (reject new assertions)

9E. Incident Response

9E.1 Alerting & Escalation
- PagerDuty integration (API key in config)
- Slack integration (webhook URL, #stemedb-alerts channel)
- Escalation policy: Warn → Page primary → Page backup → Page manager
- Alert grouping: Batch related alerts (don't page 100 times for same issue)
9E.2 Incident Management
- Incident response playbook (docs/operations/incident-response.md)
- Severity levels: P0 (total outage), P1 (degraded), P2 (warning)
- Communication templates (customer email, status page update)
- Post-mortem template (5 Whys, timeline, action items)
9E.3 Chaos Engineering
- Monthly "game day" exercises
- Scenarios: Node failure, network partition, disk full, slow disk
- Use stemedb-chaos crate to inject failures
- Document learnings, update runbooks
9E.4 On-Call Rotation
- Define on-call schedule (primary, backup, manager escalation)
- On-call playbook (what to do when paged, who to call, escalation path)
- On-call compensation policy
- Post-incident review process

9F. Security Hardening

9F.1 mTLS for Cluster Communication
- Require client certificates for all node-to-node RPC
- Certificate authority: Internal CA or Let's Encrypt
- Certificate rotation: 90-day validity, automated renewal
- Reject connections without valid cert (prevent rogue nodes)
9F.2 Encryption at Rest
- WAL encryption: AES-256-GCM per segment
- KV store encryption: Transparent encryption layer (redb feature or OS-level LUKS)
- Key management: AWS KMS, HashiCorp Vault, or customer-managed keys
- Compliance: Meets HIPAA/GDPR encryption requirements
9F.3 Node Authentication
- Each node has Ed25519 keypair (identity)
- Signed cluster join: Node signs join request with private key
- Admin API: Approve/reject join requests (stemedb-admin node approve <node-id>)
- Prevent unauthorized nodes from joining cluster
9F.4 API Security
- Rate limiting per API key (100 req/min for free tier, 10K req/min for enterprise)
- Input validation: UTF-8, max lengths, regex injection protection
- SQL injection prevention: Parameterized queries only (no string concatenation)
- XSS prevention: Escape all user-provided content in dashboard
9F.5 Secrets Management
- Never store secrets in code or config files
- Use environment variables or secret management service (Vault, AWS Secrets Manager)
- Secret rotation policy (API keys rotated every 90 days)
- Audit log: Track secret access (who accessed what secret when)

9G. Operational Maturity

9G.1 SLI/SLO Definitions
- Availability SLO: 99.95% uptime (21.9 min/month downtime budget)
- Latency SLO: p95 query latency <100ms, p99 <500ms
- Error rate SLO: <0.1% of requests fail
- Dashboard: SLO compliance tracking, error budget remaining
9G.2 Capacity Planning
- Quarterly capacity review (growth trends, resource utilization)
- 6-month forecast (projected assertion count, disk usage, API load)
- Auto-scaling triggers (add nodes when CPU >70% for 10 min)
- Budget planning: Cloud costs per customer, per assertion
9G.3 Performance Testing
- Load testing: Sustained 10K assertions/sec for 1 hour
- Stress testing: Ramp to failure (find breaking point)
- Chaos testing: Inject failures during load test
- Regression testing: Compare performance across releases
9G.4 Documentation
- Operator guide (docs/operations/operator-guide.md)
- Troubleshooting guide (symptom → diagnosis → fix)
- Architecture deep-dive (how it works, design decisions)
- API reference (auto-generated from OpenAPI spec)
- SDK usage guides (Go, Python, TypeScript)

Architecture Overview

Write Path (Spine):           Read Path (Cortex):
[Agent] -> [Ingestion]        [Agent] <- [Lens Engine]
              |                              |
              v                              |
         [WAL/Fsync]                  [Index Lookup]
              |                              |
              v                              |
         [KV Store] <--------------------+

Port Scheme (181XX)

Offset	Service	Default	Env Var
+0	HTTP API	18180	`STEMEDB_BIND_ADDR`
+1	Cluster Gateway	18181	`STEMEDB_NODE_API_ADDR`
+2	Cluster RPC	18182	`STEMEDB_NODE_RPC_ADDR`
+3	SWIM Gossip	18183	via `SwimConfig`
+4	Metrics	18184	(reserved)
+5	Admin	18185	(reserved)
+6	Latent Signal	18186	—
+7	Community App	18187	—
+8	Admin Dashboard	18188	—

Crates

Crate	Purpose	Status
`stemedb-core`	Assertion, LifecycleStage, MaterializedView, types, signing	✅
`stemedb-wal`	Write-ahead log with crash recovery	✅
`stemedb-storage`	KVStore, VoteStore, IndexStore, TrustRankStore, QuarantineStore	✅
`stemedb-ingest`	Ingestion pipeline, signature verification, ContentDefenseLayer	✅
`stemedb-query`	Query engine, Materializer for O(1) MV reads	✅
`stemedb-lens`	Lenses (Recency, Consensus, Authority, Skeptic, Layered, etc.)	✅
`stemedb-api`	HTTP API with axum + utoipa OpenAPI docs	✅
`stemedb-sim`	Simulation for testing the pipeline	✅
`stemedb-merkle`	BLAKE3 Merkle tree for diff detection	✅
`stemedb-rpc`	gRPC services for node-to-node communication	✅
`stemedb-sync`	Merkle sync, gossip broadcast, anti-entropy	✅
`stemedb-cluster`	Cluster membership (SWIM), sharding, gateway	✅
`stemedb-ontology`	Domain definitions (Pharma), subject builders, medical extractors	✅
`stemedb-chaos`	Chaos testing infrastructure	✅
`stemedb-dashboard`	Admin dashboard (React/Next.js)	✅ (7 panels)

Applications

App	Purpose	Status
`aphoria`	Code-level truth linter — 42 extractors, claims, verify, coverage	🎯 A5 flywheel
`disputed`	Controversy explorer	Planned

SDKs

SDK	Purpose	Status
`sdk/go/steme`	Go HTTP client with Ed25519 signing and fluent builders	✅
`sdk/go/adk`	ADK-Go tools and callbacks for AI agents	✅

Quick Reference

# Build
cargo build --workspace

# Test
cargo test --workspace

# Lint (must pass before commit)
cargo clippy --workspace -- -D warnings
cargo fmt --check

# Run API server
cargo run --bin stemedb-api

# Run Aphoria scan
cargo run --bin aphoria -- scan /path/to/project --show-observations

# Run demo script
./scripts/demo-consumer-health.sh

Arena: Simulation Roadmap

Goal: Incrementally evolve the simulator from Spine validation to a full Agent-Based Modeling environment. Philosophy: Make it run. Then add. Verify at every step. Alignment: Tracks main roadmap phases; exercises features as they land.

Current State

The simulator (stemedb-sim) validates the full system through Arena 0-4:

Completed Arenas:

✅ Arena 0: Test infrastructure with assertions and CI integration
✅ Arena 1: Query path via QueryEngine, Recency lens, lifecycle filtering, query audit
✅ Arena 2: Voting & VoteAwareConsensus, troll resistance
✅ Arena 2.5: Hardening (race conditions, API tests, crash recovery, input validation)
✅ Arena 3: Materialized Views, fast-path verification, MV freshness
✅ Arena 4: Agent personas (Scientist, Troll, Believer with differentiated strategies)

What's Tested:

WAL durability, rkyv serialization, Ed25519 signatures
Ingestor pipeline (WAL → KV async flow)
QueryEngine with multiple lenses
Lifecycle filtering, voting, consensus
Query audit trail, materialized views
Strategy-driven agent behaviors

What's Not Yet Tested:

❌ TrustRank (Arena 5)
❌ Concurrent agents at scale (Arena 6)
❌ Time-travel queries (Arena 7)
❌ Skeptic lens & conflict scores (Arena 8)

Upcoming Arena Phases

Arena 5: TrustRank Integration (Next)

Initialize TrustRank for agents
Reputation adjustment after votes
TrustAwareAuthorityLens verification
Troll reputation decay over time

Arena 6: Concurrent Agents

Tokio task per agent
Scale to 100 agents, then 1000
Contention metrics and bottleneck identification

Arena 7: Time-Travel & Epochs

Time-travel query verification
Epoch creation and supersession
Epoch cascade validation

Arena 8: Skeptic & Conflict

High/low conflict scenarios
Skeptic lens surfacing outliers
Conflict score accuracy

Arena 9: Full Gameplay Loop

Ground truth injection
Complete 5-tick scenario
Extended 1000-tick run
Emergence validation

Alignment with Use Cases

Use Case	Arena Phase
Agile Agent Team
Lifecycle filtering	Arena 1.3
Query audit trail	Arena 1.4
Time-travel debugging	Arena 7.1
Expert weighting	Arena 5.3
Financial Due Diligence
Conflict detection	Arena 8.1, 8.3
Epoch cascades	Arena 7.2, 7.3

Run command: cargo run --bin stemedb-sim Test suite: cargo test -p stemedb-sim

CLAUDE.md — AI assistant instructions and project rules
roadmap-archive.md — Completed phases 1-8A + Pilot 1-3
applications/aphoria/docs/vision-gaps.md — Aphoria vision gap analysis
claims-explained.md — Hand-written Maxwell claims (the gold standard)
docs/demo/pilot/amazement-demo.md — Technical demo script
docs/demo/pilot/amazement-demo-2.md — Executive demo script
uat/production-readiness/README.md — Production verification checklist

57 KiB Raw Blame History