stemedb/architecture.md
jordan 3cfaa1e1d3 feat: Complete Phase 1 (The Spine) - storage foundation
Phase 1 delivers the complete durability and storage layer:

- WAL with crash recovery: Append-only journal with BLAKE3 checksums,
  fsync guarantees, and proper seek-to-EOF on reopen
- Storage engine: sled-backed KVStore with scan_prefix for range queries
- Content-addressed storage: H:{hash}, V:{hash}, E:{hash} key patterns
- Ingestor: Background worker tailing WAL, writing to KV with 8-byte
  aligned record headers for rkyv zero-copy deserialization
- Comprehensive tests: 31 tests covering crash recovery, round-trips,
  and multi-cycle durability

New crates: stemedb-wal, stemedb-storage, stemedb-ingest

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 14:15:34 -07:00

5.7 KiB

Episteme (StemeDB) Architecture

Design Philosophy: Immutable History, Probabilistic Resolution, Materialized Speed. Status: Draft Spec v1.0

1. System Overview

Episteme is a Log-Structured, Content-Addressed Knowledge Graph. Unlike traditional databases that mutate state in place, Episteme appends Assertions to an immutable ledger (Merkle DAG). State resolution happens via Lenses.

To solve the O(N) read latency of conflict resolution, Episteme employs a Materialized View layer that pre-calculates the "Current Truth" for standard lenses.

High-Level Data Flow

[Writer Agent]      [Reader Agent]
      │                   ▲
      │ (1) Sign &        │ (6) Sub-millisecond Answer
      │     Propose       │     (Pre-computed)
      ▼                   │
┌────────────┐      ┌────────────┐
│  Ingestion │      │ Resolution │
│  Gateway   │      │ Engine     │
└─────┬──────┘      └─────┬──────┘
      │ (2) Append        │ (5) Read Materialized View
      │     to Ballot     │     OR Apply Custom Lens
      ▼                   │
┌────────────┐      ┌────────────┐
│ Quarantine │      │ Indexing   │
│ Journal    │──────► Service    │
└─────┬──────┘ (3)  └─────┬──────┘
      │                   │ (4) Compaction & Materialization
      ▼                   ▼
┌────────────┐      ┌────────────┐
│ Job Manager│      │ Materialized│
└────────────┘      │ Views      │
 (TAN Meter)        └────────────┘

2. Core Data Structures

2.1. The Atomic Unit: Assertion (The Candidate)

Assertions are proposals of truth. They are immutable.

struct Assertion {
    pub subject: EntityId,
    pub predicate: RelationId,
    pub object: Value,
    pub epoch: Option<EpochId>,
    pub agent_id: PublicKey,     // The Proposer
    pub timestamp: u64,
    // ... lineage and vector fields ...
}

2.2. The Ballot Box: Vote (The High-Velocity Stream)

To prevent lock contention on Assertions, Agents write Votes to a separate high-velocity log.

struct Vote {
    pub assertion_hash: Hash,    // What are we voting on?
    pub agent_id: PublicKey,     // Who is voting?
    pub weight: f32,             // 0.0 - 1.0 (Confidence)
    pub signature: Signature,    // Cryptographic proof
    pub timestamp: u64,
}

2.3. The Storage Layout (LSM Tree)

Key Value Purpose
H:{Hash} Assertion Immutable Content Store
V:{Hash} List<Vote> The Ballot Box (Append-only)
MV:{Subject}:{Predicate} Assertion Materialized View (The "Winner")
S:{Subject} List<Hash> Adjacency Index

3. The Write Path (The Ballot Box)

  1. Ingest: Agents submit Assertions or Votes.
  2. Journal: Written to episteme-wal.
  3. Ballot Box: Votes are appended to the V:{Hash} stream.
  4. Compactor (Async): A background worker aggregates Votes + TrustRank to update the MV:{Subject}:{Predicate} key.
    • This ensures that Read queries (GET /query) are O(1) lookups on the Materialized View, not O(N) calculations.

4. The Read Path (The Cortex)

Fast Path (Standard Lenses):

  • Query: GET /query?lens=Consensus
  • Action: GET MV:{Subject}:{Predicate}
  • Cost: O(1). Low latency.

Slow Path (Custom/Skeptic Lenses):

  • Query: GET /query?lens=Skeptic
  • Action: Gather all candidates + votes, compute variance on the fly.
  • Cost: O(N). High latency, used for analysis/debugging.

Standard Lenses

  • Consensus: Highest cluster density.
  • Authority: Filter by Reputation.
  • Recency: Last Writer Wins.
  • Skeptic: Returns variance/conflict metrics.
  • EpochAware: Validates against current paradigm.
  • Constraints: Returns all must_use/forbidden assertions for a context. Acts as a "Pre-Flight Check" to solve the Optimization Conflict.

5. The Meter (Economic Safety)

To prevent infinite loops, the Job Manager enforces Temporal Advantage Normalization (TAN).

  • Budgeting: Every Job must declare a max_cost.
  • Throttling: Forking Reality or Deep Recursion is rejected if current_cost + projected_cost > max_cost.

6. The Simulator (Mid-Training Pipeline)

The system continuously exports data to train the next generation of Agents.

  • Negative Samples: High-confidence assertions that were later superseded (Failures).
  • Golden Paths: Branches that successfully merged to Main (Successes).
  • Format: Exported as HuggingFace-compatible datasets for LoRA fine-tuning.

7. Implementation Roadmap

Phase 1: The Spine (Foundation)

  • Reuse quarantine-journal pattern for WAL.
  • Implement Assertion, Epoch, and Vote structs.
  • Basic sled storage backend.

Phase 2: The Lattice (Connectivity)

  • The Ballot Box: Implement separate Vote storage stream.
  • Materializer: Implement background worker to maintain MV keys.
  • The Meter: Implement Budget/TAN middleware in Job Manager.
  • Agent Wallet: Sidecar for key management/signing.

Phase 3: The Cortex (Reasoning)

  • SMT Backend & Branching.
  • Vector Search.
  • Lens: Constraints: Implement the pre-flight check logic.

Phase 4: The Hive (Learning)

  • The Simulator: Log exporter pipeline.
  • TrustRank Learning Loop.