stemedb/vision.md
jordan 422e2d4416 feat(aphoria): wire claims through StemeDB — Gap Closure Phase 1
Claims now flow through StemeDB's append-only knowledge graph instead of
mutable TOML files. This resolves all 6 critical claim-bypass code paths:

- Bridge: lossless AuthoredClaim ↔ Assertion round-trip (comparison, status, lifecycle mapping)
- LocalEpisteme: ingest_authored_claim() and fetch_authored_claims() with AUTHORED_CLAIM predicate index
- EpistemeClaimStore: ClaimStore trait backed by StemeDB (append-only delete via deprecation)
- CLI handlers: all claim commands read/write through StemeDB
- Scanner: loads claims from StemeDB with auto-migration fallback to TOML
- Export: new `aphoria claims export` serializes StemeDB claims to TOML/JSON

Also cleans up dead code (EpistemeConfig.url), renames ingest_claims→ingest_observations,
fixes ClaimFilter.authority_tier type, adds Draft variant to ClaimStatus, and fixes
pre-existing clippy warnings (too_many_arguments, filter_next→rfind).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 02:02:51 -07:00

240 lines
11 KiB
Markdown

# Episteme: Git for Truth
> **Internal Codename:** StemeDB
> **Category:** Infrastructure / Probabilistic Knowledge Database
> **Role:** The shared memory for AI research agents that disagree
## The Problem: Databases Force False Certainty
Current databases (Postgres, Neo4j, Vector DBs) suffer from **The Tower of Babel** problem: they store _Data_, not _Evidence_. They are deterministic, stateless, and brittle.
When multiple agents observe the world and report different things, traditional databases force you to:
- **Pick a winner** (losing the disagreement)
- **Version-table chaos** (complexity explodes)
- **Application logic everywhere** (authority weighting, decay, cascades)
**Real example:** A woman researching Semaglutide found her doctor saying "well-tolerated" while Reddit flagged gastroparesis months before the FDA added the warning. She had no way to weigh these sources structurally. The Reddit signal was right. The system failed her.
## The Solution: Store Claims, Resolve at Read Time
Episteme rejects the idea of a single, static "database state." Instead, it models knowledge as a **Probabilistic Marketplace**:
- **Assertions are immutable.** Every claim is signed, timestamped, and preserved forever.
> **Status: PARTIALLY IMPLEMENTED.** StemeDB Assertions follow this model. Aphoria's `AuthoredClaim` entries are currently stored in a mutable TOML file (`.aphoria/claims.toml`), not yet routed through the append-only DAG. Bridging claims into StemeDB as Assertions is tracked in the gap closure plan.
- **Contradictions coexist.** The database holds disagreement without forcing resolution.
- **Lenses resolve at query time.** Different readers can apply different resolution strategies.
- **Source authority is structural.** A regulatory filing outweighs a Reddit post by design.
## The Four Pillars
Every use case must demonstrate at least one pillar. If Postgres could do it, it's not a compelling use case.
| Pillar | What It Enables | Postgres Gap |
| ----------------------------- | -------------------------------------------------- | ---------------------------------------------- |
| **First-Class Contradiction** | Hold conflicting facts without forcing resolution | Must pick one value or version-table chaos |
| **Invalidation Cascades** | Retracted evidence flags all downstream decisions | Recursive CTEs don't scale, app logic drifts |
| **Multi-Signature Consensus** | Weighted trust via cryptographic co-signatures | Join tables have no cryptographic proof |
| **Semantic Decay** | Old data fades from hot path but remains auditable | Manual WHERE clauses, inconsistent decay rates |
## The Core Data Model: The Signed Assertion
The atomic unit is not a Row, Document, or Embedding. It is the **Signed Assertion**:
```rust
struct Assertion {
// The Proposition (What is being claimed)
subject: EntityId, // "semaglutide", "Tesla_Inc"
predicate: RelationId, // "has_side_effect", "annual_revenue"
object: ObjectValue, // "gastroparesis", "$96.7B"
// The Lineage (Why we believe it)
source_hash: Hash, // Content-addressed link to source document
source_class: SourceClass, // Authority tier (0=Regulatory...5=Anecdotal)
source_metadata: Option<JSON>, // Rich provenance (journal, DOI, etc.)
visual_hash: Option<PHash>, // Perceptual hash for image provenance
epoch: Option<EpochId>, // Paradigm context ("covid-2020", "gaap-2024")
// The Meta-Cognition (Who said it, how confident)
signatures: Vec<SignatureEntry>, // Ed25519 cryptographic proofs
confidence: f32, // 0.0-1.0 subjective certainty
timestamp: u64, // When created
lifecycle: LifecycleStage, // Proposed → Approved → Deprecated
// The Semantic (Meaning for similarity search)
vector: Option<Vec<f32>>, // Embedding for k-NN queries
}
```
> **Current state:** Aphoria uses a separate `AuthoredClaim` struct with fields like `concept_path`, `predicate`, `value`, `comparison`, `invariant`, and `consequence`. A bridge function (`authored_claim_to_assertion()`) exists to convert between the two representations but is not yet used in the primary claim storage path. Claims are currently persisted in `.aphoria/claims.toml`, not as `Assertion` entries in the DAG. Routing claims through StemeDB is planned.
## The Source Class Hierarchy
Every assertion has a source class that structurally affects resolution weight and decay:
| Tier | Class | Example | Decay Half-Life | Authority Weight |
| ---- | ----------------- | --------------------- | --------------- | ---------------- |
| 0 | **Regulatory** | FDA label, SEC filing | Never | 1.0 |
| 1 | **Clinical** | Peer-reviewed RCTs | 2 years | 0.9 |
| 2 | **Observational** | Real-world evidence | 1 year | 0.7 |
| 3 | **Expert** | Physician guidelines | 6 months | 0.5 |
| 4 | **Community** | Patient registries | 3 months | 0.2 |
| 5 | **Anecdotal** | Reddit posts, social | 30 days | 0.1 |
A million Tier-5 anecdotal assertions cannot outvote a single Tier-0 regulatory assertion. But the million anecdotes can signal "something is happening here" via cluster escalation.
## The Query Engine: Truth Lenses
Reading applies a **Lens** to collapse the probabilistic field into a concrete answer. Materialized Views ensure sub-millisecond latency for common patterns.
### Resolution Lenses (Pick a Winner)
| Lens | Behavior |
| -------------- | ---------------------------------------- |
| **Recency** | Last writer wins |
| **Consensus** | Highest cluster density of object values |
| **Authority** | Filter by TrustRank reputation |
| **Vote-Aware** | Weight by Ballot Box votes |
| **EpochAware** | Filter out superseded paradigms |
### Analysis Lenses (Surface Disagreement)
| Lens | Behavior |
| --------------------- | ------------------------------------------------------- |
| **Skeptic** | Return all claims with conflict score and weight shares |
| **Layered Consensus** | Per-source-class resolution (tier-by-tier visibility) |
| **Constraints** | Pre-flight check for must_use/forbidden predicates |
The **Skeptic** and **Layered Consensus** lenses are key differentiators: they answer "where do sources agree and disagree?" rather than hiding the variance.
## Key Capabilities
### Time-Travel Queries
"What was the known risk profile when I started Semaglutide in June 2023?"
```http
GET /query?subject=semaglutide&predicate=side_effects&as_of=1687000000
```
The append-only DAG preserves every historical state. Time travel is a hash lookup, not a reconstruction.
### Semantic Decay
Confidence decays based on source class. Old Reddit posts fade; regulatory filings persist:
```http
GET /query?subject=semaglutide&predicate=efficacy&source_class_decay=true
```
### Conflict Analysis
Instead of "here is the answer," show "here is the shape of disagreement":
```http
GET /skeptic?subject=semaglutide&predicate=gastroparesis_risk
```
Returns: which tiers agree, which disagree, emerging signals without clinical evidence.
### Query Audit Trail
Every query is logged with full provenance. "Why did you believe that?" is answerable:
```http
GET /audit/query/{query_id}
```
## The Ballot Box: High-Velocity Consensus
To avoid write contention on assertions, agents vote separately:
```rust
struct Vote {
assertion_hash: Hash,
agent_id: PublicKey,
weight: f32,
signature: Signature,
}
```
Votes are append-only. A background Materializer aggregates votes to update O(1) read views.
## Trust Packs: The Curator Economy
Users subscribe to "Trust Packs" (curated lists of trusted agents) to filter reality:
- _"The Skeptical Cardio Pack"_ filters out low-quality cardiac studies
- _"Mayo Clinic Curated"_ only shows assertions from verified Mayo researchers
Trust Packs are BitSet overlays that filter the Consensus Lens efficiently.
## The Meter: Economics of Reasoning
Deep Research is computationally expensive. Episteme enforces token-bucket quotas:
- **Assert:** 10 tokens
- **Vote:** 1 token
- **Query:** 5 + lens complexity tokens
- **Default:** 10,000 tokens/agent/hour
## Architecture: The Biological Stack
| Layer | Crate | Role |
| --------------- | ------------------------------- | ---------------------------------------- |
| **The Spine** | `stemedb-wal` | Append-only WAL for durability |
| **The Lattice** | `stemedb-storage` | KV store, indexes, vector/visual indices |
| **The Cortex** | `stemedb-query`, `stemedb-lens` | Query engine, Lenses, Materializer |
| **The Surface** | `stemedb-api` | HTTP API with OpenAPI docs |
The biological metaphor:
- **Spine:** Raw persistence. Never loses a claim.
- **Lattice:** Connectivity. O(1) lookups via compound indexes.
- **Cortex:** Reasoning. Collapse probability into answers.
## Future Vision
### Forking Reality (Planned)
Agents simulate futures without polluting the main branch via **Copy-on-Write Branching** using Sparse Merkle Trees.
### The Super Curator (Planned)
A specialized swarm of reviewer agents that audits high-variance facts and escalates emerging signals.
### The Simulator (Planned)
A pipeline that converts high-confidence failure logs into synthetic training trajectories.
## The Git Analogy
| Git Concept | Episteme Equivalent |
| ----------- | ---------------------------------------- |
| Commit | Assertion (immutable, content-addressed) |
| Branch | Epoch (paradigm context) |
| Merge | Lens resolution |
| Revert | Epoch supersession cascade |
| Blame | Signature/agent audit trail |
| History | Append-only DAG preserved forever |
## When to Use Episteme
**Use Episteme when:**
- Multiple sources report conflicting information
- You need to weight sources by authority, not just timestamp
- You need to surface disagreement, not hide it
- Guidance changes and you need to notify prior consumers
- You need to audit "why did you believe that?"
- You need historical snapshots ("what was true on this date?")
**Use Postgres when:**
- You have a single source of truth
- Data never conflicts
- Temporal validity doesn't matter
- Consensus has already been reached by humans
For everything else: **Episteme is the database.**