stemedb/vision.md
jordan a776744889 Initial project setup with Claude Code monorepo structure
- Rust workspace with stemedb-core crate
- Full .claude/ configuration (agents, skills, commands, guides)
- ai-lookup/ for token-efficient fact storage
- Quality gates: clippy, fmt, jscpd duplication detection
- Pre-commit hook with 5-phase quality checks
- CLAUDE.md router and CODING_GUIDELINES.md standards

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 10:56:26 -07:00

102 lines
4.7 KiB
Markdown

# Episteme: The Probabilistic Knowledge Lattice
> **Internal Codename:** StemeDB
> **Category:** Infrastructure / Database
> **Role:** The Cortex (Reasoning & Truth)
## 1. The Manifesto: "Git for Truth"
We are building the shared, long-term memory for autonomous research agents.
Current databases (Postgres, Neo4j, Vector DBs) suffer from **The Tower of Babel** problem: they store *Data*, not *Evidence*. They are deterministic, stateless, and brittle. If an Agent writes `Revenue = $10M` and another writes `Revenue = $12M`, one must overwrite the other. History is lost. Truth is flattened.
**Episteme** rejects the idea of a single, static "database state." Instead, it models knowledge as a **Probabilistic Lattice of Assertions**.
* We do not store "Facts."
* We store "Claims."
* We do not "Update" records.
* We "Append" new evidence.
* We do not query "The Truth."
* We query through "Lenses" (Consensus, Recency, Authority).
## 2. The Core Data Model: The Hyper-Edge
The atomic unit of Episteme is not a Row, Document, or Embedding. It is the **Signed Assertion**.
```rust
struct Assertion {
// The Proposition (The "What")
subject: EntityId, // e.g., "Tesla_Inc"
predicate: RelationId, // "has_annual_revenue"
object: Value, // e.g., "$96.7B"
// The Meta-Cognition (The "Why")
confidence: f32, // 0.0 to 1.0 (Agent's subjective certainty)
source_hash: Hash, // Content-addressed link to source (PDF, URL, Log)
agent_id: PublicKey, // Who made this claim? (Cryptographic signature)
timestamp: u64, // When?
// The Semantic Vector (The "Meaning")
vector: Vec<f32>, // Embedding for semantic navigation
}
```
### 2.1. Non-Destructive Writes
Episteme is an **Append-Only Merkle DAG**.
* **Conflict is a Feature:** If Agent A claims X, and Agent B claims Y, the database holds *both* realities simultaneously.
* **Traceability:** Every assertion links back to its parent (if it modifies/refutes a previous claim) and its source (evidence).
## 3. The Query Engine: "Truth Lenses"
Because the database holds conflicting realities, "Reading" is a compute-heavy operation. You cannot just `GET key`. You must apply a **Lens**.
A **Lens** is a compiled WASM filter that resolves the probability field into a concrete answer at Read Time.
### Standard Lenses
1. **Lens::Consensus:** "Return the value with the highest cluster density across all agents." (Democratic Truth)
2. **Lens::Authority:** "Return values signed by Agents with `Reputation > 900`." (Expert Truth)
3. **Lens::Recency:** "Return the latest assertion, ignoring history." (News)
4. **Lens::Skeptic:** "Return the *variance* between claims." (Finds controversy/ambiguity)
## 4. Features for the AI Scientist
### 4.1. "Forking Reality" (Branching)
Agents need to simulate futures ("What if inflation hits 5%?"). Episteme supports **Copy-on-Write Branching**.
* An Agent creates a `Scenario Branch`.
* It inserts hypothetical assertions (`Inflation = 5%`).
* It queries for 2nd-order effects.
* The Main Branch remains unpolluted.
### 4.2. TrustRank (Reputation Markets)
We implement a recursive PageRank-style algorithm for **Source Credibility**.
1. **Validation:** If an Agent's claim is later verified by Ground Truth (e.g., an earnings call), their Reputation Score (`R`) increases.
2. **Back-Propagation:** High-`R` agents confer weight to the sources they cite.
3. **Decay:** Claims from low-`R` agents fade faster from the "Hot" tier.
## 5. Architecture: The Rust Stack
Episteme follows the **"Defensive by Default"** best practices.
### Tier 1: The Spine (Durability)
* **Component:** `episteme-wal` (Implementing the Quarantine Journal pattern)
* **Role:** Raw, serialized append-only log. Ensures we never lose a claim.
* **Format:** Binary `Record` with BLAKE3 checksums.
### Tier 2: The Lattice (Graph/Index)
* **Component:** `episteme-core`
* **Role:** The Hot/Warm memory.
* **Hot Tier:** `DashMap` of active contradiction clusters.
* **Warm Tier:** `sled` (LSM Tree) for the Merkle DAG + `hnsw` for vector search.
### Tier 3: The Cortex (Compute)
* **Component:** `episteme-lens`
* **Role:** The WASM runtime for executing Lenses.
* **Function:** Collapses the probabilistic graph into deterministic answers for the client.
## 6. The Ecosystem Triad
Episteme completes the Intelligence Stack:
| System | Biological Analogy | Function | Question Answered |
| :--- | :--- | :--- | :--- |
| **LogDB** | **The Spine** | Immutable Event Log | "What happened?" |
| **AssociativeDB** | **The Hippocampus** | Associative Memory | "What is this like?" |
| **Episteme** | **The Cortex** | Structured Reasoning | "Is this true?" |