stemedb/docs/data-structures.md
jordan 1ce4004807 feat: Complete Phase 2 (The Cortex) - query, lens, and API layers
This commit adds the read path (Cortex) to complement the write path (Spine):

## Crates
- stemedb-api: HTTP API with axum + utoipa OpenAPI
  - /v1/assert, /v1/query, /v1/epoch, /v1/skeptic, /v1/trace, /v1/audit
  - Metered endpoints with quota enforcement
  - Ed25519 signature verification
- stemedb-lens: Truth resolution lenses
  - RecencyLens, ConsensusLens, ConfidenceLens
  - VoteAwareConsensusLens (Ballot Box pattern)
  - TrustAwareAuthorityLens (The Hive pattern)
  - SkepticLens (conflict analysis)
  - EpochAwareLens (paradigm-safe queries)
- stemedb-query: Query engine with materialized views

## Storage Extensions
- VoteStore: Vote aggregation with cached counts
- TrustRankStore: Agent reputation with decay
- AuditStore: Query audit trail
- IndexStore: SP/P/S index structures
- SupersessionStore: Epoch supersession chains

## SDKs
- sdk/go/steme: Go HTTP client with Ed25519 signing
- sdk/go/adk: ADK-Go tools for AI agents

## Documentation
- Updated CLAUDE.md, architecture.md, roadmap.md
- New ai-lookup entries for all services
- Use case docs for consumer health intelligence
- Arena roadmap for simulation advancement

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 13:22:44 -07:00

497 lines
16 KiB
Markdown

# StemeDB Data Structures
> **Last Updated:** 2026-01-31
> **Source:** `crates/stemedb-core/src/types.rs`
This document describes the core data structures in StemeDB (Episteme). These types form the foundation of the "Git for Truth" knowledge graph.
---
## Design Principles
1. **Append-Only**: Data is never mutated. New assertions create new records.
2. **Content-Addressed**: Every assertion's ID is a BLAKE3 hash of its content.
3. **Zero-Copy**: Uses `rkyv` for serialization - data can be read directly from disk without parsing.
4. **Provenance-First**: Every fact carries its source, signers, and confidence.
---
## Primitive Types
```rust
pub type Hash = [u8; 32]; // BLAKE3 256-bit hash
pub type PHash = [u8; 8]; // Perceptual hash for images (8 bytes)
pub type EntityId = String; // Subject or object identifier
pub type RelationId = String; // Predicate identifier
pub type EpochId = Hash; // Paradigm/era identifier
pub type QueryId = Hash; // Query audit record identifier
```
---
## The Assertion (Atomic Unit of Knowledge)
The `Assertion` is the fundamental unit. It represents a single claim about the world.
```rust
pub struct Assertion {
// ═══════════════════════════════════════════════════════════
// 1. THE FACT (What is being claimed)
// ═══════════════════════════════════════════════════════════
/// The entity this assertion is about (e.g., "Semaglutide", "Tesla_Inc")
pub subject: EntityId,
/// The relationship or property (e.g., "has_side_effect", "annual_revenue")
pub predicate: RelationId,
/// The claimed value
pub object: ObjectValue,
// ═══════════════════════════════════════════════════════════
// 2. THE LINEAGE (Why we believe it)
// ═══════════════════════════════════════════════════════════
/// If this modifies/forks another assertion, its hash
pub parent_hash: Option<Hash>,
/// Hash of the source evidence (PDF, URL, database export)
pub source_hash: Hash,
/// Authority tier of the source (enables indexing and decay rates)
pub source_class: SourceClass,
/// Perceptual hash of a visual anchor (e.g., screenshot of table)
pub visual_hash: Option<PHash>,
/// Which paradigm/era this belongs to (for paradigm shifts)
pub epoch: Option<EpochId>,
/// Lifecycle stage (Proposed → Approved → Deprecated)
pub lifecycle: LifecycleStage,
// ═══════════════════════════════════════════════════════════
// 3. META-COGNITION (Who said it, how sure are they)
// ═══════════════════════════════════════════════════════════
/// Cryptographic signatures from agents vouching for this
pub signatures: Vec<SignatureEntry>,
/// Subjective confidence score (0.0 to 1.0)
pub confidence: f32,
/// Unix timestamp when created
pub timestamp: u64,
/// Semantic embedding vector for similarity search
pub vector: Option<Vec<f32>>,
}
```
### ObjectValue
The value in a subject-predicate-object triple:
```rust
pub enum ObjectValue {
Text(String), // "muscle loss"
Number(f64), // 96.7
Boolean(bool), // true
Reference(EntityId), // Points to another entity (graph edge)
}
```
### LifecycleStage
Assertions progress through stages (as new assertions, not mutations):
```
Proposed → UnderReview → Approved
↘ Rejected
↘ Deprecated
```
```rust
pub enum LifecycleStage {
Proposed, // Initial idea, not for production use
UnderReview, // Gathering votes and feedback
Approved, // Accepted as current truth
Deprecated, // Was true, now superseded
Rejected, // Explicitly declined
}
```
### SourceClass
Authority tier classification for sources. Enables indexing by tier and tier-based decay rates:
| Tier | Class | Example | Default Decay |
|------|-------|---------|---------------|
| 0 | Regulatory | FDA, EMA, WHO | Never |
| 1 | Clinical | Phase III trials, peer-reviewed RCTs | 2 years |
| 2 | Observational | Real-world evidence, cohort studies | 1 year |
| 3 | Expert | Medical professional opinions, guidelines | 6 months |
| 4 | Community | Curated forums, patient advocacy groups | 3 months |
| 5 | Anecdotal | Reddit posts, individual testimonials | 1 month |
```rust
pub enum SourceClass {
Regulatory, // Tier 0: Highest authority, never decays
Clinical, // Tier 1: Peer-reviewed research
Observational, // Tier 2: Real-world evidence
Expert, // Tier 3: Professional opinions (default)
Community, // Tier 4: Curated community knowledge
Anecdotal, // Tier 5: Individual reports, fast decay
}
impl SourceClass {
pub fn tier(&self) -> u8; // Returns 0-5
pub fn default_decay_days(&self) -> Option<u32>;
pub fn authority_weight(&self) -> f32; // 1.0 for Regulatory, 0.1 for Anecdotal
}
```
**Key Benefits:**
- **Indexing**: `SC:{source_class}` index enables "show me only regulatory sources"
- **Decay rates**: Anecdotal claims decay faster than clinical evidence
- **Trust weighting**: Lenses can weight sources by authority tier in conflict resolution
### SignatureEntry
Cryptographic proof that an agent vouches for an assertion:
```rust
pub struct SignatureEntry {
pub agent_id: [u8; 32], // Ed25519 public key
pub signature: [u8; 64], // Ed25519 signature over assertion content
pub timestamp: u64, // When the agent signed
}
```
---
## The Vote (High-Velocity Consensus)
Votes are separated from assertions to enable thousands of agents to vote simultaneously without lock contention (the "Ballot Box" pattern).
```rust
pub struct Vote {
/// Hash of the assertion being voted on
pub assertion_hash: Hash,
/// Ed25519 public key of the voter
pub agent_id: [u8; 32],
/// Weight of the vote (0.0 = reject, 1.0 = full endorsement)
pub weight: f32,
/// Signature over the assertion_hash
pub signature: [u8; 64],
/// When the vote was cast
pub timestamp: u64,
}
```
**Key Insight**: Votes are append-only. An agent can change their vote by submitting a new one with a later timestamp.
---
## The Epoch (Paradigm Shifts)
Epochs represent distinct periods of truth. When knowledge paradigms shift, old epochs can be superseded.
```rust
pub struct Epoch {
pub id: EpochId,
pub name: String, // "Pre-2024", "Newtonian"
pub supersedes: Option<EpochId>, // What this replaces
pub supersession_type: Option<SupersessionType>,
pub start_timestamp: u64,
pub end_timestamp: Option<u64>,
}
pub enum SupersessionType {
Invalidation, // Old epoch was factually wrong (e.g., "Earth is flat")
Temporal, // Old epoch was correct but outdated (e.g., "President is Obama")
Refinement, // Old epoch was a simplification (e.g., Newtonian → Relativity)
}
```
---
## Query Results
### MaterializedView (O(1) Winner Lookup)
Pre-computed resolution stored at `MV:{subject}:{predicate}`:
```rust
pub struct MaterializedView {
/// The winning assertion from lens resolution
pub winner: Assertion,
/// Which lens produced this (e.g., "VoteAwareConsensus")
pub lens_name: String,
/// Confidence in the resolution (0.0 to 1.0)
pub resolution_confidence: f32,
/// How many candidates were considered
pub candidates_count: usize,
/// When this view was computed
pub materialized_at: u64,
}
```
### ConflictAnalysis (Trust but Verify)
For the SkepticLens - surfaces all competing claims instead of picking a winner:
```rust
pub struct ConflictAnalysis {
/// Overall status: Unanimous, Agreed, or Contested
pub status: ResolutionStatus,
/// Conflict score (0.0 = unanimous, 1.0 = maximum chaos)
/// Calculated using normalized Shannon entropy
pub conflict_score: f32,
/// All distinct claims, ranked by weight_share descending
pub claims: Vec<ClaimSummary>,
/// Total candidates considered
pub candidates_count: usize,
}
pub enum ResolutionStatus {
Unanimous, // All agree (entropy < 0.1)
Agreed, // Strong majority (entropy < 0.4)
Contested, // Significant disagreement (entropy >= 0.4)
}
```
### ClaimSummary
A single competing claim within a ConflictAnalysis:
```rust
pub struct ClaimSummary {
/// The claimed value
pub value: ObjectValue,
/// This claim's share of total support (0.0 to 1.0)
pub weight_share: f32,
/// Number of assertions making this claim
pub assertion_count: u32,
/// Hash of the highest-confidence assertion (for drill-down)
pub representative_hash: Hash,
/// Source provenance
pub source: SourceSummary,
/// Agents who signed assertions for this claim
pub supporting_agents: Vec<AgentSummary>,
}
```
### SourceSummary & AgentSummary
Provenance types for "show me the proof" UX:
```rust
pub struct SourceSummary {
pub source_hash: Hash, // Hash of source document
pub visual_hash: Option<PHash>, // Visual anchor (screenshot)
}
pub struct AgentSummary {
pub agent_id: [u8; 32], // Agent's public key
pub trust_score: f32, // Trust score at query time
}
```
---
## Query Audit Trail
Every query is logged for "Why did you think that?" debugging:
```rust
pub struct QueryAudit {
pub query_id: QueryId,
pub agent_id: Option<[u8; 32]>, // Who queried (from X-Agent-Id header)
pub timestamp: u64,
pub params: QueryParams,
pub result_hash: Option<Hash>, // Winning assertion hash
pub result_confidence: f32,
pub contributing_assertions: Vec<ContributingAssertion>,
}
pub struct QueryParams {
pub subject: Option<EntityId>,
pub predicate: Option<RelationId>,
pub lifecycle: Option<LifecycleStage>,
pub epoch: Option<EpochId>,
pub lens: Option<String>,
}
pub struct ContributingAssertion {
pub assertion_hash: Hash,
pub weight: f32, // How much this influenced the result
pub source_hash: Hash,
pub lifecycle: LifecycleStage,
}
```
---
## Storage Layout
Key patterns in the KV store:
| Key Pattern | Value | Purpose |
|-------------|-------|---------|
| `H:{hash}` | Serialized Assertion | Primary assertion storage |
| `S:{subject}` | `Vec<Hash>` | Subject index |
| `SP:{subject}:{predicate}` | `Vec<Hash>` | Compound index (O(1) lookup) |
| `MV:{subject}:{predicate}` | MaterializedView | Pre-computed winner |
| `V:{assertion_hash}:{vote_hash}` | Vote | Individual votes |
| `VC:{assertion_hash}` | u64 | Vote count cache |
| `VW:{assertion_hash}` | f32 | Aggregate vote weight cache |
| `TR:{agent_id}` | TrustRank | Agent reputation |
| `TP:{pack_id}` | TrustPack | Curated agent lists |
| `AUD:{query_id}` | QueryAudit | Query audit record |
| `E:{epoch_id}` | Epoch | Epoch definitions |
---
## The Trust Pack (Curator Economy)
Trust Packs are the "App Store for Trust" - curated lists of trusted agents that filter consensus through domain expertise.
```rust
pub struct TrustPack {
/// Content-addressed pack ID (BLAKE3 hash)
pub id: PackId,
/// Human-readable name (e.g., "Mayo_Clinic_Experts")
pub name: String,
/// Ed25519 public key of the pack maintainer
pub maintainer: [u8; 32],
/// Agent public keys in this pack
/// Future: Replace with RoaringBitmap for O(1) membership
pub agents: Vec<[u8; 32]>,
/// Unix timestamp when pack was created
pub created_at: u64,
/// Unix timestamp of last modification
pub updated_at: u64,
}
```
**Key Methods:**
- `add_agent(agent_id)` - Idempotent agent addition
- `remove_agent(agent_id)` - Safe removal
- `contains_agent(agent_id) -> bool` - Membership check
**Use Case:** Users subscribe to packs like "Skeptical Cardio Pack" to filter GLP-1 side effect claims through vetted cardiologists.
---
## Serialization
All types use `rkyv` for zero-copy deserialization:
```rust
use stemedb_core::serde::{serialize, deserialize};
// Serialize
let bytes: Vec<u8> = serialize(&assertion)?;
// Deserialize (zero-copy when possible)
let assertion: Assertion = deserialize(&bytes)?;
```
**Critical Rule**: Never use raw `AllocSerializer` in production code. Always use `stemedb_core::serde::{serialize, deserialize}`.
---
## Relationship Diagram
```
┌─────────────────────────────────────────────┐
│ ASSERTION │
│ ┌─────────┐ ┌───────────┐ ┌─────────────┐ │
│ │ subject │ │ predicate │ │ object │ │
│ └─────────┘ └───────────┘ └─────────────┘ │
│ │
│ ┌─────────────────┐ ┌──────────────────┐ │
│ │ source_hash │ │ signatures[] │ │
│ └────────┬────────┘ └────────┬─────────┘ │
│ │ │ │
└───────────┼────────────────────┼────────────┘
│ │
┌───────────▼───────┐ ┌────────▼────────┐
│ SOURCE DOCUMENT │ │ AGENTS │
│ (PDF, URL...) │ │ (Ed25519 keys) │
└───────────────────┘ └────────┬────────┘
┌────────▼────────┐
│ TRUST RANK │
│ (reputation) │
└─────────────────┘
┌─────────────────┐ ┌─────────────────┐
│ VOTE │◄────────│ ASSERTION │
│ (Ballot Box) │ votes │ (target) │
│ weight: 0.0-1.0│ on │ │
└─────────────────┘ └─────────────────┘
┌─────────────────┐ ┌─────────────────┐
│ EPOCH B │◄────────│ EPOCH A │
│ supersedes: A │ older │ │
│ type: Temporal │ epoch │ │
└─────────────────┘ └─────────────────┘
```
---
## API Representation
All binary data (hashes, signatures, agent IDs) is hex-encoded in JSON APIs:
```json
{
"subject": "Semaglutide",
"predicate": "muscle_effect",
"object": { "type": "Text", "value": "Significant loss" },
"source_hash": "a1b2c3d4e5f6...",
"signatures": [
{
"agent_id": "deadbeef...",
"signature": "cafebabe...",
"timestamp": 1706745600
}
],
"confidence": 0.85,
"timestamp": 1706745600
}
```
---
## See Also
- [Architecture Overview](../architecture.md)
- [Lens Documentation](../ai-lookup/services/lens.md)
- [API Endpoints Guide](../.claude/guides/backend/api-endpoints.md)