stemedb/docs/data-structures.md
jordan 1ce4004807 feat: Complete Phase 2 (The Cortex) - query, lens, and API layers
This commit adds the read path (Cortex) to complement the write path (Spine):

## Crates
- stemedb-api: HTTP API with axum + utoipa OpenAPI
  - /v1/assert, /v1/query, /v1/epoch, /v1/skeptic, /v1/trace, /v1/audit
  - Metered endpoints with quota enforcement
  - Ed25519 signature verification
- stemedb-lens: Truth resolution lenses
  - RecencyLens, ConsensusLens, ConfidenceLens
  - VoteAwareConsensusLens (Ballot Box pattern)
  - TrustAwareAuthorityLens (The Hive pattern)
  - SkepticLens (conflict analysis)
  - EpochAwareLens (paradigm-safe queries)
- stemedb-query: Query engine with materialized views

## Storage Extensions
- VoteStore: Vote aggregation with cached counts
- TrustRankStore: Agent reputation with decay
- AuditStore: Query audit trail
- IndexStore: SP/P/S index structures
- SupersessionStore: Epoch supersession chains

## SDKs
- sdk/go/steme: Go HTTP client with Ed25519 signing
- sdk/go/adk: ADK-Go tools for AI agents

## Documentation
- Updated CLAUDE.md, architecture.md, roadmap.md
- New ai-lookup entries for all services
- Use case docs for consumer health intelligence
- Arena roadmap for simulation advancement

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 13:22:44 -07:00

16 KiB

StemeDB Data Structures

Last Updated: 2026-01-31 Source: crates/stemedb-core/src/types.rs

This document describes the core data structures in StemeDB (Episteme). These types form the foundation of the "Git for Truth" knowledge graph.


Design Principles

  1. Append-Only: Data is never mutated. New assertions create new records.
  2. Content-Addressed: Every assertion's ID is a BLAKE3 hash of its content.
  3. Zero-Copy: Uses rkyv for serialization - data can be read directly from disk without parsing.
  4. Provenance-First: Every fact carries its source, signers, and confidence.

Primitive Types

pub type Hash = [u8; 32];       // BLAKE3 256-bit hash
pub type PHash = [u8; 8];       // Perceptual hash for images (8 bytes)
pub type EntityId = String;     // Subject or object identifier
pub type RelationId = String;   // Predicate identifier
pub type EpochId = Hash;        // Paradigm/era identifier
pub type QueryId = Hash;        // Query audit record identifier

The Assertion (Atomic Unit of Knowledge)

The Assertion is the fundamental unit. It represents a single claim about the world.

pub struct Assertion {
    // ═══════════════════════════════════════════════════════════
    // 1. THE FACT (What is being claimed)
    // ═══════════════════════════════════════════════════════════

    /// The entity this assertion is about (e.g., "Semaglutide", "Tesla_Inc")
    pub subject: EntityId,

    /// The relationship or property (e.g., "has_side_effect", "annual_revenue")
    pub predicate: RelationId,

    /// The claimed value
    pub object: ObjectValue,

    // ═══════════════════════════════════════════════════════════
    // 2. THE LINEAGE (Why we believe it)
    // ═══════════════════════════════════════════════════════════

    /// If this modifies/forks another assertion, its hash
    pub parent_hash: Option<Hash>,

    /// Hash of the source evidence (PDF, URL, database export)
    pub source_hash: Hash,

    /// Authority tier of the source (enables indexing and decay rates)
    pub source_class: SourceClass,

    /// Perceptual hash of a visual anchor (e.g., screenshot of table)
    pub visual_hash: Option<PHash>,

    /// Which paradigm/era this belongs to (for paradigm shifts)
    pub epoch: Option<EpochId>,

    /// Lifecycle stage (Proposed → Approved → Deprecated)
    pub lifecycle: LifecycleStage,

    // ═══════════════════════════════════════════════════════════
    // 3. META-COGNITION (Who said it, how sure are they)
    // ═══════════════════════════════════════════════════════════

    /// Cryptographic signatures from agents vouching for this
    pub signatures: Vec<SignatureEntry>,

    /// Subjective confidence score (0.0 to 1.0)
    pub confidence: f32,

    /// Unix timestamp when created
    pub timestamp: u64,

    /// Semantic embedding vector for similarity search
    pub vector: Option<Vec<f32>>,
}

ObjectValue

The value in a subject-predicate-object triple:

pub enum ObjectValue {
    Text(String),           // "muscle loss"
    Number(f64),            // 96.7
    Boolean(bool),          // true
    Reference(EntityId),    // Points to another entity (graph edge)
}

LifecycleStage

Assertions progress through stages (as new assertions, not mutations):

Proposed → UnderReview → Approved
                      ↘ Rejected
         ↘ Deprecated
pub enum LifecycleStage {
    Proposed,      // Initial idea, not for production use
    UnderReview,   // Gathering votes and feedback
    Approved,      // Accepted as current truth
    Deprecated,    // Was true, now superseded
    Rejected,      // Explicitly declined
}

SourceClass

Authority tier classification for sources. Enables indexing by tier and tier-based decay rates:

Tier Class Example Default Decay
0 Regulatory FDA, EMA, WHO Never
1 Clinical Phase III trials, peer-reviewed RCTs 2 years
2 Observational Real-world evidence, cohort studies 1 year
3 Expert Medical professional opinions, guidelines 6 months
4 Community Curated forums, patient advocacy groups 3 months
5 Anecdotal Reddit posts, individual testimonials 1 month
pub enum SourceClass {
    Regulatory,    // Tier 0: Highest authority, never decays
    Clinical,      // Tier 1: Peer-reviewed research
    Observational, // Tier 2: Real-world evidence
    Expert,        // Tier 3: Professional opinions (default)
    Community,     // Tier 4: Curated community knowledge
    Anecdotal,     // Tier 5: Individual reports, fast decay
}

impl SourceClass {
    pub fn tier(&self) -> u8;              // Returns 0-5
    pub fn default_decay_days(&self) -> Option<u32>;
    pub fn authority_weight(&self) -> f32; // 1.0 for Regulatory, 0.1 for Anecdotal
}

Key Benefits:

  • Indexing: SC:{source_class} index enables "show me only regulatory sources"
  • Decay rates: Anecdotal claims decay faster than clinical evidence
  • Trust weighting: Lenses can weight sources by authority tier in conflict resolution

SignatureEntry

Cryptographic proof that an agent vouches for an assertion:

pub struct SignatureEntry {
    pub agent_id: [u8; 32],   // Ed25519 public key
    pub signature: [u8; 64],  // Ed25519 signature over assertion content
    pub timestamp: u64,       // When the agent signed
}

The Vote (High-Velocity Consensus)

Votes are separated from assertions to enable thousands of agents to vote simultaneously without lock contention (the "Ballot Box" pattern).

pub struct Vote {
    /// Hash of the assertion being voted on
    pub assertion_hash: Hash,

    /// Ed25519 public key of the voter
    pub agent_id: [u8; 32],

    /// Weight of the vote (0.0 = reject, 1.0 = full endorsement)
    pub weight: f32,

    /// Signature over the assertion_hash
    pub signature: [u8; 64],

    /// When the vote was cast
    pub timestamp: u64,
}

Key Insight: Votes are append-only. An agent can change their vote by submitting a new one with a later timestamp.


The Epoch (Paradigm Shifts)

Epochs represent distinct periods of truth. When knowledge paradigms shift, old epochs can be superseded.

pub struct Epoch {
    pub id: EpochId,
    pub name: String,                           // "Pre-2024", "Newtonian"
    pub supersedes: Option<EpochId>,            // What this replaces
    pub supersession_type: Option<SupersessionType>,
    pub start_timestamp: u64,
    pub end_timestamp: Option<u64>,
}

pub enum SupersessionType {
    Invalidation,  // Old epoch was factually wrong (e.g., "Earth is flat")
    Temporal,      // Old epoch was correct but outdated (e.g., "President is Obama")
    Refinement,    // Old epoch was a simplification (e.g., Newtonian → Relativity)
}

Query Results

MaterializedView (O(1) Winner Lookup)

Pre-computed resolution stored at MV:{subject}:{predicate}:

pub struct MaterializedView {
    /// The winning assertion from lens resolution
    pub winner: Assertion,

    /// Which lens produced this (e.g., "VoteAwareConsensus")
    pub lens_name: String,

    /// Confidence in the resolution (0.0 to 1.0)
    pub resolution_confidence: f32,

    /// How many candidates were considered
    pub candidates_count: usize,

    /// When this view was computed
    pub materialized_at: u64,
}

ConflictAnalysis (Trust but Verify)

For the SkepticLens - surfaces all competing claims instead of picking a winner:

pub struct ConflictAnalysis {
    /// Overall status: Unanimous, Agreed, or Contested
    pub status: ResolutionStatus,

    /// Conflict score (0.0 = unanimous, 1.0 = maximum chaos)
    /// Calculated using normalized Shannon entropy
    pub conflict_score: f32,

    /// All distinct claims, ranked by weight_share descending
    pub claims: Vec<ClaimSummary>,

    /// Total candidates considered
    pub candidates_count: usize,
}

pub enum ResolutionStatus {
    Unanimous,   // All agree (entropy < 0.1)
    Agreed,      // Strong majority (entropy < 0.4)
    Contested,   // Significant disagreement (entropy >= 0.4)
}

ClaimSummary

A single competing claim within a ConflictAnalysis:

pub struct ClaimSummary {
    /// The claimed value
    pub value: ObjectValue,

    /// This claim's share of total support (0.0 to 1.0)
    pub weight_share: f32,

    /// Number of assertions making this claim
    pub assertion_count: u32,

    /// Hash of the highest-confidence assertion (for drill-down)
    pub representative_hash: Hash,

    /// Source provenance
    pub source: SourceSummary,

    /// Agents who signed assertions for this claim
    pub supporting_agents: Vec<AgentSummary>,
}

SourceSummary & AgentSummary

Provenance types for "show me the proof" UX:

pub struct SourceSummary {
    pub source_hash: Hash,           // Hash of source document
    pub visual_hash: Option<PHash>,  // Visual anchor (screenshot)
}

pub struct AgentSummary {
    pub agent_id: [u8; 32],   // Agent's public key
    pub trust_score: f32,     // Trust score at query time
}

Query Audit Trail

Every query is logged for "Why did you think that?" debugging:

pub struct QueryAudit {
    pub query_id: QueryId,
    pub agent_id: Option<[u8; 32]>,   // Who queried (from X-Agent-Id header)
    pub timestamp: u64,
    pub params: QueryParams,
    pub result_hash: Option<Hash>,    // Winning assertion hash
    pub result_confidence: f32,
    pub contributing_assertions: Vec<ContributingAssertion>,
}

pub struct QueryParams {
    pub subject: Option<EntityId>,
    pub predicate: Option<RelationId>,
    pub lifecycle: Option<LifecycleStage>,
    pub epoch: Option<EpochId>,
    pub lens: Option<String>,
}

pub struct ContributingAssertion {
    pub assertion_hash: Hash,
    pub weight: f32,          // How much this influenced the result
    pub source_hash: Hash,
    pub lifecycle: LifecycleStage,
}

Storage Layout

Key patterns in the KV store:

Key Pattern Value Purpose
H:{hash} Serialized Assertion Primary assertion storage
S:{subject} Vec<Hash> Subject index
SP:{subject}:{predicate} Vec<Hash> Compound index (O(1) lookup)
MV:{subject}:{predicate} MaterializedView Pre-computed winner
V:{assertion_hash}:{vote_hash} Vote Individual votes
VC:{assertion_hash} u64 Vote count cache
VW:{assertion_hash} f32 Aggregate vote weight cache
TR:{agent_id} TrustRank Agent reputation
TP:{pack_id} TrustPack Curated agent lists
AUD:{query_id} QueryAudit Query audit record
E:{epoch_id} Epoch Epoch definitions

The Trust Pack (Curator Economy)

Trust Packs are the "App Store for Trust" - curated lists of trusted agents that filter consensus through domain expertise.

pub struct TrustPack {
    /// Content-addressed pack ID (BLAKE3 hash)
    pub id: PackId,

    /// Human-readable name (e.g., "Mayo_Clinic_Experts")
    pub name: String,

    /// Ed25519 public key of the pack maintainer
    pub maintainer: [u8; 32],

    /// Agent public keys in this pack
    /// Future: Replace with RoaringBitmap for O(1) membership
    pub agents: Vec<[u8; 32]>,

    /// Unix timestamp when pack was created
    pub created_at: u64,

    /// Unix timestamp of last modification
    pub updated_at: u64,
}

Key Methods:

  • add_agent(agent_id) - Idempotent agent addition
  • remove_agent(agent_id) - Safe removal
  • contains_agent(agent_id) -> bool - Membership check

Use Case: Users subscribe to packs like "Skeptical Cardio Pack" to filter GLP-1 side effect claims through vetted cardiologists.


Serialization

All types use rkyv for zero-copy deserialization:

use stemedb_core::serde::{serialize, deserialize};

// Serialize
let bytes: Vec<u8> = serialize(&assertion)?;

// Deserialize (zero-copy when possible)
let assertion: Assertion = deserialize(&bytes)?;

Critical Rule: Never use raw AllocSerializer in production code. Always use stemedb_core::serde::{serialize, deserialize}.


Relationship Diagram

                    ┌─────────────────────────────────────────────┐
                    │              ASSERTION                      │
                    │  ┌─────────┐ ┌───────────┐ ┌─────────────┐  │
                    │  │ subject │ │ predicate │ │   object    │  │
                    │  └─────────┘ └───────────┘ └─────────────┘  │
                    │                                             │
                    │  ┌─────────────────┐  ┌──────────────────┐  │
                    │  │   source_hash   │  │   signatures[]   │  │
                    │  └────────┬────────┘  └────────┬─────────┘  │
                    │           │                    │            │
                    └───────────┼────────────────────┼────────────┘
                                │                    │
                    ┌───────────▼───────┐  ┌────────▼────────┐
                    │  SOURCE DOCUMENT  │  │     AGENTS      │
                    │   (PDF, URL...)   │  │ (Ed25519 keys)  │
                    └───────────────────┘  └────────┬────────┘
                                                    │
                                           ┌────────▼────────┐
                                           │   TRUST RANK    │
                                           │  (reputation)   │
                                           └─────────────────┘


    ┌─────────────────┐         ┌─────────────────┐
    │      VOTE       │◄────────│   ASSERTION     │
    │  (Ballot Box)   │ votes   │    (target)     │
    │  weight: 0.0-1.0│  on     │                 │
    └─────────────────┘         └─────────────────┘


    ┌─────────────────┐         ┌─────────────────┐
    │     EPOCH B     │◄────────│    EPOCH A      │
    │  supersedes: A  │  older  │                 │
    │  type: Temporal │  epoch  │                 │
    └─────────────────┘         └─────────────────┘

API Representation

All binary data (hashes, signatures, agent IDs) is hex-encoded in JSON APIs:

{
  "subject": "Semaglutide",
  "predicate": "muscle_effect",
  "object": { "type": "Text", "value": "Significant loss" },
  "source_hash": "a1b2c3d4e5f6...",
  "signatures": [
    {
      "agent_id": "deadbeef...",
      "signature": "cafebabe...",
      "timestamp": 1706745600
    }
  ],
  "confidence": 0.85,
  "timestamp": 1706745600
}

See Also