stemedb/docs/data-structures.md
jordan 55349845d0 refactor: Split all files to enforce 500-line max
Break monolith source files into focused modules:
- stemedb-core/types.rs → types/ directory (assertion, source, gold_standard, etc.)
- stemedb-storage: audit_store, quota_store, trust_rank_store, vector_index, vote_store → module directories
- stemedb-ingest/worker.rs → worker/ with separate test modules
- stemedb-query: engine, materializer, query → module directories
- stemedb-lens: epoch_aware, skeptic → module directories
- stemedb-sim/lib.rs → agent, arenas/, helpers, runner, strategy, types
- stemedb-api/tests: integration_tests → http_basic, http_validation, http_epoch, http_pipeline
- stemedb-api/tests: e2e_flow_test → e2e_full_pipeline, e2e_lens_resolution
- stemedb-query/tests: e2e_pipeline → e2e_pipeline + e2e_decay

Also adds new features: gold standard verification, escalation handlers,
admin endpoints, concept hierarchy spec, arena roadmap, and Go SDK.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 01:13:45 -07:00

17 KiB

StemeDB Data Structures

Last Updated: 2026-01-31 Source: crates/stemedb-core/src/types.rs

This document describes the core data structures in StemeDB (Episteme). These types form the foundation of the "Git for Truth" knowledge graph.


Design Principles

  1. Append-Only: Data is never mutated. New assertions create new records.
  2. Content-Addressed: Every assertion's ID is a BLAKE3 hash of its content.
  3. Zero-Copy: Uses rkyv for serialization - data can be read directly from disk without parsing.
  4. Provenance-First: Every fact carries its source, signers, and confidence.

Primitive Types

pub type Hash = [u8; 32];       // BLAKE3 256-bit hash
pub type PHash = [u8; 8];       // Perceptual hash for images (8 bytes)
pub type EntityId = String;     // Subject or object identifier
pub type RelationId = String;   // Predicate identifier
pub type EpochId = Hash;        // Paradigm/era identifier
pub type QueryId = Hash;        // Query audit record identifier

The Assertion (Atomic Unit of Knowledge)

The Assertion is the fundamental unit. It represents a single claim about the world.

pub struct Assertion {
    // ═══════════════════════════════════════════════════════════
    // 1. THE FACT (What is being claimed)
    // ═══════════════════════════════════════════════════════════

    /// The entity this assertion is about (e.g., "Semaglutide", "Tesla_Inc")
    pub subject: EntityId,

    /// The relationship or property (e.g., "has_side_effect", "annual_revenue")
    pub predicate: RelationId,

    /// The claimed value
    pub object: ObjectValue,

    // ═══════════════════════════════════════════════════════════
    // 2. THE LINEAGE (Why we believe it)
    // ═══════════════════════════════════════════════════════════

    /// If this modifies/forks another assertion, its hash
    pub parent_hash: Option<Hash>,

    /// Hash of the source evidence (PDF, URL, database export)
    pub source_hash: Hash,

    /// Authority tier of the source (enables indexing and decay rates)
    pub source_class: SourceClass,

    /// Perceptual hash of a visual anchor (e.g., screenshot of table)
    pub visual_hash: Option<PHash>,

    /// Which paradigm/era this belongs to (for paradigm shifts)
    pub epoch: Option<EpochId>,

    /// Lifecycle stage (Proposed → Approved → Deprecated)
    pub lifecycle: LifecycleStage,

    // ═══════════════════════════════════════════════════════════
    // 3. META-COGNITION (Who said it, how sure are they)
    // ═══════════════════════════════════════════════════════════

    /// Cryptographic signatures from agents vouching for this
    pub signatures: Vec<SignatureEntry>,

    /// Subjective confidence score (0.0 to 1.0)
    pub confidence: f32,

    /// Unix timestamp when created
    pub timestamp: u64,

    /// Semantic embedding vector for similarity search
    pub vector: Option<Vec<f32>>,
}

ObjectValue

The value in a subject-predicate-object triple:

pub enum ObjectValue {
    Text(String),           // "muscle loss"
    Number(f64),            // 96.7
    Boolean(bool),          // true
    Reference(EntityId),    // Points to another entity (graph edge)
}

LifecycleStage

Assertions progress through stages (as new assertions, not mutations):

Proposed → UnderReview → Approved
                      ↘ Rejected
         ↘ Deprecated
pub enum LifecycleStage {
    Proposed,      // Initial idea, not for production use
    UnderReview,   // Gathering votes and feedback
    Approved,      // Accepted as current truth
    Deprecated,    // Was true, now superseded
    Rejected,      // Explicitly declined
}

SourceClass

Authority tier classification for sources. Enables indexing by tier and tier-based decay rates:

Tier Class Example Default Decay
0 Regulatory FDA, EMA, WHO Never
1 Clinical Phase III trials, peer-reviewed RCTs 2 years
2 Observational Real-world evidence, cohort studies 1 year
3 Expert Medical professional opinions, guidelines 6 months
4 Community Curated forums, patient advocacy groups 3 months
5 Anecdotal Reddit posts, individual testimonials 1 month
pub enum SourceClass {
    Regulatory,    // Tier 0: Highest authority, never decays
    Clinical,      // Tier 1: Peer-reviewed research
    Observational, // Tier 2: Real-world evidence
    Expert,        // Tier 3: Professional opinions (default)
    Community,     // Tier 4: Curated community knowledge
    Anecdotal,     // Tier 5: Individual reports, fast decay
}

impl SourceClass {
    pub fn tier(&self) -> u8;              // Returns 0-5
    pub fn default_decay_days(&self) -> Option<u32>;
    pub fn authority_weight(&self) -> f32; // 1.0 for Regulatory, 0.1 for Anecdotal
}

Key Benefits:

  • Indexing: SC:{source_class} index enables "show me only regulatory sources"
  • Decay rates: Anecdotal claims decay faster than clinical evidence
  • Trust weighting: Lenses can weight sources by authority tier in conflict resolution

SignatureEntry

Cryptographic proof that an agent vouches for an assertion:

pub struct SignatureEntry {
    pub agent_id: [u8; 32],   // Ed25519 public key
    pub signature: [u8; 64],  // Ed25519 signature over assertion content
    pub timestamp: u64,       // When the agent signed
}

The Vote (High-Velocity Consensus)

Votes are separated from assertions to enable thousands of agents to vote simultaneously without lock contention (the "Ballot Box" pattern).

A vote is not just "I agree" - it's a cryptographic witness: "I saw this exact text at this URL at this time." This enables browser extension products where votes represent observations, not opinions.

pub struct Vote {
    /// Hash of the assertion being voted on
    pub assertion_hash: Hash,

    /// Ed25519 public key of the voter
    pub agent_id: [u8; 32],

    /// Weight of the vote (0.0 = reject, 1.0 = full endorsement)
    pub weight: f32,

    /// Signature over the assertion_hash
    pub signature: [u8; 64],

    /// When the vote was cast
    pub timestamp: u64,

    /// The URL where the claim was observed (optional)
    /// Enables provenance tracking: "I saw this at example.com/article"
    pub source_url: Option<String>,

    /// Optional context (page snippet, etc.) stored as bytes
    /// Same pattern as source_metadata on Assertion for rkyv zero-copy
    pub observed_context: Option<Vec<u8>>,
}

Key Insight: Votes are append-only. An agent can change their vote by submitting a new one with a later timestamp.

Provenance Witness: The source_url and observed_context fields transform votes from opinions into observations, enabling the browser extension to count "How many people saw this claim on this page?" rather than just "How many people agree?"


The Epoch (Paradigm Shifts)

Epochs represent distinct periods of truth. When knowledge paradigms shift, old epochs can be superseded.

pub struct Epoch {
    pub id: EpochId,
    pub name: String,                           // "Pre-2024", "Newtonian"
    pub supersedes: Option<EpochId>,            // What this replaces
    pub supersession_type: Option<SupersessionType>,
    pub start_timestamp: u64,
    pub end_timestamp: Option<u64>,
}

pub enum SupersessionType {
    Invalidation,  // Old epoch was factually wrong (e.g., "Earth is flat")
    Temporal,      // Old epoch was correct but outdated (e.g., "President is Obama")
    Refinement,    // Old epoch was a simplification (e.g., Newtonian → Relativity)
}

Query Results

MaterializedView (O(1) Winner Lookup)

Pre-computed resolution stored at MV:{subject}:{predicate}:

pub struct MaterializedView {
    /// The winning assertion from lens resolution
    pub winner: Assertion,

    /// Which lens produced this (e.g., "VoteAwareConsensus")
    pub lens_name: String,

    /// Confidence in the resolution (0.0 to 1.0)
    pub resolution_confidence: f32,

    /// How many candidates were considered
    pub candidates_count: usize,

    /// When this view was computed
    pub materialized_at: u64,
}

ConflictAnalysis (Trust but Verify)

For the SkepticLens - surfaces all competing claims instead of picking a winner:

pub struct ConflictAnalysis {
    /// Overall status: Unanimous, Agreed, or Contested
    pub status: ResolutionStatus,

    /// Conflict score (0.0 = unanimous, 1.0 = maximum chaos)
    /// Calculated using normalized Shannon entropy
    pub conflict_score: f32,

    /// All distinct claims, ranked by weight_share descending
    pub claims: Vec<ClaimSummary>,

    /// Total candidates considered
    pub candidates_count: usize,
}

pub enum ResolutionStatus {
    Unanimous,   // All agree (entropy < 0.1)
    Agreed,      // Strong majority (entropy < 0.4)
    Contested,   // Significant disagreement (entropy >= 0.4)
}

ClaimSummary

A single competing claim within a ConflictAnalysis:

pub struct ClaimSummary {
    /// The claimed value
    pub value: ObjectValue,

    /// This claim's share of total support (0.0 to 1.0)
    pub weight_share: f32,

    /// Number of assertions making this claim
    pub assertion_count: u32,

    /// Hash of the highest-confidence assertion (for drill-down)
    pub representative_hash: Hash,

    /// Source provenance
    pub source: SourceSummary,

    /// Agents who signed assertions for this claim
    pub supporting_agents: Vec<AgentSummary>,
}

SourceSummary & AgentSummary

Provenance types for "show me the proof" UX:

pub struct SourceSummary {
    pub source_hash: Hash,           // Hash of source document
    pub visual_hash: Option<PHash>,  // Visual anchor (screenshot)
}

pub struct AgentSummary {
    pub agent_id: [u8; 32],   // Agent's public key
    pub trust_score: f32,     // Trust score at query time
}

Query Audit Trail

Every query is logged for "Why did you think that?" debugging:

pub struct QueryAudit {
    pub query_id: QueryId,
    pub agent_id: Option<[u8; 32]>,   // Who queried (from X-Agent-Id header)
    pub timestamp: u64,
    pub params: QueryParams,
    pub result_hash: Option<Hash>,    // Winning assertion hash
    pub result_confidence: f32,
    pub contributing_assertions: Vec<ContributingAssertion>,
}

pub struct QueryParams {
    pub subject: Option<EntityId>,
    pub predicate: Option<RelationId>,
    pub lifecycle: Option<LifecycleStage>,
    pub epoch: Option<EpochId>,
    pub lens: Option<String>,
}

pub struct ContributingAssertion {
    pub assertion_hash: Hash,
    pub weight: f32,          // How much this influenced the result
    pub source_hash: Hash,
    pub lifecycle: LifecycleStage,
}

Storage Layout

Key patterns in the KV store:

Key Pattern Value Purpose
H:{hash} Serialized Assertion Primary assertion storage
S:{subject} Vec<Hash> Subject index
SP:{subject}:{predicate} Vec<Hash> Compound index (O(1) lookup)
MV:{subject}:{predicate} MaterializedView Pre-computed winner
V:{assertion_hash}:{vote_hash} Vote Individual votes
VC:{assertion_hash} u64 Vote count cache
VW:{assertion_hash} f32 Aggregate vote weight cache
TR:{agent_id} TrustRank Agent reputation
TP:{pack_id} TrustPack Curated agent lists
AUD:{query_id} QueryAudit Query audit record
E:{epoch_id} Epoch Epoch definitions

The Trust Pack (Curator Economy)

Trust Packs are the "App Store for Trust" - curated lists of trusted agents that filter consensus through domain expertise.

pub struct TrustPack {
    /// Content-addressed pack ID (BLAKE3 hash)
    pub id: PackId,

    /// Human-readable name (e.g., "Mayo_Clinic_Experts")
    pub name: String,

    /// Ed25519 public key of the pack maintainer
    pub maintainer: [u8; 32],

    /// Agent public keys in this pack
    /// Future: Replace with RoaringBitmap for O(1) membership
    pub agents: Vec<[u8; 32]>,

    /// Unix timestamp when pack was created
    pub created_at: u64,

    /// Unix timestamp of last modification
    pub updated_at: u64,
}

Key Methods:

  • add_agent(agent_id) - Idempotent agent addition
  • remove_agent(agent_id) - Safe removal
  • contains_agent(agent_id) -> bool - Membership check

Use Case: Users subscribe to packs like "Skeptical Cardio Pack" to filter GLP-1 side effect claims through vetted cardiologists.


Serialization

All types use rkyv for zero-copy deserialization:

use stemedb_core::serde::{serialize, deserialize};

// Serialize
let bytes: Vec<u8> = serialize(&assertion)?;

// Deserialize (zero-copy when possible)
let assertion: Assertion = deserialize(&bytes)?;

Critical Rule: Never use raw AllocSerializer in production code. Always use stemedb_core::serde::{serialize, deserialize}.


Relationship Diagram

                    ┌─────────────────────────────────────────────┐
                    │              ASSERTION                      │
                    │  ┌─────────┐ ┌───────────┐ ┌─────────────┐  │
                    │  │ subject │ │ predicate │ │   object    │  │
                    │  └─────────┘ └───────────┘ └─────────────┘  │
                    │                                             │
                    │  ┌─────────────────┐  ┌──────────────────┐  │
                    │  │   source_hash   │  │   signatures[]   │  │
                    │  └────────┬────────┘  └────────┬─────────┘  │
                    │           │                    │            │
                    └───────────┼────────────────────┼────────────┘
                                │                    │
                    ┌───────────▼───────┐  ┌────────▼────────┐
                    │  SOURCE DOCUMENT  │  │     AGENTS      │
                    │   (PDF, URL...)   │  │ (Ed25519 keys)  │
                    └───────────────────┘  └────────┬────────┘
                                                    │
                                           ┌────────▼────────┐
                                           │   TRUST RANK    │
                                           │  (reputation)   │
                                           └─────────────────┘


    ┌─────────────────┐         ┌─────────────────┐
    │      VOTE       │◄────────│   ASSERTION     │
    │  (Ballot Box)   │ votes   │    (target)     │
    │  weight: 0.0-1.0│  on     │                 │
    └─────────────────┘         └─────────────────┘


    ┌─────────────────┐         ┌─────────────────┐
    │     EPOCH B     │◄────────│    EPOCH A      │
    │  supersedes: A  │  older  │                 │
    │  type: Temporal │  epoch  │                 │
    └─────────────────┘         └─────────────────┘

API Representation

All binary data (hashes, signatures, agent IDs) is hex-encoded in JSON APIs:

{
  "subject": "Semaglutide",
  "predicate": "muscle_effect",
  "object": { "type": "Text", "value": "Significant loss" },
  "source_hash": "a1b2c3d4e5f6...",
  "signatures": [
    {
      "agent_id": "deadbeef...",
      "signature": "cafebabe...",
      "timestamp": 1706745600
    }
  ],
  "confidence": 0.85,
  "timestamp": 1706745600
}

See Also