stemedb/docs/data-structures.md
jordan ad07a75d0a feat: add source content to source registry, signed assertions, feed endpoint, dashboard enhancements
- Add `content: Option<String>` to SourceRecord with rkyv schema evolution
  (LegacySourceRecord compat deserializer for backward compatibility)
- Add MAX_SOURCE_CONTENT_LEN (1MB) limit with API validation
- Strip content from list responses, include in single-source GET
- Update Go SDK RegisterSourceRequest with Content field
- FCM pipeline extracts PDF text via pdftotext and passes to registration
- Dashboard impact panel fetches and displays source content with expand/collapse
- Add feed endpoint, dashboard feed panel, and signed assertion support
- Update data-structures.md, API docs, and storage docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:54:27 -07:00

564 lines
19 KiB
Markdown

# StemeDB Data Structures
> **Last Updated:** 2026-02-19
> **Source:** `crates/stemedb-core/src/types.rs`
This document describes the core data structures in StemeDB (Episteme). These types form the foundation of the "Git for Truth" knowledge graph.
---
## Design Principles
1. **Append-Only**: Data is never mutated. New assertions create new records.
2. **Content-Addressed**: Every assertion's ID is a BLAKE3 hash of its content.
3. **Zero-Copy**: Uses `rkyv` for serialization - data can be read directly from disk without parsing.
4. **Provenance-First**: Every fact carries its source, signers, and confidence.
---
## Primitive Types
```rust
pub type Hash = [u8; 32]; // BLAKE3 256-bit hash
pub type PHash = [u8; 8]; // Perceptual hash for images (8 bytes)
pub type EntityId = String; // Subject or object identifier
pub type RelationId = String; // Predicate identifier
pub type EpochId = Hash; // Paradigm/era identifier
pub type QueryId = Hash; // Query audit record identifier
```
---
## The Assertion (Atomic Unit of Knowledge)
The `Assertion` is the fundamental unit. It represents a single claim about the world.
```rust
pub struct Assertion {
// ═══════════════════════════════════════════════════════════
// 1. THE FACT (What is being claimed)
// ═══════════════════════════════════════════════════════════
/// The entity this assertion is about (e.g., "Semaglutide", "Tesla_Inc")
pub subject: EntityId,
/// The relationship or property (e.g., "has_side_effect", "annual_revenue")
pub predicate: RelationId,
/// The claimed value
pub object: ObjectValue,
// ═══════════════════════════════════════════════════════════
// 2. THE LINEAGE (Why we believe it)
// ═══════════════════════════════════════════════════════════
/// If this modifies/forks another assertion, its hash
pub parent_hash: Option<Hash>,
/// Hash of the source evidence (PDF, URL, database export)
pub source_hash: Hash,
/// Authority tier of the source (enables indexing and decay rates)
pub source_class: SourceClass,
/// Perceptual hash of a visual anchor (e.g., screenshot of table)
pub visual_hash: Option<PHash>,
/// Which paradigm/era this belongs to (for paradigm shifts)
pub epoch: Option<EpochId>,
/// Lifecycle stage (Proposed → Approved → Deprecated)
pub lifecycle: LifecycleStage,
// ═══════════════════════════════════════════════════════════
// 3. META-COGNITION (Who said it, how sure are they)
// ═══════════════════════════════════════════════════════════
/// Cryptographic signatures from agents vouching for this
pub signatures: Vec<SignatureEntry>,
/// Subjective confidence score (0.0 to 1.0)
pub confidence: f32,
/// Unix timestamp when created
pub timestamp: u64,
/// Semantic embedding vector for similarity search
pub vector: Option<Vec<f32>>,
}
```
### ObjectValue
The value in a subject-predicate-object triple:
```rust
pub enum ObjectValue {
Text(String), // "muscle loss"
Number(f64), // 96.7
Boolean(bool), // true
Reference(EntityId), // Points to another entity (graph edge)
}
```
### LifecycleStage
Assertions progress through stages (as new assertions, not mutations):
```
Proposed → UnderReview → Approved
↘ Rejected
↘ Deprecated
```
```rust
pub enum LifecycleStage {
Proposed, // Initial idea, not for production use
UnderReview, // Gathering votes and feedback
Approved, // Accepted as current truth
Deprecated, // Was true, now superseded
Rejected, // Explicitly declined
}
```
### SourceClass
Authority tier classification for sources. Enables indexing by tier and tier-based decay rates:
| Tier | Class | Example | Default Decay |
|------|-------|---------|---------------|
| 0 | Regulatory | FDA, EMA, WHO | Never |
| 1 | Clinical | Phase III trials, peer-reviewed RCTs | 2 years |
| 2 | Observational | Real-world evidence, cohort studies | 1 year |
| 3 | Expert | Medical professional opinions, guidelines | 6 months |
| 4 | Community | Curated forums, patient advocacy groups | 3 months |
| 5 | Anecdotal | Reddit posts, individual testimonials | 1 month |
```rust
pub enum SourceClass {
Regulatory, // Tier 0: Highest authority, never decays
Clinical, // Tier 1: Peer-reviewed research
Observational, // Tier 2: Real-world evidence
Expert, // Tier 3: Professional opinions (default)
Community, // Tier 4: Curated community knowledge
Anecdotal, // Tier 5: Individual reports, fast decay
}
impl SourceClass {
pub fn tier(&self) -> u8; // Returns 0-5
pub fn default_decay_days(&self) -> Option<u32>;
pub fn authority_weight(&self) -> f32; // 1.0 for Regulatory, 0.1 for Anecdotal
}
```
**Key Benefits:**
- **Indexing**: `SC:{source_class}` index enables "show me only regulatory sources"
- **Decay rates**: Anecdotal claims decay faster than clinical evidence
- **Trust weighting**: Lenses can weight sources by authority tier in conflict resolution
### SignatureEntry
Cryptographic proof that an agent vouches for an assertion:
```rust
pub struct SignatureEntry {
pub agent_id: [u8; 32], // Ed25519 public key
pub signature: [u8; 64], // Ed25519 signature over assertion content
pub timestamp: u64, // When the agent signed
}
```
---
## The Vote (High-Velocity Consensus)
Votes are separated from assertions to enable thousands of agents to vote simultaneously without lock contention (the "Ballot Box" pattern).
A vote is not just "I agree" - it's a **cryptographic witness**: "I saw this exact text at this URL at this time." This enables browser extension products where votes represent observations, not opinions.
```rust
pub struct Vote {
/// Hash of the assertion being voted on
pub assertion_hash: Hash,
/// Ed25519 public key of the voter
pub agent_id: [u8; 32],
/// Weight of the vote (0.0 = reject, 1.0 = full endorsement)
pub weight: f32,
/// Signature over the assertion_hash
pub signature: [u8; 64],
/// When the vote was cast
pub timestamp: u64,
/// The URL where the claim was observed (optional)
/// Enables provenance tracking: "I saw this at example.com/article"
pub source_url: Option<String>,
/// Optional context (page snippet, etc.) stored as bytes
/// Same pattern as source_metadata on Assertion for rkyv zero-copy
pub observed_context: Option<Vec<u8>>,
}
```
**Key Insight**: Votes are append-only. An agent can change their vote by submitting a new one with a later timestamp.
**Provenance Witness**: The `source_url` and `observed_context` fields transform votes from opinions into observations, enabling the browser extension to count "How many people saw this claim on this page?" rather than just "How many people agree?"
---
## The Epoch (Paradigm Shifts)
Epochs represent distinct periods of truth. When knowledge paradigms shift, old epochs can be superseded.
```rust
pub struct Epoch {
pub id: EpochId,
pub name: String, // "Pre-2024", "Newtonian"
pub supersedes: Option<EpochId>, // What this replaces
pub supersession_type: Option<SupersessionType>,
pub start_timestamp: u64,
pub end_timestamp: Option<u64>,
}
pub enum SupersessionType {
Invalidation, // Old epoch was factually wrong (e.g., "Earth is flat")
Temporal, // Old epoch was correct but outdated (e.g., "President is Obama")
Refinement, // Old epoch was a simplification (e.g., Newtonian → Relativity)
}
```
---
## Query Results
### MaterializedView (O(1) Winner Lookup)
Pre-computed resolution stored at `MV:{subject}:{predicate}`:
```rust
pub struct MaterializedView {
/// The winning assertion from lens resolution
pub winner: Assertion,
/// Which lens produced this (e.g., "VoteAwareConsensus")
pub lens_name: String,
/// Confidence in the resolution (0.0 to 1.0)
pub resolution_confidence: f32,
/// How many candidates were considered
pub candidates_count: usize,
/// When this view was computed
pub materialized_at: u64,
}
```
### ConflictAnalysis (Trust but Verify)
For the SkepticLens - surfaces all competing claims instead of picking a winner:
```rust
pub struct ConflictAnalysis {
/// Overall status: Unanimous, Agreed, or Contested
pub status: ResolutionStatus,
/// Conflict score (0.0 = unanimous, 1.0 = maximum chaos)
/// Calculated using normalized Shannon entropy
pub conflict_score: f32,
/// All distinct claims, ranked by weight_share descending
pub claims: Vec<ClaimSummary>,
/// Total candidates considered
pub candidates_count: usize,
}
pub enum ResolutionStatus {
Unanimous, // All agree (entropy < 0.1)
Agreed, // Strong majority (entropy < 0.4)
Contested, // Significant disagreement (entropy >= 0.4)
}
```
### ClaimSummary
A single competing claim within a ConflictAnalysis:
```rust
pub struct ClaimSummary {
/// The claimed value
pub value: ObjectValue,
/// This claim's share of total support (0.0 to 1.0)
pub weight_share: f32,
/// Number of assertions making this claim
pub assertion_count: u32,
/// Hash of the highest-confidence assertion (for drill-down)
pub representative_hash: Hash,
/// Source provenance
pub source: SourceSummary,
/// Agents who signed assertions for this claim
pub supporting_agents: Vec<AgentSummary>,
}
```
### SourceSummary & AgentSummary
Provenance types for "show me the proof" UX:
```rust
pub struct SourceSummary {
pub source_hash: Hash, // Hash of source document
pub visual_hash: Option<PHash>, // Visual anchor (screenshot)
}
pub struct AgentSummary {
pub agent_id: [u8; 32], // Agent's public key
pub trust_score: f32, // Trust score at query time
}
```
---
## Query Audit Trail
Every query is logged for "Why did you think that?" debugging:
```rust
pub struct QueryAudit {
pub query_id: QueryId,
pub agent_id: Option<[u8; 32]>, // Who queried (from X-Agent-Id header)
pub timestamp: u64,
pub params: QueryParams,
pub result_hash: Option<Hash>, // Winning assertion hash
pub result_confidence: f32,
pub contributing_assertions: Vec<ContributingAssertion>,
}
pub struct QueryParams {
pub subject: Option<EntityId>,
pub predicate: Option<RelationId>,
pub lifecycle: Option<LifecycleStage>,
pub epoch: Option<EpochId>,
pub lens: Option<String>,
}
pub struct ContributingAssertion {
pub assertion_hash: Hash,
pub weight: f32, // How much this influenced the result
pub source_hash: Hash,
pub lifecycle: LifecycleStage,
}
```
---
## Storage Layout
Key patterns in the KV store:
| Key Pattern | Value | Purpose |
|-------------|-------|---------|
| `H:{hash}` | Serialized Assertion | Primary assertion storage |
| `S:{subject}` | `Vec<Hash>` | Subject index |
| `SP:{subject}:{predicate}` | `Vec<Hash>` | Compound index (O(1) lookup) |
| `MV:{subject}:{predicate}` | MaterializedView | Pre-computed winner |
| `V:{assertion_hash}:{vote_hash}` | Vote | Individual votes |
| `VC:{assertion_hash}` | u64 | Vote count cache |
| `VW:{assertion_hash}` | f32 | Aggregate vote weight cache |
| `TR:{agent_id}` | TrustRank | Agent reputation |
| `TP:{pack_id}` | TrustPack | Curated agent lists |
| `AUD:{query_id}` | QueryAudit | Query audit record |
| `E:{epoch_id}` | Epoch | Epoch definitions |
---
## The Trust Pack (Curator Economy)
Trust Packs are the "App Store for Trust" - curated lists of trusted agents that filter consensus through domain expertise.
```rust
pub struct TrustPack {
/// Content-addressed pack ID (BLAKE3 hash)
pub id: PackId,
/// Human-readable name (e.g., "Mayo_Clinic_Experts")
pub name: String,
/// Ed25519 public key of the pack maintainer
pub maintainer: [u8; 32],
/// Agent public keys in this pack
/// Future: Replace with RoaringBitmap for O(1) membership
pub agents: Vec<[u8; 32]>,
/// Unix timestamp when pack was created
pub created_at: u64,
/// Unix timestamp of last modification
pub updated_at: u64,
}
```
**Key Methods:**
- `add_agent(agent_id)` - Idempotent agent addition
- `remove_agent(agent_id)` - Safe removal
- `contains_agent(agent_id) -> bool` - Membership check
**Use Case:** Users subscribe to packs like "Skeptical Cardio Pack" to filter GLP-1 side effect claims through vetted cardiologists.
---
## The SourceRecord (Source Registry)
The Source Registry maps content-addressed source hashes to human-readable metadata. This enables the dashboard to show "FDA Approval Letter for Wegovy" instead of a raw BLAKE3 hash.
```rust
pub struct SourceRecord {
/// Content-addressed hash of the source (BLAKE3, 32 bytes).
pub hash: [u8; 32],
/// Human-readable label.
pub label: String,
/// Optional URL where the source can be accessed.
pub url: Option<String>,
/// Authority tier (0-5), matching SourceClass.
pub tier: u8,
/// Current status (Active, Deprecated, Quarantined).
pub status: SourceStatus,
/// HLC timestamp when the record was created.
pub created_at: u64,
/// HLC timestamp of the last update.
pub updated_at: u64,
/// Optional curator notes about the source.
pub notes: Option<String>,
/// Optional full-text content of the source document.
/// Populated by pipelines that extract text from PDFs.
/// Max size: 1 MB (MAX_SOURCE_CONTENT_LEN).
pub content: Option<String>,
}
```
**Key Points:**
- **Status lifecycle:** Active → Deprecated or Quarantined (curator-driven)
- **Content field:** Stores extracted document text (e.g., from `pdftotext`). Stripped from list responses (`GET /v1/sources`) to avoid returning megabytes; included in single-source responses (`GET /v1/sources/{hash}`)
- **rkyv compat:** Uses `deserialize_source_record_compat()` for backward compatibility with data written before the `content` field was added
---
## Serialization
All types use `rkyv` for zero-copy deserialization:
```rust
use stemedb_core::serde::{serialize, deserialize};
// Serialize
let bytes: Vec<u8> = serialize(&assertion)?;
// Deserialize (zero-copy when possible)
let assertion: Assertion = deserialize(&bytes)?;
```
**Critical Rule**: Never use raw `AllocSerializer` in production code. Always use `stemedb_core::serde::{serialize, deserialize}`.
### Schema Evolution (rkyv Compat)
rkyv does **not** support schema evolution. When a field is added to a struct, old data can't be deserialized with the new struct. The solution is a legacy compat pattern:
| Type | Compat Function | Legacy Struct |
|------|----------------|---------------|
| `Assertion` | `deserialize_assertion_compat()` | `LegacyAssertion` (pre-`narrative`) |
| `SourceRecord` | `deserialize_source_record_compat()` | `LegacySourceRecord` (pre-`content`) |
All assertion deserialization should use `deserialize_assertion_compat()`. All source record deserialization should use `deserialize_source_record_compat()`. When adding fields to rkyv structs in the future, always add a legacy compat deserializer following this pattern.
---
## Relationship Diagram
```
┌─────────────────────────────────────────────┐
│ ASSERTION │
│ ┌─────────┐ ┌───────────┐ ┌─────────────┐ │
│ │ subject │ │ predicate │ │ object │ │
│ └─────────┘ └───────────┘ └─────────────┘ │
│ │
│ ┌─────────────────┐ ┌──────────────────┐ │
│ │ source_hash │ │ signatures[] │ │
│ └────────┬────────┘ └────────┬─────────┘ │
│ │ │ │
└───────────┼────────────────────┼────────────┘
│ │
┌───────────▼───────┐ ┌────────▼────────┐
│ SOURCE DOCUMENT │ │ AGENTS │
│ (PDF, URL...) │ │ (Ed25519 keys) │
└───────────────────┘ └────────┬────────┘
┌────────▼────────┐
│ TRUST RANK │
│ (reputation) │
└─────────────────┘
┌─────────────────┐ ┌─────────────────┐
│ VOTE │◄────────│ ASSERTION │
│ (Ballot Box) │ votes │ (target) │
│ weight: 0.0-1.0│ on │ │
└─────────────────┘ └─────────────────┘
┌─────────────────┐ ┌─────────────────┐
│ EPOCH B │◄────────│ EPOCH A │
│ supersedes: A │ older │ │
│ type: Temporal │ epoch │ │
└─────────────────┘ └─────────────────┘
```
---
## API Representation
All binary data (hashes, signatures, agent IDs) is hex-encoded in JSON APIs:
```json
{
"subject": "Semaglutide",
"predicate": "muscle_effect",
"object": { "type": "Text", "value": "Significant loss" },
"source_hash": "a1b2c3d4e5f6...",
"signatures": [
{
"agent_id": "deadbeef...",
"signature": "cafebabe...",
"timestamp": 1706745600
}
],
"confidence": 0.85,
"timestamp": 1706745600
}
```
---
## See Also
- [Architecture Overview](../architecture.md)
- [Lens Documentation](../ai-lookup/services/lens.md)
- [API Endpoints Guide](../.claude/guides/backend/api-endpoints.md)