- Add `content: Option<String>` to SourceRecord with rkyv schema evolution (LegacySourceRecord compat deserializer for backward compatibility) - Add MAX_SOURCE_CONTENT_LEN (1MB) limit with API validation - Strip content from list responses, include in single-source GET - Update Go SDK RegisterSourceRequest with Content field - FCM pipeline extracts PDF text via pdftotext and passes to registration - Dashboard impact panel fetches and displays source content with expand/collapse - Add feed endpoint, dashboard feed panel, and signed assertion support - Update data-structures.md, API docs, and storage docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
564 lines
19 KiB
Markdown
564 lines
19 KiB
Markdown
# StemeDB Data Structures
|
|
|
|
> **Last Updated:** 2026-02-19
|
|
> **Source:** `crates/stemedb-core/src/types.rs`
|
|
|
|
This document describes the core data structures in StemeDB (Episteme). These types form the foundation of the "Git for Truth" knowledge graph.
|
|
|
|
---
|
|
|
|
## Design Principles
|
|
|
|
1. **Append-Only**: Data is never mutated. New assertions create new records.
|
|
2. **Content-Addressed**: Every assertion's ID is a BLAKE3 hash of its content.
|
|
3. **Zero-Copy**: Uses `rkyv` for serialization - data can be read directly from disk without parsing.
|
|
4. **Provenance-First**: Every fact carries its source, signers, and confidence.
|
|
|
|
---
|
|
|
|
## Primitive Types
|
|
|
|
```rust
|
|
pub type Hash = [u8; 32]; // BLAKE3 256-bit hash
|
|
pub type PHash = [u8; 8]; // Perceptual hash for images (8 bytes)
|
|
pub type EntityId = String; // Subject or object identifier
|
|
pub type RelationId = String; // Predicate identifier
|
|
pub type EpochId = Hash; // Paradigm/era identifier
|
|
pub type QueryId = Hash; // Query audit record identifier
|
|
```
|
|
|
|
---
|
|
|
|
## The Assertion (Atomic Unit of Knowledge)
|
|
|
|
The `Assertion` is the fundamental unit. It represents a single claim about the world.
|
|
|
|
```rust
|
|
pub struct Assertion {
|
|
// ═══════════════════════════════════════════════════════════
|
|
// 1. THE FACT (What is being claimed)
|
|
// ═══════════════════════════════════════════════════════════
|
|
|
|
/// The entity this assertion is about (e.g., "Semaglutide", "Tesla_Inc")
|
|
pub subject: EntityId,
|
|
|
|
/// The relationship or property (e.g., "has_side_effect", "annual_revenue")
|
|
pub predicate: RelationId,
|
|
|
|
/// The claimed value
|
|
pub object: ObjectValue,
|
|
|
|
// ═══════════════════════════════════════════════════════════
|
|
// 2. THE LINEAGE (Why we believe it)
|
|
// ═══════════════════════════════════════════════════════════
|
|
|
|
/// If this modifies/forks another assertion, its hash
|
|
pub parent_hash: Option<Hash>,
|
|
|
|
/// Hash of the source evidence (PDF, URL, database export)
|
|
pub source_hash: Hash,
|
|
|
|
/// Authority tier of the source (enables indexing and decay rates)
|
|
pub source_class: SourceClass,
|
|
|
|
/// Perceptual hash of a visual anchor (e.g., screenshot of table)
|
|
pub visual_hash: Option<PHash>,
|
|
|
|
/// Which paradigm/era this belongs to (for paradigm shifts)
|
|
pub epoch: Option<EpochId>,
|
|
|
|
/// Lifecycle stage (Proposed → Approved → Deprecated)
|
|
pub lifecycle: LifecycleStage,
|
|
|
|
// ═══════════════════════════════════════════════════════════
|
|
// 3. META-COGNITION (Who said it, how sure are they)
|
|
// ═══════════════════════════════════════════════════════════
|
|
|
|
/// Cryptographic signatures from agents vouching for this
|
|
pub signatures: Vec<SignatureEntry>,
|
|
|
|
/// Subjective confidence score (0.0 to 1.0)
|
|
pub confidence: f32,
|
|
|
|
/// Unix timestamp when created
|
|
pub timestamp: u64,
|
|
|
|
/// Semantic embedding vector for similarity search
|
|
pub vector: Option<Vec<f32>>,
|
|
}
|
|
```
|
|
|
|
### ObjectValue
|
|
|
|
The value in a subject-predicate-object triple:
|
|
|
|
```rust
|
|
pub enum ObjectValue {
|
|
Text(String), // "muscle loss"
|
|
Number(f64), // 96.7
|
|
Boolean(bool), // true
|
|
Reference(EntityId), // Points to another entity (graph edge)
|
|
}
|
|
```
|
|
|
|
### LifecycleStage
|
|
|
|
Assertions progress through stages (as new assertions, not mutations):
|
|
|
|
```
|
|
Proposed → UnderReview → Approved
|
|
↘ Rejected
|
|
↘ Deprecated
|
|
```
|
|
|
|
```rust
|
|
pub enum LifecycleStage {
|
|
Proposed, // Initial idea, not for production use
|
|
UnderReview, // Gathering votes and feedback
|
|
Approved, // Accepted as current truth
|
|
Deprecated, // Was true, now superseded
|
|
Rejected, // Explicitly declined
|
|
}
|
|
```
|
|
|
|
### SourceClass
|
|
|
|
Authority tier classification for sources. Enables indexing by tier and tier-based decay rates:
|
|
|
|
| Tier | Class | Example | Default Decay |
|
|
|------|-------|---------|---------------|
|
|
| 0 | Regulatory | FDA, EMA, WHO | Never |
|
|
| 1 | Clinical | Phase III trials, peer-reviewed RCTs | 2 years |
|
|
| 2 | Observational | Real-world evidence, cohort studies | 1 year |
|
|
| 3 | Expert | Medical professional opinions, guidelines | 6 months |
|
|
| 4 | Community | Curated forums, patient advocacy groups | 3 months |
|
|
| 5 | Anecdotal | Reddit posts, individual testimonials | 1 month |
|
|
|
|
```rust
|
|
pub enum SourceClass {
|
|
Regulatory, // Tier 0: Highest authority, never decays
|
|
Clinical, // Tier 1: Peer-reviewed research
|
|
Observational, // Tier 2: Real-world evidence
|
|
Expert, // Tier 3: Professional opinions (default)
|
|
Community, // Tier 4: Curated community knowledge
|
|
Anecdotal, // Tier 5: Individual reports, fast decay
|
|
}
|
|
|
|
impl SourceClass {
|
|
pub fn tier(&self) -> u8; // Returns 0-5
|
|
pub fn default_decay_days(&self) -> Option<u32>;
|
|
pub fn authority_weight(&self) -> f32; // 1.0 for Regulatory, 0.1 for Anecdotal
|
|
}
|
|
```
|
|
|
|
**Key Benefits:**
|
|
- **Indexing**: `SC:{source_class}` index enables "show me only regulatory sources"
|
|
- **Decay rates**: Anecdotal claims decay faster than clinical evidence
|
|
- **Trust weighting**: Lenses can weight sources by authority tier in conflict resolution
|
|
|
|
### SignatureEntry
|
|
|
|
Cryptographic proof that an agent vouches for an assertion:
|
|
|
|
```rust
|
|
pub struct SignatureEntry {
|
|
pub agent_id: [u8; 32], // Ed25519 public key
|
|
pub signature: [u8; 64], // Ed25519 signature over assertion content
|
|
pub timestamp: u64, // When the agent signed
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## The Vote (High-Velocity Consensus)
|
|
|
|
Votes are separated from assertions to enable thousands of agents to vote simultaneously without lock contention (the "Ballot Box" pattern).
|
|
|
|
A vote is not just "I agree" - it's a **cryptographic witness**: "I saw this exact text at this URL at this time." This enables browser extension products where votes represent observations, not opinions.
|
|
|
|
```rust
|
|
pub struct Vote {
|
|
/// Hash of the assertion being voted on
|
|
pub assertion_hash: Hash,
|
|
|
|
/// Ed25519 public key of the voter
|
|
pub agent_id: [u8; 32],
|
|
|
|
/// Weight of the vote (0.0 = reject, 1.0 = full endorsement)
|
|
pub weight: f32,
|
|
|
|
/// Signature over the assertion_hash
|
|
pub signature: [u8; 64],
|
|
|
|
/// When the vote was cast
|
|
pub timestamp: u64,
|
|
|
|
/// The URL where the claim was observed (optional)
|
|
/// Enables provenance tracking: "I saw this at example.com/article"
|
|
pub source_url: Option<String>,
|
|
|
|
/// Optional context (page snippet, etc.) stored as bytes
|
|
/// Same pattern as source_metadata on Assertion for rkyv zero-copy
|
|
pub observed_context: Option<Vec<u8>>,
|
|
}
|
|
```
|
|
|
|
**Key Insight**: Votes are append-only. An agent can change their vote by submitting a new one with a later timestamp.
|
|
|
|
**Provenance Witness**: The `source_url` and `observed_context` fields transform votes from opinions into observations, enabling the browser extension to count "How many people saw this claim on this page?" rather than just "How many people agree?"
|
|
|
|
---
|
|
|
|
## The Epoch (Paradigm Shifts)
|
|
|
|
Epochs represent distinct periods of truth. When knowledge paradigms shift, old epochs can be superseded.
|
|
|
|
```rust
|
|
pub struct Epoch {
|
|
pub id: EpochId,
|
|
pub name: String, // "Pre-2024", "Newtonian"
|
|
pub supersedes: Option<EpochId>, // What this replaces
|
|
pub supersession_type: Option<SupersessionType>,
|
|
pub start_timestamp: u64,
|
|
pub end_timestamp: Option<u64>,
|
|
}
|
|
|
|
pub enum SupersessionType {
|
|
Invalidation, // Old epoch was factually wrong (e.g., "Earth is flat")
|
|
Temporal, // Old epoch was correct but outdated (e.g., "President is Obama")
|
|
Refinement, // Old epoch was a simplification (e.g., Newtonian → Relativity)
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Query Results
|
|
|
|
### MaterializedView (O(1) Winner Lookup)
|
|
|
|
Pre-computed resolution stored at `MV:{subject}:{predicate}`:
|
|
|
|
```rust
|
|
pub struct MaterializedView {
|
|
/// The winning assertion from lens resolution
|
|
pub winner: Assertion,
|
|
|
|
/// Which lens produced this (e.g., "VoteAwareConsensus")
|
|
pub lens_name: String,
|
|
|
|
/// Confidence in the resolution (0.0 to 1.0)
|
|
pub resolution_confidence: f32,
|
|
|
|
/// How many candidates were considered
|
|
pub candidates_count: usize,
|
|
|
|
/// When this view was computed
|
|
pub materialized_at: u64,
|
|
}
|
|
```
|
|
|
|
### ConflictAnalysis (Trust but Verify)
|
|
|
|
For the SkepticLens - surfaces all competing claims instead of picking a winner:
|
|
|
|
```rust
|
|
pub struct ConflictAnalysis {
|
|
/// Overall status: Unanimous, Agreed, or Contested
|
|
pub status: ResolutionStatus,
|
|
|
|
/// Conflict score (0.0 = unanimous, 1.0 = maximum chaos)
|
|
/// Calculated using normalized Shannon entropy
|
|
pub conflict_score: f32,
|
|
|
|
/// All distinct claims, ranked by weight_share descending
|
|
pub claims: Vec<ClaimSummary>,
|
|
|
|
/// Total candidates considered
|
|
pub candidates_count: usize,
|
|
}
|
|
|
|
pub enum ResolutionStatus {
|
|
Unanimous, // All agree (entropy < 0.1)
|
|
Agreed, // Strong majority (entropy < 0.4)
|
|
Contested, // Significant disagreement (entropy >= 0.4)
|
|
}
|
|
```
|
|
|
|
### ClaimSummary
|
|
|
|
A single competing claim within a ConflictAnalysis:
|
|
|
|
```rust
|
|
pub struct ClaimSummary {
|
|
/// The claimed value
|
|
pub value: ObjectValue,
|
|
|
|
/// This claim's share of total support (0.0 to 1.0)
|
|
pub weight_share: f32,
|
|
|
|
/// Number of assertions making this claim
|
|
pub assertion_count: u32,
|
|
|
|
/// Hash of the highest-confidence assertion (for drill-down)
|
|
pub representative_hash: Hash,
|
|
|
|
/// Source provenance
|
|
pub source: SourceSummary,
|
|
|
|
/// Agents who signed assertions for this claim
|
|
pub supporting_agents: Vec<AgentSummary>,
|
|
}
|
|
```
|
|
|
|
### SourceSummary & AgentSummary
|
|
|
|
Provenance types for "show me the proof" UX:
|
|
|
|
```rust
|
|
pub struct SourceSummary {
|
|
pub source_hash: Hash, // Hash of source document
|
|
pub visual_hash: Option<PHash>, // Visual anchor (screenshot)
|
|
}
|
|
|
|
pub struct AgentSummary {
|
|
pub agent_id: [u8; 32], // Agent's public key
|
|
pub trust_score: f32, // Trust score at query time
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Query Audit Trail
|
|
|
|
Every query is logged for "Why did you think that?" debugging:
|
|
|
|
```rust
|
|
pub struct QueryAudit {
|
|
pub query_id: QueryId,
|
|
pub agent_id: Option<[u8; 32]>, // Who queried (from X-Agent-Id header)
|
|
pub timestamp: u64,
|
|
pub params: QueryParams,
|
|
pub result_hash: Option<Hash>, // Winning assertion hash
|
|
pub result_confidence: f32,
|
|
pub contributing_assertions: Vec<ContributingAssertion>,
|
|
}
|
|
|
|
pub struct QueryParams {
|
|
pub subject: Option<EntityId>,
|
|
pub predicate: Option<RelationId>,
|
|
pub lifecycle: Option<LifecycleStage>,
|
|
pub epoch: Option<EpochId>,
|
|
pub lens: Option<String>,
|
|
}
|
|
|
|
pub struct ContributingAssertion {
|
|
pub assertion_hash: Hash,
|
|
pub weight: f32, // How much this influenced the result
|
|
pub source_hash: Hash,
|
|
pub lifecycle: LifecycleStage,
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Storage Layout
|
|
|
|
Key patterns in the KV store:
|
|
|
|
| Key Pattern | Value | Purpose |
|
|
|-------------|-------|---------|
|
|
| `H:{hash}` | Serialized Assertion | Primary assertion storage |
|
|
| `S:{subject}` | `Vec<Hash>` | Subject index |
|
|
| `SP:{subject}:{predicate}` | `Vec<Hash>` | Compound index (O(1) lookup) |
|
|
| `MV:{subject}:{predicate}` | MaterializedView | Pre-computed winner |
|
|
| `V:{assertion_hash}:{vote_hash}` | Vote | Individual votes |
|
|
| `VC:{assertion_hash}` | u64 | Vote count cache |
|
|
| `VW:{assertion_hash}` | f32 | Aggregate vote weight cache |
|
|
| `TR:{agent_id}` | TrustRank | Agent reputation |
|
|
| `TP:{pack_id}` | TrustPack | Curated agent lists |
|
|
| `AUD:{query_id}` | QueryAudit | Query audit record |
|
|
| `E:{epoch_id}` | Epoch | Epoch definitions |
|
|
|
|
---
|
|
|
|
## The Trust Pack (Curator Economy)
|
|
|
|
Trust Packs are the "App Store for Trust" - curated lists of trusted agents that filter consensus through domain expertise.
|
|
|
|
```rust
|
|
pub struct TrustPack {
|
|
/// Content-addressed pack ID (BLAKE3 hash)
|
|
pub id: PackId,
|
|
|
|
/// Human-readable name (e.g., "Mayo_Clinic_Experts")
|
|
pub name: String,
|
|
|
|
/// Ed25519 public key of the pack maintainer
|
|
pub maintainer: [u8; 32],
|
|
|
|
/// Agent public keys in this pack
|
|
/// Future: Replace with RoaringBitmap for O(1) membership
|
|
pub agents: Vec<[u8; 32]>,
|
|
|
|
/// Unix timestamp when pack was created
|
|
pub created_at: u64,
|
|
|
|
/// Unix timestamp of last modification
|
|
pub updated_at: u64,
|
|
}
|
|
```
|
|
|
|
**Key Methods:**
|
|
- `add_agent(agent_id)` - Idempotent agent addition
|
|
- `remove_agent(agent_id)` - Safe removal
|
|
- `contains_agent(agent_id) -> bool` - Membership check
|
|
|
|
**Use Case:** Users subscribe to packs like "Skeptical Cardio Pack" to filter GLP-1 side effect claims through vetted cardiologists.
|
|
|
|
---
|
|
|
|
## The SourceRecord (Source Registry)
|
|
|
|
The Source Registry maps content-addressed source hashes to human-readable metadata. This enables the dashboard to show "FDA Approval Letter for Wegovy" instead of a raw BLAKE3 hash.
|
|
|
|
```rust
|
|
pub struct SourceRecord {
|
|
/// Content-addressed hash of the source (BLAKE3, 32 bytes).
|
|
pub hash: [u8; 32],
|
|
|
|
/// Human-readable label.
|
|
pub label: String,
|
|
|
|
/// Optional URL where the source can be accessed.
|
|
pub url: Option<String>,
|
|
|
|
/// Authority tier (0-5), matching SourceClass.
|
|
pub tier: u8,
|
|
|
|
/// Current status (Active, Deprecated, Quarantined).
|
|
pub status: SourceStatus,
|
|
|
|
/// HLC timestamp when the record was created.
|
|
pub created_at: u64,
|
|
|
|
/// HLC timestamp of the last update.
|
|
pub updated_at: u64,
|
|
|
|
/// Optional curator notes about the source.
|
|
pub notes: Option<String>,
|
|
|
|
/// Optional full-text content of the source document.
|
|
/// Populated by pipelines that extract text from PDFs.
|
|
/// Max size: 1 MB (MAX_SOURCE_CONTENT_LEN).
|
|
pub content: Option<String>,
|
|
}
|
|
```
|
|
|
|
**Key Points:**
|
|
- **Status lifecycle:** Active → Deprecated or Quarantined (curator-driven)
|
|
- **Content field:** Stores extracted document text (e.g., from `pdftotext`). Stripped from list responses (`GET /v1/sources`) to avoid returning megabytes; included in single-source responses (`GET /v1/sources/{hash}`)
|
|
- **rkyv compat:** Uses `deserialize_source_record_compat()` for backward compatibility with data written before the `content` field was added
|
|
|
|
---
|
|
|
|
## Serialization
|
|
|
|
All types use `rkyv` for zero-copy deserialization:
|
|
|
|
```rust
|
|
use stemedb_core::serde::{serialize, deserialize};
|
|
|
|
// Serialize
|
|
let bytes: Vec<u8> = serialize(&assertion)?;
|
|
|
|
// Deserialize (zero-copy when possible)
|
|
let assertion: Assertion = deserialize(&bytes)?;
|
|
```
|
|
|
|
**Critical Rule**: Never use raw `AllocSerializer` in production code. Always use `stemedb_core::serde::{serialize, deserialize}`.
|
|
|
|
### Schema Evolution (rkyv Compat)
|
|
|
|
rkyv does **not** support schema evolution. When a field is added to a struct, old data can't be deserialized with the new struct. The solution is a legacy compat pattern:
|
|
|
|
| Type | Compat Function | Legacy Struct |
|
|
|------|----------------|---------------|
|
|
| `Assertion` | `deserialize_assertion_compat()` | `LegacyAssertion` (pre-`narrative`) |
|
|
| `SourceRecord` | `deserialize_source_record_compat()` | `LegacySourceRecord` (pre-`content`) |
|
|
|
|
All assertion deserialization should use `deserialize_assertion_compat()`. All source record deserialization should use `deserialize_source_record_compat()`. When adding fields to rkyv structs in the future, always add a legacy compat deserializer following this pattern.
|
|
|
|
---
|
|
|
|
## Relationship Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────────┐
|
|
│ ASSERTION │
|
|
│ ┌─────────┐ ┌───────────┐ ┌─────────────┐ │
|
|
│ │ subject │ │ predicate │ │ object │ │
|
|
│ └─────────┘ └───────────┘ └─────────────┘ │
|
|
│ │
|
|
│ ┌─────────────────┐ ┌──────────────────┐ │
|
|
│ │ source_hash │ │ signatures[] │ │
|
|
│ └────────┬────────┘ └────────┬─────────┘ │
|
|
│ │ │ │
|
|
└───────────┼────────────────────┼────────────┘
|
|
│ │
|
|
┌───────────▼───────┐ ┌────────▼────────┐
|
|
│ SOURCE DOCUMENT │ │ AGENTS │
|
|
│ (PDF, URL...) │ │ (Ed25519 keys) │
|
|
└───────────────────┘ └────────┬────────┘
|
|
│
|
|
┌────────▼────────┐
|
|
│ TRUST RANK │
|
|
│ (reputation) │
|
|
└─────────────────┘
|
|
|
|
|
|
┌─────────────────┐ ┌─────────────────┐
|
|
│ VOTE │◄────────│ ASSERTION │
|
|
│ (Ballot Box) │ votes │ (target) │
|
|
│ weight: 0.0-1.0│ on │ │
|
|
└─────────────────┘ └─────────────────┘
|
|
|
|
|
|
┌─────────────────┐ ┌─────────────────┐
|
|
│ EPOCH B │◄────────│ EPOCH A │
|
|
│ supersedes: A │ older │ │
|
|
│ type: Temporal │ epoch │ │
|
|
└─────────────────┘ └─────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## API Representation
|
|
|
|
All binary data (hashes, signatures, agent IDs) is hex-encoded in JSON APIs:
|
|
|
|
```json
|
|
{
|
|
"subject": "Semaglutide",
|
|
"predicate": "muscle_effect",
|
|
"object": { "type": "Text", "value": "Significant loss" },
|
|
"source_hash": "a1b2c3d4e5f6...",
|
|
"signatures": [
|
|
{
|
|
"agent_id": "deadbeef...",
|
|
"signature": "cafebabe...",
|
|
"timestamp": 1706745600
|
|
}
|
|
],
|
|
"confidence": 0.85,
|
|
"timestamp": 1706745600
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## See Also
|
|
|
|
- [Architecture Overview](../architecture.md)
|
|
- [Lens Documentation](../ai-lookup/services/lens.md)
|
|
- [API Endpoints Guide](../.claude/guides/backend/api-endpoints.md)
|