stemedb/docs/specs/concept-hierarchy.md
jordan 55349845d0 refactor: Split all files to enforce 500-line max
Break monolith source files into focused modules:
- stemedb-core/types.rs → types/ directory (assertion, source, gold_standard, etc.)
- stemedb-storage: audit_store, quota_store, trust_rank_store, vector_index, vote_store → module directories
- stemedb-ingest/worker.rs → worker/ with separate test modules
- stemedb-query: engine, materializer, query → module directories
- stemedb-lens: epoch_aware, skeptic → module directories
- stemedb-sim/lib.rs → agent, arenas/, helpers, runner, strategy, types
- stemedb-api/tests: integration_tests → http_basic, http_validation, http_epoch, http_pipeline
- stemedb-api/tests: e2e_flow_test → e2e_full_pipeline, e2e_lens_resolution
- stemedb-query/tests: e2e_pipeline → e2e_pipeline + e2e_decay

Also adds new features: gold standard verification, escalation handlers,
admin endpoints, concept hierarchy spec, arena roadmap, and Go SDK.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 01:13:45 -07:00

32 KiB

Concept Hierarchy Spec

Status: Draft Author: Jordan Washburn Date: 2026-02-02


Problem

The current EntityId is a flat String. Subjects like "Semaglutide" or "Tesla_Inc" have no structure, no namespace, no hierarchy. This creates three problems:

  1. Collision. Two projects making claims about JWT validation produce assertions with the same subject string. There's no way to distinguish "citadeldb's JWT config" from "some other project's JWT config" without manual namespacing conventions.

  2. No hierarchical queries. You can't ask "show me all security-related conflicts in citadeldb." You can only query exact subjects. There's no way to traverse up ("all authentication claims") or down ("just the audience validation claims").

  3. Entity resolution is manual. "Ozempic" and "Semaglutide" and "GLP-1 agonist" are different strings that refer to overlapping concepts. There's no alias mechanism. Every consumer of the data has to know the synonyms.


Design

ConceptPath

A ConceptPath replaces the flat EntityId as the subject of an Assertion. It's a structured, hierarchical identifier with a scheme, segments, and a leaf.

Wire format (string representation):

{scheme}://{segment_0}/{segment_1}/.../{segment_n}/{leaf}

Examples:

code://rust/citadeldb/auth/jwt/audience_validation
code://go/episteme-sdk/auth/api_keys/rotation_policy
rfc://7519/jwt/audience_validation
owasp://cheatsheet/tls/certificate_verification
fda://label/semaglutide/side_effects/gastroparesis
sec://10k/tesla/revenue/2024
internal://wiki/citadeldb/auth/jwt/skip_aud

Components:

Component Required Description
scheme Yes The source domain. Maps to a default source tier.
segments Yes (1+) The hierarchical path. Ordered from broad to narrow.
leaf Yes The specific concept at the bottom of the hierarchy.

Rules:

  • Scheme is lowercase alphanumeric + hyphens. No colons, no slashes.
  • Segments and leaf are lowercase alphanumeric + underscores + hyphens. No spaces, no dots.
  • The :// separator is mandatory between scheme and path.
  • / separates segments and leaf. No trailing slash.
  • Maximum total length: 1024 bytes (matches current MAX_SUBJECT_LEN).
  • Minimum: {scheme}://{leaf} (one segment that is also the leaf). Example: code://readme.

Scheme Registry

Schemes map to default source tiers. This is the connection between the concept hierarchy and the existing SourceClass system.

Scheme Default Tier SourceClass Description
rfc 0 Regulatory IETF standards documents
nist 0 Regulatory NIST publications, CIS benchmarks
fda 0 Regulatory FDA labels, approvals, warnings
sec 0 Regulatory SEC filings, regulations
owasp 1 Clinical Security best practices
pubmed 1 Clinical Peer-reviewed biomedical literature
vendor 2 Observational Official vendor documentation
internal 3 Expert Internal wikis, runbooks, decisions
code 3 Expert Extracted from source code
community 4 Community Stack Overflow, forums, registries
blog 5 Anecdotal Blog posts, tutorials, social media
custom 3 Expert User-defined (tier overridable)

The scheme provides a default tier. An assertion can still override source_class explicitly. But if source_class is not provided at ingestion time, the scheme determines it. This means the concept path carries its own trust signal.

ConceptSpace

A ConceptSpace is a node in the hierarchy tree. It's not a separate entity — it's an emergent structure defined by the set of assertions whose ConceptPaths share a common prefix.

Querying code://rust/citadeldb/auth returns all assertions under that prefix:

  • code://rust/citadeldb/auth/jwt/audience_validation
  • code://rust/citadeldb/auth/jwt/expiry_policy
  • code://rust/citadeldb/auth/oauth/scope_validation
  • code://rust/citadeldb/auth/session/timeout

No explicit ConceptSpace object needs to be created or registered for hierarchical queries. The hierarchy exists implicitly in the paths. This is critical — it means the system works without any upfront taxonomy.

Aliases

Aliases connect ConceptPaths that refer to the same underlying concept across different hierarchies or naming conventions.

Alias record:

canonical: rfc://7519/jwt/audience_validation
aliases:
  - code://rust/citadeldb/auth/jwt/aud_check
  - internal://wiki/citadeldb/auth/skip_aud
  - community://stackoverflow/jwt/verify_aud

When a query targets any alias, the system resolves to the canonical path and returns assertions from all aliased paths. This is how entity resolution works without NLP.

Alias creation:

  1. Manual. A human or agent explicitly declares: "these paths are about the same concept."
  2. Suggested. When a new path is ingested that shares a leaf name with an existing path under a different scheme, the system flags it as a potential alias. Example: ingesting code://rust/citadeldb/auth/jwt/audience_validation when rfc://7519/jwt/audience_validation already exists. Same leaf, different scheme. Likely about the same thing.
  3. Merged. Two ConceptSpaces that were separate can be merged by aliasing their roots. Aliasing code://rust/citadeldb/auth/jwt to code://go/episteme-sdk/auth/jwt means queries against either return claims from both.

Alias rules:

  • An alias always points to a canonical path. The canonical is the highest-tier path (by scheme default).
  • If two Tier 0 paths are aliased, the first registered is canonical.
  • Aliases are transitive. If A aliases to B and B aliases to C, querying A returns results from A, B, and C.
  • Aliases are stored, not computed. They don't change retroactively. Existing assertions keep their original ConceptPath. The alias is a lookup-time resolution.

Rust Types

ConceptPath

/// A hierarchical, scheme-qualified concept identifier.
///
/// Replaces flat `EntityId` strings as assertion subjects. Enables
/// hierarchical queries, cross-scheme alias resolution, and scheme-based
/// default source tier inference.
///
/// # Wire Format
/// `{scheme}://{segment_0}/{segment_1}/.../{leaf}`
///
/// # Examples
/// - `code://rust/citadeldb/auth/jwt/audience_validation`
/// - `rfc://7519/jwt/audience_validation`
/// - `fda://label/semaglutide/side_effects/gastroparesis`
#[derive(Archive, Deserialize, Serialize, Debug, Clone, PartialEq, Eq, Hash)]
#[archive(check_bytes)]
pub struct ConceptPath {
    /// The source domain scheme (e.g., "code", "rfc", "fda").
    /// Determines default source tier.
    pub scheme: String,

    /// Ordered path segments from broad to narrow.
    /// Must have at least one element. The last element is the leaf.
    ///
    /// For `code://rust/citadeldb/auth/jwt/audience_validation`:
    ///   segments = ["rust", "citadeldb", "auth", "jwt", "audience_validation"]
    pub segments: Vec<String>,
}

The leaf is segments.last(). Segments includes the leaf — no separate field. This simplifies serialization and prefix matching.

ConceptAlias

/// An alias mapping between concept paths.
///
/// Stored at `CA:{alias_path}` → canonical path.
/// Enables cross-scheme entity resolution without NLP.
#[derive(Archive, Deserialize, Serialize, Debug, Clone, PartialEq)]
#[archive(check_bytes)]
pub struct ConceptAlias {
    /// The alias path (the one being looked up).
    pub alias: ConceptPath,

    /// The canonical path (the one it resolves to).
    pub canonical: ConceptPath,

    /// Who created this alias (agent public key).
    pub created_by: [u8; 32],

    /// When this alias was created.
    pub created_at: u64,

    /// How this alias was created.
    pub origin: AliasOrigin,
}

/// How an alias was created.
#[derive(Archive, Deserialize, Serialize, Debug, Clone, Copy, PartialEq, Eq)]
#[archive(check_bytes)]
pub enum AliasOrigin {
    /// Explicitly declared by a human or agent.
    Manual,
    /// Suggested by the system (shared leaf name), confirmed by a human.
    Suggested,
    /// Created by a merge operation.
    Merged,
}

SourceScheme

/// Known source domain schemes with default tier mappings.
///
/// Used to infer `SourceClass` from a ConceptPath's scheme when
/// no explicit source class is provided at ingestion time.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SourceScheme {
    Rfc,
    Nist,
    Fda,
    Sec,
    Owasp,
    Pubmed,
    Vendor,
    Internal,
    Code,
    Community,
    Blog,
    Custom,
}

impl SourceScheme {
    /// Parse a scheme string into a known SourceScheme.
    /// Unknown schemes map to Custom.
    pub fn from_str(s: &str) -> Self { /* ... */ }

    /// The default SourceClass for this scheme.
    pub fn default_source_class(&self) -> SourceClass { /* ... */ }

    /// The scheme string for serialization.
    pub fn as_str(&self) -> &'static str { /* ... */ }
}

ConceptPath Methods

impl ConceptPath {
    /// Parse from wire format: `scheme://segment/segment/.../leaf`
    pub fn parse(s: &str) -> Result<Self, ConceptPathError> { /* ... */ }

    /// Serialize to wire format.
    pub fn to_string(&self) -> String {
        format!("{}://{}", self.scheme, self.segments.join("/"))
    }

    /// The leaf concept (last segment).
    pub fn leaf(&self) -> &str {
        // segments is guaranteed non-empty by construction
        &self.segments[self.segments.len() - 1]
    }

    /// The parent path (all segments except the last).
    /// Returns None if this path has only one segment (the leaf).
    pub fn parent(&self) -> Option<ConceptPath> {
        if self.segments.len() <= 1 {
            return None;
        }
        Some(ConceptPath {
            scheme: self.scheme.clone(),
            segments: self.segments[..self.segments.len() - 1].to_vec(),
        })
    }

    /// The depth of this path (number of segments).
    pub fn depth(&self) -> usize {
        self.segments.len()
    }

    /// Check if this path is a prefix of another.
    pub fn is_prefix_of(&self, other: &ConceptPath) -> bool {
        self.scheme == other.scheme
            && self.segments.len() < other.segments.len()
            && other.segments.starts_with(&self.segments)
    }

    /// Check if this path is under a given prefix.
    pub fn is_under(&self, prefix: &ConceptPath) -> bool {
        prefix.is_prefix_of(self)
    }

    /// The inferred default SourceClass based on the scheme.
    pub fn default_source_class(&self) -> SourceClass {
        SourceScheme::from_str(&self.scheme).default_source_class()
    }

    /// Generate the storage key prefix for hierarchical queries.
    /// Used by IndexStore for scan_prefix operations.
    pub fn storage_prefix(&self) -> Vec<u8> {
        let mut key = b"S:".to_vec();
        key.extend_from_slice(self.to_string().as_bytes());
        key
    }
}

Storage Changes

Key Format Migration

Current key format uses flat subjects:

S:Tesla_Inc                    → Vec<Hash>
SP:Tesla_Inc:has_revenue       → Vec<Hash>

New key format uses ConceptPath wire format:

S:sec://10k/tesla/revenue/2024                          → Vec<Hash>
SP:sec://10k/tesla/revenue/2024:reported_amount         → Vec<Hash>

Because :// and / are part of the subject string now, and the existing S: prefix scanning uses byte-level prefix matching, hierarchical queries work automatically:

scan_prefix("S:code://rust/citadeldb/auth/")

This returns all keys under that path. The / in the ConceptPath maps directly to byte-level prefix scanning that KVStore::scan_prefix already supports. No new index structure needed for hierarchical queries.

New Key Prefixes

Prefix Format Content Purpose
CA: CA:{alias_path_string} Serialized canonical ConceptPath Alias resolution
CAR: CAR:{canonical_path_string} Serialized Vec<ConceptPath> Reverse alias index (canonical -> all aliases)
CS: CS:{scheme}://{partial_path} Serialized concept metadata Concept space metadata (optional, for caching)

Alias Resolution in Queries

When the query engine receives a subject query:

1. Parse the subject as a ConceptPath
2. Check CA:{subject} for an alias
   - If alias exists, resolve to canonical
   - Check CA:{canonical} recursively (transitive aliases)
   - Collect all aliased paths from CAR:{canonical}
3. For each path (canonical + all aliases):
   - Run the existing index lookup (S: or SP:)
4. Merge results, deduplicate by assertion hash
5. Pass merged candidates to the Lens for resolution

This adds one or two extra lookups per query. The CAR: reverse index prevents needing to scan all CA: keys to find aliases for a canonical path.

Hierarchical Queries

When the subject is a prefix (not a full leaf path), the query engine uses scan_prefix instead of get:

Query: subject = "code://rust/citadeldb/auth"

1. Parse as ConceptPath
2. Detect: this path has no assertions directly (it's a namespace, not a leaf)
   OR: the caller explicitly requests hierarchical mode
3. Run scan_prefix("S:code://rust/citadeldb/auth/")
   - Note the trailing / — this prevents "auth" matching "authentication"
4. Collect all assertion hashes from all matching keys
5. Resolve aliases for each matched path
6. Pass all candidates to the Lens

The trailing / is important. Without it, auth would prefix-match authentication. The query engine appends / when running hierarchical queries against non-leaf paths.

Detection heuristic: A query is hierarchical if:

  • The subject has no assertions in S:{subject} (exact match misses)
  • OR the caller sets a hierarchical: true flag in QueryParams
  • OR the subject ends with /* (glob syntax)

Assertion Changes

Subject Field

The Assertion struct's subject field changes from EntityId (String) to ConceptPath.

pub struct Assertion {
    pub subject: ConceptPath,   // was: EntityId (String)
    pub predicate: RelationId,  // unchanged
    // ... rest unchanged
}

Backward Compatibility

Existing assertions with flat string subjects are interpreted as:

"Tesla_Inc" → custom://Tesla_Inc
"Semaglutide" → custom://Semaglutide

The custom scheme with a single-segment path. This means:

  • All existing data remains valid
  • Old-format queries still work (the parser falls back to custom:// for unqualified strings)
  • No data migration required for existing assertions

The parser logic:

impl ConceptPath {
    pub fn parse(s: &str) -> Result<Self, ConceptPathError> {
        if let Some((scheme, path)) = s.split_once("://") {
            // New format: scheme://segments/leaf
            let segments: Vec<String> = path.split('/').map(String::from).collect();
            if segments.is_empty() || segments.iter().any(|s| s.is_empty()) {
                return Err(ConceptPathError::EmptySegment);
            }
            Ok(ConceptPath { scheme: scheme.to_string(), segments })
        } else {
            // Legacy format: bare string → custom://{string}
            Ok(ConceptPath {
                scheme: "custom".to_string(),
                segments: vec![s.to_string()],
            })
        }
    }
}

Source Class Inference

During ingestion, if no explicit source_class is set, it's inferred from the ConceptPath scheme:

// In ingestion pipeline
if assertion.source_class == SourceClass::default() {
    assertion.source_class = assertion.subject.default_source_class();
}

This means ingesting a claim under rfc://7519/jwt/audience_validation automatically classifies it as Tier 0 (Regulatory) unless explicitly overridden. The path carries its own trust weight.


Query Changes

QueryParams

pub struct QueryParams {
    /// Subject filter. Accepts:
    /// - Full ConceptPath: `code://rust/citadeldb/auth/jwt/audience_validation`
    /// - Prefix (hierarchical): `code://rust/citadeldb/auth/*`
    /// - Legacy bare string: `Tesla_Inc` (interpreted as `custom://Tesla_Inc`)
    pub subject: Option<String>,

    /// Enable hierarchical query mode.
    /// When true, subject is treated as a prefix and all descendants are returned.
    /// When false (default), only exact subject matches are returned.
    pub hierarchical: bool,

    /// Resolve aliases before querying.
    /// When true (default), alias resolution expands the query to include
    /// all aliased paths for the target concept.
    pub resolve_aliases: bool,

    // ... existing fields unchanged
}

API Endpoints

New endpoints for concept management:

POST   /v1/concepts/alias          Create an alias between two paths
GET    /v1/concepts/aliases/{path}  List all aliases for a path
DELETE /v1/concepts/alias           Remove an alias
GET    /v1/concepts/tree/{prefix}   Browse the concept hierarchy under a prefix
GET    /v1/concepts/suggest         Get suggested aliases for a path (shared leaf detection)

Existing endpoints are unchanged. The subject parameter in query endpoints now accepts ConceptPath strings, with backward compatibility for bare strings.


Escalation Integration

The existing EscalationPolicy gains a concept_prefix field:

pub struct EscalationPolicy {
    pub name: String,
    pub min_conflict_score: f32,
    pub level: EscalationLevel,
    pub predicate_pattern: Option<String>,

    /// Optional concept path prefix. If set, this policy only applies
    /// to assertions whose subject is under this prefix.
    /// Example: "code://rust/citadeldb/auth" applies only to auth claims.
    pub concept_prefix: Option<String>,
}

This enables scoped escalation policies:

Policy: "production-auth-safety"
  concept_prefix: "code://*/auth"
  min_conflict_score: 0.4
  level: High

Policy: "general-config"
  concept_prefix: "code://"
  min_conflict_score: 0.7
  level: Medium

Auth claims get tighter thresholds. General config claims get looser ones. The wildcard * in concept prefixes matches any single segment.


CLI Code Analyzer (First Application)

The concept hierarchy is designed to support a code analysis CLI as the first consumer. The CLI:

  1. Walks a project's file tree
  2. Maps directory structure to ConceptPaths under the code:// scheme
  3. Extracts claims from source files using pattern-based extractors
  4. Ingests assertions into a local Episteme instance
  5. Queries for conflicts against authoritative concept spaces (rfc://, owasp://)

Path Mapping

The CLI maps project structure to ConceptPaths deterministically:

Project: citadeldb
Language: rust
Root: crates/citadeldb/

File: crates/citadeldb/src/auth/jwt.rs
  → code://rust/citadeldb/auth/jwt/{extracted_concept}

File: crates/citadeldb/src/net/tls.rs
  → code://rust/citadeldb/net/tls/{extracted_concept}

File: config/production.yaml
  → code://config/citadeldb/production/{extracted_concept}

Normalization

Directory names are normalized to concept segments:

Directory Concept Segment Rule
src/auth/ auth Strip src/ (language boilerplate)
src/net/ net Direct mapping
crates/citadeldb/ citadeldb Strip crates/ wrapper
config/ config Direct mapping
src/middleware/auth_jwt.rs middleware/auth_jwt Underscore preserved

Normalization rules are language-specific and configurable. The CLI ships with defaults for Rust, Go, Python, and TypeScript.

Extractor Output

Each extractor produces a claim that maps to an Assertion:

struct ExtractedClaim {
    /// ConceptPath derived from file location + claim type
    path: ConceptPath,            // code://rust/citadeldb/auth/jwt/audience_validation
    /// What the code asserts
    predicate: String,            // "config_value"
    /// The extracted value
    value: ObjectValue,           // Boolean(false)  — i.e., aud validation is disabled
    /// File and line where found
    source_location: String,      // "crates/citadeldb/src/auth/jwt.rs:47"
    /// Confidence of extraction (1.0 for regex, lower for heuristic)
    extraction_confidence: f32,
}

Automatic Alias Suggestion

When the CLI ingests code://rust/citadeldb/auth/jwt/audience_validation and the database already contains rfc://7519/jwt/audience_validation, the shared leaf audience_validation under the shared parent segment jwt triggers an alias suggestion:

SUGGEST: code://rust/citadeldb/auth/jwt/audience_validation
      ↔  rfc://7519/jwt/audience_validation

Reason: shared leaf "audience_validation" under shared segment "jwt"
Action: [accept] [reject] [defer]

If accepted, the alias is created. Future queries for either path return claims from both. The conflict between "RFC says validate" and "code says skip" is now structurally visible through the Skeptic lens without anyone manually connecting them.


Migration Path

Phase A: Add ConceptPath type

  • Add ConceptPath, ConceptAlias, SourceScheme to stemedb-core/src/types/
  • ConceptPath parsing with backward-compatible legacy fallback
  • Unit tests for parsing, prefix matching, parent traversal

Phase B: Storage layer support

  • Add CA: and CAR: key prefixes to storage
  • Implement alias resolution in query path
  • Implement hierarchical scan using existing scan_prefix
  • Add concept management endpoints to API

Phase C: Assertion migration

  • Change Assertion.subject from EntityId to ConceptPath
  • Update serialization (rkyv derive)
  • Update ingestion pipeline for ConceptPath subjects
  • Update index key construction
  • Backward compat: legacy assertions read as custom://{subject}

Phase D: Source class inference

  • Wire scheme-based tier inference into ingestion pipeline
  • Update escalation policies with concept_prefix
  • Update materialized views for ConceptPath subjects

Phase E: CLI analyzer (first application)

  • Project walker + path mapper
  • Security config extractors (TLS, JWT, secrets, timeouts, deps)
  • Ingestion into local Episteme
  • Conflict report generation
  • Alias suggestion engine

Design Decisions (Resolved)

  1. Wildcard depth: Recursive by default. code://rust/citadeldb/* matches all descendants, not just direct children. This matches what users expect from file systems (ls -R). scan_prefix is inherently recursive. If single-level queries become necessary later, add a depth: Option<usize> parameter to QueryParams. Don't build it now.

  2. No cross-scheme queries. *://*/auth/jwt/audience_validation is not supported. Too expensive — requires scanning every S: key in the database. If a user wants to see all sources about JWT audience validation across schemes, they alias the paths. Aliases are the cross-scheme mechanism. This forces users to be intentional about which concepts they connect, which is the right default.

  3. Alias conflict: Higher-tier wins. If rfc://7519/jwt/audience_validation (Tier 0) and blog://some-post/jwt/aud_check (Tier 5) both claim the same alias target, the RFC becomes canonical. Authority resolves ambiguity. Implementation: when creating an alias, check if the target already has a canonical. If so, compare scheme tiers. Higher tier wins canonical status; the loser becomes an alias of the winner.

  4. No concept space metadata in v1. ConceptSpaces exist implicitly from assertion paths. No explicit ConceptSpace object, no metadata, no registration. The hierarchy is emergent. If we need metadata later (descriptions, owners, tags), it can be added as assertions about concept paths — the database eating its own tail.

  5. No path supersession. Alias is sufficient. Renaming auth to authentication is expressed as an alias from the old path to the new path. Existing assertions keep their original ConceptPath (append-only). No mutation, no supersession chain, no complexity. Old queries still work because the alias resolves.


Implementation Plan

Touch Points

Based on codebase analysis, ConceptPath changes touch these files:

Component File Lines What Changes
Type definition stemedb-core/src/types/mod.rs 53 EntityId = String stays as-is for backward compat. New ConceptPath type added alongside.
Assertion struct stemedb-core/src/types/assertion.rs 14 subject: EntityId stays. ConceptPath stored as the EntityId string (wire format).
Index key construction stemedb-storage/src/index_store.rs 112-125 subject_key() and subject_predicate_key() — no change needed. They already use subject.as_bytes(), and ConceptPath wire format is a valid UTF-8 string.
Index trait stemedb-storage/src/index_store.rs 161 add_to_indexes(&self, subject: &str, ...) — no signature change. ConceptPath's to_string() output is passed.
Ingestion validation stemedb-ingest/src/worker/processing.rs 210-216 Subject length check stays (1024 byte limit). Add ConceptPath parse validation.
Ingestion indexing stemedb-ingest/src/worker/processing.rs 137 add_to_indexes(&assertion.subject, ...) — no change. Subject is still a String.
Signature message stemedb-ingest/src/worker/processing.rs 260-261 format!("{}:{}", subject, predicate)no change. The subject string now contains :// and / but the signature format is still {subject}:{predicate}. Backward compatible because legacy subjects don't contain ://.
Query matching stemedb-query/src/query/mod.rs 233-240 assertion.subject != subject — stays as string comparison for exact match. Add hierarchical mode that uses starts_with for prefix queries.
Query execution stemedb-query/src/engine/mod.rs 122-149 Fast path and slow path selection. Add hierarchical branch that uses scan_prefix instead of get.
Candidate fetch stemedb-query/src/engine/candidates.rs 23-64 fetch_by_subject() and fetch_by_subject_predicate() — add fetch_by_subject_prefix() alongside.
MV key construction stemedb-query/src/engine/execution.rs 38 format!("MV:{}:{}", subject, predicate) — no change. ConceptPath wire format works as-is.
MV materializer stemedb-query/src/materializer/mod.rs 147-155 SP key parsing. Must handle :// and / in subject portion of SP:{subject}:{predicate}. Current parser splits on : which will break on code://. Needs fix.
API query params stemedb-api/src/dto/query_params.rs 14-26 Add hierarchical: Option<bool> and resolve_aliases: Option<bool>.
API create request stemedb-api/src/dto/create.rs 14-65 Subject stays as String in DTO. ConceptPath validation happens in handler.
API handlers stemedb-api/src/handlers/assert.rs 99 Add ConceptPath parse validation before creating Assertion.
API query handler stemedb-api/src/handlers/query.rs 89-96 Add alias resolution and hierarchical query routing.
Lenses stemedb-lens/src/*.rs No change. Lenses operate on pre-filtered candidates. They never access assertion.subject for resolution.
Simulator stemedb-sim/src/agent.rs 47-51 Test subjects must use ConceptPath format. Update arena test data.
Escalation stemedb-core/src/types/escalation.rs 74-85 Add concept_prefix: Option<String> to EscalationPolicy.

Critical Insight: Minimal Type Change

The key realization: EntityId stays as String. ConceptPath is a parsing/validation layer, not a storage type. The wire format of a ConceptPath (code://rust/citadeldb/auth/jwt/aud_validation) is a valid UTF-8 string. It's stored as a String in the Assertion, indexed as bytes in the KV store, and compared as strings in queries.

This means:

  • No rkyv serialization changes
  • No storage migration
  • No signature format breakage
  • Existing assertions with flat string subjects remain valid ("Tesla_Inc" is a valid string that parses as custom://Tesla_Inc)

ConceptPath is a type that parses from and serializes to String. It provides structured access (scheme, segments, leaf, parent, prefix matching) but the underlying storage representation is unchanged.

The One Breaking Change: SP Key Parsing

The materializer parses SP: keys to extract subject and predicate:

SP:{subject}:{predicate}

Current parsing splits on :. This breaks when the subject contains ://:

SP:code://rust/citadeldb/auth/jwt/aud_validation:config_value
     ^--- splits here, wrong

Fix: The SP key separator must change from : to \x00 (null byte) or the parser must split on the last : instead of the first. Since predicates never contain ://, splitting on the last : is safe and backward compatible:

// Current (broken for ConceptPath subjects):
let parts: Vec<&str> = key_str.splitn(3, ':').collect();

// Fixed (split from the right):
let key_str = &key_str[3..]; // strip "SP:" prefix
if let Some(last_colon) = key_str.rfind(':') {
    let subject = &key_str[..last_colon];
    let predicate = &key_str[last_colon + 1..];
}

This is the only storage-level code change required. Everything else is additive.

New Code

What Where Purpose
ConceptPath struct stemedb-core/src/types/concept.rs Parse, validate, traverse, prefix-match
ConceptAlias struct stemedb-core/src/types/concept.rs Alias record with origin tracking
SourceScheme enum stemedb-core/src/types/concept.rs Known schemes → default tier mapping
AliasStore trait stemedb-storage/src/alias_store.rs CA: and CAR: key management
GenericAliasStore stemedb-storage/src/alias_store.rs Implementation over KVStore
Alias resolution stemedb-query/src/engine/aliases.rs Resolve aliases before candidate fetch
Hierarchical fetch stemedb-query/src/engine/candidates.rs fetch_by_subject_prefix() using scan_prefix
Concept API handlers stemedb-api/src/handlers/concepts.rs CRUD for aliases, tree browsing, suggestions
Concept DTOs stemedb-api/src/dto/concepts.rs Request/response types for concept endpoints

Where This Fits in the Roadmap

This is Phase 5A.2 from the existing roadmap — "Key Layout Redesign." The roadmap already calls for preparing keys for subject-prefix range sharding. ConceptPath is the answer to that item. The subject-prefix key layout proposed in the roadmap:

{subject}\x00H:{hash}       → Assertion data
{subject}\x00MV:{predicate}  → Materialized view

...is exactly what ConceptPath enables, because the subject is now a hierarchical path that naturally groups related data.

Sequence within Phase 5:

5A.1  Replace sled with redb/fjall        ← storage engine swap (independent)
5A.2  ConceptPath + Key Layout Redesign    ← THIS SPEC
5B.*  WAL Hardening                        ← independent, can parallel
5C.*  Index Persistence                    ← depends on 5A.1

5A.2 (ConceptPath) can start before 5A.1 (sled replacement) because the changes are at the type and parsing layer, not the storage engine layer. The SP key parser fix and the alias store are the only storage-touching changes, and they work against any KVStore backend.

However, if we're sequencing for the agent-first roadmap (tmp/agent-first-roadmap.md), the order is:

1. ConceptPath types + parsing              (stemedb-core)     ← do first
2. Alias store                              (stemedb-storage)  ← do second
3. Hierarchical query + alias resolution    (stemedb-query)    ← do third
4. Concept API endpoints                    (stemedb-api)      ← do fourth
5. SP key parser fix                        (stemedb-query)    ← do with #3
6. Escalation policy concept_prefix         (stemedb-core)     ← do with #4
7. CLI code analyzer                        (new crate)        ← Phase 1 of agent-first roadmap

Steps 1-6 are database changes. Step 7 is the first application. Steps 1-4 are sufficient to validate the core loop from the agent-first roadmap — you can ingest assertions with ConceptPath subjects, query them hierarchically, and see conflicts across schemes via aliases.

What We Do NOT Build

  • No ConceptSpace metadata objects
  • No cross-scheme wildcard queries (*://)
  • No path supersession mechanism
  • No automatic concept extraction (that's the CLI, not the database)
  • No UI for concept browsing (API only)
  • No concept-level permissions or ACLs
  • No concept versioning or history (concepts are implicit, assertions have history)