stemedb/docs/specs/concept-hierarchy.md
jordan 55349845d0 refactor: Split all files to enforce 500-line max
Break monolith source files into focused modules:
- stemedb-core/types.rs → types/ directory (assertion, source, gold_standard, etc.)
- stemedb-storage: audit_store, quota_store, trust_rank_store, vector_index, vote_store → module directories
- stemedb-ingest/worker.rs → worker/ with separate test modules
- stemedb-query: engine, materializer, query → module directories
- stemedb-lens: epoch_aware, skeptic → module directories
- stemedb-sim/lib.rs → agent, arenas/, helpers, runner, strategy, types
- stemedb-api/tests: integration_tests → http_basic, http_validation, http_epoch, http_pipeline
- stemedb-api/tests: e2e_flow_test → e2e_full_pipeline, e2e_lens_resolution
- stemedb-query/tests: e2e_pipeline → e2e_pipeline + e2e_decay

Also adds new features: gold standard verification, escalation handlers,
admin endpoints, concept hierarchy spec, arena roadmap, and Go SDK.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 01:13:45 -07:00

784 lines
32 KiB
Markdown

# Concept Hierarchy Spec
**Status:** Draft
**Author:** Jordan Washburn
**Date:** 2026-02-02
---
## Problem
The current `EntityId` is a flat `String`. Subjects like `"Semaglutide"` or `"Tesla_Inc"` have no structure, no namespace, no hierarchy. This creates three problems:
1. **Collision.** Two projects making claims about JWT validation produce assertions with the same subject string. There's no way to distinguish "citadeldb's JWT config" from "some other project's JWT config" without manual namespacing conventions.
2. **No hierarchical queries.** You can't ask "show me all security-related conflicts in citadeldb." You can only query exact subjects. There's no way to traverse up ("all authentication claims") or down ("just the audience validation claims").
3. **Entity resolution is manual.** "Ozempic" and "Semaglutide" and "GLP-1 agonist" are different strings that refer to overlapping concepts. There's no alias mechanism. Every consumer of the data has to know the synonyms.
---
## Design
### ConceptPath
A ConceptPath replaces the flat `EntityId` as the subject of an Assertion. It's a structured, hierarchical identifier with a scheme, segments, and a leaf.
**Wire format (string representation):**
```
{scheme}://{segment_0}/{segment_1}/.../{segment_n}/{leaf}
```
**Examples:**
```
code://rust/citadeldb/auth/jwt/audience_validation
code://go/episteme-sdk/auth/api_keys/rotation_policy
rfc://7519/jwt/audience_validation
owasp://cheatsheet/tls/certificate_verification
fda://label/semaglutide/side_effects/gastroparesis
sec://10k/tesla/revenue/2024
internal://wiki/citadeldb/auth/jwt/skip_aud
```
**Components:**
| Component | Required | Description |
|-----------|----------|-------------|
| `scheme` | Yes | The source domain. Maps to a default source tier. |
| `segments` | Yes (1+) | The hierarchical path. Ordered from broad to narrow. |
| `leaf` | Yes | The specific concept at the bottom of the hierarchy. |
**Rules:**
- Scheme is lowercase alphanumeric + hyphens. No colons, no slashes.
- Segments and leaf are lowercase alphanumeric + underscores + hyphens. No spaces, no dots.
- The `://` separator is mandatory between scheme and path.
- `/` separates segments and leaf. No trailing slash.
- Maximum total length: 1024 bytes (matches current `MAX_SUBJECT_LEN`).
- Minimum: `{scheme}://{leaf}` (one segment that is also the leaf). Example: `code://readme`.
### Scheme Registry
Schemes map to default source tiers. This is the connection between the concept hierarchy and the existing `SourceClass` system.
| Scheme | Default Tier | SourceClass | Description |
|--------|-------------|-------------|-------------|
| `rfc` | 0 | Regulatory | IETF standards documents |
| `nist` | 0 | Regulatory | NIST publications, CIS benchmarks |
| `fda` | 0 | Regulatory | FDA labels, approvals, warnings |
| `sec` | 0 | Regulatory | SEC filings, regulations |
| `owasp` | 1 | Clinical | Security best practices |
| `pubmed` | 1 | Clinical | Peer-reviewed biomedical literature |
| `vendor` | 2 | Observational | Official vendor documentation |
| `internal` | 3 | Expert | Internal wikis, runbooks, decisions |
| `code` | 3 | Expert | Extracted from source code |
| `community` | 4 | Community | Stack Overflow, forums, registries |
| `blog` | 5 | Anecdotal | Blog posts, tutorials, social media |
| `custom` | 3 | Expert | User-defined (tier overridable) |
The scheme provides a **default** tier. An assertion can still override `source_class` explicitly. But if `source_class` is not provided at ingestion time, the scheme determines it. This means the concept path carries its own trust signal.
### ConceptSpace
A ConceptSpace is a node in the hierarchy tree. It's not a separate entity — it's an emergent structure defined by the set of assertions whose ConceptPaths share a common prefix.
Querying `code://rust/citadeldb/auth` returns all assertions under that prefix:
- `code://rust/citadeldb/auth/jwt/audience_validation`
- `code://rust/citadeldb/auth/jwt/expiry_policy`
- `code://rust/citadeldb/auth/oauth/scope_validation`
- `code://rust/citadeldb/auth/session/timeout`
No explicit ConceptSpace object needs to be created or registered for hierarchical queries. The hierarchy exists implicitly in the paths. This is critical — it means the system works without any upfront taxonomy.
### Aliases
Aliases connect ConceptPaths that refer to the same underlying concept across different hierarchies or naming conventions.
**Alias record:**
```
canonical: rfc://7519/jwt/audience_validation
aliases:
- code://rust/citadeldb/auth/jwt/aud_check
- internal://wiki/citadeldb/auth/skip_aud
- community://stackoverflow/jwt/verify_aud
```
When a query targets any alias, the system resolves to the canonical path and returns assertions from **all** aliased paths. This is how entity resolution works without NLP.
**Alias creation:**
1. **Manual.** A human or agent explicitly declares: "these paths are about the same concept."
2. **Suggested.** When a new path is ingested that shares a leaf name with an existing path under a different scheme, the system flags it as a potential alias. Example: ingesting `code://rust/citadeldb/auth/jwt/audience_validation` when `rfc://7519/jwt/audience_validation` already exists. Same leaf, different scheme. Likely about the same thing.
3. **Merged.** Two ConceptSpaces that were separate can be merged by aliasing their roots. Aliasing `code://rust/citadeldb/auth/jwt` to `code://go/episteme-sdk/auth/jwt` means queries against either return claims from both.
**Alias rules:**
- An alias always points to a canonical path. The canonical is the highest-tier path (by scheme default).
- If two Tier 0 paths are aliased, the first registered is canonical.
- Aliases are transitive. If A aliases to B and B aliases to C, querying A returns results from A, B, and C.
- Aliases are stored, not computed. They don't change retroactively. Existing assertions keep their original ConceptPath. The alias is a lookup-time resolution.
---
## Rust Types
### ConceptPath
```rust
/// A hierarchical, scheme-qualified concept identifier.
///
/// Replaces flat `EntityId` strings as assertion subjects. Enables
/// hierarchical queries, cross-scheme alias resolution, and scheme-based
/// default source tier inference.
///
/// # Wire Format
/// `{scheme}://{segment_0}/{segment_1}/.../{leaf}`
///
/// # Examples
/// - `code://rust/citadeldb/auth/jwt/audience_validation`
/// - `rfc://7519/jwt/audience_validation`
/// - `fda://label/semaglutide/side_effects/gastroparesis`
#[derive(Archive, Deserialize, Serialize, Debug, Clone, PartialEq, Eq, Hash)]
#[archive(check_bytes)]
pub struct ConceptPath {
/// The source domain scheme (e.g., "code", "rfc", "fda").
/// Determines default source tier.
pub scheme: String,
/// Ordered path segments from broad to narrow.
/// Must have at least one element. The last element is the leaf.
///
/// For `code://rust/citadeldb/auth/jwt/audience_validation`:
/// segments = ["rust", "citadeldb", "auth", "jwt", "audience_validation"]
pub segments: Vec<String>,
}
```
The leaf is `segments.last()`. Segments includes the leaf — no separate field. This simplifies serialization and prefix matching.
### ConceptAlias
```rust
/// An alias mapping between concept paths.
///
/// Stored at `CA:{alias_path}` → canonical path.
/// Enables cross-scheme entity resolution without NLP.
#[derive(Archive, Deserialize, Serialize, Debug, Clone, PartialEq)]
#[archive(check_bytes)]
pub struct ConceptAlias {
/// The alias path (the one being looked up).
pub alias: ConceptPath,
/// The canonical path (the one it resolves to).
pub canonical: ConceptPath,
/// Who created this alias (agent public key).
pub created_by: [u8; 32],
/// When this alias was created.
pub created_at: u64,
/// How this alias was created.
pub origin: AliasOrigin,
}
/// How an alias was created.
#[derive(Archive, Deserialize, Serialize, Debug, Clone, Copy, PartialEq, Eq)]
#[archive(check_bytes)]
pub enum AliasOrigin {
/// Explicitly declared by a human or agent.
Manual,
/// Suggested by the system (shared leaf name), confirmed by a human.
Suggested,
/// Created by a merge operation.
Merged,
}
```
### SourceScheme
```rust
/// Known source domain schemes with default tier mappings.
///
/// Used to infer `SourceClass` from a ConceptPath's scheme when
/// no explicit source class is provided at ingestion time.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SourceScheme {
Rfc,
Nist,
Fda,
Sec,
Owasp,
Pubmed,
Vendor,
Internal,
Code,
Community,
Blog,
Custom,
}
impl SourceScheme {
/// Parse a scheme string into a known SourceScheme.
/// Unknown schemes map to Custom.
pub fn from_str(s: &str) -> Self { /* ... */ }
/// The default SourceClass for this scheme.
pub fn default_source_class(&self) -> SourceClass { /* ... */ }
/// The scheme string for serialization.
pub fn as_str(&self) -> &'static str { /* ... */ }
}
```
### ConceptPath Methods
```rust
impl ConceptPath {
/// Parse from wire format: `scheme://segment/segment/.../leaf`
pub fn parse(s: &str) -> Result<Self, ConceptPathError> { /* ... */ }
/// Serialize to wire format.
pub fn to_string(&self) -> String {
format!("{}://{}", self.scheme, self.segments.join("/"))
}
/// The leaf concept (last segment).
pub fn leaf(&self) -> &str {
// segments is guaranteed non-empty by construction
&self.segments[self.segments.len() - 1]
}
/// The parent path (all segments except the last).
/// Returns None if this path has only one segment (the leaf).
pub fn parent(&self) -> Option<ConceptPath> {
if self.segments.len() <= 1 {
return None;
}
Some(ConceptPath {
scheme: self.scheme.clone(),
segments: self.segments[..self.segments.len() - 1].to_vec(),
})
}
/// The depth of this path (number of segments).
pub fn depth(&self) -> usize {
self.segments.len()
}
/// Check if this path is a prefix of another.
pub fn is_prefix_of(&self, other: &ConceptPath) -> bool {
self.scheme == other.scheme
&& self.segments.len() < other.segments.len()
&& other.segments.starts_with(&self.segments)
}
/// Check if this path is under a given prefix.
pub fn is_under(&self, prefix: &ConceptPath) -> bool {
prefix.is_prefix_of(self)
}
/// The inferred default SourceClass based on the scheme.
pub fn default_source_class(&self) -> SourceClass {
SourceScheme::from_str(&self.scheme).default_source_class()
}
/// Generate the storage key prefix for hierarchical queries.
/// Used by IndexStore for scan_prefix operations.
pub fn storage_prefix(&self) -> Vec<u8> {
let mut key = b"S:".to_vec();
key.extend_from_slice(self.to_string().as_bytes());
key
}
}
```
---
## Storage Changes
### Key Format Migration
Current key format uses flat subjects:
```
S:Tesla_Inc → Vec<Hash>
SP:Tesla_Inc:has_revenue → Vec<Hash>
```
New key format uses ConceptPath wire format:
```
S:sec://10k/tesla/revenue/2024 → Vec<Hash>
SP:sec://10k/tesla/revenue/2024:reported_amount → Vec<Hash>
```
Because `://` and `/` are part of the subject string now, and the existing `S:` prefix scanning uses byte-level prefix matching, hierarchical queries work automatically:
```
scan_prefix("S:code://rust/citadeldb/auth/")
```
This returns all keys under that path. The `/` in the ConceptPath maps directly to byte-level prefix scanning that `KVStore::scan_prefix` already supports. No new index structure needed for hierarchical queries.
### New Key Prefixes
| Prefix | Format | Content | Purpose |
|--------|--------|---------|---------|
| `CA:` | `CA:{alias_path_string}` | Serialized canonical ConceptPath | Alias resolution |
| `CAR:` | `CAR:{canonical_path_string}` | Serialized `Vec<ConceptPath>` | Reverse alias index (canonical -> all aliases) |
| `CS:` | `CS:{scheme}://{partial_path}` | Serialized concept metadata | Concept space metadata (optional, for caching) |
### Alias Resolution in Queries
When the query engine receives a subject query:
```
1. Parse the subject as a ConceptPath
2. Check CA:{subject} for an alias
- If alias exists, resolve to canonical
- Check CA:{canonical} recursively (transitive aliases)
- Collect all aliased paths from CAR:{canonical}
3. For each path (canonical + all aliases):
- Run the existing index lookup (S: or SP:)
4. Merge results, deduplicate by assertion hash
5. Pass merged candidates to the Lens for resolution
```
This adds one or two extra lookups per query. The `CAR:` reverse index prevents needing to scan all `CA:` keys to find aliases for a canonical path.
### Hierarchical Queries
When the subject is a prefix (not a full leaf path), the query engine uses `scan_prefix` instead of `get`:
```
Query: subject = "code://rust/citadeldb/auth"
1. Parse as ConceptPath
2. Detect: this path has no assertions directly (it's a namespace, not a leaf)
OR: the caller explicitly requests hierarchical mode
3. Run scan_prefix("S:code://rust/citadeldb/auth/")
- Note the trailing / — this prevents "auth" matching "authentication"
4. Collect all assertion hashes from all matching keys
5. Resolve aliases for each matched path
6. Pass all candidates to the Lens
```
The trailing `/` is important. Without it, `auth` would prefix-match `authentication`. The query engine appends `/` when running hierarchical queries against non-leaf paths.
**Detection heuristic:** A query is hierarchical if:
- The subject has no assertions in `S:{subject}` (exact match misses)
- OR the caller sets a `hierarchical: true` flag in QueryParams
- OR the subject ends with `/*` (glob syntax)
---
## Assertion Changes
### Subject Field
The `Assertion` struct's `subject` field changes from `EntityId` (String) to `ConceptPath`.
```rust
pub struct Assertion {
pub subject: ConceptPath, // was: EntityId (String)
pub predicate: RelationId, // unchanged
// ... rest unchanged
}
```
### Backward Compatibility
Existing assertions with flat string subjects are interpreted as:
```
"Tesla_Inc" → custom://Tesla_Inc
"Semaglutide" → custom://Semaglutide
```
The `custom` scheme with a single-segment path. This means:
- All existing data remains valid
- Old-format queries still work (the parser falls back to `custom://` for unqualified strings)
- No data migration required for existing assertions
The parser logic:
```rust
impl ConceptPath {
pub fn parse(s: &str) -> Result<Self, ConceptPathError> {
if let Some((scheme, path)) = s.split_once("://") {
// New format: scheme://segments/leaf
let segments: Vec<String> = path.split('/').map(String::from).collect();
if segments.is_empty() || segments.iter().any(|s| s.is_empty()) {
return Err(ConceptPathError::EmptySegment);
}
Ok(ConceptPath { scheme: scheme.to_string(), segments })
} else {
// Legacy format: bare string → custom://{string}
Ok(ConceptPath {
scheme: "custom".to_string(),
segments: vec![s.to_string()],
})
}
}
}
```
### Source Class Inference
During ingestion, if no explicit `source_class` is set, it's inferred from the ConceptPath scheme:
```rust
// In ingestion pipeline
if assertion.source_class == SourceClass::default() {
assertion.source_class = assertion.subject.default_source_class();
}
```
This means ingesting a claim under `rfc://7519/jwt/audience_validation` automatically classifies it as Tier 0 (Regulatory) unless explicitly overridden. The path carries its own trust weight.
---
## Query Changes
### QueryParams
```rust
pub struct QueryParams {
/// Subject filter. Accepts:
/// - Full ConceptPath: `code://rust/citadeldb/auth/jwt/audience_validation`
/// - Prefix (hierarchical): `code://rust/citadeldb/auth/*`
/// - Legacy bare string: `Tesla_Inc` (interpreted as `custom://Tesla_Inc`)
pub subject: Option<String>,
/// Enable hierarchical query mode.
/// When true, subject is treated as a prefix and all descendants are returned.
/// When false (default), only exact subject matches are returned.
pub hierarchical: bool,
/// Resolve aliases before querying.
/// When true (default), alias resolution expands the query to include
/// all aliased paths for the target concept.
pub resolve_aliases: bool,
// ... existing fields unchanged
}
```
### API Endpoints
New endpoints for concept management:
```
POST /v1/concepts/alias Create an alias between two paths
GET /v1/concepts/aliases/{path} List all aliases for a path
DELETE /v1/concepts/alias Remove an alias
GET /v1/concepts/tree/{prefix} Browse the concept hierarchy under a prefix
GET /v1/concepts/suggest Get suggested aliases for a path (shared leaf detection)
```
Existing endpoints are unchanged. The `subject` parameter in query endpoints now accepts ConceptPath strings, with backward compatibility for bare strings.
---
## Escalation Integration
The existing `EscalationPolicy` gains a `concept_prefix` field:
```rust
pub struct EscalationPolicy {
pub name: String,
pub min_conflict_score: f32,
pub level: EscalationLevel,
pub predicate_pattern: Option<String>,
/// Optional concept path prefix. If set, this policy only applies
/// to assertions whose subject is under this prefix.
/// Example: "code://rust/citadeldb/auth" applies only to auth claims.
pub concept_prefix: Option<String>,
}
```
This enables scoped escalation policies:
```
Policy: "production-auth-safety"
concept_prefix: "code://*/auth"
min_conflict_score: 0.4
level: High
Policy: "general-config"
concept_prefix: "code://"
min_conflict_score: 0.7
level: Medium
```
Auth claims get tighter thresholds. General config claims get looser ones. The wildcard `*` in concept prefixes matches any single segment.
---
## CLI Code Analyzer (First Application)
The concept hierarchy is designed to support a code analysis CLI as the first consumer. The CLI:
1. Walks a project's file tree
2. Maps directory structure to ConceptPaths under the `code://` scheme
3. Extracts claims from source files using pattern-based extractors
4. Ingests assertions into a local Episteme instance
5. Queries for conflicts against authoritative concept spaces (`rfc://`, `owasp://`)
### Path Mapping
The CLI maps project structure to ConceptPaths deterministically:
```
Project: citadeldb
Language: rust
Root: crates/citadeldb/
File: crates/citadeldb/src/auth/jwt.rs
→ code://rust/citadeldb/auth/jwt/{extracted_concept}
File: crates/citadeldb/src/net/tls.rs
→ code://rust/citadeldb/net/tls/{extracted_concept}
File: config/production.yaml
→ code://config/citadeldb/production/{extracted_concept}
```
### Normalization
Directory names are normalized to concept segments:
| Directory | Concept Segment | Rule |
|-----------|----------------|------|
| `src/auth/` | `auth` | Strip `src/` (language boilerplate) |
| `src/net/` | `net` | Direct mapping |
| `crates/citadeldb/` | `citadeldb` | Strip `crates/` wrapper |
| `config/` | `config` | Direct mapping |
| `src/middleware/auth_jwt.rs` | `middleware/auth_jwt` | Underscore preserved |
Normalization rules are language-specific and configurable. The CLI ships with defaults for Rust, Go, Python, and TypeScript.
### Extractor Output
Each extractor produces a claim that maps to an Assertion:
```rust
struct ExtractedClaim {
/// ConceptPath derived from file location + claim type
path: ConceptPath, // code://rust/citadeldb/auth/jwt/audience_validation
/// What the code asserts
predicate: String, // "config_value"
/// The extracted value
value: ObjectValue, // Boolean(false) — i.e., aud validation is disabled
/// File and line where found
source_location: String, // "crates/citadeldb/src/auth/jwt.rs:47"
/// Confidence of extraction (1.0 for regex, lower for heuristic)
extraction_confidence: f32,
}
```
### Automatic Alias Suggestion
When the CLI ingests `code://rust/citadeldb/auth/jwt/audience_validation` and the database already contains `rfc://7519/jwt/audience_validation`, the shared leaf `audience_validation` under the shared parent segment `jwt` triggers an alias suggestion:
```
SUGGEST: code://rust/citadeldb/auth/jwt/audience_validation
↔ rfc://7519/jwt/audience_validation
Reason: shared leaf "audience_validation" under shared segment "jwt"
Action: [accept] [reject] [defer]
```
If accepted, the alias is created. Future queries for either path return claims from both. The conflict between "RFC says validate" and "code says skip" is now structurally visible through the Skeptic lens without anyone manually connecting them.
---
## Migration Path
### Phase A: Add ConceptPath type
- Add `ConceptPath`, `ConceptAlias`, `SourceScheme` to `stemedb-core/src/types/`
- ConceptPath parsing with backward-compatible legacy fallback
- Unit tests for parsing, prefix matching, parent traversal
### Phase B: Storage layer support
- Add `CA:` and `CAR:` key prefixes to storage
- Implement alias resolution in query path
- Implement hierarchical scan using existing `scan_prefix`
- Add concept management endpoints to API
### Phase C: Assertion migration
- Change `Assertion.subject` from `EntityId` to `ConceptPath`
- Update serialization (rkyv derive)
- Update ingestion pipeline for ConceptPath subjects
- Update index key construction
- Backward compat: legacy assertions read as `custom://{subject}`
### Phase D: Source class inference
- Wire scheme-based tier inference into ingestion pipeline
- Update escalation policies with `concept_prefix`
- Update materialized views for ConceptPath subjects
### Phase E: CLI analyzer (first application)
- Project walker + path mapper
- Security config extractors (TLS, JWT, secrets, timeouts, deps)
- Ingestion into local Episteme
- Conflict report generation
- Alias suggestion engine
---
## Design Decisions (Resolved)
1. **Wildcard depth: Recursive by default.** `code://rust/citadeldb/*` matches all descendants, not just direct children. This matches what users expect from file systems (`ls -R`). `scan_prefix` is inherently recursive. If single-level queries become necessary later, add a `depth: Option<usize>` parameter to QueryParams. Don't build it now.
2. **No cross-scheme queries.** `*://*/auth/jwt/audience_validation` is not supported. Too expensive — requires scanning every `S:` key in the database. If a user wants to see all sources about JWT audience validation across schemes, they alias the paths. Aliases are the cross-scheme mechanism. This forces users to be intentional about which concepts they connect, which is the right default.
3. **Alias conflict: Higher-tier wins.** If `rfc://7519/jwt/audience_validation` (Tier 0) and `blog://some-post/jwt/aud_check` (Tier 5) both claim the same alias target, the RFC becomes canonical. Authority resolves ambiguity. Implementation: when creating an alias, check if the target already has a canonical. If so, compare scheme tiers. Higher tier wins canonical status; the loser becomes an alias of the winner.
4. **No concept space metadata in v1.** ConceptSpaces exist implicitly from assertion paths. No explicit `ConceptSpace` object, no metadata, no registration. The hierarchy is emergent. If we need metadata later (descriptions, owners, tags), it can be added as assertions *about* concept paths — the database eating its own tail.
5. **No path supersession. Alias is sufficient.** Renaming `auth` to `authentication` is expressed as an alias from the old path to the new path. Existing assertions keep their original ConceptPath (append-only). No mutation, no supersession chain, no complexity. Old queries still work because the alias resolves.
---
## Implementation Plan
### Touch Points
Based on codebase analysis, ConceptPath changes touch these files:
| Component | File | Lines | What Changes |
|-----------|------|-------|-------------|
| **Type definition** | `stemedb-core/src/types/mod.rs` | 53 | `EntityId = String` stays as-is for backward compat. New `ConceptPath` type added alongside. |
| **Assertion struct** | `stemedb-core/src/types/assertion.rs` | 14 | `subject: EntityId` stays. ConceptPath stored as the EntityId string (wire format). |
| **Index key construction** | `stemedb-storage/src/index_store.rs` | 112-125 | `subject_key()` and `subject_predicate_key()` — no change needed. They already use `subject.as_bytes()`, and ConceptPath wire format is a valid UTF-8 string. |
| **Index trait** | `stemedb-storage/src/index_store.rs` | 161 | `add_to_indexes(&self, subject: &str, ...)` — no signature change. ConceptPath's `to_string()` output is passed. |
| **Ingestion validation** | `stemedb-ingest/src/worker/processing.rs` | 210-216 | Subject length check stays (1024 byte limit). Add ConceptPath parse validation. |
| **Ingestion indexing** | `stemedb-ingest/src/worker/processing.rs` | 137 | `add_to_indexes(&assertion.subject, ...)` — no change. Subject is still a String. |
| **Signature message** | `stemedb-ingest/src/worker/processing.rs` | 260-261 | `format!("{}:{}", subject, predicate)`**no change**. The subject string now contains `://` and `/` but the signature format is still `{subject}:{predicate}`. Backward compatible because legacy subjects don't contain `://`. |
| **Query matching** | `stemedb-query/src/query/mod.rs` | 233-240 | `assertion.subject != subject` — stays as string comparison for exact match. Add hierarchical mode that uses `starts_with` for prefix queries. |
| **Query execution** | `stemedb-query/src/engine/mod.rs` | 122-149 | Fast path and slow path selection. Add hierarchical branch that uses `scan_prefix` instead of `get`. |
| **Candidate fetch** | `stemedb-query/src/engine/candidates.rs` | 23-64 | `fetch_by_subject()` and `fetch_by_subject_predicate()` — add `fetch_by_subject_prefix()` alongside. |
| **MV key construction** | `stemedb-query/src/engine/execution.rs` | 38 | `format!("MV:{}:{}", subject, predicate)` — no change. ConceptPath wire format works as-is. |
| **MV materializer** | `stemedb-query/src/materializer/mod.rs` | 147-155 | SP key parsing. Must handle `://` and `/` in subject portion of `SP:{subject}:{predicate}`. Current parser splits on `:` which will break on `code://`. **Needs fix.** |
| **API query params** | `stemedb-api/src/dto/query_params.rs` | 14-26 | Add `hierarchical: Option<bool>` and `resolve_aliases: Option<bool>`. |
| **API create request** | `stemedb-api/src/dto/create.rs` | 14-65 | Subject stays as `String` in DTO. ConceptPath validation happens in handler. |
| **API handlers** | `stemedb-api/src/handlers/assert.rs` | 99 | Add ConceptPath parse validation before creating Assertion. |
| **API query handler** | `stemedb-api/src/handlers/query.rs` | 89-96 | Add alias resolution and hierarchical query routing. |
| **Lenses** | `stemedb-lens/src/*.rs` | — | **No change.** Lenses operate on pre-filtered candidates. They never access `assertion.subject` for resolution. |
| **Simulator** | `stemedb-sim/src/agent.rs` | 47-51 | Test subjects must use ConceptPath format. Update arena test data. |
| **Escalation** | `stemedb-core/src/types/escalation.rs` | 74-85 | Add `concept_prefix: Option<String>` to `EscalationPolicy`. |
### Critical Insight: Minimal Type Change
The key realization: **`EntityId` stays as `String`**. ConceptPath is a parsing/validation layer, not a storage type. The wire format of a ConceptPath (`code://rust/citadeldb/auth/jwt/aud_validation`) is a valid UTF-8 string. It's stored as a String in the Assertion, indexed as bytes in the KV store, and compared as strings in queries.
This means:
- No rkyv serialization changes
- No storage migration
- No signature format breakage
- Existing assertions with flat string subjects remain valid (`"Tesla_Inc"` is a valid string that parses as `custom://Tesla_Inc`)
ConceptPath is a **type that parses from and serializes to String**. It provides structured access (scheme, segments, leaf, parent, prefix matching) but the underlying storage representation is unchanged.
### The One Breaking Change: SP Key Parsing
The materializer parses `SP:` keys to extract subject and predicate:
```
SP:{subject}:{predicate}
```
Current parsing splits on `:`. This breaks when the subject contains `://`:
```
SP:code://rust/citadeldb/auth/jwt/aud_validation:config_value
^--- splits here, wrong
```
**Fix:** The SP key separator must change from `:` to `\x00` (null byte) or the parser must split on the *last* `:` instead of the first. Since predicates never contain `://`, splitting on the last `:` is safe and backward compatible:
```rust
// Current (broken for ConceptPath subjects):
let parts: Vec<&str> = key_str.splitn(3, ':').collect();
// Fixed (split from the right):
let key_str = &key_str[3..]; // strip "SP:" prefix
if let Some(last_colon) = key_str.rfind(':') {
let subject = &key_str[..last_colon];
let predicate = &key_str[last_colon + 1..];
}
```
This is the only storage-level code change required. Everything else is additive.
### New Code
| What | Where | Purpose |
|------|-------|---------|
| `ConceptPath` struct | `stemedb-core/src/types/concept.rs` | Parse, validate, traverse, prefix-match |
| `ConceptAlias` struct | `stemedb-core/src/types/concept.rs` | Alias record with origin tracking |
| `SourceScheme` enum | `stemedb-core/src/types/concept.rs` | Known schemes → default tier mapping |
| `AliasStore` trait | `stemedb-storage/src/alias_store.rs` | `CA:` and `CAR:` key management |
| `GenericAliasStore` | `stemedb-storage/src/alias_store.rs` | Implementation over KVStore |
| Alias resolution | `stemedb-query/src/engine/aliases.rs` | Resolve aliases before candidate fetch |
| Hierarchical fetch | `stemedb-query/src/engine/candidates.rs` | `fetch_by_subject_prefix()` using `scan_prefix` |
| Concept API handlers | `stemedb-api/src/handlers/concepts.rs` | CRUD for aliases, tree browsing, suggestions |
| Concept DTOs | `stemedb-api/src/dto/concepts.rs` | Request/response types for concept endpoints |
### Where This Fits in the Roadmap
This is **Phase 5A.2** from the existing roadmap — "Key Layout Redesign." The roadmap already calls for preparing keys for subject-prefix range sharding. ConceptPath is the answer to that item. The subject-prefix key layout proposed in the roadmap:
```
{subject}\x00H:{hash} → Assertion data
{subject}\x00MV:{predicate} → Materialized view
```
...is exactly what ConceptPath enables, because the subject is now a hierarchical path that naturally groups related data.
**Sequence within Phase 5:**
```
5A.1 Replace sled with redb/fjall ← storage engine swap (independent)
5A.2 ConceptPath + Key Layout Redesign ← THIS SPEC
5B.* WAL Hardening ← independent, can parallel
5C.* Index Persistence ← depends on 5A.1
```
5A.2 (ConceptPath) can start **before** 5A.1 (sled replacement) because the changes are at the type and parsing layer, not the storage engine layer. The SP key parser fix and the alias store are the only storage-touching changes, and they work against any KVStore backend.
**However**, if we're sequencing for the agent-first roadmap (`tmp/agent-first-roadmap.md`), the order is:
```
1. ConceptPath types + parsing (stemedb-core) ← do first
2. Alias store (stemedb-storage) ← do second
3. Hierarchical query + alias resolution (stemedb-query) ← do third
4. Concept API endpoints (stemedb-api) ← do fourth
5. SP key parser fix (stemedb-query) ← do with #3
6. Escalation policy concept_prefix (stemedb-core) ← do with #4
7. CLI code analyzer (new crate) ← Phase 1 of agent-first roadmap
```
Steps 1-6 are database changes. Step 7 is the first application. Steps 1-4 are sufficient to validate the core loop from the agent-first roadmap — you can ingest assertions with ConceptPath subjects, query them hierarchically, and see conflicts across schemes via aliases.
### What We Do NOT Build
- No `ConceptSpace` metadata objects
- No cross-scheme wildcard queries (`*://`)
- No path supersession mechanism
- No automatic concept extraction (that's the CLI, not the database)
- No UI for concept browsing (API only)
- No concept-level permissions or ACLs
- No concept versioning or history (concepts are implicit, assertions have history)