tidaldb/docs/specs/11-schema.md
jordan 413b712c0a chore: initialize tidalDB repository with schema foundation and standards
- Schema phase 1 (tasks 01-02): EntityId, EntityKind, Timestamp, Score, SignalTypeDef, DecayModel, Window, WindowSet — all with property tests and benchmarks scaffolding
- Stub modules for storage, signals, query, ranking
- Full documentation suite: VISION, USE_CASES, SEQUENCE, API, CODING_GUIDELINES, ai-lookup, research docs, specs, roadmap, planning docs
- Marketing site (Next.js) with blog infrastructure
- .claude/ agents and skills for the tidalDB development workflow
- Foundation standards enforced: thiserror + tracing declared as dependencies, clippy::unwrap_used = deny added to lint config
- .gitignore hardened: .next/, node_modules/, .env, secrets, logs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 12:52:20 -07:00

2312 lines
87 KiB
Markdown

# Schema Specification
**Status:** Draft
**Author:** tidalDB Engineering
**Last Updated:** 2026-02-20
**Prerequisites:** [02-entity-model.md](02-entity-model.md), [03-signal-system.md](03-signal-system.md), [04-relationships.md](04-relationships.md), [API.md](../../API.md)
**Research:** [thoughts.md](../../thoughts.md) (Stage 3 insight: schema encodes behavior, not just shape)
---
## Table of Contents
1. [Design Principles](#1-design-principles)
2. [Type System](#2-type-system)
3. [Schema Definition API](#3-schema-definition-api)
4. [Schema Versioning](#4-schema-versioning)
5. [Schema Validation Rules](#5-schema-validation-rules)
6. [Schema Migration](#6-schema-migration)
7. [Schema Introspection](#7-schema-introspection)
8. [Defaults and Population Priors](#8-defaults-and-population-priors)
9. [A/B Testing Support](#9-ab-testing-support)
10. [Schema Storage](#10-schema-storage)
11. [Example: Video Platform Schema](#11-example-video-platform-schema)
12. [Invariants and Correctness Guarantees](#12-invariants-and-correctness-guarantees)
---
## 1. Design Principles
The schema system is the contract between the application and the database. It defines not just what data exists, but how that data behaves -- decay rates, velocity computation, scoring weights, diversity rules, cohort boundaries. This is the Stage 3 insight from thoughts.md: **schema encodes behavior, not just shape**.
### Schema Is the Source of Truth for Behavior
In traditional databases, schema defines columns and types. Application code defines behavior. In tidalDB, the boundary shifts. A signal's half-life is not a magic constant in application code -- it is a declaration in schema that the database enforces. A ranking profile's scoring weights are not buried in a microservice -- they are versioned schema objects the database executes.
This design choice has three consequences:
1. **The query optimizer reasons about behavior.** When the database sees `USING PROFILE trending`, it knows to use velocity signals, skip total-count indexes, and enforce per-creator diversity. A general-purpose database executing the same logic as an opaque UDF cannot optimize.
2. **Behavior changes do not require redeployment.** Changing a ranking profile's exploration budget from 10% to 15% is a schema mutation, not a code change. It takes effect immediately for the next query.
3. **Behavior is auditable.** Every ranking profile version is stored with a timestamp. "What scoring function was active during the incident last Tuesday?" is answerable by schema introspection.
### Additive Changes Are Always Safe
The schema system distinguishes additive changes (always safe, no migration required) from breaking changes (require explicit migration with dry-run validation). This distinction is enforced at the API level -- an additive change is applied immediately; a breaking change returns a `MigrationRequired` error with a description of what would break.
### Immutability Where It Matters
Signal definitions are immutable once created. Changing a signal's decay half-life would retroactively invalidate all historical running scores -- the O(1) running decay formula assumes a constant lambda. Rather than silently producing incorrect scores, the schema system rejects the mutation and requires the application to define a new signal type.
Ranking profiles are versioned rather than mutated. Version 1 of `for_you` and version 2 coexist. The application controls which version is active. Old versions can be queried explicitly for comparison and debugging.
### Deep Module, Small Interface
The schema system exposes six definition methods (`define_entity`, `define_signal`, `define_profile`, `define_cohort`, `define_relationship`, `migrate`) and six introspection methods. Everything else -- validation, versioning, storage, cache invalidation, WAL logging -- is internal. The caller never interacts with the schema storage format, the version counter, or the validation engine directly.
---
## 2. Type System
All types that compose the schema. These are the Rust types that the application constructs and passes to `define_*` methods.
### Entity Types
```rust
/// Definition of an entity type (Item, User, or Creator).
/// Passed to `db.define_entity()`.
pub struct EntityDef {
/// Which entity kind this definition applies to.
pub kind: EntityKind,
/// Metadata fields carried by entities of this kind.
pub metadata_fields: Vec<Field>,
/// Embedding slots for vector search.
pub embedding: EmbeddingDef,
}
/// The three entity kinds. Fixed -- not extensible by the application.
#[derive(Clone, Copy, PartialEq, Eq, Hash, Debug)]
pub enum EntityKind {
Item,
User,
Creator,
}
/// A metadata field declaration.
pub struct Field {
/// Field name. Lowercase alphanumeric plus underscores. Max 64 chars.
pub name: String,
/// Field data type, which determines indexing behavior.
pub field_type: FieldType,
/// Writability: who can set this field.
pub writability: Writability,
}
/// Convenience constructors for Field.
impl Field {
pub fn text(name: &str) -> Self;
pub fn keyword(name: &str) -> Self;
pub fn keywords(name: &str) -> Self;
pub fn i64(name: &str) -> Self;
pub fn f64(name: &str) -> Self;
pub fn bool(name: &str) -> Self;
pub fn timestamp(name: &str) -> Self;
pub fn duration(name: &str) -> Self;
/// A database-computed field with the given underlying storage type.
/// Writability is automatically set to `DbComputed`.
pub fn computed(name: &str, underlying: FieldType) -> Self;
}
/// Field data types. Determines storage format, index type, and query semantics.
#[derive(Clone, PartialEq, Eq, Debug)]
pub enum FieldType {
/// UTF-8 string, BM25-indexed, full-text searchable.
Text,
/// UTF-8 string, exact-match indexed, filterable, facetable.
Keyword,
/// Vec<String>, each value exact-match indexed.
Keywords,
/// 64-bit signed integer, range-filterable, sortable.
I64,
/// 64-bit float, range-filterable, sortable.
F64,
/// Boolean, equality-filterable.
Bool,
/// UTC nanosecond timestamp, range-filterable, sortable.
Timestamp,
/// Duration in seconds (f64), range-filterable, sortable.
Duration,
}
/// Who can write this field.
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
pub enum Writability {
/// Application writes via write_*() / update_*().
AppSet,
/// Database computes from signal patterns and relationships.
DbComputed,
/// Database manages as part of signal processing (embeddings).
DbManaged,
}
```
### Embedding Types
```rust
/// Embedding configuration for an entity type.
pub struct EmbeddingDef {
/// One or more embedding slots. Max 4 per entity type.
pub slots: Vec<EmbeddingSlot>,
}
/// A single embedding vector slot.
pub struct EmbeddingSlot {
/// Slot name. Unique within the entity type.
pub name: String,
/// Vector dimensions. Range: [2, 4096].
pub dimensions: u32,
/// Who provides this embedding.
pub source: EmbeddingSource,
/// Storage precision. Default: F16.
pub precision: EmbeddingPrecision,
}
/// Who computes and writes the embedding.
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
pub enum EmbeddingSource {
/// Application computes externally, writes via API.
External,
/// Database computes and maintains (e.g., user preference vector).
DatabaseManaged,
}
/// Storage precision for embedding vectors.
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
pub enum EmbeddingPrecision {
/// 16-bit float. Default. ~1% recall loss vs f32, 50% memory savings.
F16,
/// 32-bit float. Use when embedding model requires higher precision.
F32,
/// 8-bit integer quantization. For memory-constrained deployments.
I8,
}
impl Default for EmbeddingPrecision {
fn default() -> Self { Self::F16 }
}
```
### Signal Types
```rust
/// Definition of a signal type. Passed to `db.define_signal()`.
/// Immutable once created -- changing decay would invalidate historical data.
pub struct SignalDef {
/// Signal name. Unique globally. Lowercase alphanumeric plus underscores.
pub name: String,
/// Which entity type this signal targets.
pub target: EntityKind,
/// How the signal weight decays over time.
pub decay: Decay,
/// Time windows for which aggregates are maintained.
pub windows: Vec<Window>,
/// Whether to compute rate-of-change (velocity) per window.
pub velocity: bool,
/// Durability level for this signal type's WAL writes.
/// Default: Batched { max_batch: 256, max_delay: 10ms }.
pub durability: Option<DurabilityLevel>,
}
/// How signal weight diminishes over time.
#[derive(Clone, Debug, PartialEq)]
pub enum Decay {
/// Signal weight halves every `half_life` duration.
/// Formula: w(t) = w_0 * exp(-lambda * t), lambda = ln(2) / half_life
/// The database precomputes and stores lambda at definition time.
Exponential { half_life: Duration },
/// Signal weight drops linearly to zero over `lifetime`.
/// Formula: w(t) = w_0 * max(0, 1 - t / lifetime)
/// Cannot use the O(1) running score trick (not multiplicatively
/// composable). Uses windowed aggregation with linear interpolation
/// at the boundary.
Linear { lifetime: Duration },
/// Signal weight never decays. For permanent state: hides, blocks.
Permanent,
}
impl Decay {
/// Precompute the decay rate constant lambda.
/// Only meaningful for Exponential decay; returns None otherwise.
pub fn lambda(&self) -> Option<f64> {
match self {
Decay::Exponential { half_life } => {
Some(2.0_f64.ln() / half_life.as_secs_f64())
}
_ => None,
}
}
}
/// Time window for signal aggregation.
#[derive(Clone, Debug, PartialEq, Eq)]
pub enum Window {
/// Fixed-duration sliding window.
Sliding { duration: Duration },
/// Unbounded accumulator -- all events since entity creation.
AllTime,
}
impl Window {
pub fn hours(n: u64) -> Self {
Window::Sliding { duration: Duration::from_secs(n * 3600) }
}
pub fn days(n: u64) -> Self {
Window::Sliding { duration: Duration::from_secs(n * 86400) }
}
pub fn all_time() -> Self { Window::AllTime }
}
```
### Ranking Profile Types
```rust
/// Definition of a ranking profile. Passed to `db.define_profile()`.
/// Versioned -- multiple versions coexist under the same name.
pub struct ProfileDef {
/// Profile name. Lowercase alphanumeric plus underscores and hyphens.
pub name: String,
/// Version number. Must be strictly greater than the latest existing
/// version for this name (or 1 if no prior versions exist).
pub version: u32,
/// How to generate the initial candidate set.
pub candidate: Candidate,
/// Signal and relationship boosts applied during scoring.
pub boosts: Vec<Boost>,
/// Recency decay applied to candidate age.
pub decay: Option<ProfileDecay>,
/// Quality gates -- candidates below threshold are excluded.
pub gates: Vec<Gate>,
/// Negative signal penalties subtracted from score.
pub penalties: Vec<Penalty>,
/// Hard exclusion predicates evaluated before scoring.
pub excludes: Vec<Exclude>,
/// Post-scoring diversity constraints.
pub diversity: Option<DiversitySpec>,
/// Fraction of results reserved for exploration (new/unseen creators).
/// Range: [0.0, 1.0]. Default: 0.0 (no exploration).
pub exploration: f64,
/// Optional sort override. If None, results are ordered by computed
/// score. If Some, the specified sort mode takes precedence.
pub sort: Option<Sort>,
}
/// How to generate the initial candidate set for scoring.
#[derive(Clone, Debug)]
pub enum Candidate {
/// Approximate nearest neighbor retrieval over entity embeddings.
Ann {
/// Which vector to use as the query.
query_vector: VectorSource,
/// Which entity type to search.
index: EntityKind,
/// Which embedding slot to search against.
embedding_slot: Option<String>,
/// Number of ANN candidates to retrieve before scoring.
top_k: u32,
},
/// Full scan of all entities of a given kind. Used for trending,
/// browse, and other non-personalized surfaces.
Scan {
entity: EntityKind,
},
/// Retrieve content from entities connected by a relationship edge.
/// E.g., items from followed creators.
Relationship {
edge: String,
},
/// Social graph traversal -- items engaged by users in the
/// querying user's extended social graph.
SocialGraph {
depth: u8,
edge: String,
min_weight: f64,
},
/// Hybrid text + vector retrieval (for search).
Hybrid {
text_weight: f64,
vector_weight: f64,
fusion: Fusion,
},
}
/// Where the query vector comes from.
#[derive(Clone, Debug)]
pub enum VectorSource {
/// Use the querying user's preference embedding.
UserPreference,
/// Use a specific item's embedding (for related/up-next queries).
ItemEmbedding { item_id: String },
/// Use a vector provided by the caller (for search).
Provided,
}
/// Fusion strategy for hybrid text + vector search.
#[derive(Clone, Debug)]
pub enum Fusion {
/// Reciprocal Rank Fusion. RRF(d) = 1/(k + rank_bm25) + 1/(k + rank_ann).
/// k=60 is the standard default. Rank-based, no score normalization needed.
Rrf { k: u32 },
/// Linear combination: alpha * text_score + (1-alpha) * vector_score.
/// Requires score normalization. Use only after relevance tuning.
Linear { alpha: f64 },
}
/// A positive scoring boost.
#[derive(Clone, Debug)]
pub enum Boost {
/// Boost based on a signal's value within a window.
Signal {
signal: String,
window: Window,
mode: SignalMode,
weight: f64,
},
/// Boost based on a relationship edge weight.
Relationship {
edge: String,
weight: f64,
},
/// Boost based on social proof (engagement by user's social graph).
SocialProof {
weight: f64,
},
}
/// What aspect of a signal to use in scoring.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum SignalMode {
/// Raw count within the window.
Count,
/// Running decay score (exponentially weighted).
Value,
/// Rate of change within the window.
Velocity,
/// Ratio of unique users to total count.
UniqueRatio,
/// Ratio of this signal to another (e.g., likes / views).
Ratio,
}
impl Boost {
pub fn signal(signal: &str, window: Window, mode: SignalMode, weight: f64) -> Self {
Boost::Signal {
signal: signal.to_string(),
window,
mode,
weight,
}
}
pub fn relationship(edge: &str, weight: f64) -> Self {
Boost::Relationship { edge: edge.to_string(), weight }
}
pub fn social_proof(weight: f64) -> Self {
Boost::SocialProof { weight }
}
}
/// Recency decay applied to candidate age in the profile.
#[derive(Clone, Debug)]
pub struct ProfileDecay {
/// The timestamp field to use as the age reference.
pub field: String,
/// Half-life for age decay.
pub half_life: Duration,
}
/// Quality gate -- candidates below the threshold are excluded.
#[derive(Clone, Debug)]
pub enum Gate {
/// Minimum signal value to pass. Candidates below are excluded.
Min {
signal: String,
window: Window,
threshold: f64,
},
/// Minimum ratio of one signal to another.
MinRatio {
name: String,
threshold: f64,
},
}
impl Gate {
pub fn min(signal: &str, window: Window, threshold: f64) -> Self {
Gate::Min {
signal: signal.to_string(),
window,
threshold,
}
}
pub fn min_ratio(name: &str, threshold: f64) -> Self {
Gate::MinRatio {
name: name.to_string(),
threshold,
}
}
}
/// Negative signal penalty subtracted from score.
#[derive(Clone, Debug)]
pub struct Penalty {
/// Signal name.
pub signal: String,
/// Window to evaluate.
pub window: Window,
/// Penalty weight (should be negative).
pub weight: f64,
}
impl Penalty {
pub fn signal(signal: &str, window: Window, weight: f64) -> Self {
Penalty {
signal: signal.to_string(),
window,
weight,
}
}
}
/// Hard exclusion predicate evaluated before scoring begins.
#[derive(Clone, Debug)]
pub enum Exclude {
/// Exclude items where this signal exists for the querying user.
/// E.g., Exclude::signal("hide") excludes all hidden items.
Signal { signal: String },
/// Exclude based on relationship. E.g., Exclude::relationship("blocked").
Relationship { edge: String },
}
impl Exclude {
pub fn signal(signal: &str) -> Self {
Exclude::Signal { signal: signal.to_string() }
}
pub fn relationship(edge: &str) -> Self {
Exclude::Relationship { edge: edge.to_string() }
}
}
/// Post-scoring diversity enforcement.
#[derive(Clone, Debug, Default)]
pub struct DiversitySpec {
/// Maximum items from the same creator in the result set.
pub max_per_creator: Option<u32>,
/// Enforce a mix of content formats (video, short, article, etc.).
pub format_mix: bool,
/// Topic diversity via maximal marginal relevance (MMR).
/// 0.0 = no enforcement, 1.0 = maximize diversity.
pub topic_diversity: Option<f64>,
}
/// Sort mode override. Can be specified per-profile or per-query.
#[derive(Clone, Debug)]
pub enum Sort {
Relevance,
Personalized,
New,
Old,
Hot,
Trending,
Rising,
Controversial,
HiddenGems,
TopAllTime,
TopHour,
TopToday,
TopWeek,
TopMonth,
TopYear,
MostViewed,
MostLiked,
MostCommented,
MostShared,
Shortest,
Longest,
AlphabeticalAsc,
AlphabeticalDesc,
Shuffle,
LiveViewerCount,
DateSaved,
CreatorEngagementRate,
/// Sort by a specific metadata field.
Field(String, SortDirection),
}
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum SortDirection {
Asc,
Desc,
}
```
### Cohort Types
```rust
/// Definition of a named cohort. Passed to `db.define_cohort()`.
/// Cohorts define reusable user segments for cohort-scoped queries.
pub struct CohortDef {
/// Cohort name. Unique globally. Lowercase alphanumeric plus underscores.
pub name: String,
/// Predicate that defines cohort membership.
pub predicate: Predicate,
/// How often cohort membership is recomputed.
pub refresh: RefreshPolicy,
}
/// Composable predicate for cohort membership evaluation.
/// Predicates reference fields on the User entity type.
#[derive(Clone, Debug)]
pub enum Predicate {
/// Field equals a specific value.
Eq(String, PredicateValue),
/// Field does not equal a specific value.
Neq(String, PredicateValue),
/// Numeric field is greater than a threshold.
Gt(String, f64),
/// Numeric field is less than a threshold.
Lt(String, f64),
/// Numeric field is in a range [low, high].
Range(String, f64, f64),
/// Keywords field contains a specific value.
Contains(String, String),
/// Keywords field contains any of the given values (OR).
ContainsAny(String, Vec<String>),
/// All child predicates must be true.
And(Vec<Predicate>),
/// At least one child predicate must be true.
Or(Vec<Predicate>),
/// Child predicate must be false.
Not(Box<Predicate>),
}
/// Value types used in predicate comparisons.
#[derive(Clone, Debug)]
pub enum PredicateValue {
String(String),
I64(i64),
F64(f64),
Bool(bool),
}
/// How often a cohort's membership set is recomputed.
#[derive(Clone, Debug)]
pub enum RefreshPolicy {
/// Recompute every N minutes.
Interval { minutes: u32 },
/// Recompute every hour.
Hourly,
/// Recompute every day.
Daily,
/// Recompute on every relevant user metadata change.
/// More expensive but always fresh. Suitable for small cohorts
/// defined over app-set fields.
OnWrite,
}
```
### Relationship Types
```rust
/// Definition of a relationship type. Passed to `db.define_relationship()`.
pub struct RelationshipDef {
/// Relationship name. Unique globally.
pub name: String,
/// Source entity kind.
pub from: EntityKind,
/// Target entity kind.
pub to: EntityKind,
/// Default weight for new edges of this type.
pub weight_default: f64,
/// Optional decay for the relationship weight.
/// None = permanent (follows, blocks).
/// Some = weight decays toward zero over time.
pub decay: Option<Decay>,
/// Whether the relationship is symmetric (A->B implies B->A).
pub symmetric: bool,
}
```
### Error Types
```rust
/// All errors that can occur during schema operations.
#[derive(Debug)]
pub enum SchemaError {
// -- Entity validation errors --
/// Entity kind already has a definition.
EntityAlreadyDefined { kind: EntityKind },
/// Duplicate field name within an entity type.
DuplicateFieldName { kind: EntityKind, field: String },
/// Field name is invalid (not lowercase alphanumeric + underscores).
InvalidFieldName { field: String, reason: String },
/// Embedding dimensions out of range [2, 4096].
InvalidDimensions { slot: String, dimensions: u32 },
/// Too many embedding slots (max 4 per entity type).
TooManyEmbeddingSlots { kind: EntityKind, count: usize },
/// Duplicate embedding slot name within an entity type.
DuplicateEmbeddingSlot { kind: EntityKind, slot: String },
// -- Signal validation errors --
/// Signal name already exists.
SignalAlreadyDefined { name: String },
/// Signal name is invalid.
InvalidSignalName { name: String, reason: String },
/// Signal targets an entity kind that has no definition.
UndefinedTargetEntity { signal: String, target: EntityKind },
/// Permanent-decay signal has velocity enabled (meaningless).
PermanentWithVelocity { signal: String },
/// Too many windows on a signal (max 8).
TooManyWindows { signal: String, count: usize },
/// Too many signal types per entity type (max 64).
TooManySignals { target: EntityKind, count: usize },
/// AllTime window specified with velocity (undefined operation).
AllTimeWithVelocity { signal: String },
/// Attempted to modify an immutable signal definition.
SignalImmutable { name: String },
// -- Profile validation errors --
/// Profile version already exists for this name.
ProfileVersionExists { name: String, version: u32 },
/// Profile version is not sequential (must be > latest).
ProfileVersionNotSequential { name: String, expected: u32, got: u32 },
/// Profile references a signal that is not defined.
UndefinedSignal { profile: String, signal: String },
/// Profile references a relationship type that is not defined.
UndefinedRelationship { profile: String, edge: String },
/// Profile references an entity type that is not defined.
UndefinedEntity { profile: String, entity: EntityKind },
/// Profile candidate strategy references an embedding slot that
/// does not exist on the target entity type.
UndefinedEmbeddingSlot { profile: String, slot: String },
/// Exploration budget out of range [0.0, 1.0].
InvalidExploration { profile: String, value: f64 },
/// Topic diversity out of range [0.0, 1.0].
InvalidTopicDiversity { profile: String, value: f64 },
/// Profile name is invalid.
InvalidProfileName { name: String, reason: String },
// -- Cohort validation errors --
/// Cohort name already exists.
CohortAlreadyDefined { name: String },
/// Cohort predicate references a field not defined on User entity.
UndefinedCohortField { cohort: String, field: String },
/// Cohort predicate references a field with incompatible type.
CohortFieldTypeMismatch {
cohort: String,
field: String,
expected: FieldType,
got: String,
},
/// Maximum number of cohorts exceeded (100).
TooManyCohorts { count: usize },
// -- Relationship validation errors --
/// Relationship name already exists.
RelationshipAlreadyDefined { name: String },
/// Relationship references an entity kind that is not defined.
UndefinedRelationshipEntity { relationship: String, entity: EntityKind },
/// Default weight out of range [0.0, 1.0].
InvalidDefaultWeight { relationship: String, weight: f64 },
// -- Migration errors --
/// A breaking change was attempted without using the migration API.
MigrationRequired { description: String },
/// Migration references objects that no longer exist.
MigrationTargetNotFound { description: String },
/// Migration would invalidate active profiles or cohorts.
MigrationBreaksDependent { migration: String, dependents: Vec<String> },
// -- Write-path errors --
/// Attempted to write a computed field via the write API.
ComputedFieldWrite { entity: EntityKind, field: String },
/// Entity with this ID already exists (use update_*() instead).
EntityExists { kind: EntityKind, id: String },
/// Entity ID collision in BLAKE3 hash space (astronomically unlikely).
IdCollision { id_a: String, id_b: String },
// -- Storage errors --
/// Schema storage operation failed.
StorageFailure(String),
}
```
---
## 3. Schema Definition API
The schema definition API is the set of methods on `TidalDB` that declare the structure and behavior of the database. All definitions are WAL-logged for crash recovery and stored in the B-tree backend under the `SCHEMA:` key prefix.
### 3.1 Define Entity
```rust
impl TidalDB {
/// Define an entity type's metadata fields and embedding slots.
///
/// Each entity kind (Item, User, Creator) is defined exactly once.
/// Calling define_entity for an already-defined kind returns
/// SchemaError::EntityAlreadyDefined.
///
/// After definition, entities of this kind can be written via
/// write_item(), write_user(), or write_creator().
pub fn define_entity(&self, def: EntityDef) -> Result<(), SchemaError>;
}
```
**Behavior on commit:**
1. Validate field names (unique, valid characters, max length).
2. Validate embedding slots (unique names, valid dimensions, max 4 slots).
3. Validate field types (computed fields have valid underlying type).
4. WAL-log the schema change (record type `0x04`).
5. Store definition in `SCHEMA:entity:{kind}` key.
6. Update in-memory schema cache.
7. Initialize indexes for all declared fields (inverted index for text fields, term dictionary for keyword fields, sorted numeric index for numeric fields, etc.).
### 3.2 Define Signal
```rust
impl TidalDB {
/// Define a signal type with its decay, windowing, and velocity behavior.
///
/// Signal names are globally unique. The target entity kind must already
/// be defined via define_entity.
///
/// Signal definitions are immutable once created. Attempting to redefine
/// an existing signal returns SchemaError::SignalImmutable.
///
/// On success, all existing entities of the target kind receive an
/// initialized (zeroed) signal ledger for this signal type.
pub fn define_signal(&self, def: SignalDef) -> Result<(), SchemaError>;
}
```
**Behavior on commit:**
1. Validate signal name (unique, valid characters).
2. Validate target entity kind is defined.
3. Validate decay/window/velocity constraints (see Section 5).
4. Precompute lambda for exponential decay and store alongside definition.
5. WAL-log the schema change.
6. Store definition in `SCHEMA:signal:{name}` key.
7. Update in-memory schema cache (signal type registry).
8. Register signal type index (u8) for compact storage in WAL events.
9. Existing entities of the target kind lazily receive zeroed ledger state for this signal on their next signal write (not eagerly initialized -- this would be O(N) for 10M entities).
### 3.3 Define Profile
```rust
impl TidalDB {
/// Define a ranking profile version.
///
/// Profile names are reusable -- each call creates a new version.
/// Version numbers must be strictly increasing for a given name.
/// The first version for a new name must be version 1.
///
/// New profiles start in Draft status. Call activate_profile()
/// to make them available for queries.
pub fn define_profile(&self, def: ProfileDef) -> Result<(), SchemaError>;
/// Transition a profile version's lifecycle status.
pub fn set_profile_status(
&self,
name: &str,
version: u32,
status: ProfileStatus,
) -> Result<(), SchemaError>;
/// Retrieve a profile by name. If version is None, returns the
/// latest active version. If no active version exists, returns
/// the latest version regardless of status.
pub fn get_profile(
&self,
name: &str,
version: Option<u32>,
) -> Result<ProfileDef, SchemaError>;
}
```
**Behavior on commit:**
1. Validate profile name (valid characters).
2. Validate version is sequential (> latest version for this name, or 1 if new).
3. Validate all signal references exist (boost signals, gate signals, penalty signals, exclude signals).
4. Validate all relationship references exist (boost relationships, exclude relationships, candidate edges).
5. Validate candidate strategy (entity kind is defined, embedding slot exists, dimensions match).
6. Validate exploration budget is in [0.0, 1.0].
7. Validate diversity spec (topic_diversity in [0.0, 1.0] if present).
8. WAL-log the schema change.
9. Store definition in `SCHEMA:profile:{name}:{version}` key.
10. Set initial status to `Draft`.
11. Update in-memory schema cache.
### 3.4 Define Cohort
```rust
impl TidalDB {
/// Define a named cohort (user segment) for cohort-scoped queries.
///
/// Cohort predicates reference fields defined on the User entity type.
/// The User entity must be defined before any cohorts can be defined.
///
/// Maximum 100 cohort definitions (bounded by the cohort tracking
/// storage budget -- see 03-signal-system.md Section 7).
pub fn define_cohort(&self, def: CohortDef) -> Result<(), SchemaError>;
}
```
**Behavior on commit:**
1. Validate cohort name (unique, valid characters).
2. Validate total cohort count does not exceed 100.
3. Validate predicate: all referenced fields exist on the User entity, types are compatible with the predicate operator.
4. WAL-log the schema change.
5. Store definition in `SCHEMA:cohort:{name}` key.
6. Update in-memory schema cache.
7. Schedule initial membership computation (background materializer evaluates the predicate against all existing users).
### 3.5 Define Relationship
```rust
impl TidalDB {
/// Define a relationship type (edge kind) between entity types.
///
/// Both source and target entity kinds must already be defined.
/// Relationship names are globally unique.
pub fn define_relationship(&self, def: RelationshipDef) -> Result<(), SchemaError>;
}
```
**Behavior on commit:**
1. Validate relationship name (unique, valid characters).
2. Validate from/to entity kinds are defined.
3. Validate default weight is in [0.0, 1.0].
4. If decay is specified, validate it (same rules as signal decay).
5. WAL-log the schema change.
6. Store definition in `SCHEMA:relationship:{name}` key.
7. Update in-memory schema cache.
---
## 4. Schema Versioning
Different schema objects have different versioning semantics, reflecting the different consequences of change.
### 4.1 Versioning by Object Type
| Schema Object | Versioning Model | Rationale |
|---------------|-----------------|-----------|
| Entity definitions | Append-only fields | Removing or changing a field type would invalidate indexes and break queries. |
| Signal definitions | Immutable | Changing decay invalidates all historical running scores. Lambda is baked into the O(1) formula. |
| Ranking profiles | Explicitly versioned | Profiles are the tuning knob. Multiple versions must coexist for A/B testing and rollback. |
| Cohort definitions | Mutable (predicate can change) | Cohort membership is recomputed periodically. Changing the predicate simply changes the next computation. |
| Relationship definitions | Immutable | Changing from/to entity kinds or decay would invalidate existing edges. |
### 4.2 Profile Version Lifecycle
Every profile version follows a four-state lifecycle:
```
define_profile()
(none) ─────────────────────────> Draft
set_profile_status() │ (validate all references)
v
Active
set_profile_status() │ (mark as deprecated,
│ still queryable)
v
Deprecated
set_profile_status() │ (no longer queryable
│ except by explicit version)
v
Archived
```
```rust
/// Lifecycle status of a ranking profile version.
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
pub enum ProfileStatus {
/// Newly defined. Not yet available for queries.
/// Can be tested via explicit version: get_profile("name", Some(version)).
Draft,
/// Available for queries. `get_profile("name", None)` returns
/// the latest active version.
Active,
/// Still queryable by explicit version, but no longer returned
/// as the "latest" active version. Used during A/B test wind-down.
Deprecated,
/// No longer queryable. Retained for audit purposes only.
/// Querying an archived profile returns SchemaError.
Archived,
}
```
**Status transition rules:**
| Current | Allowed Next | Forbidden |
|---------|-------------|-----------|
| Draft | Active | Deprecated, Archived |
| Active | Deprecated | Draft, Archived |
| Deprecated | Archived, Active (re-activation) | Draft |
| Archived | (terminal) | Any |
**Multiple active versions.** Multiple versions of the same profile name can be `Active` simultaneously. This is intentional -- it enables A/B testing. The application decides which version to use per query by specifying the version explicitly. `get_profile("for_you", None)` returns the highest-versioned active version.
### 4.3 Schema Version Counter
The database maintains a monotonically increasing schema version counter. Every `define_*` call, `set_profile_status` call, and migration increments this counter. The counter serves as a cache invalidation epoch -- query plan caches are invalidated when the schema version changes.
```rust
impl TidalDB {
/// Returns the current schema version number.
/// Incremented on every schema definition or modification.
pub fn schema_version(&self) -> u64;
}
```
---
## 5. Schema Validation Rules
Every schema definition is validated at definition time. Validation is eager and complete -- a definition that passes validation is guaranteed to be self-consistent and compatible with all existing definitions.
### 5.1 Validation Rules Reference
| Rule ID | Object | Rule | Error |
|---------|--------|------|-------|
| V-E01 | Entity | Entity kind can only be defined once. | `EntityAlreadyDefined` |
| V-E02 | Entity | Field names must be unique within an entity type. | `DuplicateFieldName` |
| V-E03 | Entity | Field names: lowercase `[a-z0-9_]`, max 64 characters, must start with a letter. | `InvalidFieldName` |
| V-E04 | Entity | Embedding dimensions must be in [2, 4096]. | `InvalidDimensions` |
| V-E05 | Entity | Maximum 4 embedding slots per entity type. | `TooManyEmbeddingSlots` |
| V-E06 | Entity | Embedding slot names must be unique within an entity type. | `DuplicateEmbeddingSlot` |
| V-S01 | Signal | Signal names must be globally unique. | `SignalAlreadyDefined` |
| V-S02 | Signal | Signal names: lowercase `[a-z0-9_]`, max 64 characters. | `InvalidSignalName` |
| V-S03 | Signal | Target entity kind must have a definition. | `UndefinedTargetEntity` |
| V-S04 | Signal | Permanent decay signals must have `velocity: false`. | `PermanentWithVelocity` |
| V-S05 | Signal | Maximum 8 windows per signal type. | `TooManyWindows` |
| V-S06 | Signal | Maximum 64 signal types per entity type. | `TooManySignals` |
| V-S07 | Signal | AllTime window with velocity is forbidden. | `AllTimeWithVelocity` |
| V-S08 | Signal | Existing signal definitions cannot be modified. | `SignalImmutable` |
| V-P01 | Profile | Profile name: lowercase `[a-z0-9_-]`, max 64 characters. | `InvalidProfileName` |
| V-P02 | Profile | Version must be > latest version for this name (or 1 if new). | `ProfileVersionNotSequential` |
| V-P03 | Profile | Version must not already exist for this name. | `ProfileVersionExists` |
| V-P04 | Profile | All boost/penalty/gate signal references must be defined signals. | `UndefinedSignal` |
| V-P05 | Profile | All boost/exclude relationship references must be defined relationship types. | `UndefinedRelationship` |
| V-P06 | Profile | Candidate entity kind must be defined. | `UndefinedEntity` |
| V-P07 | Profile | Candidate ANN embedding slot must exist on the target entity. | `UndefinedEmbeddingSlot` |
| V-P08 | Profile | Exploration must be in [0.0, 1.0]. | `InvalidExploration` |
| V-P09 | Profile | DiversitySpec.topic_diversity must be in [0.0, 1.0] if present. | `InvalidTopicDiversity` |
| V-P10 | Profile | ProfileDecay.field must be a timestamp field on the candidate entity. | `UndefinedSignal` (reused) |
| V-C01 | Cohort | Cohort names must be globally unique. | `CohortAlreadyDefined` |
| V-C02 | Cohort | Predicate fields must exist on the User entity type. | `UndefinedCohortField` |
| V-C03 | Cohort | Predicate field types must be compatible with the operator (Eq on keyword, Gt on numeric, Contains on keywords). | `CohortFieldTypeMismatch` |
| V-C04 | Cohort | Maximum 100 cohort definitions. | `TooManyCohorts` |
| V-R01 | Relationship | Relationship names must be globally unique. | `RelationshipAlreadyDefined` |
| V-R02 | Relationship | From and To entity kinds must be defined. | `UndefinedRelationshipEntity` |
| V-R03 | Relationship | Default weight must be in [0.0, 1.0]. | `InvalidDefaultWeight` |
### 5.2 Cross-Object Dependency Graph
Schema objects reference each other. The validation system maintains a dependency graph to prevent orphaned references and to power impact analysis during migrations.
```
EntityDef (Item)
^
|-- SignalDef (view, target: Item)
| ^
| |-- ProfileDef (for_you, boost: view.velocity(24h))
| |-- ProfileDef (trending, boost: view.velocity(6h))
|
|-- EmbeddingSlot (content, 1536D)
| ^
| |-- ProfileDef (for_you, candidate: Ann, slot: content)
|
|-- Field (category)
^
|-- CohortDef (jazz_fans, predicate: Contains(inferred_interests, "jazz"))
EntityDef (User)
^
|-- CohortDef (young_us_jazz, predicate: And(...))
|
|-- Field (region)
^
|-- CohortDef (us_users, predicate: Eq(region, "US"))
RelationshipDef (follows, from: User, to: Creator)
^
|-- ProfileDef (following, candidate: Relationship("follows"))
|-- ProfileDef (for_you, exclude: Relationship("blocked"))
```
**Invariant: no dangling references.** Every signal, profile, cohort, and relationship definition references only objects that exist at definition time. The validation engine checks all references eagerly. There are no deferred reference checks.
**Invariant: no circular dependencies.** Entity definitions depend on nothing. Signal definitions depend on entity definitions. Profile definitions depend on signal and relationship definitions. Cohort definitions depend on entity field definitions. This is a strict DAG with no cycles.
---
## 6. Schema Migration
### 6.1 Additive Changes (Always Safe)
These changes can be applied immediately via the standard `define_*` methods. No migration API required.
| Change | Method | Effect on Existing Data |
|--------|--------|------------------------|
| Add new field to entity type | `define_entity` with additional fields | Existing entities get `NULL` / default for the new field. Indexes are created empty and populated by background scan. |
| Add new signal type | `define_signal` | Existing entities lazily receive zeroed signal ledger on first signal write. |
| Add new ranking profile version | `define_profile` | New version coexists with old versions. No effect on existing data. |
| Add new cohort definition | `define_cohort` | Membership computed by background materializer. No effect on existing data. |
| Add new relationship type | `define_relationship` | No existing edges. Edges created on first `write_relationship` call. |
| Activate/deprecate/archive a profile | `set_profile_status` | Only affects which version `get_profile(name, None)` returns. |
**Adding fields to an entity type.** This is the most common schema change. The API accepts a partial `EntityDef` that adds fields to an already-defined entity kind:
```rust
impl TidalDB {
/// Add fields to an existing entity type definition.
/// Only new fields are accepted -- existing fields cannot be
/// modified or removed via this method.
pub fn add_fields(
&self,
kind: EntityKind,
fields: Vec<Field>,
) -> Result<(), SchemaError>;
}
```
After `add_fields`, the new fields are available for filtering, sorting, and cohort predicates. Existing entities that have not been updated return `NULL` for the new fields. Background index population scans existing entities and builds indexes for any non-NULL values.
### 6.2 Breaking Changes (Require Migration)
These changes would invalidate existing data, indexes, or references. They cannot be applied via `define_*` methods -- attempting to do so returns `SchemaError::MigrationRequired`.
| Change | Why It Breaks | Migration Requirement |
|--------|--------------|----------------------|
| Remove entity field | Profiles, cohorts, or sorts may reference it. Indexes must be dropped. | Verify no dependents reference the field. Drop index. |
| Change field type | Index format changes. Existing values may not be representable in the new type. | Rebuild index. Validate existing values are compatible. |
| Remove signal type | Profiles may reference it as a boost/gate/penalty/exclude. | Verify no active profiles reference the signal. Mark signal as removed. |
| Change signal decay/windows | Invalidates all historical running scores and windowed aggregates. | Cannot be done. Define a new signal type instead. |
| Remove relationship type | Profiles may reference it in candidate, boost, or exclude. | Verify no active profiles reference the relationship. Delete all edges. |
| Remove cohort definition | No direct dependents, but users relying on the cohort name lose it. | Safe to remove if confirmed. |
### 6.3 Migration API
```rust
impl TidalDB {
/// Analyze a proposed migration and return a plan.
/// Does NOT apply any changes. The plan describes:
/// - What objects are affected
/// - What dependents reference the affected objects
/// - Estimated cost (index rebuild time, storage impact)
pub fn plan_migration(
&self,
migration: Migration,
) -> Result<MigrationPlan, SchemaError>;
/// Apply a previously planned migration.
/// The plan must have been generated by plan_migration() in the
/// same schema version (the plan is invalidated if schema changes
/// between planning and application).
pub fn apply_migration(
&self,
plan: MigrationPlan,
) -> Result<(), SchemaError>;
}
/// A migration describes one or more breaking schema changes.
pub struct Migration {
/// Human-readable description.
pub description: String,
/// The individual operations in this migration.
pub operations: Vec<MigrationOp>,
}
/// A single migration operation.
pub enum MigrationOp {
/// Remove a field from an entity type.
RemoveField { kind: EntityKind, field: String },
/// Change a field's type (requires index rebuild + value validation).
ChangeFieldType { kind: EntityKind, field: String, new_type: FieldType },
/// Remove a signal type definition.
RemoveSignal { name: String },
/// Remove a relationship type definition and all its edges.
RemoveRelationship { name: String },
/// Remove a cohort definition.
RemoveCohort { name: String },
}
/// The result of analyzing a migration.
pub struct MigrationPlan {
/// The schema version at which this plan was generated.
/// Plan is invalidated if schema_version changes.
schema_version: u64,
/// Objects that will be modified or removed.
affected_objects: Vec<String>,
/// Active profiles, cohorts, or other objects that reference
/// the affected objects and must be updated first.
blocked_by: Vec<MigrationBlocker>,
/// Estimated cost of applying this migration.
estimated_cost: MigrationCost,
}
pub struct MigrationBlocker {
/// The dependent object (e.g., "profile:for_you:v3").
pub object: String,
/// Why it blocks the migration.
pub reason: String,
}
pub struct MigrationCost {
/// Estimated time to rebuild affected indexes.
pub index_rebuild_time: Duration,
/// Number of entities that need to be scanned.
pub entities_affected: u64,
/// Storage that will be freed.
pub storage_freed: u64,
}
```
**Migration workflow:**
```
1. Application defines the migration:
let migration = Migration {
description: "Remove deprecated 'flair' field from Item".to_string(),
operations: vec![MigrationOp::RemoveField {
kind: EntityKind::Item,
field: "flair".to_string(),
}],
};
2. Application plans the migration (dry-run):
let plan = db.plan_migration(migration)?;
// plan.blocked_by = ["cohort:flair_users references field 'flair'"]
// Application must remove the cohort first.
3. Application resolves blockers:
db.apply_migration(db.plan_migration(Migration {
description: "Remove flair_users cohort".to_string(),
operations: vec![MigrationOp::RemoveCohort {
name: "flair_users".to_string(),
}],
})?)?;
4. Application re-plans the original migration:
let plan = db.plan_migration(migration)?;
// plan.blocked_by = [] -- no more blockers
5. Application applies the migration:
db.apply_migration(plan)?;
```
### 6.4 Migration Compatibility Matrix
This matrix shows which schema changes are additive (safe) vs breaking (require migration).
| Operation | Entity Fields | Signal Defs | Profiles | Cohorts | Relationships |
|-----------|:---:|:---:|:---:|:---:|:---:|
| **Add** | Safe | Safe | Safe (new version) | Safe | Safe |
| **Remove** | Migration | Migration | N/A (archive instead) | Migration | Migration |
| **Modify type** | Migration | Forbidden | N/A (new version) | Safe (predicate) | Forbidden |
| **Modify behavior** | N/A | Forbidden | N/A (new version) | Safe (refresh) | Forbidden |
| **Rename** | Migration | Forbidden | N/A (new name) | Migration | Forbidden |
"Forbidden" means the operation is not supported at all -- the application must create a new object. This applies to signal definitions and relationship definitions where the original declaration's semantics are baked into persisted data (running scores, edge weights).
---
## 7. Schema Introspection
The introspection API allows the application to discover the current schema state. All introspection methods are read-only and lock-free (they read from the in-memory schema cache).
```rust
impl TidalDB {
// -- Entity introspection --
/// List all defined entity types with their field schemas.
pub fn list_entities(&self) -> Vec<EntityInfo>;
/// Describe a specific entity type.
pub fn describe_entity(&self, kind: EntityKind) -> Result<EntityInfo, SchemaError>;
// -- Signal introspection --
/// List all defined signal types with their decay/window config.
pub fn list_signals(&self) -> Vec<SignalInfo>;
/// Describe a specific signal type.
pub fn describe_signal(&self, name: &str) -> Result<SignalInfo, SchemaError>;
// -- Profile introspection --
/// List all profile names with their version history and statuses.
pub fn list_profiles(&self) -> Vec<ProfileSummary>;
/// Describe a specific profile version. If version is None,
/// returns the latest active version.
pub fn describe_profile(
&self,
name: &str,
version: Option<u32>,
) -> Result<ProfileInfo, SchemaError>;
// -- Cohort introspection --
/// List all cohort definitions with their membership counts.
pub fn list_cohorts(&self) -> Vec<CohortInfo>;
/// Describe a specific cohort with its full predicate.
pub fn describe_cohort(&self, name: &str) -> Result<CohortInfo, SchemaError>;
// -- Relationship introspection --
/// List all defined relationship types.
pub fn list_relationships(&self) -> Vec<RelationshipInfo>;
/// Describe a specific relationship type.
pub fn describe_relationship(&self, name: &str) -> Result<RelationshipInfo, SchemaError>;
// -- Global schema state --
/// Current schema version number.
pub fn schema_version(&self) -> u64;
/// Full dependency graph of all schema objects.
/// Useful for understanding the impact of a proposed change.
pub fn schema_dependencies(&self) -> DependencyGraph;
}
```
### Introspection Return Types
```rust
/// Summary of an entity type definition.
pub struct EntityInfo {
pub kind: EntityKind,
pub fields: Vec<FieldInfo>,
pub embedding_slots: Vec<EmbeddingSlotInfo>,
/// Number of active (non-archived) entities of this kind.
pub entity_count: u64,
/// Number of signal types targeting this entity kind.
pub signal_type_count: u32,
}
pub struct FieldInfo {
pub name: String,
pub field_type: FieldType,
pub writability: Writability,
/// Whether an index exists for this field.
pub indexed: bool,
}
pub struct EmbeddingSlotInfo {
pub name: String,
pub dimensions: u32,
pub source: EmbeddingSource,
pub precision: EmbeddingPrecision,
/// Number of entities with a non-null vector in this slot.
pub populated_count: u64,
}
/// Summary of a signal type definition.
pub struct SignalInfo {
pub name: String,
pub target: EntityKind,
pub decay: Decay,
pub lambda: Option<f64>,
pub windows: Vec<Window>,
pub velocity: bool,
pub durability: DurabilityLevel,
}
/// Summary of profile versions for a given name.
pub struct ProfileSummary {
pub name: String,
pub versions: Vec<ProfileVersionSummary>,
}
pub struct ProfileVersionSummary {
pub version: u32,
pub status: ProfileStatus,
pub created_at: Timestamp,
}
/// Full profile definition with metrics.
pub struct ProfileInfo {
pub definition: ProfileDef,
pub status: ProfileStatus,
pub created_at: Timestamp,
/// Total queries executed with this profile version.
pub query_count: u64,
/// Average query latency for this profile version.
pub avg_latency: Duration,
}
/// Summary of a cohort definition.
pub struct CohortInfo {
pub name: String,
pub predicate: Predicate,
pub refresh: RefreshPolicy,
/// Current membership count (as of last refresh).
pub member_count: u64,
/// When membership was last recomputed.
pub last_refreshed: Timestamp,
}
/// Summary of a relationship type definition.
pub struct RelationshipInfo {
pub name: String,
pub from: EntityKind,
pub to: EntityKind,
pub weight_default: f64,
pub decay: Option<Decay>,
pub symmetric: bool,
/// Total number of active edges of this type.
pub edge_count: u64,
}
/// The full dependency graph of all schema objects.
pub struct DependencyGraph {
/// Each entry is (object_id, Vec<dependent_object_ids>).
pub edges: Vec<(String, Vec<String>)>,
}
```
---
## 8. Defaults and Population Priors
The database ships with sensible defaults that enable a working system before the application defines any custom profiles. These defaults are overridable -- defining a profile with the same name replaces the built-in.
### 8.1 Built-in Ranking Profiles
The following profiles are automatically available after entity and signal types are defined. They are created with `ProfileStatus::Active` and version `0` (a reserved version number for built-ins that application-defined profiles override starting at version 1).
| Profile | Candidate Strategy | Primary Signal | Sort Semantics |
|---------|-------------------|----------------|----------------|
| `for_you` | ANN over user preference vector, top_k=500 | preference match + engagement velocity | Personalized blend of semantic relevance and social proof |
| `trending` | Scan all items | `view.velocity(6h) + share.velocity(6h)` | Pure signal velocity, no personalization |
| `rising` | Scan all items | Relative velocity: `velocity(1h) / velocity(24h)`, age-boosted | Content accelerating relative to its baseline |
| `hot` | Scan all items | `score / (age_hours + 2)^1.8` | Reddit-model age decay over cumulative engagement |
| `following` | Relationship: `follows` | N/A | `created_at DESC` (pure chronological) |
| `related` | ANN over anchor item embedding, top_k=200 | Semantic similarity + collaborative filtering | Most similar content to the anchor |
| `browse` | Scan all items | `completion_rate * 0.4 + like_ratio * 0.3 + log(views) * 0.3` | Quality-weighted with reach tiebreaker |
| `search` | Hybrid text + vector, RRF(k=60) | BM25 * 0.6 + semantic_similarity * 0.4 | Relevance with quality boost |
| `controversial` | Scan all items | `sqrt(positive_signals * negative_signals)` | Maximize engagement polarity |
| `hidden_gems` | Scan all items | `completion_rate * like_ratio / log(views + 1)` | High quality, low reach |
| `notification` | Relationship: `follows`, since last_seen | `interaction_weight * item_quality` | Most important notifications first |
| `live` | Filter: `status=live` | `interaction_weight * log(viewer_count)` | Live content the user cares about |
**Override behavior.** When the application defines `for_you` version 1, the built-in version 0 is automatically archived. The application's version takes precedence. If the application archives all versions of a profile that has a built-in, the built-in is restored as the fallback.
### 8.2 Built-in Signal Types
The database does not define signal types automatically. Signal types must be explicitly defined by the application because they determine storage layout and memory budget. However, the documentation includes a recommended set of 40+ signal types (see 03-signal-system.md Section 11) that covers the common content platform use case.
### 8.3 Population-Level Priors
These are database-maintained values that serve as defaults for cold-start entities.
| Prior | Definition | Used For |
|-------|-----------|----------|
| Population preference vector | Centroid (mean) of all active user preference vectors. Recomputed hourly by the background materializer. | New users with no signal history. Their preference vector is initialized to this centroid. |
| Default signal baselines | Per-signal-type median values across all active items. | Cold-start exploration budget calibration: a new item's signals are compared against these baselines to estimate how much exploration is needed. |
| Global engagement distribution | Distribution of engagement_level across all users (% power_user, regular, casual, dormant, new). | Cohort-scoped queries without explicit cohort: "trending globally" uses the full distribution. |
### 8.4 Cold Start Configuration
Cold start behavior is specified per ranking profile, not globally. The `exploration` field in `ProfileDef` controls how much of the result set is reserved for cold-start items.
```rust
// Profile with 10% exploration budget
ProfileDef {
name: "for_you",
exploration: 0.10, // 10% of results from new/unseen content
..
}
```
**Exploration budget mechanics:**
1. The query executor reserves `floor(limit * exploration)` slots for exploration items.
2. Exploration candidates are items that meet ALL of:
- Created within the last 48 hours (configurable)
- Fewer than 1000 impressions (configurable)
- Not hidden or blocked by the querying user
3. Exploration candidates are ranked by a simplified score: `content_similarity * freshness_bonus`. No signal-based scoring (there are no signals to score).
4. Exploration slots are distributed evenly through the result set (not clustered at the end).
5. As an item accumulates signals, it exits the exploration pool and competes normally.
---
## 9. A/B Testing Support
tidalDB supports A/B testing of ranking profiles through the profile versioning system. The database does not perform traffic splitting -- that is application logic. The database provides the infrastructure: multiple active profile versions, per-version metrics, and deterministic query execution.
### 9.1 How A/B Testing Works
```rust
// The application maintains its own traffic split logic.
let profile_version = if user_in_experiment_bucket(user_id) {
"for_you_v2" // or get_profile("for_you", Some(2))
} else {
"for_you" // latest active version (v1)
};
let results = db.retrieve(Retrieve {
for_user: Some(user_id),
profile: profile_version,
..
})?;
```
### 9.2 Profile Metrics
The database tracks per-profile-version metrics automatically:
```rust
pub struct ProfileMetrics {
/// Total queries executed with this profile version.
pub query_count: u64,
/// Latency percentiles (p50, p95, p99).
pub latency_p50: Duration,
pub latency_p95: Duration,
pub latency_p99: Duration,
/// Average number of candidates scored per query.
pub avg_candidates_scored: f64,
/// Average number of results returned per query.
pub avg_results_returned: f64,
/// When the first query was executed with this version.
pub first_query_at: Option<Timestamp>,
/// When the most recent query was executed.
pub last_query_at: Option<Timestamp>,
}
impl TidalDB {
/// Retrieve metrics for a specific profile version.
pub fn profile_metrics(
&self,
name: &str,
version: u32,
) -> Result<ProfileMetrics, SchemaError>;
}
```
These metrics help the application decide when to promote a new version to `Active` and deprecate the old one. The database does not make this decision -- it only provides the data.
### 9.3 What the Database Does NOT Do
- **Traffic splitting.** The application decides which user sees which profile.
- **Statistical significance testing.** The application runs its own hypothesis tests.
- **Automatic promotion.** The application calls `set_profile_status` explicitly.
- **Metric comparison.** The application queries `profile_metrics` for each version and compares.
This is a deliberate design choice. Traffic splitting and experimentation are application-domain concerns with complex requirements (random assignment, sticky bucketing, interaction effects, ramp-up schedules) that vary wildly across organizations. The database provides the building blocks; the application provides the logic.
---
## 10. Schema Storage
### 10.1 Storage Format
Schema definitions are stored in the B-tree backend (redb) under the `SCHEMA:` key prefix. This is the same backend used for entity metadata and materialized views -- read-heavy, rarely written.
```
Key Encoding:
SCHEMA:entity:{kind} -> serialized EntityDef
SCHEMA:signal:{name} -> serialized SignalDef + precomputed lambda
SCHEMA:profile:{name}:{version} -> serialized ProfileDef + status + metadata
SCHEMA:cohort:{name} -> serialized CohortDef + membership bitmap ref
SCHEMA:relationship:{name} -> serialized RelationshipDef
SCHEMA:version -> u64 schema version counter
SCHEMA:metrics:profile:{name}:{v} -> serialized ProfileMetrics
```
### 10.2 In-Memory Schema Cache
On database open, all `SCHEMA:*` keys are loaded into an in-memory cache. The cache provides O(1) access to any schema object. All validation and introspection reads come from the cache, never from disk.
```rust
/// In-memory representation of the complete schema.
/// Loaded once at startup. Updated atomically on define_*() calls.
pub(crate) struct SchemaCache {
/// Entity definitions by kind.
entities: HashMap<EntityKind, EntityDef>,
/// Signal definitions by name.
signals: HashMap<String, SignalDef>,
/// Signal type index: maps signal name to compact u8 index
/// used in WAL events and hot-tier state.
signal_type_ids: HashMap<String, u8>,
/// Profile definitions by (name, version).
profiles: HashMap<(String, u32), (ProfileDef, ProfileStatus)>,
/// Cohort definitions by name.
cohorts: HashMap<String, CohortDef>,
/// Relationship definitions by name.
relationships: HashMap<String, RelationshipDef>,
/// Dependency graph for migration impact analysis.
dependencies: DependencyGraph,
/// Schema version counter.
version: AtomicU64,
}
```
**Cache invalidation.** When a `define_*` method succeeds:
1. The new definition is written to the B-tree backend.
2. The schema cache is updated with the new definition.
3. The schema version counter is incremented (atomic).
4. Query plan caches that reference the old schema version are invalidated.
The cache update is performed under a `RwLock` (write-locked during mutation, read-locked during validation and introspection). Schema mutations are rare (minutes to hours between changes in production), so write-lock contention is negligible. Read-lock acquisition for validation and introspection is practically free.
### 10.3 WAL Logging
Every schema change is WAL-logged as a `SchemaChange` record (type `0x04`) before the B-tree write occurs. This ensures crash recovery can replay schema changes and restore the schema to a consistent state.
```
SchemaChange WAL Record Payload:
+----------+-------+-----------------------------+
| Op Type | Name | Serialized Definition |
| 1 byte | var | var |
+----------+-------+-----------------------------+
Op Types:
0x01 = DefineEntity
0x02 = DefineSignal
0x03 = DefineProfile
0x04 = DefineCohort
0x05 = DefineRelationship
0x06 = SetProfileStatus
0x07 = AddFields
0x08 = ApplyMigration
```
**Recovery sequence.** On crash recovery, `SchemaChange` records are replayed in sequence order. The entity store, signal ledger, and other subsystems are not updated until schema recovery completes -- they depend on having a consistent schema to validate incoming replayed events.
---
## 11. Example: Video Platform Schema
A complete schema definition for a video streaming platform, demonstrating all five object types. This example produces a working database that supports all 14 use cases from USE_CASES.md.
```rust
use tidaldb::{TidalDB, Config};
use tidaldb::schema::*;
use std::time::Duration;
fn define_video_platform_schema(db: &TidalDB) -> Result<(), SchemaError> {
// =====================================================================
// 1. ENTITY TYPES
// =====================================================================
db.define_entity(EntityDef {
kind: EntityKind::Item,
metadata_fields: vec![
// Text fields (BM25 full-text indexed)
Field::text("title"),
Field::text("description"),
// Keyword fields (exact match, filterable)
Field::keyword("category"),
Field::keywords("tags"),
Field::keyword("format"), // video, short, live, podcast
Field::keyword("language"),
Field::keyword("content_rating"), // G, PG, PG-13, R
Field::keyword("status"), // published, live, scheduled
Field::keyword("availability"), // free, premium
// Numeric
Field::i64("award_count"),
// Boolean
Field::bool("has_subtitles"),
Field::bool("downloadable"),
Field::bool("safe_search"),
// Duration
Field::duration("duration"),
// Timestamps
Field::timestamp("created_at"),
Field::timestamp("updated_at"),
],
embedding: EmbeddingDef {
slots: vec![
EmbeddingSlot {
name: "content".to_string(),
dimensions: 1536,
source: EmbeddingSource::External,
precision: EmbeddingPrecision::F16,
},
],
},
})?;
db.define_entity(EntityDef {
kind: EntityKind::User,
metadata_fields: vec![
// Application-set
Field::keyword("locale"),
Field::keyword("language"),
Field::keyword("region"),
Field::keyword("age_range"),
Field::keyword("account_type"),
Field::keywords("explicit_interests"),
// Database-computed
Field::computed("inferred_interests", FieldType::Keywords),
Field::computed("engagement_level", FieldType::Keyword),
Field::computed("content_format_preference", FieldType::Keyword),
Field::computed("platform_tenure_days", FieldType::I64),
Field::computed("followed_creator_count", FieldType::I64),
],
embedding: EmbeddingDef {
slots: vec![
EmbeddingSlot {
name: "preference".to_string(),
dimensions: 1536,
source: EmbeddingSource::DatabaseManaged,
precision: EmbeddingPrecision::F16,
},
],
},
})?;
db.define_entity(EntityDef {
kind: EntityKind::Creator,
metadata_fields: vec![
Field::text("name"),
Field::keyword("handle"),
Field::keyword("language"),
Field::keyword("region"),
Field::keywords("categories"),
Field::bool("verified"),
// Database-computed
Field::computed("follower_count", FieldType::I64),
Field::computed("total_items", FieldType::I64),
Field::computed("avg_engagement_rate", FieldType::F64),
],
embedding: EmbeddingDef {
slots: vec![
EmbeddingSlot {
name: "catalog".to_string(),
dimensions: 1536,
source: EmbeddingSource::DatabaseManaged,
precision: EmbeddingPrecision::F16,
},
],
},
})?;
// =====================================================================
// 2. SIGNAL TYPES
// =====================================================================
// -- Positive engagement signals --
db.define_signal(SignalDef {
name: "view".to_string(),
target: EntityKind::Item,
decay: Decay::Exponential { half_life: Duration::from_secs(7 * 86400) },
windows: vec![
Window::hours(1),
Window::hours(24),
Window::days(7),
Window::days(30),
Window::all_time(),
],
velocity: true,
durability: None, // default: Batched
})?;
db.define_signal(SignalDef {
name: "like".to_string(),
target: EntityKind::Item,
decay: Decay::Exponential { half_life: Duration::from_secs(7 * 86400) },
windows: vec![
Window::hours(1),
Window::hours(24),
Window::days(7),
Window::all_time(),
],
velocity: true,
durability: None,
})?;
db.define_signal(SignalDef {
name: "share".to_string(),
target: EntityKind::Item,
decay: Decay::Exponential { half_life: Duration::from_secs(3 * 86400) },
windows: vec![
Window::hours(1),
Window::hours(24),
Window::days(7),
],
velocity: true,
durability: None,
})?;
db.define_signal(SignalDef {
name: "comment".to_string(),
target: EntityKind::Item,
decay: Decay::Exponential { half_life: Duration::from_secs(3 * 86400) },
windows: vec![
Window::hours(1),
Window::hours(24),
Window::days(7),
Window::all_time(),
],
velocity: true,
durability: None,
})?;
db.define_signal(SignalDef {
name: "save".to_string(),
target: EntityKind::Item,
decay: Decay::Exponential { half_life: Duration::from_secs(7 * 86400) },
windows: vec![Window::hours(24), Window::days(7), Window::all_time()],
velocity: false,
durability: None,
})?;
// -- Quality signals --
db.define_signal(SignalDef {
name: "completion".to_string(),
target: EntityKind::Item,
decay: Decay::Exponential { half_life: Duration::from_secs(30 * 86400) },
windows: vec![Window::all_time()],
velocity: false,
durability: None,
})?;
db.define_signal(SignalDef {
name: "dwell_time".to_string(),
target: EntityKind::Item,
decay: Decay::Exponential { half_life: Duration::from_secs(3 * 86400) },
windows: vec![Window::hours(24), Window::days(7)],
velocity: false,
durability: Some(DurabilityLevel::Eventual),
})?;
db.define_signal(SignalDef {
name: "impression".to_string(),
target: EntityKind::Item,
decay: Decay::Exponential { half_life: Duration::from_secs(86400) },
windows: vec![Window::hours(1), Window::hours(24)],
velocity: false,
durability: Some(DurabilityLevel::Eventual),
})?;
// -- Negative engagement signals --
db.define_signal(SignalDef {
name: "skip".to_string(),
target: EntityKind::Item,
decay: Decay::Exponential { half_life: Duration::from_secs(86400) },
windows: vec![Window::hours(1), Window::hours(24)],
velocity: false,
durability: None,
})?;
db.define_signal(SignalDef {
name: "hide".to_string(),
target: EntityKind::Item,
decay: Decay::Permanent,
windows: vec![],
velocity: false,
durability: Some(DurabilityLevel::Immediate),
})?;
db.define_signal(SignalDef {
name: "dislike".to_string(),
target: EntityKind::Item,
decay: Decay::Exponential { half_life: Duration::from_secs(7 * 86400) },
windows: vec![
Window::hours(1),
Window::hours(24),
Window::days(7),
Window::all_time(),
],
velocity: true,
durability: None,
})?;
db.define_signal(SignalDef {
name: "report".to_string(),
target: EntityKind::Item,
decay: Decay::Permanent,
windows: vec![Window::all_time()],
velocity: false,
durability: Some(DurabilityLevel::Immediate),
})?;
// =====================================================================
// 3. RELATIONSHIP TYPES
// =====================================================================
db.define_relationship(RelationshipDef {
name: "follows".to_string(),
from: EntityKind::User,
to: EntityKind::Creator,
weight_default: 1.0,
decay: None,
symmetric: false,
})?;
db.define_relationship(RelationshipDef {
name: "blocked".to_string(),
from: EntityKind::User,
to: EntityKind::Creator,
weight_default: 1.0,
decay: None,
symmetric: false,
})?;
db.define_relationship(RelationshipDef {
name: "muted".to_string(),
from: EntityKind::User,
to: EntityKind::Creator,
weight_default: 1.0,
decay: None,
symmetric: false,
})?;
db.define_relationship(RelationshipDef {
name: "saved".to_string(),
from: EntityKind::User,
to: EntityKind::Item,
weight_default: 1.0,
decay: None,
symmetric: false,
})?;
db.define_relationship(RelationshipDef {
name: "interaction_weight".to_string(),
from: EntityKind::User,
to: EntityKind::Creator,
weight_default: 0.0,
decay: Some(Decay::Exponential {
half_life: Duration::from_secs(30 * 86400),
}),
symmetric: false,
})?;
db.define_relationship(RelationshipDef {
name: "similarity".to_string(),
from: EntityKind::Item,
to: EntityKind::Item,
weight_default: 0.0,
decay: None, // recomputed periodically, not decayed
symmetric: true,
})?;
// =====================================================================
// 4. RANKING PROFILES
// =====================================================================
// -- Personalized feed --
db.define_profile(ProfileDef {
name: "for_you".to_string(),
version: 1,
candidate: Candidate::Ann {
query_vector: VectorSource::UserPreference,
index: EntityKind::Item,
embedding_slot: Some("content".to_string()),
top_k: 500,
},
boosts: vec![
Boost::signal("view", Window::hours(24), SignalMode::Velocity, 0.3),
Boost::relationship("interaction_weight", 0.2),
Boost::social_proof(0.15),
],
decay: Some(ProfileDecay {
field: "created_at".to_string(),
half_life: Duration::from_secs(48 * 3600),
}),
gates: vec![
Gate::min("completion", Window::all_time(), 0.3),
],
penalties: vec![
Penalty::signal("skip", Window::hours(24), -0.5),
],
excludes: vec![
Exclude::signal("hide"),
Exclude::relationship("blocked"),
],
diversity: Some(DiversitySpec {
max_per_creator: Some(2),
format_mix: true,
topic_diversity: None,
}),
exploration: 0.10,
sort: None,
})?;
db.set_profile_status("for_you", 1, ProfileStatus::Active)?;
// -- Trending --
db.define_profile(ProfileDef {
name: "trending".to_string(),
version: 1,
candidate: Candidate::Scan { entity: EntityKind::Item },
boosts: vec![
Boost::signal("share", Window::hours(6), SignalMode::Velocity, 0.5),
Boost::signal("view", Window::hours(6), SignalMode::Velocity, 0.3),
Boost::signal("view", Window::hours(24), SignalMode::UniqueRatio, 0.2),
],
decay: None,
gates: vec![],
penalties: vec![],
excludes: vec![],
diversity: Some(DiversitySpec {
max_per_creator: Some(1),
format_mix: false,
topic_diversity: None,
}),
exploration: 0.0,
sort: None,
})?;
db.set_profile_status("trending", 1, ProfileStatus::Active)?;
// -- Following feed --
db.define_profile(ProfileDef {
name: "following".to_string(),
version: 1,
candidate: Candidate::Relationship { edge: "follows".to_string() },
boosts: vec![],
decay: None,
gates: vec![],
penalties: vec![],
excludes: vec![
Exclude::relationship("blocked"),
],
diversity: None,
exploration: 0.0,
sort: Some(Sort::New),
})?;
db.set_profile_status("following", 1, ProfileStatus::Active)?;
// -- Search --
db.define_profile(ProfileDef {
name: "search".to_string(),
version: 1,
candidate: Candidate::Hybrid {
text_weight: 0.6,
vector_weight: 0.4,
fusion: Fusion::Rrf { k: 60 },
},
boosts: vec![
Boost::signal("completion", Window::all_time(), SignalMode::Value, 0.15),
Boost::signal("like", Window::all_time(), SignalMode::Ratio, 0.10),
],
decay: Some(ProfileDecay {
field: "created_at".to_string(),
half_life: Duration::from_secs(90 * 86400),
}),
gates: vec![],
penalties: vec![],
excludes: vec![
Exclude::signal("hide"),
Exclude::relationship("blocked"),
],
diversity: Some(DiversitySpec {
max_per_creator: Some(2),
format_mix: false,
topic_diversity: None,
}),
exploration: 0.0,
sort: None,
})?;
db.set_profile_status("search", 1, ProfileStatus::Active)?;
// -- Hidden gems --
db.define_profile(ProfileDef {
name: "hidden_gems".to_string(),
version: 1,
candidate: Candidate::Scan { entity: EntityKind::Item },
boosts: vec![
Boost::signal("completion", Window::all_time(), SignalMode::Value, 0.4),
Boost::signal("like", Window::all_time(), SignalMode::Ratio, 0.3),
],
decay: Some(ProfileDecay {
field: "created_at".to_string(),
half_life: Duration::from_secs(30 * 86400),
}),
gates: vec![
Gate::min("completion", Window::all_time(), 0.6),
Gate::min("view", Window::all_time(), 10.0),
],
penalties: vec![
// Penalize high-reach content (inverse reach scoring)
Penalty::signal("view", Window::all_time(), -0.3),
],
excludes: vec![
Exclude::signal("hide"),
Exclude::relationship("blocked"),
],
diversity: Some(DiversitySpec {
max_per_creator: Some(1),
format_mix: true,
topic_diversity: Some(0.7),
}),
exploration: 0.0,
sort: None,
})?;
db.set_profile_status("hidden_gems", 1, ProfileStatus::Active)?;
// =====================================================================
// 5. COHORT DEFINITIONS
// =====================================================================
db.define_cohort(CohortDef {
name: "us_young_jazz".to_string(),
predicate: Predicate::And(vec![
Predicate::Eq("region".to_string(), PredicateValue::String("US".to_string())),
Predicate::Eq("age_range".to_string(), PredicateValue::String("18-24".to_string())),
Predicate::Or(vec![
Predicate::Contains("explicit_interests".to_string(), "jazz".to_string()),
Predicate::Contains("inferred_interests".to_string(), "jazz".to_string()),
]),
]),
refresh: RefreshPolicy::Hourly,
})?;
db.define_cohort(CohortDef {
name: "power_users".to_string(),
predicate: Predicate::Eq(
"engagement_level".to_string(),
PredicateValue::String("power_user".to_string()),
),
refresh: RefreshPolicy::Hourly,
})?;
db.define_cohort(CohortDef {
name: "new_users".to_string(),
predicate: Predicate::And(vec![
Predicate::Eq(
"engagement_level".to_string(),
PredicateValue::String("new".to_string()),
),
Predicate::Lt("platform_tenure_days".to_string(), 30.0),
]),
refresh: RefreshPolicy::Hourly,
})?;
Ok(())
}
```
**What this schema enables:**
After defining this schema, the application can execute all of these queries without any additional configuration:
```rust
// Personalized For You feed
db.retrieve(Retrieve { profile: "for_you", for_user: Some("user_123"), .. })?;
// Global trending
db.retrieve(Retrieve { profile: "trending", .. })?;
// Trending in jazz category
db.retrieve(Retrieve {
profile: "trending",
filters: vec![Filter::eq("category", "jazz")],
..
})?;
// Trending among US users aged 18-24 who like jazz
db.retrieve(Retrieve {
profile: "trending",
for_cohort: Some("us_young_jazz"),
..
})?;
// Following feed (chronological)
db.retrieve(Retrieve {
profile: "following",
for_user: Some("user_123"),
..
})?;
// Search with hybrid text + vector
db.search(Search {
query: "jazz piano tutorial",
vector: Some(&query_embedding),
profile: "search",
for_user: Some("user_123"),
..
})?;
// Hidden gems in the last 30 days
db.retrieve(Retrieve {
profile: "hidden_gems",
filters: vec![Filter::created_within(Duration::from_secs(30 * 86400))],
..
})?;
```
---
## 12. Invariants and Correctness Guarantees
These invariants must hold at all times. They are encoded as property tests, assertions, and crash recovery tests.
### Schema Integrity Invariants
**INV-SCH-1: No dangling references.** Every signal, profile, cohort, and relationship definition references only objects that exist at the time of definition. Formally: for every reference `R` in a schema object `O`, the referenced object exists in the schema when `O` is defined. No lazy or deferred reference resolution.
**INV-SCH-2: No orphaned dependents.** A schema object referenced by another schema object cannot be removed unless the referencing object is removed first. The migration API enforces this via the `blocked_by` field in `MigrationPlan`.
**INV-SCH-3: Signal immutability.** Once a signal definition is committed, its `name`, `target`, `decay`, `windows`, and `velocity` fields cannot be changed. Any attempt returns `SchemaError::SignalImmutable`.
**INV-SCH-4: Profile version monotonicity.** For a given profile name, version numbers are strictly increasing. If versions 1, 2, 3 exist, the next must be 4 or greater.
**INV-SCH-5: Schema cache consistency.** The in-memory schema cache is always consistent with the B-tree storage. Formally: `cache.get(key) == btree.get(key)` for all `SCHEMA:*` keys, at all times after database open completes.
**INV-SCH-6: WAL recoverability.** After crash recovery, the schema state is identical to the state before the crash. All `SchemaChange` WAL records are replayed in order, and the resulting schema matches the pre-crash schema.
**INV-SCH-7: Computed field write rejection.** Any attempt to write a `DbComputed` or `DbManaged` field via the write API returns `SchemaError::ComputedFieldWrite`. The database never silently ignores a computed field write.
**INV-SCH-8: Validation completeness.** Every validation rule in Section 5 is checked for every definition. A definition that passes all rules is guaranteed to produce a consistent schema state. A definition that fails any rule is rejected without side effects (no partial writes).
### Property Tests
```rust
// P1: Schema operations are atomic -- a failed define_* has no side effects.
proptest! {
fn failed_define_no_side_effects(
def in arb_invalid_signal_def(),
) {
let db = TidalDB::open(test_config())?;
let version_before = db.schema_version();
let _ = db.define_signal(def); // expected to fail
let version_after = db.schema_version();
prop_assert_eq!(version_before, version_after);
}
}
// P2: Profile version ordering is maintained.
proptest! {
fn profile_versions_strictly_increasing(
versions in prop::collection::vec(1u32..100, 1..20),
) {
let db = TidalDB::open(test_config())?;
setup_base_schema(&db)?;
let mut sorted = versions.clone();
sorted.sort();
sorted.dedup();
for &v in &sorted {
let result = db.define_profile(make_profile("test", v));
prop_assert!(result.is_ok());
}
// Verify versions are stored in order
let summary = db.list_profiles();
let stored_versions: Vec<u32> = summary.iter()
.find(|p| p.name == "test")
.unwrap()
.versions.iter()
.map(|v| v.version)
.collect();
prop_assert_eq!(stored_versions, sorted);
}
}
// P3: Schema survives crash at any point during define_*.
proptest! {
fn schema_crash_recovery(
defs in arb_schema_definition_sequence(1..50),
crash_point in 0usize..50,
) {
let (wal, expected_schema) = execute_defs_with_crash(&defs, crash_point);
let recovered_schema = replay_schema_from_wal(wal);
prop_assert_eq!(expected_schema, recovered_schema);
}
}
// P4: Validation rejects all invalid states.
proptest! {
fn validation_rejects_invalid_references(
signal_name in "[a-z]{1,10}",
) {
let db = TidalDB::open(test_config())?;
// No entity types defined -- signal should fail validation
let result = db.define_signal(SignalDef {
name: signal_name,
target: EntityKind::Item,
decay: Decay::Permanent,
windows: vec![],
velocity: false,
durability: None,
});
prop_assert!(matches!(result, Err(SchemaError::UndefinedTargetEntity { .. })));
}
}
// P5: Migration blockers are complete -- no migration succeeds
// that would leave a dangling reference.
proptest! {
fn migration_blockers_complete(
schema in arb_complete_schema(),
removal in arb_removal_from_schema(),
) {
let plan = db.plan_migration(removal.clone())?;
if plan.blocked_by.is_empty() {
// Migration should succeed without creating dangling refs
db.apply_migration(plan)?;
assert_no_dangling_references(&db);
} else {
// Migration should be blocked
// Verify each blocker is a real dependency
for blocker in &plan.blocked_by {
assert!(schema_references(&db, &blocker.object, &removal));
}
}
}
}
```
---
## Appendix A: Glossary
| Term | Definition |
|------|------------|
| **Schema** | The complete set of entity, signal, profile, cohort, and relationship definitions that describe the structure and behavior of a tidalDB instance. |
| **Entity Definition** | Declaration of an entity kind's metadata fields and embedding slots. |
| **Signal Definition** | Immutable declaration of a signal type's decay, windowing, and velocity behavior. |
| **Ranking Profile** | Versioned, named scoring function combining candidate generation, boosts, gates, penalties, excludes, and diversity constraints. |
| **Cohort** | A named user segment defined by a predicate over user entity fields. |
| **Profile Version** | A specific numbered iteration of a ranking profile. Multiple versions can coexist. |
| **Profile Lifecycle** | The four-state progression: Draft -> Active -> Deprecated -> Archived. |
| **Additive Change** | A schema modification that does not invalidate existing data (add field, add signal, new profile version). Always safe. |
| **Breaking Change** | A schema modification that would invalidate existing data or references (remove field, change type). Requires the migration API. |
| **Migration Plan** | The result of analyzing a proposed breaking change: affected objects, blockers, and estimated cost. |
| **Schema Version** | A monotonically increasing counter incremented on every schema change. Used for cache invalidation. |
| **Lambda** | The precomputed decay rate constant: `ln(2) / half_life_seconds`. Stored alongside signal definitions. |
| **Exploration Budget** | The fraction of query results reserved for cold-start items. Declared per ranking profile. |
| **Population Prior** | Database-maintained default values (preference centroid, signal baselines) used for cold-start entities. |
## Appendix B: References
1. thoughts.md -- Stage 3 insight: "Schema encodes behavior, not just shape."
2. VISION.md -- Design principles: temporal decay as a type, ranking profiles as data.
3. API.md -- Schema definition API surface and examples.
4. 02-entity-model.md -- Entity type definitions, field types, writability model.
5. 03-signal-system.md -- Signal type declarations, decay computation, windowed aggregation.
6. 04-relationships.md -- Relationship edge types, weight update mechanics.
7. CODING_GUIDELINES.md -- Error handling (`Result<T, E>` everywhere), trait abstraction, module boundaries.
8. Ousterhout, J. "A Philosophy of Software Design." -- Deep modules, small interfaces.