# 02 -- Entity Model Specification The entity model defines the three core domain objects in tidalDB: **Items** (content), **Users** (consumers), and **Creators** (producers). Every entity has metadata fields, an embedding slot, and an attached signal ledger. The model is designed to support cohort-based targeting, personalized ranking, and the full query surface described in VISION.md and USE_CASES.md. This specification covers entity schemas, field types, lifecycle semantics, embedding management, and the cohort-ready attribute design that enables queries like "what is trending among US users aged 18-24 who are interested in jazz." --- ## Table of Contents - [Design Principles](#design-principles) - [Field Type Reference](#field-type-reference) - [Entity Relationships Diagram](#entity-relationships-diagram) - [Item Entity](#item-entity) - [User Entity](#user-entity) - [Creator Entity](#creator-entity) - [Field Writability Model](#field-writability-model) - [Entity Lifecycle](#entity-lifecycle) - [Embedding Management](#embedding-management) - [Cohort-Ready Design](#cohort-ready-design) - [Signal Ledger Attachment](#signal-ledger-attachment) - [Storage Representation](#storage-representation) - [Design Rationale](#design-rationale) --- ## Design Principles **Entities are nodes, not rows.** An entity is not a collection of columns in a table. It is a node in a graph with metadata, embeddings, a signal ledger, and relationship edges. The database reasons about entities holistically -- not as field bags. **Some fields are yours; some are ours.** The entity model distinguishes between application-set fields (written by the caller) and database-computed fields (maintained by tidalDB). The application sets demographic attributes on a user. The database computes behavioral segments from signal patterns. Neither overwrites the other. **Rich attributes enable cohort queries.** A user entity with two fields (language, region) cannot answer "what is trending among power users in Japan who prefer short-form video." The user model must carry enough dimensionality to resolve cohort membership efficiently at query time. **Every field earns its index.** Fields exist because a query needs them. Every field in this spec can be traced to a filter, sort mode, ranking profile signal, or cohort predicate in USE_CASES.md. --- ## Field Type Reference Every metadata field on an entity has a declared type that determines its indexing behavior, storage format, and query semantics. | Type | Storage | Indexed As | Query Operations | Example | |------|---------|------------|------------------|---------| | `text` | UTF-8 string | Inverted index (BM25, tokenized) | Full-text search, phrase match, field-scoped search | `title`, `description` | | `keyword` | UTF-8 string | Term dictionary, exact match | Equality, IN-list, faceting | `category`, `locale` | | `keywords` | `Vec` | Term dictionary per value | Equality per value, IN-list, faceting | `tags`, `explicit_interests` | | `i64` | 64-bit signed integer | Sorted numeric index | Range, equality, min/max, sort | `birth_year`, `follower_count` | | `f64` | 64-bit float | Sorted numeric index | Range, equality, min/max, sort | `avg_completion_rate` | | `bool` | 1-bit boolean | Boolean index | Equality | `verified`, `has_subtitles` | | `timestamp` | UTC nanoseconds (`i64`) | Sorted numeric index | Range, presets (`today`, `this_week`), since | `created_at`, `first_signal_at` | | `duration` | Seconds (`f64`) | Sorted numeric index | Range, presets (`short`, `medium`, `long`), sort | `duration` | | `embedding` | `Vec` or quantized | HNSW (USearch) | ANN search, cosine similarity | `content_vector`, `preference_vector` | | `computed` | Varies (keyword, keywords, i64, f64) | Same as underlying type | Same as underlying type | `engagement_level`, `inferred_interests` | **`computed` fields** are a special category. They have an underlying storage type (keyword, keywords, i64, f64) and are indexed identically to that type. The distinction is write semantics: computed fields are not directly writable by the application. They are maintained by the database based on signal patterns, relationship state, or periodic background computation. Attempting to set a computed field via `write_user()` or `update_user()` returns a `SchemaError`. --- ## Entity Relationships Diagram ``` ┌──────────────┐ │ User │ │ │ │ metadata │ │ embedding │ │ signals │ └──────┬───────┘ │ ┌─────────────┼─────────────┐ │ │ │ follows/blocks viewed/liked interacted (Relationship) (Signal) (Relationship) │ │ │ ▼ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ Creator │◄─────────│ Item │ │ │ created │ │ │ metadata │ │ metadata │ │ embedding │ │ embedding │ │ signals │ │ signals │ └──────────────┘ └──────────────┘ Relationship edges: User ──follows──▶ Creator (permanent, weight) User ──blocks───▶ Creator (permanent, hard filter) User ──viewed───▶ Item (signal-derived) User ──liked────▶ Item (signal-derived) User ──saved────▶ Item (explicit) User ──hid──────▶ Item (permanent negative) Item ──created_by──▶ Creator (structural, immutable) Creator ──similar_to──▶ Creator (computed, embedding distance) Item ──similar_to──▶ Item (computed, embedding distance) ``` Every entity participates in two kinds of connections: 1. **Relationships** -- explicit, weighted, directional edges managed via `write_relationship()`. Used for follows, blocks, saves, collections. 2. **Signal-derived state** -- implicit edges created automatically when signals are written. A `view` signal on an item by a user creates a user-item "seen" edge. A `like` creates a user-item "liked" edge. These are queryable via `Filter::unseen()`, `Filter::user_state("liked")`, etc. --- ## Item Entity Items are the content that gets ranked. Videos, articles, images, audio tracks, podcasts, live streams, galleries -- anything a user consumes and engages with. Every item belongs to exactly one creator (the `creator_id` link). Items carry metadata for filtering and display, one or more embedding slots for semantic retrieval, and a signal ledger that accumulates engagement data. ### Schema Definition ```rust db.define_entity(EntityDef { kind: EntityKind::Item, metadata_fields: vec![ // --- Text fields: full-text indexed, searchable via BM25 --- Field::text("title"), Field::text("description"), // --- Keyword fields: exact match, filterable, facetable --- Field::keyword("category"), // primary category: "music", "gaming", "cooking" Field::keywords("tags"), // multi-value: ["jazz", "piano", "tutorial"] Field::keyword("format"), // video, short, live, vod, podcast, article, image, gallery, audio Field::keyword("language"), // ISO 639-1: "en", "ja", "es" Field::keywords("subtitle_languages"),// available subtitle languages Field::keywords("dubbed_languages"), // available dub languages Field::keyword("content_rating"), // G, PG, PG-13, R, NC-17 Field::keyword("status"), // published, live, scheduled, archived, draft Field::keyword("availability"), // free, premium, subscriber_only, rental Field::keyword("resolution"), // SD, HD, FHD, 4K, 8K Field::keyword("audio_quality"), // standard, high, lossless, spatial Field::keyword("content_region"), // geographic origin: "US", "JP" Field::keyword("post_type"), // text, link, image, video, poll (forum-style) Field::keywords("hashtags"), // #jazz, #tutorial Field::keyword("flair"), // community-specific label // --- Numeric fields: range-filterable, sortable --- Field::i64("award_count"), // community awards/gilding count // --- Boolean fields: filterable --- Field::bool("has_subtitles"), Field::bool("has_audio_description"), Field::bool("has_sign_language"), Field::bool("downloadable"), Field::bool("hdr"), Field::bool("is_original"), // not a crosspost/repost Field::bool("safe_search"), // passes safe-search filter // --- Duration: range-filterable, sortable, preset-filterable --- Field::duration("duration"), // --- Timestamps: range-filterable, sortable --- Field::timestamp("created_at"), Field::timestamp("updated_at"), Field::timestamp("scheduled_at"), // for premieres / scheduled live Field::timestamp("available_until"), // for "leaving soon" filter ], // Primary content embedding -- externally computed, DB-indexed. embedding: EmbeddingDef { slots: vec![ EmbeddingSlot { name: "content", // text/semantic content vector dimensions: 1536, source: EmbeddingSource::External, }, ], }, })?; ``` ### Field Summary Table | Field | Type | Writability | Indexed | Used By | |-------|------|-------------|---------|---------| | `title` | text | app-set | BM25 inverted | UC-02 search, UC-06 alphabetical sort | | `description` | text | app-set | BM25 inverted | UC-02 search | | `category` | keyword | app-set | term dictionary | UC-03 scoped trending, UC-06 browse, cohort | | `tags` | keywords | app-set | term dictionary | UC-02 search, UC-06 filter | | `format` | keyword | app-set | term dictionary | UC-01 format filter, UC-06 browse, diversity | | `language` | keyword | app-set | term dictionary | UC-02 language filter | | `subtitle_languages` | keywords | app-set | term dictionary | UC-02 accessibility filter | | `dubbed_languages` | keywords | app-set | term dictionary | UC-02 accessibility filter | | `content_rating` | keyword | app-set | term dictionary | UC-02 maturity filter | | `status` | keyword | app-set | term dictionary | UC-12 live filter | | `availability` | keyword | app-set | term dictionary | UC-02 availability filter | | `resolution` | keyword | app-set | term dictionary | UC-02 quality filter | | `audio_quality` | keyword | app-set | term dictionary | UC-02 quality filter | | `content_region` | keyword | app-set | term dictionary | UC-02 geographic filter, cohort | | `post_type` | keyword | app-set | term dictionary | UC-14 forum filtering | | `hashtags` | keywords | app-set | term dictionary | UC-02 hashtag search | | `flair` | keyword | app-set | term dictionary | UC-14 community filter | | `award_count` | i64 | app-set | sorted numeric | UC-14 gilded filter | | `has_subtitles` | bool | app-set | boolean | UC-02 accessibility filter | | `has_audio_description` | bool | app-set | boolean | UC-02 accessibility filter | | `has_sign_language` | bool | app-set | boolean | UC-02 accessibility filter | | `downloadable` | bool | app-set | boolean | UC-09 download filter | | `hdr` | bool | app-set | boolean | UC-02 quality filter | | `is_original` | bool | app-set | boolean | UC-14 original-only filter | | `safe_search` | bool | app-set | boolean | UC-02 safe search toggle | | `duration` | duration | app-set | sorted numeric | UC-02 duration filter, UC-06 shortest/longest sort | | `created_at` | timestamp | app-set | sorted numeric | UC-04 chronological, UC-06 date filter | | `updated_at` | timestamp | app-set | sorted numeric | change tracking | | `scheduled_at` | timestamp | app-set | sorted numeric | UC-12 scheduled content | | `available_until` | timestamp | app-set | sorted numeric | UC-02 "leaving soon" filter | | `content` (embedding) | embedding | app-set | HNSW (USearch) | UC-01 ANN retrieval, UC-02 semantic search, UC-05 related | ### Additional Embedding Slots Applications may define additional embedding slots for multi-modal retrieval: ```rust EmbeddingSlot { name: "visual", // image/thumbnail embedding dimensions: 512, source: EmbeddingSource::External, }, EmbeddingSlot { name: "audio", // audio fingerprint embedding dimensions: 256, source: EmbeddingSource::External, }, ``` Each slot gets its own HNSW index. Queries specify which embedding to search against. This supports UC-11 (visual/semantic search) without overloading a single vector space. --- ## User Entity Users are the consumers of content. They generate signals (views, likes, skips, hides), accumulate preference profiles, and form relationships with creators and items. The user entity carries two categories of fields: 1. **Application-set fields** -- demographic and preference data the application writes explicitly. These are known at registration time or provided by the user. 2. **Database-computed fields** -- behavioral segments, interest profiles, and engagement patterns derived from signal history. The database maintains these automatically. The application reads them (for display, analytics, cohort targeting) but never writes them directly. This distinction is the foundation of cohort targeting. An application sets `locale: "en-US"` and `birth_year: 2001`. The database computes `engagement_level: "power_user"` and `inferred_interests: ["jazz", "piano", "music_theory"]`. A cohort query combines both: `locale:en-US AND age_range:18-24 AND engagement_level:power_user AND interest:jazz`. ### Schema Definition ```rust db.define_entity(EntityDef { kind: EntityKind::User, metadata_fields: vec![ // ================================================================ // APPLICATION-SET: Demographic Attributes // Written by the application at registration or profile update. // ================================================================ Field::keyword("locale"), // full locale: "en-US", "ja-JP", "es-MX" Field::keyword("language"), // preferred content language: "en", "ja" Field::keyword("region"), // geographic region: "US", "JP", "DE" Field::keyword("timezone"), // IANA timezone: "America/New_York", "Asia/Tokyo" Field::i64("birth_year"), // for age-based cohort bucketing (optional) Field::keyword("age_range"), // explicit bucket: "13-17", "18-24", "25-34", "35-44", "45-54", "55+" Field::keyword("gender"), // optional: "male", "female", "non-binary", "undisclosed" Field::keyword("account_type"), // free, premium, creator, admin Field::keywords("explicit_interests"),// stated interests at signup: ["jazz", "cooking", "rust"] Field::keywords("preferred_formats"), // stated format preference: ["video", "short"] // ================================================================ // DATABASE-COMPUTED: Interest Profile // Derived from engagement patterns. Updated by background computation. // ================================================================ Field::computed("inferred_interests", FieldType::Keywords), // keywords derived from engagement history. // top N topics by weighted engagement volume. // e.g., ["jazz", "piano", "music_theory", "cooking", "rust"] // updated: every signal write triggers incremental update; // full recomputation on background schedule. Field::computed("primary_categories", FieldType::Keywords), // top categories by engagement volume (coarser than interests). // e.g., ["music", "programming", "food"] // updated: background computation, hourly. // ================================================================ // DATABASE-COMPUTED: Behavioral Segments // Derived from signal frequency, patterns, and recency. // ================================================================ Field::computed("engagement_level", FieldType::Keyword), // power_user: > 50 signals/day, 7-day streak // regular: 10-50 signals/day, active 4+ days/week // casual: 1-10 signals/day, active 1-3 days/week // dormant: < 1 signal/day for 7+ days // new: < 7 days since first signal // updated: background computation, every 6 hours. Field::computed("content_format_preference", FieldType::Keyword), // short: > 60% of completions are items with duration < 4min // long: > 60% of completions are items with duration > 20min // mixed: neither threshold met // updated: background computation, daily. Field::computed("session_pattern", FieldType::Keyword), // binge: avg session > 30min, sequential consumption // browsing: avg session 5-30min, diverse consumption // searching: > 40% of sessions start with search // updated: background computation, daily. Field::computed("platform_tenure_days", FieldType::I64), // days since first signal was written for this user. // updated: on every signal write (trivial computation). Field::computed("daily_active_hours", FieldType::F64), // average number of distinct hours with signal activity per day. // computed over trailing 7-day window. // updated: background computation, daily. // ================================================================ // DATABASE-COMPUTED: Creator Relationship Profile // Derived from relationship graph and signal patterns. // ================================================================ Field::computed("followed_creator_count", FieldType::I64), // count of active "follows" relationships. // updated: on relationship write (increment/decrement). Field::computed("avg_creator_interaction_depth", FieldType::F64), // average interaction_weight across all followed creators. // 0.0 = passive scroller, 1.0 = deeply engaged with every follow. // updated: background computation, daily. ], // User preference vector -- managed by the database. // Updated automatically on every signal write: shifted toward // (positive signal) or away from (negative signal) the item's embedding. embedding: EmbeddingDef { slots: vec![ EmbeddingSlot { name: "preference", dimensions: 1536, source: EmbeddingSource::DatabaseManaged, }, ], }, })?; ``` ### Field Summary Table | Field | Type | Writability | Indexed | Used By | |-------|------|-------------|---------|---------| | `locale` | keyword | app-set | term dictionary | cohort targeting, content language matching | | `language` | keyword | app-set | term dictionary | content language filter | | `region` | keyword | app-set | term dictionary | geographic cohort, regional trending | | `timezone` | keyword | app-set | term dictionary | time-aware ranking, notification timing | | `birth_year` | i64 | app-set | sorted numeric | age-based cohort bucketing | | `age_range` | keyword | app-set | term dictionary | age-based cohort targeting | | `gender` | keyword | app-set | term dictionary | demographic cohort targeting | | `account_type` | keyword | app-set | term dictionary | feature gating, cohort | | `explicit_interests` | keywords | app-set | term dictionary | cold-start preference seeding, cohort | | `preferred_formats` | keywords | app-set | term dictionary | format ranking boost, cohort | | `inferred_interests` | computed (keywords) | db-computed | term dictionary | interest-based cohort, profile display | | `primary_categories` | computed (keywords) | db-computed | term dictionary | category-based cohort | | `engagement_level` | computed (keyword) | db-computed | term dictionary | behavioral cohort | | `content_format_preference` | computed (keyword) | db-computed | term dictionary | format-based cohort | | `session_pattern` | computed (keyword) | db-computed | term dictionary | behavioral cohort | | `platform_tenure_days` | computed (i64) | db-computed | sorted numeric | tenure-based cohort | | `daily_active_hours` | computed (f64) | db-computed | sorted numeric | engagement depth cohort | | `followed_creator_count` | computed (i64) | db-computed | sorted numeric | social graph cohort | | `avg_creator_interaction_depth` | computed (f64) | db-computed | sorted numeric | engagement depth cohort | | `preference` (embedding) | embedding | db-managed | HNSW (USearch) | UC-01 For You ANN retrieval | ### Cohort Query Examples With the expanded user model, tidalDB can resolve cohort predicates at query time: ``` -- Trending among US users aged 18-24 who like jazz RETRIEVE items USING PROFILE trending FOR COHORT region:US AND age_range:18-24 AND (explicit_interests:jazz OR inferred_interests:jazz) LIMIT 25 -- Popular among power users who prefer long-form content RETRIEVE items USING PROFILE top_week FOR COHORT engagement_level:power_user AND content_format_preference:long LIMIT 25 -- Rising content among new users (cold-start cohort) RETRIEVE items USING PROFILE rising FOR COHORT engagement_level:new AND platform_tenure_days<30 LIMIT 25 ``` The `FOR COHORT` clause resolves to a user set, aggregates their signal patterns over the matching items, and ranks accordingly. This is the mechanism that replaces the "feature store" in the traditional stack. --- ## Creator Entity Creators are the entities that produce content. Every item belongs to exactly one creator. Creators have their own metadata, embeddings, and signal ledgers that enable creator discovery (UC-10), creator profile pages (UC-08), and creator-level ranking signals. ### Schema Definition ```rust db.define_entity(EntityDef { kind: EntityKind::Creator, metadata_fields: vec![ // ================================================================ // APPLICATION-SET: Profile Information // ================================================================ Field::text("name"), // display name, full-text searchable Field::keyword("handle"), // unique handle, exact match searchable Field::keyword("language"), // primary content language Field::keyword("region"), // geographic region Field::keywords("categories"), // content categories: ["music", "education"] Field::keywords("tags"), // more specific: ["jazz", "piano", "tutorial"] Field::bool("verified"), // platform verification status Field::keyword("account_type"), // individual, brand, organization, label // ================================================================ // DATABASE-COMPUTED: Audience Metrics // ================================================================ Field::computed("follower_count", FieldType::I64), // count of active "follows" relationships pointing to this creator. // updated: on relationship write (increment/decrement). Field::computed("follower_growth_velocity", FieldType::F64), // net new followers per day, 7-day trailing average. // updated: background computation, daily. // ================================================================ // DATABASE-COMPUTED: Content Catalog Statistics // ================================================================ Field::computed("total_items", FieldType::I64), // count of non-archived items by this creator. // updated: on item write/archive. Field::computed("category_distribution", FieldType::Keywords), // top categories by item count. // e.g., ["jazz:45", "blues:20", "tutorial:15"] // stored as keyword values for faceting, with counts encoded. // updated: background computation, daily. Field::computed("avg_item_quality", FieldType::F64), // average completion_rate across all items with > 100 views. // proxy for content quality independent of reach. // updated: background computation, daily. // ================================================================ // DATABASE-COMPUTED: Engagement Metrics // ================================================================ Field::computed("avg_engagement_rate", FieldType::F64), // average (likes + comments + shares) / views across recent catalog. // trailing 30-day window over items created in that window. // updated: background computation, daily. Field::computed("posting_frequency", FieldType::F64), // average items published per week, trailing 30-day window. // updated: background computation, daily. Field::computed("last_posted_at", FieldType::Timestamp), // timestamp of most recent item creation. // updated: on item write. ], // Creator embedding -- aggregated from their item catalog. // Represents the semantic "center" of what this creator produces. embedding: EmbeddingDef { slots: vec![ EmbeddingSlot { name: "catalog", dimensions: 1536, source: EmbeddingSource::DatabaseManaged, }, ], }, })?; ``` ### Field Summary Table | Field | Type | Writability | Indexed | Used By | |-------|------|-------------|---------|---------| | `name` | text | app-set | BM25 inverted | UC-10 people search | | `handle` | keyword | app-set | term dictionary | UC-02 `creator:handle` search | | `language` | keyword | app-set | term dictionary | UC-10 language filter | | `region` | keyword | app-set | term dictionary | UC-10 geographic filter | | `categories` | keywords | app-set | term dictionary | UC-10 topic filter | | `tags` | keywords | app-set | term dictionary | UC-10 niche discovery | | `verified` | bool | app-set | boolean | UC-10 verified filter | | `account_type` | keyword | app-set | term dictionary | UC-10 creator type filter | | `follower_count` | computed (i64) | db-computed | sorted numeric | UC-10 follower range filter, sort | | `follower_growth_velocity` | computed (f64) | db-computed | sorted numeric | UC-03 rising creators | | `total_items` | computed (i64) | db-computed | sorted numeric | UC-08 catalog size | | `category_distribution` | computed (keywords) | db-computed | term dictionary | UC-08 catalog browsing | | `avg_item_quality` | computed (f64) | db-computed | sorted numeric | UC-13 hidden gems by creator | | `avg_engagement_rate` | computed (f64) | db-computed | sorted numeric | UC-10 engagement rate sort | | `posting_frequency` | computed (f64) | db-computed | sorted numeric | UC-10 activity filter | | `last_posted_at` | computed (timestamp) | db-computed | sorted numeric | UC-10 recently active filter | | `catalog` (embedding) | embedding | db-managed | HNSW (USearch) | UC-10 "creators like X" | ### Creator Embedding Computation The creator's `catalog` embedding is the centroid of their non-archived items' content embeddings, weighted by item quality (completion rate). This is computed by the database on a background schedule: ``` catalog_embedding = weighted_mean( vectors: [item.content_embedding for item in creator.items if item.status != "archived"], weights: [item.completion_rate_all_time.max(0.1) for item in creator.items] ) ``` When a new item is published by a creator, the catalog embedding is incrementally updated: ``` new_catalog = (old_catalog * old_count + new_item_embedding) / (old_count + 1) ``` Full recomputation occurs on a background schedule (daily) to correct for incremental drift and account for archived items. --- ## Field Writability Model Every field in the entity model belongs to one of three writability categories. This distinction is enforced at the schema level -- the database rejects writes that violate writability constraints. | Category | Who Writes | When Updated | Examples | |----------|-----------|--------------|----------| | **app-set** | Application via `write_*()` / `update_*()` | On explicit write | `title`, `locale`, `birth_year`, `verified` | | **db-computed** | Database background computation | On schedule or trigger (see below) | `engagement_level`, `inferred_interests`, `follower_count` | | **db-managed** | Database signal processing | On every relevant signal write | `preference` embedding, `interaction_weight` | ### Update Triggers for Computed Fields Computed fields are updated by one of three mechanisms: | Trigger | Latency | Fields | |---------|---------|--------| | **Immediate** (on write) | < 1ms | `follower_count`, `total_items`, `platform_tenure_days`, `last_posted_at` | | **Incremental** (signal-driven) | < 100ms | `inferred_interests` (top-N update), `preference` embedding (vector shift) | | **Background** (scheduled) | Minutes to hours | `engagement_level`, `content_format_preference`, `session_pattern`, `daily_active_hours`, `avg_creator_interaction_depth`, `avg_engagement_rate`, `posting_frequency`, `avg_item_quality`, `category_distribution`, `follower_growth_velocity`, `primary_categories`, creator `catalog` embedding | Background computation runs on a configurable schedule. The default is: - **Hourly:** `engagement_level`, `primary_categories`, `inferred_interests` (full recomputation) - **Daily:** `content_format_preference`, `session_pattern`, `daily_active_hours`, `avg_creator_interaction_depth`, `avg_engagement_rate`, `posting_frequency`, `avg_item_quality`, `category_distribution`, `follower_growth_velocity`, creator `catalog` embedding (full recomputation) Applications can trigger immediate recomputation of any computed field via `db.recompute_field(entity_id, field_name)` for debugging or operational purposes. This is not intended for production hot paths. ### Write API Enforcement ```rust // This succeeds -- locale is app-set db.update_user("user_123", UpdateUser { metadata: Some(metadata! { "locale" => "ja-JP", "timezone" => "Asia/Tokyo", }), ..Default::default() })?; // This fails with SchemaError::ComputedFieldWrite db.update_user("user_123", UpdateUser { metadata: Some(metadata! { "engagement_level" => "power_user", // ERROR: computed field }), ..Default::default() })?; ``` --- ## Entity Lifecycle Every entity follows the same lifecycle model. The lifecycle defines what state transitions are legal and what each transition means for storage, indexing, and query visibility. ### States ``` write_*() (none) ──────────────▶ Active │ update_*()│ (metadata/embedding changes) ◄─────────┘ │ archive() │ ▼ Archived │ delete() │ ▼ Deleted (hard remove) ``` ### State Semantics | State | Query Visible | Signals Accepted | Signal Ledger | Relationships | Embeddings | |-------|--------------|------------------|---------------|---------------|------------| | **Active** | Yes | Yes | Accumulating | Active | Indexed in HNSW | | **Archived** | No (excluded by default) | No (rejected with error) | Preserved (read-only) | Preserved but inactive | Removed from HNSW | | **Deleted** | No | No | Destroyed | Destroyed | Destroyed | ### Create On `write_item()`, `write_user()`, or `write_creator()`: 1. Entity metadata is stored in the entity store. 2. Text fields are indexed in the inverted index (Tantivy). 3. Keyword, numeric, boolean, timestamp, and duration fields are indexed in their respective indexes. 4. Embedding is inserted into the HNSW index (USearch) -- normalized to unit length at insertion. 5. Signal ledger is initialized (all counters at zero, all decay scores at zero, `last_update_ns` set to creation time). 6. For items: linked to creator entity; cold-start exploration budget applied. 7. For users: if no embedding provided, initialized to population-level default preference vector. 8. For creators: catalog embedding initialized to zero vector (will be computed when first item is published). 9. Entity is immediately queryable after commit. **Idempotency:** Writing an entity with an ID that already exists is an error (`SchemaError::EntityExists`). Use `update_*()` for modifications. ### Update On `update_item()`, `update_user()`, or `update_creator()`: 1. Only provided fields are modified. Omitted fields retain their current values (partial update). 2. Modified text fields trigger re-indexing in the inverted index. 3. Modified keyword/numeric/boolean fields trigger re-indexing in their respective indexes. 4. If an embedding is provided, the old vector is replaced in the HNSW index. The new vector is normalized at insertion. 5. Signal ledger is not affected by metadata updates. 6. Computed fields cannot be set (returns `SchemaError::ComputedFieldWrite`). ### Archive On `db.archive(entity_kind, entity_id)`: 1. Entity `status` is set to `"archived"`. 2. Entity is removed from query candidate sets (excluded from RETRIEVE, SEARCH results). 3. Entity embedding is removed from the HNSW index. 4. Entity is removed from the inverted index. 5. Signal ledger is preserved in read-only state. Historical queries and analytics can still access signal data. 6. Relationships involving this entity are preserved but marked inactive. They no longer influence ranking for other entities. 7. The entity can be unarchived via `db.unarchive(entity_kind, entity_id)`, which reverses all of the above. Archive is the expected path for content removal. Creators unpublish videos. Users deactivate accounts. The data remains for analytics, audit, and potential restoration. ### Delete On `db.delete(entity_kind, entity_id)`: 1. Entity metadata is destroyed. 2. All indexes are updated to remove the entity. 3. Signal ledger is destroyed. 4. All relationships involving this entity are destroyed. 5. For items: the creator's `total_items` count is decremented and catalog embedding is marked for recomputation. 6. For users: all user-specific signal state (seen items, preference vector, relationship weights) is destroyed. 7. For creators: all items by this creator remain but lose their creator link (orphaned items should be archived or reassigned by the application before deleting a creator). Delete is a destructive, irreversible operation intended for legal compliance (GDPR right to erasure, DMCA takedowns). Normal content removal should use archive. ### Cold Start State A newly created entity with no signal history is in cold-start state. The database handles this natively: - **Items:** Receive an exploration budget (configurable per ranking profile) that injects them into a percentage of query results regardless of signal state. The budget decays as signals accumulate. Default: 10% of For You feed slots for the first 48 hours or until 1000 impressions, whichever comes first. - **Users:** Start with a population-level default preference vector. If `explicit_interests` are provided at creation, the vector is seeded toward those interest embeddings. After approximately 20 signal events, the preference vector becomes user-specific. - **Creators:** Start with a zero catalog embedding. After their first item is published, the catalog embedding is set to that item's content embedding. Subsequent items refine it. Cold start handling is specified in the ranking profile, not in the entity model. The entity model provides the fields and embedding slots that ranking profiles use to detect and handle cold-start conditions. --- ## Embedding Management Embeddings are dense vector representations stored alongside entities and indexed for approximate nearest neighbor (ANN) retrieval via USearch (HNSW). ### Embedding Sources | Source | Meaning | Who Writes | When Updated | |--------|---------|-----------|--------------| | `External` | Application computes and provides the vector | Application | On `write_*()` or `update_*()` with embedding | | `DatabaseManaged` | Database computes and maintains the vector | Database | On signal writes (incremental) and background schedule (full) | ### External Embeddings The application is responsible for computing external embeddings using its own model (OpenAI, Cohere, custom, etc.). tidalDB indexes and retrieves over these vectors but never generates them. ```rust // Application computes the embedding externally let content_vector: Vec = embedding_service.embed(&title_and_description); db.write_item(WriteItem { id: "item_abc", creator_id: "creator_xyz", metadata: metadata! { /* ... */ }, embeddings: embeddings! { "content" => &content_vector, // 1536-dim, externally computed }, })?; ``` **Normalization:** All embeddings are normalized to unit length at insertion time. This enables cosine similarity to be computed as L2 distance (mathematically equivalent for unit vectors), which is more SIMD-friendly. The application does not need to pre-normalize -- the database handles it. See `docs/research/ann_for_tidaldb.md` for rationale. **Dimensions:** Configurable per embedding slot in the entity definition. The default is 1536 (matching OpenAI text-embedding-3-large). Changing dimensions after data has been written requires rebuilding the HNSW index for that slot. ### Database-Managed Embeddings Two embeddings are managed by the database: **User preference vector** (`User.preference`): Updated incrementally on every signal write. When a user generates a positive signal (like, completion, save) for an item, the preference vector is shifted toward the item's content embedding. When a user generates a negative signal (skip, hide, not-interested), the preference vector is shifted away. The learning rate and momentum are configurable per signal type in the ranking profile. ``` # Positive signal (like, completion) preference += learning_rate * (item.content_embedding - preference) # Negative signal (skip, hide) preference -= learning_rate * (item.content_embedding - preference) * negative_weight # Re-normalize to unit length after each update preference = normalize(preference) ``` Full recomputation from signal history occurs on a daily background schedule to correct for incremental drift. **Creator catalog vector** (`Creator.catalog`): Weighted centroid of all non-archived item embeddings by this creator. Updated incrementally when items are published or archived. Full recomputation on a daily background schedule. ### Multiple Embedding Slots An entity type can define multiple embedding slots for multi-modal retrieval: ```rust embedding: EmbeddingDef { slots: vec![ EmbeddingSlot { name: "content", dimensions: 1536, source: External }, EmbeddingSlot { name: "visual", dimensions: 512, source: External }, EmbeddingSlot { name: "audio", dimensions: 256, source: External }, ], }, ``` Each slot is independently indexed in its own HNSW graph. Queries specify which slot to search: ```rust // Semantic search over content embeddings (default) db.search(Search { vector: Some(&query_vec), vector_slot: "content", .. })?; // Visual similarity search (UC-11) db.search(Search { vector: Some(&image_vec), vector_slot: "visual", .. })?; ``` If `vector_slot` is omitted, the first defined slot is used as the default. ### Embedding Slot Constraints - An entity can have at most **4 embedding slots**. This is a pragmatic limit -- each slot consumes memory for the HNSW graph (approximately 300 bytes per node at M=16, per slot). - Embedding dimensions must be between **2 and 4096** (inclusive). Dimensions below 2 are meaningless; above 4096, ANN quality degrades and memory costs become prohibitive at scale. - All embeddings are stored as `f16` by default (per `docs/research/ann_for_tidaldb.md`). The `EmbeddingSlot` definition can override to `f32` if the embedding model requires higher precision. `i8` quantization is available for memory-constrained deployments. --- ## Cohort-Ready Design The expanded user attribute model enables cohort-based queries that are central to content platform analytics and targeting. This section describes how cohort resolution works and what indexing is required. ### Cohort Predicate Resolution A cohort is a set of users matching a composite predicate over user attributes. tidalDB resolves cohort membership using the same index infrastructure that powers entity filtering: 1. Each predicate term resolves to a roaring bitmap of matching user IDs. 2. Compound predicates (AND, OR, NOT) are resolved via bitmap intersection, union, and complement. 3. The resulting user set feeds into signal aggregation for the cohort query. ``` Predicate: region:US AND age_range:18-24 AND inferred_interests:jazz Step 1: region_index["US"] → bitmap A (all US users) Step 2: age_range_index["18-24"] → bitmap B (all 18-24 users) Step 3: interests_index["jazz"] → bitmap C (all jazz-interested users) Step 4: A ∩ B ∩ C → bitmap D (the cohort) Step 5: aggregate signals over items engaged by users in bitmap D Step 6: rank items by aggregated signal velocity within the cohort ``` ### Required Indexes Every keyword and keywords field on the User entity gets a term-to-bitmap index: | Field | Index Type | Cardinality Estimate | |-------|-----------|---------------------| | `locale` | keyword → roaring bitmap | ~200 values | | `language` | keyword → roaring bitmap | ~100 values | | `region` | keyword → roaring bitmap | ~250 values | | `timezone` | keyword → roaring bitmap | ~400 values | | `age_range` | keyword → roaring bitmap | ~6 values | | `gender` | keyword → roaring bitmap | ~4 values | | `account_type` | keyword → roaring bitmap | ~4 values | | `explicit_interests` | keyword → roaring bitmap | ~10,000 values | | `preferred_formats` | keyword → roaring bitmap | ~10 values | | `inferred_interests` | keyword → roaring bitmap | ~10,000 values | | `primary_categories` | keyword → roaring bitmap | ~100 values | | `engagement_level` | keyword → roaring bitmap | ~5 values | | `content_format_preference` | keyword → roaring bitmap | ~3 values | | `session_pattern` | keyword → roaring bitmap | ~3 values | Numeric fields (`birth_year`, `platform_tenure_days`, `daily_active_hours`, `followed_creator_count`, `avg_creator_interaction_depth`) use sorted numeric indexes that support range predicates. ### Bitmap Freshness Application-set field bitmaps are updated synchronously on entity write. Database-computed field bitmaps are updated when the computed field is refreshed (hourly or daily, per the background computation schedule). This means cohort queries over computed fields reflect the last background computation, not real-time state. For most cohort use cases (trending among power users, popular in a demographic), hourly freshness is sufficient. If sub-second freshness is required for a specific computed field, the application can call `db.recompute_field(entity_id, field_name)` to trigger immediate recomputation and re-indexing. This should be used sparingly. ### Memory Budget for Cohort Indexes At 10M users with the field set defined above, the bitmap indexes require approximately: - Low-cardinality keyword fields (region, age_range, engagement_level, etc.): ~50 MB total (roaring bitmaps compress well when cardinality is low) - High-cardinality keyword fields (explicit_interests, inferred_interests): ~500 MB total (10,000 terms, average 1,000 users per term, roaring bitmap of 1,000 u64s each) - Numeric range indexes: ~80 MB total **Total: approximately 630 MB** for full cohort resolution capability over 10M users. This fits comfortably within the memory budget recommended in `docs/research/tidaldb_signal_ledger.md`. --- ## Signal Ledger Attachment Every entity automatically receives a signal ledger at creation time. The ledger is not part of the entity's metadata schema -- it is an intrinsic property of being an entity. Signal types and their behavior are defined separately via `define_signal()` (see the Signal Specification). ### What the Ledger Contains For each signal type defined in the schema and targeting this entity kind: | Component | Storage | Purpose | |-----------|---------|---------| | Running decay scores | `[f64; N]` per lambda | O(1) read of decayed signal value at query time | | Windowed counters | Bucketed counters per window | Windowed aggregation (1h, 24h, 7d, 30d, all_time) | | Velocity state | Derived from windowed counters | Rate-of-change computation | | Last update timestamp | `u64` (nanoseconds) | Decay computation reference point | The ledger follows the three-tier architecture from `docs/research/tidaldb_signal_ledger.md`: - **Tier 1 (in-memory):** Running decay scores, SWAG-backed windowed counters, recent events. ~80 bytes per entity per signal type. - **Tier 2 (disk):** Raw signal events, time-partitioned with FIFO compaction, 7-day retention. - **Tier 3 (materialized rollups):** Hourly and daily aggregates for longer windows. ### Ledger Initialization At entity creation: ```rust // Pseudocode -- internal to the database, not public API fn initialize_ledger(entity_id: EntityId, signal_types: &[SignalDef]) { for signal in signal_types { ledger.set_decay_scores(entity_id, signal.name, [0.0; N_LAMBDAS]); ledger.set_last_update(entity_id, signal.name, creation_time_ns); ledger.init_windowed_counters(entity_id, signal.name, &signal.windows); } } ``` All scores start at zero. The `last_update` is set to creation time so that the first signal write computes correct decay deltas. --- ## Storage Representation Entities are stored using the key encoding pattern from `CODING_GUIDELINES.md`, following the subject-prefix design from `thoughts.md`: ``` [entity_kind: u8][entity_id: u64 BE][0x00][TAG]:[suffix] Tags: META → serialized metadata (all fields) EMB:slot_name → raw embedding vector bytes SIG:type:win → signal windowed aggregate REL:kind → relationship edge list STATE → entity lifecycle state (active/archived) ``` ### Examples ``` [0x01][0x0000000000000ABC][0x00][META] → Item item_abc metadata [0x01][0x0000000000000ABC][0x00][EMB:content] → Item item_abc content embedding [0x01][0x0000000000000ABC][0x00][SIG:view:24h] → Item item_abc view count, 24h window [0x01][0x0000000000000ABC][0x00][REL:created_by] → Item item_abc → creator link [0x02][0x000000000000007B][0x00][META] → User user_123 metadata [0x02][0x000000000000007B][0x00][EMB:preference] → User user_123 preference vector [0x03][0x00000000000000FF][0x00][META] → Creator creator_xyz metadata [0x03][0x00000000000000FF][0x00][EMB:catalog] → Creator creator_xyz catalog vector ``` Entity kind byte values: | Kind | Byte | |------|------| | Item | `0x01` | | User | `0x02` | | Creator | `0x03` | This encoding co-locates all data for a single entity under one key prefix, enabling efficient prefix scans (fetch all state for one entity) and natural shard boundaries. Per-entity-type storage isolation (separate column families or keyspaces) prevents cross-entity-type contention as recommended in `thoughts.md`. ### Entity ID Encoding Entity IDs are provided by the application as strings (e.g., `"item_abc"`, `"user_123"`). Internally, they are hashed to `u64` using BLAKE3 for compact, fixed-width storage and comparison. The original string ID is stored in metadata for external reference. Collisions in 64-bit BLAKE3 are astronomically unlikely (birthday bound at ~4 billion entities) but the system detects them at write time and returns `SchemaError::IdCollision` if one occurs. --- ## Design Rationale ### Why the User Model Expanded From 2 Fields to 20+ The original API.md user entity had `language` and `region`. This is sufficient for a single-user personalization model where ranking depends entirely on the user's signal history and preference vector. It is woefully insufficient for cohort-based queries. The thesis of tidalDB includes replacing the feature store. A feature store's primary job in the content ranking stack is to answer "given this user's attributes and behavior, what segment do they belong to, and what is trending/popular/rising within that segment?" Without rich user attributes, tidalDB cannot answer this question. The user would need an external feature store, which defeats the single-system thesis. The expanded model enables three categories of queries that the 2-field model cannot: 1. **Demographic cohorts:** "Trending among US users aged 18-24" -- requires `region`, `age_range`. 2. **Behavioral cohorts:** "Popular among power users who prefer short-form" -- requires `engagement_level`, `content_format_preference`. 3. **Interest cohorts:** "Rising in jazz among users who have shown interest in jazz" -- requires `explicit_interests`, `inferred_interests`. ### Why Computed Fields Are a Separate Category Behavioral segments like `engagement_level` change continuously as users interact with the platform. If the application were responsible for computing and writing these, it would need to: 1. Maintain signal frequency counters per user 2. Run classification logic on every signal write 3. Write the result back to the database This is exactly the feature-store-plus-Kafka pattern that tidalDB replaces. By making these fields database-computed, the feedback loop closes natively. The signal write updates the signal ledger, the background computation reads the ledger to classify the user, and the next cohort query sees the updated classification. One system. ### Why Items Have Many Fields Every field on the Item entity maps to a filter dimension in USE_CASES.md Appendix A. The filter reference lists 30+ filterable dimensions. Each dimension must be represented as a field on the entity so the database can build the appropriate index. Removing a field means removing a filter that real users on real platforms use daily. The alternative -- a generic JSON field for "other metadata" -- sacrifices indexing. A JSON field cannot be efficiently filtered, faceted, or range-scanned. Every field that appears in a filter predicate must be a typed, indexed field. ### Why Multiple Embedding Slots UC-11 (Visual and Semantic Search) requires searching by image similarity. UC-02 requires text/semantic search. These are fundamentally different vector spaces with different dimensionality and different models. Forcing them into a single embedding slot would require either: 1. Training a multi-modal embedding (impractical for most teams) 2. Concatenating vectors (destroys distance metric quality) 3. Maintaining only one search modality (loses functionality) Multiple slots, each with its own HNSW index, keep vector spaces clean and searchable independently while allowing the query planner to choose which space to search based on the query. ### Why Entity IDs Are Hashed to u64 String comparison is 5-10x slower than integer comparison for key lookups. Signal writes and ranking queries perform thousands of entity lookups per operation. The 8-byte fixed-width key enables: 1. Cache-line-friendly key encoding (aligned, fixed size) 2. Fast comparison in hot-path data structures 3. Compact storage in roaring bitmaps (u64 values) 4. Deterministic key ordering (big-endian u64 sort) The original string ID is preserved in metadata for external reference and API responses. The hash is an internal optimization.