tidaldb/docs/specs/02-entity-model.md
jordan 413b712c0a chore: initialize tidalDB repository with schema foundation and standards
- Schema phase 1 (tasks 01-02): EntityId, EntityKind, Timestamp, Score, SignalTypeDef, DecayModel, Window, WindowSet — all with property tests and benchmarks scaffolding
- Stub modules for storage, signals, query, ranking
- Full documentation suite: VISION, USE_CASES, SEQUENCE, API, CODING_GUIDELINES, ai-lookup, research docs, specs, roadmap, planning docs
- Marketing site (Next.js) with blog infrastructure
- .claude/ agents and skills for the tidalDB development workflow
- Foundation standards enforced: thiserror + tracing declared as dependencies, clippy::unwrap_used = deny added to lint config
- .gitignore hardened: .next/, node_modules/, .env, secrets, logs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 12:52:20 -07:00

52 KiB

02 -- Entity Model Specification

The entity model defines the three core domain objects in tidalDB: Items (content), Users (consumers), and Creators (producers). Every entity has metadata fields, an embedding slot, and an attached signal ledger. The model is designed to support cohort-based targeting, personalized ranking, and the full query surface described in VISION.md and USE_CASES.md.

This specification covers entity schemas, field types, lifecycle semantics, embedding management, and the cohort-ready attribute design that enables queries like "what is trending among US users aged 18-24 who are interested in jazz."


Table of Contents


Design Principles

Entities are nodes, not rows. An entity is not a collection of columns in a table. It is a node in a graph with metadata, embeddings, a signal ledger, and relationship edges. The database reasons about entities holistically -- not as field bags.

Some fields are yours; some are ours. The entity model distinguishes between application-set fields (written by the caller) and database-computed fields (maintained by tidalDB). The application sets demographic attributes on a user. The database computes behavioral segments from signal patterns. Neither overwrites the other.

Rich attributes enable cohort queries. A user entity with two fields (language, region) cannot answer "what is trending among power users in Japan who prefer short-form video." The user model must carry enough dimensionality to resolve cohort membership efficiently at query time.

Every field earns its index. Fields exist because a query needs them. Every field in this spec can be traced to a filter, sort mode, ranking profile signal, or cohort predicate in USE_CASES.md.


Field Type Reference

Every metadata field on an entity has a declared type that determines its indexing behavior, storage format, and query semantics.

Type Storage Indexed As Query Operations Example
text UTF-8 string Inverted index (BM25, tokenized) Full-text search, phrase match, field-scoped search title, description
keyword UTF-8 string Term dictionary, exact match Equality, IN-list, faceting category, locale
keywords Vec<String> Term dictionary per value Equality per value, IN-list, faceting tags, explicit_interests
i64 64-bit signed integer Sorted numeric index Range, equality, min/max, sort birth_year, follower_count
f64 64-bit float Sorted numeric index Range, equality, min/max, sort avg_completion_rate
bool 1-bit boolean Boolean index Equality verified, has_subtitles
timestamp UTC nanoseconds (i64) Sorted numeric index Range, presets (today, this_week), since created_at, first_signal_at
duration Seconds (f64) Sorted numeric index Range, presets (short, medium, long), sort duration
embedding Vec<f32> or quantized HNSW (USearch) ANN search, cosine similarity content_vector, preference_vector
computed Varies (keyword, keywords, i64, f64) Same as underlying type Same as underlying type engagement_level, inferred_interests

computed fields are a special category. They have an underlying storage type (keyword, keywords, i64, f64) and are indexed identically to that type. The distinction is write semantics: computed fields are not directly writable by the application. They are maintained by the database based on signal patterns, relationship state, or periodic background computation. Attempting to set a computed field via write_user() or update_user() returns a SchemaError.


Entity Relationships Diagram

                           ┌──────────────┐
                           │    User      │
                           │              │
                           │  metadata    │
                           │  embedding   │
                           │  signals     │
                           └──────┬───────┘
                                  │
                    ┌─────────────┼─────────────┐
                    │             │             │
              follows/blocks   viewed/liked   interacted
              (Relationship)   (Signal)       (Relationship)
                    │             │             │
                    ▼             ▼             ▼
            ┌──────────────┐          ┌──────────────┐
            │   Creator    │◄─────────│    Item      │
            │              │ created  │              │
            │  metadata    │          │  metadata    │
            │  embedding   │          │  embedding   │
            │  signals     │          │  signals     │
            └──────────────┘          └──────────────┘

  Relationship edges:
    User ──follows──▶ Creator       (permanent, weight)
    User ──blocks───▶ Creator       (permanent, hard filter)
    User ──viewed───▶ Item          (signal-derived)
    User ──liked────▶ Item          (signal-derived)
    User ──saved────▶ Item          (explicit)
    User ──hid──────▶ Item          (permanent negative)
    Item ──created_by──▶ Creator    (structural, immutable)
    Creator ──similar_to──▶ Creator (computed, embedding distance)
    Item ──similar_to──▶ Item       (computed, embedding distance)

Every entity participates in two kinds of connections:

  1. Relationships -- explicit, weighted, directional edges managed via write_relationship(). Used for follows, blocks, saves, collections.
  2. Signal-derived state -- implicit edges created automatically when signals are written. A view signal on an item by a user creates a user-item "seen" edge. A like creates a user-item "liked" edge. These are queryable via Filter::unseen(), Filter::user_state("liked"), etc.

Item Entity

Items are the content that gets ranked. Videos, articles, images, audio tracks, podcasts, live streams, galleries -- anything a user consumes and engages with.

Every item belongs to exactly one creator (the creator_id link). Items carry metadata for filtering and display, one or more embedding slots for semantic retrieval, and a signal ledger that accumulates engagement data.

Schema Definition

db.define_entity(EntityDef {
    kind: EntityKind::Item,
    metadata_fields: vec![
        // --- Text fields: full-text indexed, searchable via BM25 ---
        Field::text("title"),
        Field::text("description"),

        // --- Keyword fields: exact match, filterable, facetable ---
        Field::keyword("category"),           // primary category: "music", "gaming", "cooking"
        Field::keywords("tags"),              // multi-value: ["jazz", "piano", "tutorial"]
        Field::keyword("format"),             // video, short, live, vod, podcast, article, image, gallery, audio
        Field::keyword("language"),           // ISO 639-1: "en", "ja", "es"
        Field::keywords("subtitle_languages"),// available subtitle languages
        Field::keywords("dubbed_languages"),  // available dub languages
        Field::keyword("content_rating"),     // G, PG, PG-13, R, NC-17
        Field::keyword("status"),             // published, live, scheduled, archived, draft
        Field::keyword("availability"),       // free, premium, subscriber_only, rental
        Field::keyword("resolution"),         // SD, HD, FHD, 4K, 8K
        Field::keyword("audio_quality"),      // standard, high, lossless, spatial
        Field::keyword("content_region"),     // geographic origin: "US", "JP"
        Field::keyword("post_type"),          // text, link, image, video, poll (forum-style)
        Field::keywords("hashtags"),          // #jazz, #tutorial
        Field::keyword("flair"),              // community-specific label

        // --- Numeric fields: range-filterable, sortable ---
        Field::i64("award_count"),            // community awards/gilding count

        // --- Boolean fields: filterable ---
        Field::bool("has_subtitles"),
        Field::bool("has_audio_description"),
        Field::bool("has_sign_language"),
        Field::bool("downloadable"),
        Field::bool("hdr"),
        Field::bool("is_original"),           // not a crosspost/repost
        Field::bool("safe_search"),           // passes safe-search filter

        // --- Duration: range-filterable, sortable, preset-filterable ---
        Field::duration("duration"),

        // --- Timestamps: range-filterable, sortable ---
        Field::timestamp("created_at"),
        Field::timestamp("updated_at"),
        Field::timestamp("scheduled_at"),     // for premieres / scheduled live
        Field::timestamp("available_until"),  // for "leaving soon" filter
    ],
    // Primary content embedding -- externally computed, DB-indexed.
    embedding: EmbeddingDef {
        slots: vec![
            EmbeddingSlot {
                name: "content",              // text/semantic content vector
                dimensions: 1536,
                source: EmbeddingSource::External,
            },
        ],
    },
})?;

Field Summary Table

Field Type Writability Indexed Used By
title text app-set BM25 inverted UC-02 search, UC-06 alphabetical sort
description text app-set BM25 inverted UC-02 search
category keyword app-set term dictionary UC-03 scoped trending, UC-06 browse, cohort
tags keywords app-set term dictionary UC-02 search, UC-06 filter
format keyword app-set term dictionary UC-01 format filter, UC-06 browse, diversity
language keyword app-set term dictionary UC-02 language filter
subtitle_languages keywords app-set term dictionary UC-02 accessibility filter
dubbed_languages keywords app-set term dictionary UC-02 accessibility filter
content_rating keyword app-set term dictionary UC-02 maturity filter
status keyword app-set term dictionary UC-12 live filter
availability keyword app-set term dictionary UC-02 availability filter
resolution keyword app-set term dictionary UC-02 quality filter
audio_quality keyword app-set term dictionary UC-02 quality filter
content_region keyword app-set term dictionary UC-02 geographic filter, cohort
post_type keyword app-set term dictionary UC-14 forum filtering
hashtags keywords app-set term dictionary UC-02 hashtag search
flair keyword app-set term dictionary UC-14 community filter
award_count i64 app-set sorted numeric UC-14 gilded filter
has_subtitles bool app-set boolean UC-02 accessibility filter
has_audio_description bool app-set boolean UC-02 accessibility filter
has_sign_language bool app-set boolean UC-02 accessibility filter
downloadable bool app-set boolean UC-09 download filter
hdr bool app-set boolean UC-02 quality filter
is_original bool app-set boolean UC-14 original-only filter
safe_search bool app-set boolean UC-02 safe search toggle
duration duration app-set sorted numeric UC-02 duration filter, UC-06 shortest/longest sort
created_at timestamp app-set sorted numeric UC-04 chronological, UC-06 date filter
updated_at timestamp app-set sorted numeric change tracking
scheduled_at timestamp app-set sorted numeric UC-12 scheduled content
available_until timestamp app-set sorted numeric UC-02 "leaving soon" filter
content (embedding) embedding app-set HNSW (USearch) UC-01 ANN retrieval, UC-02 semantic search, UC-05 related

Additional Embedding Slots

Applications may define additional embedding slots for multi-modal retrieval:

EmbeddingSlot {
    name: "visual",               // image/thumbnail embedding
    dimensions: 512,
    source: EmbeddingSource::External,
},
EmbeddingSlot {
    name: "audio",                // audio fingerprint embedding
    dimensions: 256,
    source: EmbeddingSource::External,
},

Each slot gets its own HNSW index. Queries specify which embedding to search against. This supports UC-11 (visual/semantic search) without overloading a single vector space.


User Entity

Users are the consumers of content. They generate signals (views, likes, skips, hides), accumulate preference profiles, and form relationships with creators and items.

The user entity carries two categories of fields:

  1. Application-set fields -- demographic and preference data the application writes explicitly. These are known at registration time or provided by the user.
  2. Database-computed fields -- behavioral segments, interest profiles, and engagement patterns derived from signal history. The database maintains these automatically. The application reads them (for display, analytics, cohort targeting) but never writes them directly.

This distinction is the foundation of cohort targeting. An application sets locale: "en-US" and birth_year: 2001. The database computes engagement_level: "power_user" and inferred_interests: ["jazz", "piano", "music_theory"]. A cohort query combines both: locale:en-US AND age_range:18-24 AND engagement_level:power_user AND interest:jazz.

Schema Definition

db.define_entity(EntityDef {
    kind: EntityKind::User,
    metadata_fields: vec![
        // ================================================================
        // APPLICATION-SET: Demographic Attributes
        // Written by the application at registration or profile update.
        // ================================================================
        Field::keyword("locale"),             // full locale: "en-US", "ja-JP", "es-MX"
        Field::keyword("language"),           // preferred content language: "en", "ja"
        Field::keyword("region"),             // geographic region: "US", "JP", "DE"
        Field::keyword("timezone"),           // IANA timezone: "America/New_York", "Asia/Tokyo"
        Field::i64("birth_year"),             // for age-based cohort bucketing (optional)
        Field::keyword("age_range"),          // explicit bucket: "13-17", "18-24", "25-34", "35-44", "45-54", "55+"
        Field::keyword("gender"),             // optional: "male", "female", "non-binary", "undisclosed"
        Field::keyword("account_type"),       // free, premium, creator, admin
        Field::keywords("explicit_interests"),// stated interests at signup: ["jazz", "cooking", "rust"]
        Field::keywords("preferred_formats"), // stated format preference: ["video", "short"]

        // ================================================================
        // DATABASE-COMPUTED: Interest Profile
        // Derived from engagement patterns. Updated by background computation.
        // ================================================================
        Field::computed("inferred_interests", FieldType::Keywords),
            // keywords derived from engagement history.
            // top N topics by weighted engagement volume.
            // e.g., ["jazz", "piano", "music_theory", "cooking", "rust"]
            // updated: every signal write triggers incremental update;
            //          full recomputation on background schedule.

        Field::computed("primary_categories", FieldType::Keywords),
            // top categories by engagement volume (coarser than interests).
            // e.g., ["music", "programming", "food"]
            // updated: background computation, hourly.

        // ================================================================
        // DATABASE-COMPUTED: Behavioral Segments
        // Derived from signal frequency, patterns, and recency.
        // ================================================================
        Field::computed("engagement_level", FieldType::Keyword),
            // power_user:  > 50 signals/day, 7-day streak
            // regular:     10-50 signals/day, active 4+ days/week
            // casual:      1-10 signals/day, active 1-3 days/week
            // dormant:     < 1 signal/day for 7+ days
            // new:         < 7 days since first signal
            // updated: background computation, every 6 hours.

        Field::computed("content_format_preference", FieldType::Keyword),
            // short:  > 60% of completions are items with duration < 4min
            // long:   > 60% of completions are items with duration > 20min
            // mixed:  neither threshold met
            // updated: background computation, daily.

        Field::computed("session_pattern", FieldType::Keyword),
            // binge:      avg session > 30min, sequential consumption
            // browsing:   avg session 5-30min, diverse consumption
            // searching:  > 40% of sessions start with search
            // updated: background computation, daily.

        Field::computed("platform_tenure_days", FieldType::I64),
            // days since first signal was written for this user.
            // updated: on every signal write (trivial computation).

        Field::computed("daily_active_hours", FieldType::F64),
            // average number of distinct hours with signal activity per day.
            // computed over trailing 7-day window.
            // updated: background computation, daily.

        // ================================================================
        // DATABASE-COMPUTED: Creator Relationship Profile
        // Derived from relationship graph and signal patterns.
        // ================================================================
        Field::computed("followed_creator_count", FieldType::I64),
            // count of active "follows" relationships.
            // updated: on relationship write (increment/decrement).

        Field::computed("avg_creator_interaction_depth", FieldType::F64),
            // average interaction_weight across all followed creators.
            // 0.0 = passive scroller, 1.0 = deeply engaged with every follow.
            // updated: background computation, daily.
    ],
    // User preference vector -- managed by the database.
    // Updated automatically on every signal write: shifted toward
    // (positive signal) or away from (negative signal) the item's embedding.
    embedding: EmbeddingDef {
        slots: vec![
            EmbeddingSlot {
                name: "preference",
                dimensions: 1536,
                source: EmbeddingSource::DatabaseManaged,
            },
        ],
    },
})?;

Field Summary Table

Field Type Writability Indexed Used By
locale keyword app-set term dictionary cohort targeting, content language matching
language keyword app-set term dictionary content language filter
region keyword app-set term dictionary geographic cohort, regional trending
timezone keyword app-set term dictionary time-aware ranking, notification timing
birth_year i64 app-set sorted numeric age-based cohort bucketing
age_range keyword app-set term dictionary age-based cohort targeting
gender keyword app-set term dictionary demographic cohort targeting
account_type keyword app-set term dictionary feature gating, cohort
explicit_interests keywords app-set term dictionary cold-start preference seeding, cohort
preferred_formats keywords app-set term dictionary format ranking boost, cohort
inferred_interests computed (keywords) db-computed term dictionary interest-based cohort, profile display
primary_categories computed (keywords) db-computed term dictionary category-based cohort
engagement_level computed (keyword) db-computed term dictionary behavioral cohort
content_format_preference computed (keyword) db-computed term dictionary format-based cohort
session_pattern computed (keyword) db-computed term dictionary behavioral cohort
platform_tenure_days computed (i64) db-computed sorted numeric tenure-based cohort
daily_active_hours computed (f64) db-computed sorted numeric engagement depth cohort
followed_creator_count computed (i64) db-computed sorted numeric social graph cohort
avg_creator_interaction_depth computed (f64) db-computed sorted numeric engagement depth cohort
preference (embedding) embedding db-managed HNSW (USearch) UC-01 For You ANN retrieval

Cohort Query Examples

With the expanded user model, tidalDB can resolve cohort predicates at query time:

-- Trending among US users aged 18-24 who like jazz
RETRIEVE items
USING PROFILE trending
FOR COHORT region:US AND age_range:18-24 AND (explicit_interests:jazz OR inferred_interests:jazz)
LIMIT 25

-- Popular among power users who prefer long-form content
RETRIEVE items
USING PROFILE top_week
FOR COHORT engagement_level:power_user AND content_format_preference:long
LIMIT 25

-- Rising content among new users (cold-start cohort)
RETRIEVE items
USING PROFILE rising
FOR COHORT engagement_level:new AND platform_tenure_days<30
LIMIT 25

The FOR COHORT clause resolves to a user set, aggregates their signal patterns over the matching items, and ranks accordingly. This is the mechanism that replaces the "feature store" in the traditional stack.


Creator Entity

Creators are the entities that produce content. Every item belongs to exactly one creator. Creators have their own metadata, embeddings, and signal ledgers that enable creator discovery (UC-10), creator profile pages (UC-08), and creator-level ranking signals.

Schema Definition

db.define_entity(EntityDef {
    kind: EntityKind::Creator,
    metadata_fields: vec![
        // ================================================================
        // APPLICATION-SET: Profile Information
        // ================================================================
        Field::text("name"),                  // display name, full-text searchable
        Field::keyword("handle"),             // unique handle, exact match searchable
        Field::keyword("language"),           // primary content language
        Field::keyword("region"),             // geographic region
        Field::keywords("categories"),        // content categories: ["music", "education"]
        Field::keywords("tags"),              // more specific: ["jazz", "piano", "tutorial"]
        Field::bool("verified"),              // platform verification status
        Field::keyword("account_type"),       // individual, brand, organization, label

        // ================================================================
        // DATABASE-COMPUTED: Audience Metrics
        // ================================================================
        Field::computed("follower_count", FieldType::I64),
            // count of active "follows" relationships pointing to this creator.
            // updated: on relationship write (increment/decrement).

        Field::computed("follower_growth_velocity", FieldType::F64),
            // net new followers per day, 7-day trailing average.
            // updated: background computation, daily.

        // ================================================================
        // DATABASE-COMPUTED: Content Catalog Statistics
        // ================================================================
        Field::computed("total_items", FieldType::I64),
            // count of non-archived items by this creator.
            // updated: on item write/archive.

        Field::computed("category_distribution", FieldType::Keywords),
            // top categories by item count.
            // e.g., ["jazz:45", "blues:20", "tutorial:15"]
            // stored as keyword values for faceting, with counts encoded.
            // updated: background computation, daily.

        Field::computed("avg_item_quality", FieldType::F64),
            // average completion_rate across all items with > 100 views.
            // proxy for content quality independent of reach.
            // updated: background computation, daily.

        // ================================================================
        // DATABASE-COMPUTED: Engagement Metrics
        // ================================================================
        Field::computed("avg_engagement_rate", FieldType::F64),
            // average (likes + comments + shares) / views across recent catalog.
            // trailing 30-day window over items created in that window.
            // updated: background computation, daily.

        Field::computed("posting_frequency", FieldType::F64),
            // average items published per week, trailing 30-day window.
            // updated: background computation, daily.

        Field::computed("last_posted_at", FieldType::Timestamp),
            // timestamp of most recent item creation.
            // updated: on item write.
    ],
    // Creator embedding -- aggregated from their item catalog.
    // Represents the semantic "center" of what this creator produces.
    embedding: EmbeddingDef {
        slots: vec![
            EmbeddingSlot {
                name: "catalog",
                dimensions: 1536,
                source: EmbeddingSource::DatabaseManaged,
            },
        ],
    },
})?;

Field Summary Table

Field Type Writability Indexed Used By
name text app-set BM25 inverted UC-10 people search
handle keyword app-set term dictionary UC-02 creator:handle search
language keyword app-set term dictionary UC-10 language filter
region keyword app-set term dictionary UC-10 geographic filter
categories keywords app-set term dictionary UC-10 topic filter
tags keywords app-set term dictionary UC-10 niche discovery
verified bool app-set boolean UC-10 verified filter
account_type keyword app-set term dictionary UC-10 creator type filter
follower_count computed (i64) db-computed sorted numeric UC-10 follower range filter, sort
follower_growth_velocity computed (f64) db-computed sorted numeric UC-03 rising creators
total_items computed (i64) db-computed sorted numeric UC-08 catalog size
category_distribution computed (keywords) db-computed term dictionary UC-08 catalog browsing
avg_item_quality computed (f64) db-computed sorted numeric UC-13 hidden gems by creator
avg_engagement_rate computed (f64) db-computed sorted numeric UC-10 engagement rate sort
posting_frequency computed (f64) db-computed sorted numeric UC-10 activity filter
last_posted_at computed (timestamp) db-computed sorted numeric UC-10 recently active filter
catalog (embedding) embedding db-managed HNSW (USearch) UC-10 "creators like X"

Creator Embedding Computation

The creator's catalog embedding is the centroid of their non-archived items' content embeddings, weighted by item quality (completion rate). This is computed by the database on a background schedule:

catalog_embedding = weighted_mean(
    vectors: [item.content_embedding for item in creator.items if item.status != "archived"],
    weights: [item.completion_rate_all_time.max(0.1) for item in creator.items]
)

When a new item is published by a creator, the catalog embedding is incrementally updated:

new_catalog = (old_catalog * old_count + new_item_embedding) / (old_count + 1)

Full recomputation occurs on a background schedule (daily) to correct for incremental drift and account for archived items.


Field Writability Model

Every field in the entity model belongs to one of three writability categories. This distinction is enforced at the schema level -- the database rejects writes that violate writability constraints.

Category Who Writes When Updated Examples
app-set Application via write_*() / update_*() On explicit write title, locale, birth_year, verified
db-computed Database background computation On schedule or trigger (see below) engagement_level, inferred_interests, follower_count
db-managed Database signal processing On every relevant signal write preference embedding, interaction_weight

Update Triggers for Computed Fields

Computed fields are updated by one of three mechanisms:

Trigger Latency Fields
Immediate (on write) < 1ms follower_count, total_items, platform_tenure_days, last_posted_at
Incremental (signal-driven) < 100ms inferred_interests (top-N update), preference embedding (vector shift)
Background (scheduled) Minutes to hours engagement_level, content_format_preference, session_pattern, daily_active_hours, avg_creator_interaction_depth, avg_engagement_rate, posting_frequency, avg_item_quality, category_distribution, follower_growth_velocity, primary_categories, creator catalog embedding

Background computation runs on a configurable schedule. The default is:

  • Hourly: engagement_level, primary_categories, inferred_interests (full recomputation)
  • Daily: content_format_preference, session_pattern, daily_active_hours, avg_creator_interaction_depth, avg_engagement_rate, posting_frequency, avg_item_quality, category_distribution, follower_growth_velocity, creator catalog embedding (full recomputation)

Applications can trigger immediate recomputation of any computed field via db.recompute_field(entity_id, field_name) for debugging or operational purposes. This is not intended for production hot paths.

Write API Enforcement

// This succeeds -- locale is app-set
db.update_user("user_123", UpdateUser {
    metadata: Some(metadata! {
        "locale" => "ja-JP",
        "timezone" => "Asia/Tokyo",
    }),
    ..Default::default()
})?;

// This fails with SchemaError::ComputedFieldWrite
db.update_user("user_123", UpdateUser {
    metadata: Some(metadata! {
        "engagement_level" => "power_user",  // ERROR: computed field
    }),
    ..Default::default()
})?;

Entity Lifecycle

Every entity follows the same lifecycle model. The lifecycle defines what state transitions are legal and what each transition means for storage, indexing, and query visibility.

States

                write_*()
    (none) ──────────────▶ Active
                              │
                    update_*()│ (metadata/embedding changes)
                    ◄─────────┘
                              │
                   archive()  │
                              ▼
                           Archived
                              │
                    delete()  │
                              ▼
                           Deleted
                        (hard remove)

State Semantics

State Query Visible Signals Accepted Signal Ledger Relationships Embeddings
Active Yes Yes Accumulating Active Indexed in HNSW
Archived No (excluded by default) No (rejected with error) Preserved (read-only) Preserved but inactive Removed from HNSW
Deleted No No Destroyed Destroyed Destroyed

Create

On write_item(), write_user(), or write_creator():

  1. Entity metadata is stored in the entity store.
  2. Text fields are indexed in the inverted index (Tantivy).
  3. Keyword, numeric, boolean, timestamp, and duration fields are indexed in their respective indexes.
  4. Embedding is inserted into the HNSW index (USearch) -- normalized to unit length at insertion.
  5. Signal ledger is initialized (all counters at zero, all decay scores at zero, last_update_ns set to creation time).
  6. For items: linked to creator entity; cold-start exploration budget applied.
  7. For users: if no embedding provided, initialized to population-level default preference vector.
  8. For creators: catalog embedding initialized to zero vector (will be computed when first item is published).
  9. Entity is immediately queryable after commit.

Idempotency: Writing an entity with an ID that already exists is an error (SchemaError::EntityExists). Use update_*() for modifications.

Update

On update_item(), update_user(), or update_creator():

  1. Only provided fields are modified. Omitted fields retain their current values (partial update).
  2. Modified text fields trigger re-indexing in the inverted index.
  3. Modified keyword/numeric/boolean fields trigger re-indexing in their respective indexes.
  4. If an embedding is provided, the old vector is replaced in the HNSW index. The new vector is normalized at insertion.
  5. Signal ledger is not affected by metadata updates.
  6. Computed fields cannot be set (returns SchemaError::ComputedFieldWrite).

Archive

On db.archive(entity_kind, entity_id):

  1. Entity status is set to "archived".
  2. Entity is removed from query candidate sets (excluded from RETRIEVE, SEARCH results).
  3. Entity embedding is removed from the HNSW index.
  4. Entity is removed from the inverted index.
  5. Signal ledger is preserved in read-only state. Historical queries and analytics can still access signal data.
  6. Relationships involving this entity are preserved but marked inactive. They no longer influence ranking for other entities.
  7. The entity can be unarchived via db.unarchive(entity_kind, entity_id), which reverses all of the above.

Archive is the expected path for content removal. Creators unpublish videos. Users deactivate accounts. The data remains for analytics, audit, and potential restoration.

Delete

On db.delete(entity_kind, entity_id):

  1. Entity metadata is destroyed.
  2. All indexes are updated to remove the entity.
  3. Signal ledger is destroyed.
  4. All relationships involving this entity are destroyed.
  5. For items: the creator's total_items count is decremented and catalog embedding is marked for recomputation.
  6. For users: all user-specific signal state (seen items, preference vector, relationship weights) is destroyed.
  7. For creators: all items by this creator remain but lose their creator link (orphaned items should be archived or reassigned by the application before deleting a creator).

Delete is a destructive, irreversible operation intended for legal compliance (GDPR right to erasure, DMCA takedowns). Normal content removal should use archive.

Cold Start State

A newly created entity with no signal history is in cold-start state. The database handles this natively:

  • Items: Receive an exploration budget (configurable per ranking profile) that injects them into a percentage of query results regardless of signal state. The budget decays as signals accumulate. Default: 10% of For You feed slots for the first 48 hours or until 1000 impressions, whichever comes first.
  • Users: Start with a population-level default preference vector. If explicit_interests are provided at creation, the vector is seeded toward those interest embeddings. After approximately 20 signal events, the preference vector becomes user-specific.
  • Creators: Start with a zero catalog embedding. After their first item is published, the catalog embedding is set to that item's content embedding. Subsequent items refine it.

Cold start handling is specified in the ranking profile, not in the entity model. The entity model provides the fields and embedding slots that ranking profiles use to detect and handle cold-start conditions.


Embedding Management

Embeddings are dense vector representations stored alongside entities and indexed for approximate nearest neighbor (ANN) retrieval via USearch (HNSW).

Embedding Sources

Source Meaning Who Writes When Updated
External Application computes and provides the vector Application On write_*() or update_*() with embedding
DatabaseManaged Database computes and maintains the vector Database On signal writes (incremental) and background schedule (full)

External Embeddings

The application is responsible for computing external embeddings using its own model (OpenAI, Cohere, custom, etc.). tidalDB indexes and retrieves over these vectors but never generates them.

// Application computes the embedding externally
let content_vector: Vec<f32> = embedding_service.embed(&title_and_description);

db.write_item(WriteItem {
    id: "item_abc",
    creator_id: "creator_xyz",
    metadata: metadata! { /* ... */ },
    embeddings: embeddings! {
        "content" => &content_vector,    // 1536-dim, externally computed
    },
})?;

Normalization: All embeddings are normalized to unit length at insertion time. This enables cosine similarity to be computed as L2 distance (mathematically equivalent for unit vectors), which is more SIMD-friendly. The application does not need to pre-normalize -- the database handles it. See docs/research/ann_for_tidaldb.md for rationale.

Dimensions: Configurable per embedding slot in the entity definition. The default is 1536 (matching OpenAI text-embedding-3-large). Changing dimensions after data has been written requires rebuilding the HNSW index for that slot.

Database-Managed Embeddings

Two embeddings are managed by the database:

User preference vector (User.preference): Updated incrementally on every signal write. When a user generates a positive signal (like, completion, save) for an item, the preference vector is shifted toward the item's content embedding. When a user generates a negative signal (skip, hide, not-interested), the preference vector is shifted away. The learning rate and momentum are configurable per signal type in the ranking profile.

# Positive signal (like, completion)
preference += learning_rate * (item.content_embedding - preference)

# Negative signal (skip, hide)
preference -= learning_rate * (item.content_embedding - preference) * negative_weight

# Re-normalize to unit length after each update
preference = normalize(preference)

Full recomputation from signal history occurs on a daily background schedule to correct for incremental drift.

Creator catalog vector (Creator.catalog): Weighted centroid of all non-archived item embeddings by this creator. Updated incrementally when items are published or archived. Full recomputation on a daily background schedule.

Multiple Embedding Slots

An entity type can define multiple embedding slots for multi-modal retrieval:

embedding: EmbeddingDef {
    slots: vec![
        EmbeddingSlot { name: "content", dimensions: 1536, source: External },
        EmbeddingSlot { name: "visual",  dimensions: 512,  source: External },
        EmbeddingSlot { name: "audio",   dimensions: 256,  source: External },
    ],
},

Each slot is independently indexed in its own HNSW graph. Queries specify which slot to search:

// Semantic search over content embeddings (default)
db.search(Search { vector: Some(&query_vec), vector_slot: "content", .. })?;

// Visual similarity search (UC-11)
db.search(Search { vector: Some(&image_vec), vector_slot: "visual", .. })?;

If vector_slot is omitted, the first defined slot is used as the default.

Embedding Slot Constraints

  • An entity can have at most 4 embedding slots. This is a pragmatic limit -- each slot consumes memory for the HNSW graph (approximately 300 bytes per node at M=16, per slot).
  • Embedding dimensions must be between 2 and 4096 (inclusive). Dimensions below 2 are meaningless; above 4096, ANN quality degrades and memory costs become prohibitive at scale.
  • All embeddings are stored as f16 by default (per docs/research/ann_for_tidaldb.md). The EmbeddingSlot definition can override to f32 if the embedding model requires higher precision. i8 quantization is available for memory-constrained deployments.

Cohort-Ready Design

The expanded user attribute model enables cohort-based queries that are central to content platform analytics and targeting. This section describes how cohort resolution works and what indexing is required.

Cohort Predicate Resolution

A cohort is a set of users matching a composite predicate over user attributes. tidalDB resolves cohort membership using the same index infrastructure that powers entity filtering:

  1. Each predicate term resolves to a roaring bitmap of matching user IDs.
  2. Compound predicates (AND, OR, NOT) are resolved via bitmap intersection, union, and complement.
  3. The resulting user set feeds into signal aggregation for the cohort query.
Predicate: region:US AND age_range:18-24 AND inferred_interests:jazz

Step 1: region_index["US"]           → bitmap A (all US users)
Step 2: age_range_index["18-24"]     → bitmap B (all 18-24 users)
Step 3: interests_index["jazz"]      → bitmap C (all jazz-interested users)
Step 4: A ∩ B ∩ C                    → bitmap D (the cohort)
Step 5: aggregate signals over items engaged by users in bitmap D
Step 6: rank items by aggregated signal velocity within the cohort

Required Indexes

Every keyword and keywords field on the User entity gets a term-to-bitmap index:

Field Index Type Cardinality Estimate
locale keyword → roaring bitmap ~200 values
language keyword → roaring bitmap ~100 values
region keyword → roaring bitmap ~250 values
timezone keyword → roaring bitmap ~400 values
age_range keyword → roaring bitmap ~6 values
gender keyword → roaring bitmap ~4 values
account_type keyword → roaring bitmap ~4 values
explicit_interests keyword → roaring bitmap ~10,000 values
preferred_formats keyword → roaring bitmap ~10 values
inferred_interests keyword → roaring bitmap ~10,000 values
primary_categories keyword → roaring bitmap ~100 values
engagement_level keyword → roaring bitmap ~5 values
content_format_preference keyword → roaring bitmap ~3 values
session_pattern keyword → roaring bitmap ~3 values

Numeric fields (birth_year, platform_tenure_days, daily_active_hours, followed_creator_count, avg_creator_interaction_depth) use sorted numeric indexes that support range predicates.

Bitmap Freshness

Application-set field bitmaps are updated synchronously on entity write. Database-computed field bitmaps are updated when the computed field is refreshed (hourly or daily, per the background computation schedule). This means cohort queries over computed fields reflect the last background computation, not real-time state. For most cohort use cases (trending among power users, popular in a demographic), hourly freshness is sufficient.

If sub-second freshness is required for a specific computed field, the application can call db.recompute_field(entity_id, field_name) to trigger immediate recomputation and re-indexing. This should be used sparingly.

Memory Budget for Cohort Indexes

At 10M users with the field set defined above, the bitmap indexes require approximately:

  • Low-cardinality keyword fields (region, age_range, engagement_level, etc.): ~50 MB total (roaring bitmaps compress well when cardinality is low)
  • High-cardinality keyword fields (explicit_interests, inferred_interests): ~500 MB total (10,000 terms, average 1,000 users per term, roaring bitmap of 1,000 u64s each)
  • Numeric range indexes: ~80 MB total

Total: approximately 630 MB for full cohort resolution capability over 10M users. This fits comfortably within the memory budget recommended in docs/research/tidaldb_signal_ledger.md.


Signal Ledger Attachment

Every entity automatically receives a signal ledger at creation time. The ledger is not part of the entity's metadata schema -- it is an intrinsic property of being an entity. Signal types and their behavior are defined separately via define_signal() (see the Signal Specification).

What the Ledger Contains

For each signal type defined in the schema and targeting this entity kind:

Component Storage Purpose
Running decay scores [f64; N] per lambda O(1) read of decayed signal value at query time
Windowed counters Bucketed counters per window Windowed aggregation (1h, 24h, 7d, 30d, all_time)
Velocity state Derived from windowed counters Rate-of-change computation
Last update timestamp u64 (nanoseconds) Decay computation reference point

The ledger follows the three-tier architecture from docs/research/tidaldb_signal_ledger.md:

  • Tier 1 (in-memory): Running decay scores, SWAG-backed windowed counters, recent events. ~80 bytes per entity per signal type.
  • Tier 2 (disk): Raw signal events, time-partitioned with FIFO compaction, 7-day retention.
  • Tier 3 (materialized rollups): Hourly and daily aggregates for longer windows.

Ledger Initialization

At entity creation:

// Pseudocode -- internal to the database, not public API
fn initialize_ledger(entity_id: EntityId, signal_types: &[SignalDef]) {
    for signal in signal_types {
        ledger.set_decay_scores(entity_id, signal.name, [0.0; N_LAMBDAS]);
        ledger.set_last_update(entity_id, signal.name, creation_time_ns);
        ledger.init_windowed_counters(entity_id, signal.name, &signal.windows);
    }
}

All scores start at zero. The last_update is set to creation time so that the first signal write computes correct decay deltas.


Storage Representation

Entities are stored using the key encoding pattern from CODING_GUIDELINES.md, following the subject-prefix design from thoughts.md:

[entity_kind: u8][entity_id: u64 BE][0x00][TAG]:[suffix]

Tags:
  META           → serialized metadata (all fields)
  EMB:slot_name  → raw embedding vector bytes
  SIG:type:win   → signal windowed aggregate
  REL:kind       → relationship edge list
  STATE          → entity lifecycle state (active/archived)

Examples

[0x01][0x0000000000000ABC][0x00][META]           → Item item_abc metadata
[0x01][0x0000000000000ABC][0x00][EMB:content]    → Item item_abc content embedding
[0x01][0x0000000000000ABC][0x00][SIG:view:24h]   → Item item_abc view count, 24h window
[0x01][0x0000000000000ABC][0x00][REL:created_by] → Item item_abc → creator link

[0x02][0x000000000000007B][0x00][META]           → User user_123 metadata
[0x02][0x000000000000007B][0x00][EMB:preference] → User user_123 preference vector

[0x03][0x00000000000000FF][0x00][META]           → Creator creator_xyz metadata
[0x03][0x00000000000000FF][0x00][EMB:catalog]    → Creator creator_xyz catalog vector

Entity kind byte values:

Kind Byte
Item 0x01
User 0x02
Creator 0x03

This encoding co-locates all data for a single entity under one key prefix, enabling efficient prefix scans (fetch all state for one entity) and natural shard boundaries. Per-entity-type storage isolation (separate column families or keyspaces) prevents cross-entity-type contention as recommended in thoughts.md.

Entity ID Encoding

Entity IDs are provided by the application as strings (e.g., "item_abc", "user_123"). Internally, they are hashed to u64 using BLAKE3 for compact, fixed-width storage and comparison. The original string ID is stored in metadata for external reference. Collisions in 64-bit BLAKE3 are astronomically unlikely (birthday bound at ~4 billion entities) but the system detects them at write time and returns SchemaError::IdCollision if one occurs.


Design Rationale

Why the User Model Expanded From 2 Fields to 20+

The original API.md user entity had language and region. This is sufficient for a single-user personalization model where ranking depends entirely on the user's signal history and preference vector. It is woefully insufficient for cohort-based queries.

The thesis of tidalDB includes replacing the feature store. A feature store's primary job in the content ranking stack is to answer "given this user's attributes and behavior, what segment do they belong to, and what is trending/popular/rising within that segment?" Without rich user attributes, tidalDB cannot answer this question. The user would need an external feature store, which defeats the single-system thesis.

The expanded model enables three categories of queries that the 2-field model cannot:

  1. Demographic cohorts: "Trending among US users aged 18-24" -- requires region, age_range.
  2. Behavioral cohorts: "Popular among power users who prefer short-form" -- requires engagement_level, content_format_preference.
  3. Interest cohorts: "Rising in jazz among users who have shown interest in jazz" -- requires explicit_interests, inferred_interests.

Why Computed Fields Are a Separate Category

Behavioral segments like engagement_level change continuously as users interact with the platform. If the application were responsible for computing and writing these, it would need to:

  1. Maintain signal frequency counters per user
  2. Run classification logic on every signal write
  3. Write the result back to the database

This is exactly the feature-store-plus-Kafka pattern that tidalDB replaces. By making these fields database-computed, the feedback loop closes natively. The signal write updates the signal ledger, the background computation reads the ledger to classify the user, and the next cohort query sees the updated classification. One system.

Why Items Have Many Fields

Every field on the Item entity maps to a filter dimension in USE_CASES.md Appendix A. The filter reference lists 30+ filterable dimensions. Each dimension must be represented as a field on the entity so the database can build the appropriate index. Removing a field means removing a filter that real users on real platforms use daily.

The alternative -- a generic JSON field for "other metadata" -- sacrifices indexing. A JSON field cannot be efficiently filtered, faceted, or range-scanned. Every field that appears in a filter predicate must be a typed, indexed field.

Why Multiple Embedding Slots

UC-11 (Visual and Semantic Search) requires searching by image similarity. UC-02 requires text/semantic search. These are fundamentally different vector spaces with different dimensionality and different models. Forcing them into a single embedding slot would require either:

  1. Training a multi-modal embedding (impractical for most teams)
  2. Concatenating vectors (destroys distance metric quality)
  3. Maintaining only one search modality (loses functionality)

Multiple slots, each with its own HNSW index, keep vector spaces clean and searchable independently while allowing the query planner to choose which space to search based on the query.

Why Entity IDs Are Hashed to u64

String comparison is 5-10x slower than integer comparison for key lookups. Signal writes and ranking queries perform thousands of entity lookups per operation. The 8-byte fixed-width key enables:

  1. Cache-line-friendly key encoding (aligned, fixed size)
  2. Fast comparison in hot-path data structures
  3. Compact storage in roaring bitmaps (u64 values)
  4. Deterministic key ordering (big-endian u64 sort)

The original string ID is preserved in metadata for external reference and API responses. The hash is an internal optimization.