jordan 413b712c0a chore: initialize tidalDB repository with schema foundation and standards

- Schema phase 1 (tasks 01-02): EntityId, EntityKind, Timestamp, Score, SignalTypeDef, DecayModel, Window, WindowSet — all with property tests and benchmarks scaffolding
- Stub modules for storage, signals, query, ranking
- Full documentation suite: VISION, USE_CASES, SEQUENCE, API, CODING_GUIDELINES, ai-lookup, research docs, specs, roadmap, planning docs
- Marketing site (Next.js) with blog infrastructure
- .claude/ agents and skills for the tidalDB development workflow
- Foundation standards enforced: thiserror + tracing declared as dependencies, clippy::unwrap_used = deny added to lint config
- .gitignore hardened: .next/, node_modules/, .env, secrets, logs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-20 12:52:20 -07:00

52 KiB

Raw Blame History

02 -- Entity Model Specification

The entity model defines the three core domain objects in tidalDB: Items (content), Users (consumers), and Creators (producers). Every entity has metadata fields, an embedding slot, and an attached signal ledger. The model is designed to support cohort-based targeting, personalized ranking, and the full query surface described in VISION.md and USE_CASES.md.

This specification covers entity schemas, field types, lifecycle semantics, embedding management, and the cohort-ready attribute design that enables queries like "what is trending among US users aged 18-24 who are interested in jazz."

Design Principles
Field Type Reference
Entity Relationships Diagram
Item Entity
User Entity
Creator Entity
Field Writability Model
Entity Lifecycle
Embedding Management
Cohort-Ready Design
Signal Ledger Attachment
Storage Representation
Design Rationale

Design Principles

Entities are nodes, not rows. An entity is not a collection of columns in a table. It is a node in a graph with metadata, embeddings, a signal ledger, and relationship edges. The database reasons about entities holistically -- not as field bags.

Some fields are yours; some are ours. The entity model distinguishes between application-set fields (written by the caller) and database-computed fields (maintained by tidalDB). The application sets demographic attributes on a user. The database computes behavioral segments from signal patterns. Neither overwrites the other.

Rich attributes enable cohort queries. A user entity with two fields (language, region) cannot answer "what is trending among power users in Japan who prefer short-form video." The user model must carry enough dimensionality to resolve cohort membership efficiently at query time.

Every field earns its index. Fields exist because a query needs them. Every field in this spec can be traced to a filter, sort mode, ranking profile signal, or cohort predicate in USE_CASES.md.

Field Type Reference

Every metadata field on an entity has a declared type that determines its indexing behavior, storage format, and query semantics.

Type	Storage	Indexed As	Query Operations	Example
`text`	UTF-8 string	Inverted index (BM25, tokenized)	Full-text search, phrase match, field-scoped search	`title`, `description`
`keyword`	UTF-8 string	Term dictionary, exact match	Equality, IN-list, faceting	`category`, `locale`
`keywords`	`Vec<String>`	Term dictionary per value	Equality per value, IN-list, faceting	`tags`, `explicit_interests`
`i64`	64-bit signed integer	Sorted numeric index	Range, equality, min/max, sort	`birth_year`, `follower_count`
`f64`	64-bit float	Sorted numeric index	Range, equality, min/max, sort	`avg_completion_rate`
`bool`	1-bit boolean	Boolean index	Equality	`verified`, `has_subtitles`
`timestamp`	UTC nanoseconds (`i64`)	Sorted numeric index	Range, presets (`today`, `this_week`), since	`created_at`, `first_signal_at`
`duration`	Seconds (`f64`)	Sorted numeric index	Range, presets (`short`, `medium`, `long`), sort	`duration`
`embedding`	`Vec<f32>` or quantized	HNSW (USearch)	ANN search, cosine similarity	`content_vector`, `preference_vector`
`computed`	Varies (keyword, keywords, i64, f64)	Same as underlying type	Same as underlying type	`engagement_level`, `inferred_interests`

computed fields are a special category. They have an underlying storage type (keyword, keywords, i64, f64) and are indexed identically to that type. The distinction is write semantics: computed fields are not directly writable by the application. They are maintained by the database based on signal patterns, relationship state, or periodic background computation. Attempting to set a computed field via write_user() or update_user() returns a SchemaError.

Entity Relationships Diagram

                           ┌──────────────┐
                           │    User      │
                           │              │
                           │  metadata    │
                           │  embedding   │
                           │  signals     │
                           └──────┬───────┘
                                  │
                    ┌─────────────┼─────────────┐
                    │             │             │
              follows/blocks   viewed/liked   interacted
              (Relationship)   (Signal)       (Relationship)
                    │             │             │
                    ▼             ▼             ▼
            ┌──────────────┐          ┌──────────────┐
            │   Creator    │◄─────────│    Item      │
            │              │ created  │              │
            │  metadata    │          │  metadata    │
            │  embedding   │          │  embedding   │
            │  signals     │          │  signals     │
            └──────────────┘          └──────────────┘

  Relationship edges:
    User ──follows──▶ Creator       (permanent, weight)
    User ──blocks───▶ Creator       (permanent, hard filter)
    User ──viewed───▶ Item          (signal-derived)
    User ──liked────▶ Item          (signal-derived)
    User ──saved────▶ Item          (explicit)
    User ──hid──────▶ Item          (permanent negative)
    Item ──created_by──▶ Creator    (structural, immutable)
    Creator ──similar_to──▶ Creator (computed, embedding distance)
    Item ──similar_to──▶ Item       (computed, embedding distance)

Every entity participates in two kinds of connections:

Relationships -- explicit, weighted, directional edges managed via write_relationship(). Used for follows, blocks, saves, collections.
Signal-derived state -- implicit edges created automatically when signals are written. A view signal on an item by a user creates a user-item "seen" edge. A like creates a user-item "liked" edge. These are queryable via Filter::unseen(), Filter::user_state("liked"), etc.

Item Entity

Items are the content that gets ranked. Videos, articles, images, audio tracks, podcasts, live streams, galleries -- anything a user consumes and engages with.

Every item belongs to exactly one creator (the creator_id link). Items carry metadata for filtering and display, one or more embedding slots for semantic retrieval, and a signal ledger that accumulates engagement data.

Schema Definition

db.define_entity(EntityDef {
    kind: EntityKind::Item,
    metadata_fields: vec![
        // --- Text fields: full-text indexed, searchable via BM25 ---
        Field::text("title"),
        Field::text("description"),

        // --- Keyword fields: exact match, filterable, facetable ---
        Field::keyword("category"),           // primary category: "music", "gaming", "cooking"
        Field::keywords("tags"),              // multi-value: ["jazz", "piano", "tutorial"]
        Field::keyword("format"),             // video, short, live, vod, podcast, article, image, gallery, audio
        Field::keyword("language"),           // ISO 639-1: "en", "ja", "es"
        Field::keywords("subtitle_languages"),// available subtitle languages
        Field::keywords("dubbed_languages"),  // available dub languages
        Field::keyword("content_rating"),     // G, PG, PG-13, R, NC-17
        Field::keyword("status"),             // published, live, scheduled, archived, draft
        Field::keyword("availability"),       // free, premium, subscriber_only, rental
        Field::keyword("resolution"),         // SD, HD, FHD, 4K, 8K
        Field::keyword("audio_quality"),      // standard, high, lossless, spatial
        Field::keyword("content_region"),     // geographic origin: "US", "JP"
        Field::keyword("post_type"),          // text, link, image, video, poll (forum-style)
        Field::keywords("hashtags"),          // #jazz, #tutorial
        Field::keyword("flair"),              // community-specific label

        // --- Numeric fields: range-filterable, sortable ---
        Field::i64("award_count"),            // community awards/gilding count

        // --- Boolean fields: filterable ---
        Field::bool("has_subtitles"),
        Field::bool("has_audio_description"),
        Field::bool("has_sign_language"),
        Field::bool("downloadable"),
        Field::bool("hdr"),
        Field::bool("is_original"),           // not a crosspost/repost
        Field::bool("safe_search"),           // passes safe-search filter

        // --- Duration: range-filterable, sortable, preset-filterable ---
        Field::duration("duration"),

        // --- Timestamps: range-filterable, sortable ---
        Field::timestamp("created_at"),
        Field::timestamp("updated_at"),
        Field::timestamp("scheduled_at"),     // for premieres / scheduled live
        Field::timestamp("available_until"),  // for "leaving soon" filter
    ],
    // Primary content embedding -- externally computed, DB-indexed.
    embedding: EmbeddingDef {
        slots: vec![
            EmbeddingSlot {
                name: "content",              // text/semantic content vector
                dimensions: 1536,
                source: EmbeddingSource::External,
            },
        ],
    },
})?;

Field Summary Table

Field	Type	Writability	Indexed	Used By
`title`	text	app-set	BM25 inverted	UC-02 search, UC-06 alphabetical sort
`description`	text	app-set	BM25 inverted	UC-02 search
`category`	keyword	app-set	term dictionary	UC-03 scoped trending, UC-06 browse, cohort
`tags`	keywords	app-set	term dictionary	UC-02 search, UC-06 filter
`format`	keyword	app-set	term dictionary	UC-01 format filter, UC-06 browse, diversity
`language`	keyword	app-set	term dictionary	UC-02 language filter
`subtitle_languages`	keywords	app-set	term dictionary	UC-02 accessibility filter
`dubbed_languages`	keywords	app-set	term dictionary	UC-02 accessibility filter
`content_rating`	keyword	app-set	term dictionary	UC-02 maturity filter
`status`	keyword	app-set	term dictionary	UC-12 live filter
`availability`	keyword	app-set	term dictionary	UC-02 availability filter
`resolution`	keyword	app-set	term dictionary	UC-02 quality filter
`audio_quality`	keyword	app-set	term dictionary	UC-02 quality filter
`content_region`	keyword	app-set	term dictionary	UC-02 geographic filter, cohort
`post_type`	keyword	app-set	term dictionary	UC-14 forum filtering
`hashtags`	keywords	app-set	term dictionary	UC-02 hashtag search
`flair`	keyword	app-set	term dictionary	UC-14 community filter
`award_count`	i64	app-set	sorted numeric	UC-14 gilded filter
`has_subtitles`	bool	app-set	boolean	UC-02 accessibility filter
`has_audio_description`	bool	app-set	boolean	UC-02 accessibility filter
`has_sign_language`	bool	app-set	boolean	UC-02 accessibility filter
`downloadable`	bool	app-set	boolean	UC-09 download filter
`hdr`	bool	app-set	boolean	UC-02 quality filter
`is_original`	bool	app-set	boolean	UC-14 original-only filter
`safe_search`	bool	app-set	boolean	UC-02 safe search toggle
`duration`	duration	app-set	sorted numeric	UC-02 duration filter, UC-06 shortest/longest sort
`created_at`	timestamp	app-set	sorted numeric	UC-04 chronological, UC-06 date filter
`updated_at`	timestamp	app-set	sorted numeric	change tracking
`scheduled_at`	timestamp	app-set	sorted numeric	UC-12 scheduled content
`available_until`	timestamp	app-set	sorted numeric	UC-02 "leaving soon" filter
`content` (embedding)	embedding	app-set	HNSW (USearch)	UC-01 ANN retrieval, UC-02 semantic search, UC-05 related

Additional Embedding Slots

Applications may define additional embedding slots for multi-modal retrieval:

EmbeddingSlot {
    name: "visual",               // image/thumbnail embedding
    dimensions: 512,
    source: EmbeddingSource::External,
},
EmbeddingSlot {
    name: "audio",                // audio fingerprint embedding
    dimensions: 256,
    source: EmbeddingSource::External,
},

Each slot gets its own HNSW index. Queries specify which embedding to search against. This supports UC-11 (visual/semantic search) without overloading a single vector space.

User Entity

Users are the consumers of content. They generate signals (views, likes, skips, hides), accumulate preference profiles, and form relationships with creators and items.

The user entity carries two categories of fields:

Application-set fields -- demographic and preference data the application writes explicitly. These are known at registration time or provided by the user.
Database-computed fields -- behavioral segments, interest profiles, and engagement patterns derived from signal history. The database maintains these automatically. The application reads them (for display, analytics, cohort targeting) but never writes them directly.

This distinction is the foundation of cohort targeting. An application sets locale: "en-US" and birth_year: 2001. The database computes engagement_level: "power_user" and inferred_interests: ["jazz", "piano", "music_theory"]. A cohort query combines both: locale:en-US AND age_range:18-24 AND engagement_level:power_user AND interest:jazz.

Schema Definition

db.define_entity(EntityDef {
    kind: EntityKind::User,
    metadata_fields: vec![
        // ================================================================
        // APPLICATION-SET: Demographic Attributes
        // Written by the application at registration or profile update.
        // ================================================================
        Field::keyword("locale"),             // full locale: "en-US", "ja-JP", "es-MX"
        Field::keyword("language"),           // preferred content language: "en", "ja"
        Field::keyword("region"),             // geographic region: "US", "JP", "DE"
        Field::keyword("timezone"),           // IANA timezone: "America/New_York", "Asia/Tokyo"
        Field::i64("birth_year"),             // for age-based cohort bucketing (optional)
        Field::keyword("age_range"),          // explicit bucket: "13-17", "18-24", "25-34", "35-44", "45-54", "55+"
        Field::keyword("gender"),             // optional: "male", "female", "non-binary", "undisclosed"
        Field::keyword("account_type"),       // free, premium, creator, admin
        Field::keywords("explicit_interests"),// stated interests at signup: ["jazz", "cooking", "rust"]
        Field::keywords("preferred_formats"), // stated format preference: ["video", "short"]

        // ================================================================
        // DATABASE-COMPUTED: Interest Profile
        // Derived from engagement patterns. Updated by background computation.
        // ================================================================
        Field::computed("inferred_interests", FieldType::Keywords),
            // keywords derived from engagement history.
            // top N topics by weighted engagement volume.
            // e.g., ["jazz", "piano", "music_theory", "cooking", "rust"]
            // updated: every signal write triggers incremental update;
            //          full recomputation on background schedule.

        Field::computed("primary_categories", FieldType::Keywords),
            // top categories by engagement volume (coarser than interests).
            // e.g., ["music", "programming", "food"]
            // updated: background computation, hourly.

        // ================================================================
        // DATABASE-COMPUTED: Behavioral Segments
        // Derived from signal frequency, patterns, and recency.
        // ================================================================
        Field::computed("engagement_level", FieldType::Keyword),
            // power_user:  > 50 signals/day, 7-day streak
            // regular:     10-50 signals/day, active 4+ days/week
            // casual:      1-10 signals/day, active 1-3 days/week
            // dormant:     < 1 signal/day for 7+ days
            // new:         < 7 days since first signal
            // updated: background computation, every 6 hours.

        Field::computed("content_format_preference", FieldType::Keyword),
            // short:  > 60% of completions are items with duration < 4min
            // long:   > 60% of completions are items with duration > 20min
            // mixed:  neither threshold met
            // updated: background computation, daily.

        Field::computed("session_pattern", FieldType::Keyword),
            // binge:      avg session > 30min, sequential consumption
            // browsing:   avg session 5-30min, diverse consumption
            // searching:  > 40% of sessions start with search
            // updated: background computation, daily.

        Field::computed("platform_tenure_days", FieldType::I64),
            // days since first signal was written for this user.
            // updated: on every signal write (trivial computation).

        Field::computed("daily_active_hours", FieldType::F64),
            // average number of distinct hours with signal activity per day.
            // computed over trailing 7-day window.
            // updated: background computation, daily.

        // ================================================================
        // DATABASE-COMPUTED: Creator Relationship Profile
        // Derived from relationship graph and signal patterns.
        // ================================================================
        Field::computed("followed_creator_count", FieldType::I64),
            // count of active "follows" relationships.
            // updated: on relationship write (increment/decrement).

        Field::computed("avg_creator_interaction_depth", FieldType::F64),
            // average interaction_weight across all followed creators.
            // 0.0 = passive scroller, 1.0 = deeply engaged with every follow.
            // updated: background computation, daily.
    ],
    // User preference vector -- managed by the database.
    // Updated automatically on every signal write: shifted toward
    // (positive signal) or away from (negative signal) the item's embedding.
    embedding: EmbeddingDef {
        slots: vec![
            EmbeddingSlot {
                name: "preference",
                dimensions: 1536,
                source: EmbeddingSource::DatabaseManaged,
            },
        ],
    },
})?;

Field Summary Table

Field	Type	Writability	Indexed	Used By
`locale`	keyword	app-set	term dictionary	cohort targeting, content language matching
`language`	keyword	app-set	term dictionary	content language filter
`region`	keyword	app-set	term dictionary	geographic cohort, regional trending
`timezone`	keyword	app-set	term dictionary	time-aware ranking, notification timing
`birth_year`	i64	app-set	sorted numeric	age-based cohort bucketing
`age_range`	keyword	app-set	term dictionary	age-based cohort targeting
`gender`	keyword	app-set	term dictionary	demographic cohort targeting
`account_type`	keyword	app-set	term dictionary	feature gating, cohort
`explicit_interests`	keywords	app-set	term dictionary	cold-start preference seeding, cohort
`preferred_formats`	keywords	app-set	term dictionary	format ranking boost, cohort
`inferred_interests`	computed (keywords)	db-computed	term dictionary	interest-based cohort, profile display
`primary_categories`	computed (keywords)	db-computed	term dictionary	category-based cohort
`engagement_level`	computed (keyword)	db-computed	term dictionary	behavioral cohort
`content_format_preference`	computed (keyword)	db-computed	term dictionary	format-based cohort
`session_pattern`	computed (keyword)	db-computed	term dictionary	behavioral cohort
`platform_tenure_days`	computed (i64)	db-computed	sorted numeric	tenure-based cohort
`daily_active_hours`	computed (f64)	db-computed	sorted numeric	engagement depth cohort
`followed_creator_count`	computed (i64)	db-computed	sorted numeric	social graph cohort
`avg_creator_interaction_depth`	computed (f64)	db-computed	sorted numeric	engagement depth cohort
`preference` (embedding)	embedding	db-managed	HNSW (USearch)	UC-01 For You ANN retrieval

Cohort Query Examples

With the expanded user model, tidalDB can resolve cohort predicates at query time:

-- Trending among US users aged 18-24 who like jazz
RETRIEVE items
USING PROFILE trending
FOR COHORT region:US AND age_range:18-24 AND (explicit_interests:jazz OR inferred_interests:jazz)
LIMIT 25

-- Popular among power users who prefer long-form content
RETRIEVE items
USING PROFILE top_week
FOR COHORT engagement_level:power_user AND content_format_preference:long
LIMIT 25

-- Rising content among new users (cold-start cohort)
RETRIEVE items
USING PROFILE rising
FOR COHORT engagement_level:new AND platform_tenure_days<30
LIMIT 25

The FOR COHORT clause resolves to a user set, aggregates their signal patterns over the matching items, and ranks accordingly. This is the mechanism that replaces the "feature store" in the traditional stack.

Creator Entity

Creators are the entities that produce content. Every item belongs to exactly one creator. Creators have their own metadata, embeddings, and signal ledgers that enable creator discovery (UC-10), creator profile pages (UC-08), and creator-level ranking signals.

Schema Definition

db.define_entity(EntityDef {
    kind: EntityKind::Creator,
    metadata_fields: vec![
        // ================================================================
        // APPLICATION-SET: Profile Information
        // ================================================================
        Field::text("name"),                  // display name, full-text searchable
        Field::keyword("handle"),             // unique handle, exact match searchable
        Field::keyword("language"),           // primary content language
        Field::keyword("region"),             // geographic region
        Field::keywords("categories"),        // content categories: ["music", "education"]
        Field::keywords("tags"),              // more specific: ["jazz", "piano", "tutorial"]
        Field::bool("verified"),              // platform verification status
        Field::keyword("account_type"),       // individual, brand, organization, label

        // ================================================================
        // DATABASE-COMPUTED: Audience Metrics
        // ================================================================
        Field::computed("follower_count", FieldType::I64),
            // count of active "follows" relationships pointing to this creator.
            // updated: on relationship write (increment/decrement).

        Field::computed("follower_growth_velocity", FieldType::F64),
            // net new followers per day, 7-day trailing average.
            // updated: background computation, daily.

        // ================================================================
        // DATABASE-COMPUTED: Content Catalog Statistics
        // ================================================================
        Field::computed("total_items", FieldType::I64),
            // count of non-archived items by this creator.
            // updated: on item write/archive.

        Field::computed("category_distribution", FieldType::Keywords),
            // top categories by item count.
            // e.g., ["jazz:45", "blues:20", "tutorial:15"]
            // stored as keyword values for faceting, with counts encoded.
            // updated: background computation, daily.

        Field::computed("avg_item_quality", FieldType::F64),
            // average completion_rate across all items with > 100 views.
            // proxy for content quality independent of reach.
            // updated: background computation, daily.

        // ================================================================
        // DATABASE-COMPUTED: Engagement Metrics
        // ================================================================
        Field::computed("avg_engagement_rate", FieldType::F64),
            // average (likes + comments + shares) / views across recent catalog.
            // trailing 30-day window over items created in that window.
            // updated: background computation, daily.

        Field::computed("posting_frequency", FieldType::F64),
            // average items published per week, trailing 30-day window.
            // updated: background computation, daily.

        Field::computed("last_posted_at", FieldType::Timestamp),
            // timestamp of most recent item creation.
            // updated: on item write.
    ],
    // Creator embedding -- aggregated from their item catalog.
    // Represents the semantic "center" of what this creator produces.
    embedding: EmbeddingDef {
        slots: vec![
            EmbeddingSlot {
                name: "catalog",
                dimensions: 1536,
                source: EmbeddingSource::DatabaseManaged,
            },
        ],
    },
})?;

Field Summary Table

Field	Type	Writability	Indexed	Used By
`name`	text	app-set	BM25 inverted	UC-10 people search
`handle`	keyword	app-set	term dictionary	UC-02 `creator:handle` search
`language`	keyword	app-set	term dictionary	UC-10 language filter
`region`	keyword	app-set	term dictionary	UC-10 geographic filter
`categories`	keywords	app-set	term dictionary	UC-10 topic filter
`tags`	keywords	app-set	term dictionary	UC-10 niche discovery
`verified`	bool	app-set	boolean	UC-10 verified filter
`account_type`	keyword	app-set	term dictionary	UC-10 creator type filter
`follower_count`	computed (i64)	db-computed	sorted numeric	UC-10 follower range filter, sort
`follower_growth_velocity`	computed (f64)	db-computed	sorted numeric	UC-03 rising creators
`total_items`	computed (i64)	db-computed	sorted numeric	UC-08 catalog size
`category_distribution`	computed (keywords)	db-computed	term dictionary	UC-08 catalog browsing
`avg_item_quality`	computed (f64)	db-computed	sorted numeric	UC-13 hidden gems by creator
`avg_engagement_rate`	computed (f64)	db-computed	sorted numeric	UC-10 engagement rate sort
`posting_frequency`	computed (f64)	db-computed	sorted numeric	UC-10 activity filter
`last_posted_at`	computed (timestamp)	db-computed	sorted numeric	UC-10 recently active filter
`catalog` (embedding)	embedding	db-managed	HNSW (USearch)	UC-10 "creators like X"

Creator Embedding Computation

The creator's catalog embedding is the centroid of their non-archived items' content embeddings, weighted by item quality (completion rate). This is computed by the database on a background schedule:

catalog_embedding = weighted_mean(
    vectors: [item.content_embedding for item in creator.items if item.status != "archived"],
    weights: [item.completion_rate_all_time.max(0.1) for item in creator.items]
)

When a new item is published by a creator, the catalog embedding is incrementally updated:

new_catalog = (old_catalog * old_count + new_item_embedding) / (old_count + 1)

Full recomputation occurs on a background schedule (daily) to correct for incremental drift and account for archived items.

Field Writability Model

Every field in the entity model belongs to one of three writability categories. This distinction is enforced at the schema level -- the database rejects writes that violate writability constraints.

Category	Who Writes	When Updated	Examples
app-set	Application via `write_()` / `update_()`	On explicit write	`title`, `locale`, `birth_year`, `verified`
db-computed	Database background computation	On schedule or trigger (see below)	`engagement_level`, `inferred_interests`, `follower_count`
db-managed	Database signal processing	On every relevant signal write	`preference` embedding, `interaction_weight`

Update Triggers for Computed Fields

Computed fields are updated by one of three mechanisms:

Trigger	Latency	Fields
Immediate (on write)	< 1ms	`follower_count`, `total_items`, `platform_tenure_days`, `last_posted_at`
Incremental (signal-driven)	< 100ms	`inferred_interests` (top-N update), `preference` embedding (vector shift)
Background (scheduled)	Minutes to hours	`engagement_level`, `content_format_preference`, `session_pattern`, `daily_active_hours`, `avg_creator_interaction_depth`, `avg_engagement_rate`, `posting_frequency`, `avg_item_quality`, `category_distribution`, `follower_growth_velocity`, `primary_categories`, creator `catalog` embedding

Background computation runs on a configurable schedule. The default is:

Hourly: engagement_level, primary_categories, inferred_interests (full recomputation)
Daily: content_format_preference, session_pattern, daily_active_hours, avg_creator_interaction_depth, avg_engagement_rate, posting_frequency, avg_item_quality, category_distribution, follower_growth_velocity, creator catalog embedding (full recomputation)

Applications can trigger immediate recomputation of any computed field via db.recompute_field(entity_id, field_name) for debugging or operational purposes. This is not intended for production hot paths.

Write API Enforcement

// This succeeds -- locale is app-set
db.update_user("user_123", UpdateUser {
    metadata: Some(metadata! {
        "locale" => "ja-JP",
        "timezone" => "Asia/Tokyo",
    }),
    ..Default::default()
})?;

// This fails with SchemaError::ComputedFieldWrite
db.update_user("user_123", UpdateUser {
    metadata: Some(metadata! {
        "engagement_level" => "power_user",  // ERROR: computed field
    }),
    ..Default::default()
})?;

Entity Lifecycle

Every entity follows the same lifecycle model. The lifecycle defines what state transitions are legal and what each transition means for storage, indexing, and query visibility.

States

                write_*()
    (none) ──────────────▶ Active
                              │
                    update_*()│ (metadata/embedding changes)
                    ◄─────────┘
                              │
                   archive()  │
                              ▼
                           Archived
                              │
                    delete()  │
                              ▼
                           Deleted
                        (hard remove)

State Semantics

State	Query Visible	Signals Accepted	Signal Ledger	Relationships	Embeddings
Active	Yes	Yes	Accumulating	Active	Indexed in HNSW
Archived	No (excluded by default)	No (rejected with error)	Preserved (read-only)	Preserved but inactive	Removed from HNSW
Deleted	No	No	Destroyed	Destroyed	Destroyed

Create

On write_item(), write_user(), or write_creator():

Entity metadata is stored in the entity store.
Text fields are indexed in the inverted index (Tantivy).
Keyword, numeric, boolean, timestamp, and duration fields are indexed in their respective indexes.
Embedding is inserted into the HNSW index (USearch) -- normalized to unit length at insertion.
Signal ledger is initialized (all counters at zero, all decay scores at zero, last_update_ns set to creation time).
For items: linked to creator entity; cold-start exploration budget applied.
For users: if no embedding provided, initialized to population-level default preference vector.
For creators: catalog embedding initialized to zero vector (will be computed when first item is published).
Entity is immediately queryable after commit.

Idempotency: Writing an entity with an ID that already exists is an error (SchemaError::EntityExists). Use update_*() for modifications.

Update

On update_item(), update_user(), or update_creator():

Only provided fields are modified. Omitted fields retain their current values (partial update).
Modified text fields trigger re-indexing in the inverted index.
Modified keyword/numeric/boolean fields trigger re-indexing in their respective indexes.
If an embedding is provided, the old vector is replaced in the HNSW index. The new vector is normalized at insertion.
Signal ledger is not affected by metadata updates.
Computed fields cannot be set (returns SchemaError::ComputedFieldWrite).

Delete

On db.delete(entity_kind, entity_id):

Entity metadata is destroyed.
All indexes are updated to remove the entity.
Signal ledger is destroyed.
All relationships involving this entity are destroyed.
For items: the creator's total_items count is decremented and catalog embedding is marked for recomputation.
For users: all user-specific signal state (seen items, preference vector, relationship weights) is destroyed.
For creators: all items by this creator remain but lose their creator link (orphaned items should be archived or reassigned by the application before deleting a creator).

Delete is a destructive, irreversible operation intended for legal compliance (GDPR right to erasure, DMCA takedowns). Normal content removal should use archive.

Cold Start State

A newly created entity with no signal history is in cold-start state. The database handles this natively:

Items: Receive an exploration budget (configurable per ranking profile) that injects them into a percentage of query results regardless of signal state. The budget decays as signals accumulate. Default: 10% of For You feed slots for the first 48 hours or until 1000 impressions, whichever comes first.
Users: Start with a population-level default preference vector. If explicit_interests are provided at creation, the vector is seeded toward those interest embeddings. After approximately 20 signal events, the preference vector becomes user-specific.
Creators: Start with a zero catalog embedding. After their first item is published, the catalog embedding is set to that item's content embedding. Subsequent items refine it.

Cold start handling is specified in the ranking profile, not in the entity model. The entity model provides the fields and embedding slots that ranking profiles use to detect and handle cold-start conditions.

Embedding Management

Embeddings are dense vector representations stored alongside entities and indexed for approximate nearest neighbor (ANN) retrieval via USearch (HNSW).

Embedding Sources

Source	Meaning	Who Writes	When Updated
`External`	Application computes and provides the vector	Application	On `write_()` or `update_()` with embedding
`DatabaseManaged`	Database computes and maintains the vector	Database	On signal writes (incremental) and background schedule (full)

External Embeddings

The application is responsible for computing external embeddings using its own model (OpenAI, Cohere, custom, etc.). tidalDB indexes and retrieves over these vectors but never generates them.

// Application computes the embedding externally
let content_vector: Vec<f32> = embedding_service.embed(&title_and_description);

db.write_item(WriteItem {
    id: "item_abc",
    creator_id: "creator_xyz",
    metadata: metadata! { /* ... */ },
    embeddings: embeddings! {
        "content" => &content_vector,    // 1536-dim, externally computed
    },
})?;

Normalization: All embeddings are normalized to unit length at insertion time. This enables cosine similarity to be computed as L2 distance (mathematically equivalent for unit vectors), which is more SIMD-friendly. The application does not need to pre-normalize -- the database handles it. See docs/research/ann_for_tidaldb.md for rationale.

Dimensions: Configurable per embedding slot in the entity definition. The default is 1536 (matching OpenAI text-embedding-3-large). Changing dimensions after data has been written requires rebuilding the HNSW index for that slot.

Database-Managed Embeddings

Two embeddings are managed by the database:

User preference vector (User.preference): Updated incrementally on every signal write. When a user generates a positive signal (like, completion, save) for an item, the preference vector is shifted toward the item's content embedding. When a user generates a negative signal (skip, hide, not-interested), the preference vector is shifted away. The learning rate and momentum are configurable per signal type in the ranking profile.

# Positive signal (like, completion)
preference += learning_rate * (item.content_embedding - preference)

# Negative signal (skip, hide)
preference -= learning_rate * (item.content_embedding - preference) * negative_weight

# Re-normalize to unit length after each update
preference = normalize(preference)

Full recomputation from signal history occurs on a daily background schedule to correct for incremental drift.

Creator catalog vector (Creator.catalog): Weighted centroid of all non-archived item embeddings by this creator. Updated incrementally when items are published or archived. Full recomputation on a daily background schedule.

Multiple Embedding Slots

An entity type can define multiple embedding slots for multi-modal retrieval:

embedding: EmbeddingDef {
    slots: vec![
        EmbeddingSlot { name: "content", dimensions: 1536, source: External },
        EmbeddingSlot { name: "visual",  dimensions: 512,  source: External },
        EmbeddingSlot { name: "audio",   dimensions: 256,  source: External },
    ],
},

Each slot is independently indexed in its own HNSW graph. Queries specify which slot to search:

// Semantic search over content embeddings (default)
db.search(Search { vector: Some(&query_vec), vector_slot: "content", .. })?;

// Visual similarity search (UC-11)
db.search(Search { vector: Some(&image_vec), vector_slot: "visual", .. })?;

If vector_slot is omitted, the first defined slot is used as the default.

Embedding Slot Constraints

An entity can have at most 4 embedding slots. This is a pragmatic limit -- each slot consumes memory for the HNSW graph (approximately 300 bytes per node at M=16, per slot).
Embedding dimensions must be between 2 and 4096 (inclusive). Dimensions below 2 are meaningless; above 4096, ANN quality degrades and memory costs become prohibitive at scale.
All embeddings are stored as f16 by default (per docs/research/ann_for_tidaldb.md). The EmbeddingSlot definition can override to f32 if the embedding model requires higher precision. i8 quantization is available for memory-constrained deployments.

Cohort-Ready Design

The expanded user attribute model enables cohort-based queries that are central to content platform analytics and targeting. This section describes how cohort resolution works and what indexing is required.

Cohort Predicate Resolution

A cohort is a set of users matching a composite predicate over user attributes. tidalDB resolves cohort membership using the same index infrastructure that powers entity filtering:

Each predicate term resolves to a roaring bitmap of matching user IDs.
Compound predicates (AND, OR, NOT) are resolved via bitmap intersection, union, and complement.
The resulting user set feeds into signal aggregation for the cohort query.

Predicate: region:US AND age_range:18-24 AND inferred_interests:jazz

Step 1: region_index["US"]           → bitmap A (all US users)
Step 2: age_range_index["18-24"]     → bitmap B (all 18-24 users)
Step 3: interests_index["jazz"]      → bitmap C (all jazz-interested users)
Step 4: A ∩ B ∩ C                    → bitmap D (the cohort)
Step 5: aggregate signals over items engaged by users in bitmap D
Step 6: rank items by aggregated signal velocity within the cohort

Required Indexes

Every keyword and keywords field on the User entity gets a term-to-bitmap index:

Field	Index Type	Cardinality Estimate
`locale`	keyword → roaring bitmap	~200 values
`language`	keyword → roaring bitmap	~100 values
`region`	keyword → roaring bitmap	~250 values
`timezone`	keyword → roaring bitmap	~400 values
`age_range`	keyword → roaring bitmap	~6 values
`gender`	keyword → roaring bitmap	~4 values
`account_type`	keyword → roaring bitmap	~4 values
`explicit_interests`	keyword → roaring bitmap	~10,000 values
`preferred_formats`	keyword → roaring bitmap	~10 values
`inferred_interests`	keyword → roaring bitmap	~10,000 values
`primary_categories`	keyword → roaring bitmap	~100 values
`engagement_level`	keyword → roaring bitmap	~5 values
`content_format_preference`	keyword → roaring bitmap	~3 values
`session_pattern`	keyword → roaring bitmap	~3 values

Numeric fields (birth_year, platform_tenure_days, daily_active_hours, followed_creator_count, avg_creator_interaction_depth) use sorted numeric indexes that support range predicates.

Bitmap Freshness

Application-set field bitmaps are updated synchronously on entity write. Database-computed field bitmaps are updated when the computed field is refreshed (hourly or daily, per the background computation schedule). This means cohort queries over computed fields reflect the last background computation, not real-time state. For most cohort use cases (trending among power users, popular in a demographic), hourly freshness is sufficient.

If sub-second freshness is required for a specific computed field, the application can call db.recompute_field(entity_id, field_name) to trigger immediate recomputation and re-indexing. This should be used sparingly.

Memory Budget for Cohort Indexes

At 10M users with the field set defined above, the bitmap indexes require approximately:

Low-cardinality keyword fields (region, age_range, engagement_level, etc.): ~50 MB total (roaring bitmaps compress well when cardinality is low)
High-cardinality keyword fields (explicit_interests, inferred_interests): ~500 MB total (10,000 terms, average 1,000 users per term, roaring bitmap of 1,000 u64s each)
Numeric range indexes: ~80 MB total

Total: approximately 630 MB for full cohort resolution capability over 10M users. This fits comfortably within the memory budget recommended in docs/research/tidaldb_signal_ledger.md.

Signal Ledger Attachment

Every entity automatically receives a signal ledger at creation time. The ledger is not part of the entity's metadata schema -- it is an intrinsic property of being an entity. Signal types and their behavior are defined separately via define_signal() (see the Signal Specification).

What the Ledger Contains

For each signal type defined in the schema and targeting this entity kind:

Component	Storage	Purpose
Running decay scores	`[f64; N]` per lambda	O(1) read of decayed signal value at query time
Windowed counters	Bucketed counters per window	Windowed aggregation (1h, 24h, 7d, 30d, all_time)
Velocity state	Derived from windowed counters	Rate-of-change computation
Last update timestamp	`u64` (nanoseconds)	Decay computation reference point

The ledger follows the three-tier architecture from docs/research/tidaldb_signal_ledger.md:

Tier 1 (in-memory): Running decay scores, SWAG-backed windowed counters, recent events. ~80 bytes per entity per signal type.
Tier 2 (disk): Raw signal events, time-partitioned with FIFO compaction, 7-day retention.
Tier 3 (materialized rollups): Hourly and daily aggregates for longer windows.

Ledger Initialization

At entity creation:

// Pseudocode -- internal to the database, not public API
fn initialize_ledger(entity_id: EntityId, signal_types: &[SignalDef]) {
    for signal in signal_types {
        ledger.set_decay_scores(entity_id, signal.name, [0.0; N_LAMBDAS]);
        ledger.set_last_update(entity_id, signal.name, creation_time_ns);
        ledger.init_windowed_counters(entity_id, signal.name, &signal.windows);
    }
}

All scores start at zero. The last_update is set to creation time so that the first signal write computes correct decay deltas.

Storage Representation

Entities are stored using the key encoding pattern from CODING_GUIDELINES.md, following the subject-prefix design from thoughts.md:

[entity_kind: u8][entity_id: u64 BE][0x00][TAG]:[suffix]

Tags:
  META           → serialized metadata (all fields)
  EMB:slot_name  → raw embedding vector bytes
  SIG:type:win   → signal windowed aggregate
  REL:kind       → relationship edge list
  STATE          → entity lifecycle state (active/archived)

Examples

[0x01][0x0000000000000ABC][0x00][META]           → Item item_abc metadata
[0x01][0x0000000000000ABC][0x00][EMB:content]    → Item item_abc content embedding
[0x01][0x0000000000000ABC][0x00][SIG:view:24h]   → Item item_abc view count, 24h window
[0x01][0x0000000000000ABC][0x00][REL:created_by] → Item item_abc → creator link

[0x02][0x000000000000007B][0x00][META]           → User user_123 metadata
[0x02][0x000000000000007B][0x00][EMB:preference] → User user_123 preference vector

[0x03][0x00000000000000FF][0x00][META]           → Creator creator_xyz metadata
[0x03][0x00000000000000FF][0x00][EMB:catalog]    → Creator creator_xyz catalog vector

Entity kind byte values:

Kind	Byte
Item	`0x01`
User	`0x02`
Creator	`0x03`

This encoding co-locates all data for a single entity under one key prefix, enabling efficient prefix scans (fetch all state for one entity) and natural shard boundaries. Per-entity-type storage isolation (separate column families or keyspaces) prevents cross-entity-type contention as recommended in thoughts.md.

Entity ID Encoding

Entity IDs are provided by the application as strings (e.g., "item_abc", "user_123"). Internally, they are hashed to u64 using BLAKE3 for compact, fixed-width storage and comparison. The original string ID is stored in metadata for external reference. Collisions in 64-bit BLAKE3 are astronomically unlikely (birthday bound at ~4 billion entities) but the system detects them at write time and returns SchemaError::IdCollision if one occurs.

Design Rationale

Why the User Model Expanded From 2 Fields to 20+

The original API.md user entity had language and region. This is sufficient for a single-user personalization model where ranking depends entirely on the user's signal history and preference vector. It is woefully insufficient for cohort-based queries.

The thesis of tidalDB includes replacing the feature store. A feature store's primary job in the content ranking stack is to answer "given this user's attributes and behavior, what segment do they belong to, and what is trending/popular/rising within that segment?" Without rich user attributes, tidalDB cannot answer this question. The user would need an external feature store, which defeats the single-system thesis.

The expanded model enables three categories of queries that the 2-field model cannot:

Demographic cohorts: "Trending among US users aged 18-24" -- requires region, age_range.
Behavioral cohorts: "Popular among power users who prefer short-form" -- requires engagement_level, content_format_preference.
Interest cohorts: "Rising in jazz among users who have shown interest in jazz" -- requires explicit_interests, inferred_interests.

Why Computed Fields Are a Separate Category

Behavioral segments like engagement_level change continuously as users interact with the platform. If the application were responsible for computing and writing these, it would need to:

Maintain signal frequency counters per user
Run classification logic on every signal write
Write the result back to the database

This is exactly the feature-store-plus-Kafka pattern that tidalDB replaces. By making these fields database-computed, the feedback loop closes natively. The signal write updates the signal ledger, the background computation reads the ledger to classify the user, and the next cohort query sees the updated classification. One system.

Why Items Have Many Fields

Every field on the Item entity maps to a filter dimension in USE_CASES.md Appendix A. The filter reference lists 30+ filterable dimensions. Each dimension must be represented as a field on the entity so the database can build the appropriate index. Removing a field means removing a filter that real users on real platforms use daily.

The alternative -- a generic JSON field for "other metadata" -- sacrifices indexing. A JSON field cannot be efficiently filtered, faceted, or range-scanned. Every field that appears in a filter predicate must be a typed, indexed field.

Why Multiple Embedding Slots

UC-11 (Visual and Semantic Search) requires searching by image similarity. UC-02 requires text/semantic search. These are fundamentally different vector spaces with different dimensionality and different models. Forcing them into a single embedding slot would require either:

Training a multi-modal embedding (impractical for most teams)
Concatenating vectors (destroys distance metric quality)
Maintaining only one search modality (loses functionality)

Multiple slots, each with its own HNSW index, keep vector spaces clean and searchable independently while allowing the query planner to choose which space to search based on the query.

Why Entity IDs Are Hashed to u64

String comparison is 5-10x slower than integer comparison for key lookups. Signal writes and ranking queries perform thousands of entity lookups per operation. The 8-byte fixed-width key enables:

Cache-line-friendly key encoding (aligned, fixed size)
Fast comparison in hot-path data structures
Compact storage in roaring bitmaps (u64 values)
Deterministic key ordering (big-endian u64 sort)

The original string ID is preserved in metadata for external reference and API responses. The hash is an internal optimization.

52 KiB Raw Blame History

02 -- Entity Model Specification

Table of Contents

Design Principles

Field Type Reference

Entity Relationships Diagram

Item Entity

Schema Definition

Field Summary Table

Additional Embedding Slots

User Entity

Schema Definition

Field Summary Table

Cohort Query Examples

Creator Entity

Schema Definition

Field Summary Table

Creator Embedding Computation

Field Writability Model

Update Triggers for Computed Fields

Write API Enforcement

Entity Lifecycle

States

State Semantics

Create

Update

Archive

Delete

Cold Start State

Embedding Management

Embedding Sources

External Embeddings

Database-Managed Embeddings

Multiple Embedding Slots

Embedding Slot Constraints

Cohort-Ready Design

Cohort Predicate Resolution

Required Indexes

Bitmap Freshness

Memory Budget for Cohort Indexes

Signal Ledger Attachment

What the Ledger Contains

Ledger Initialization

Storage Representation

Examples

Entity ID Encoding

Design Rationale

Why the User Model Expanded From 2 Fields to 20+

Why Computed Fields Are a Separate Category

Why Items Have Many Fields

Why Multiple Embedding Slots

Why Entity IDs Are Hashed to u64

52 KiB

Raw Blame History