tidaldb/docs/specs/05-cohorts.md
jordan 413b712c0a chore: initialize tidalDB repository with schema foundation and standards
- Schema phase 1 (tasks 01-02): EntityId, EntityKind, Timestamp, Score, SignalTypeDef, DecayModel, Window, WindowSet — all with property tests and benchmarks scaffolding
- Stub modules for storage, signals, query, ranking
- Full documentation suite: VISION, USE_CASES, SEQUENCE, API, CODING_GUIDELINES, ai-lookup, research docs, specs, roadmap, planning docs
- Marketing site (Next.js) with blog infrastructure
- .claude/ agents and skills for the tidalDB development workflow
- Foundation standards enforced: thiserror + tracing declared as dependencies, clippy::unwrap_used = deny added to lint config
- .gitignore hardened: .next/, node_modules/, .env, secrets, logs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 12:52:20 -07:00

62 KiB

05 -- Cohort Specification

Status: Draft Authors: tidalDB Engineering Date: 2026-02-20 Depends on: Entity Model (02), Signal System (03), Query Engine Research: docs/research/tidaldb_signal_ledger.md (Section 7: Cohort-Scoped Signal Aggregation)


Table of Contents

  1. Overview
  2. Cohort as a First-Class Primitive
  3. Cohort Types
  4. Cohort Definition Language
  5. Membership Resolution
  6. The Three-Layer Trending Model
  7. Integration Architecture
  8. Cohort-Scoped Ranking Profiles
  9. Hierarchical Cohort Model
  10. Cohort Analytics
  11. API Surface
  12. Worked Example
  13. Accuracy Analysis
  14. Configuration and Defaults
  15. Scale Considerations
  16. Invariants and Correctness Guarantees

1. Overview

A cohort is a dynamic predicate over user attributes that defines a population segment. Cohorts are not user groups. They are not lists. A user does not "join" a cohort -- they match its predicate based on their current attributes. When a user's attributes change, their cohort memberships change automatically.

Cohorts exist to answer a question that global signal aggregates cannot:

"What is trending for users who look like this?"

The product owner's requirement is a three-layer model:

  1. Global trending -- what is trending everywhere.
  2. Cohort trending -- what is trending for users matching a profile (e.g., US users aged 18-24 who like jazz).
  3. Search within cohort trending -- text and semantic search constrained to the cohort-trending candidate set.

Each layer builds on the previous. Global trending uses global signal aggregates (already designed in the Signal System spec, Section 6, Level 0). Cohort trending uses the hierarchical dimensional rollup system (Signal System spec, Section 7, Levels 1-2). Search within cohort trending composes text/semantic retrieval with the cohort-scoped candidate set.

This specification defines cohorts as a first-class primitive that connects the Entity Model's rich user attributes, the Signal System's dimensional rollup architecture, and the Query Engine's retrieval and ranking pipeline.


2. Cohort as a First-Class Primitive

What a Cohort Is

A cohort is a named predicate over user attributes that resolves, at query time, to a set of user IDs. The predicate is evaluated against the User entity's metadata fields -- both application-set fields (region, locale, age_range) and database-computed fields (engagement_level, inferred_interests).

Cohort "young_us_jazz":
    Predicate: region:US AND age_range:18-24 AND inferred_interests CONTAINS jazz
    Resolution: bitmap of user IDs matching this predicate
    Signal scope: aggregate signals only from users in this bitmap

What a Cohort Is Not

Not a user group. A cohort has no membership list that someone manages. Users match or do not match a predicate. There is no "add user to cohort" operation.

Not a segment stored on the user. Users do not carry a cohorts field. Membership is computed from attributes. If a user moves from the US to Japan, they stop matching region:US cohorts and start matching region:JP cohorts -- without any explicit membership update.

Not a filter on items. A cohort defines a population of users, not a subset of items. The items that "trend in a cohort" are items that users in that cohort engage with at high velocity. The cohort constrains the signal aggregation, not the item candidate set.

Not an audience. Cohorts are not used for targeting or ad delivery. They are used to scope signal aggregation for ranking queries. "What is trending among young US jazz fans" is a ranking question, not a targeting question.

Why Cohorts Are Necessary

Global trending surfaces content that appeals to the broadest audience. This is useful but incomplete. A jazz video gaining rapid traction among 18-24 year old US users will never appear on a global trending list dominated by gaming and pop music. But for a user who matches that cohort, that jazz video is the most relevant trending result.

Without cohorts, the application must:

  1. Maintain its own user segmentation system
  2. Track per-segment signal aggregates in a feature store
  3. Build custom trending logic per segment
  4. Stitch these together with the ranking service

This is the feature-store pattern that tidalDB replaces. Cohorts are the mechanism by which it replaces it.


3. Cohort Types

3.1 Static Cohorts

Predicates over immutable or slow-changing user attributes. Membership changes rarely -- only when the user explicitly updates their profile.

DEFINE COHORT us_english AS region:US AND locale IN (en-US, en-GB)
DEFINE COHORT gen_z AS age_range IN (13-17, 18-24)
DEFINE COHORT premium AS account_type:premium

Resolution strategy: Pre-computed roaring bitmap, cached indefinitely. Invalidated and recomputed only when a user's matching attribute changes via update_user(). Because the underlying attributes are application-set and change infrequently, the bitmap is effectively static.

Refresh cost: O(1) per user attribute change (bitmap flip). Full recomputation is O(users) but only triggered on schema change.

3.2 Dynamic Cohorts

Predicates over database-computed attributes. Membership changes as user behavior changes, on the background computation schedule defined in the Entity Model spec.

DEFINE COHORT power_users AS engagement_level:power_user
DEFINE COHORT jazz_fans AS inferred_interests CONTAINS jazz
DEFINE COHORT binge_watchers AS session_pattern:binge AND content_format_preference:long

Resolution strategy: Roaring bitmap refreshed on the same schedule as the underlying computed field. engagement_level is recomputed every 6 hours (Entity Model spec, Section: Field Writability Model), so the power_users cohort bitmap is at most 6 hours stale. inferred_interests is recomputed hourly (incremental) and daily (full), so jazz_fans reflects interests within the last hour.

Refresh cost: Piggybacks on the existing computed field refresh. No additional computation -- the bitmap is updated as a side effect of the computed field update.

3.3 Hybrid Cohorts

Predicates combining static and dynamic attributes. The most common cohort type in practice.

DEFINE COHORT young_us_jazz AS
    region:US AND age_range:18-24 AND inferred_interests CONTAINS jazz

Resolution strategy: Bitmap intersection of the static components (region:US, age_range:18-24) with the dynamic component (inferred_interests CONTAINS jazz). The static bitmaps are cached. The dynamic bitmap is refreshed on schedule. Intersection is computed on demand or cached with the staleness of the most-stale component.

3.4 Ad-hoc Cohorts

Inline predicates in a query, not named or saved. Used for exploratory queries and one-off analytics.

RETRIEVE items
USING PROFILE trending
FOR COHORT region:JP AND age_range:25-34
WINDOW 24h
LIMIT 25

Resolution strategy: Computed at query time from the predicate. Bitmaps for individual attribute values are always available (they are the term-to-bitmap indexes from the Entity Model spec, Section: Cohort-Ready Design). The compound bitmap is the intersection of these per-value bitmaps. Resolution cost depends on predicate complexity but is bounded by the bitmap intersection performance target (<5ms for compound predicates).

Caching: Ad-hoc cohort bitmaps are not cached between queries. If the same ad-hoc predicate appears frequently, the application should define it as a named cohort to benefit from caching.


4. Cohort Definition Language

4.1 Predicate Syntax

Cohort predicates are boolean expressions over user attribute fields. Every field on the User entity (both application-set and database-computed) is a valid predicate dimension.

Simple equality:

region:US
engagement_level:power_user
account_type:premium

Set membership (IN):

locale IN (en-US, en-GB, en-AU)
age_range IN (18-24, 25-34)

Contains (for keywords fields):

inferred_interests CONTAINS jazz
explicit_interests CONTAINS cooking
primary_categories CONTAINS music

Range predicates (for numeric fields):

birth_year:1995-2005
platform_tenure_days > 365
daily_active_hours >= 4.0
followed_creator_count:100-1000

Negation:

NOT engagement_level:dormant
NOT account_type:admin

Compound predicates (AND/OR/NOT with grouping):

region:US AND age_range:18-24 AND inferred_interests CONTAINS jazz
(region:US OR region:CA) AND age_range:18-24
region:US AND NOT engagement_level:dormant
(locale IN (en-US, en-GB) OR language:en) AND engagement_level:power_user

4.2 Named Cohort Definition

Named cohorts are defined in schema and persist across queries. They are the recommended approach for any cohort used more than once.

db.define_cohort(CohortDef {
    name: "young_us_jazz",
    predicate: Predicate::and(vec![
        Predicate::eq("region", "US"),
        Predicate::eq("age_range", "18-24"),
        Predicate::contains("inferred_interests", "jazz"),
    ]),
})?;

db.define_cohort(CohortDef {
    name: "latam_power_users",
    predicate: Predicate::and(vec![
        Predicate::in_set("region", &["BR", "MX", "AR", "CO", "CL"]),
        Predicate::eq("engagement_level", "power_user"),
    ]),
})?;

db.define_cohort(CohortDef {
    name: "long_form_enthusiasts",
    predicate: Predicate::and(vec![
        Predicate::eq("content_format_preference", "long"),
        Predicate::gt("daily_active_hours", 2.0),
        Predicate::not(Predicate::eq("engagement_level", "dormant")),
    ]),
})?;

Text DSL equivalent (for query strings and configuration):

DEFINE COHORT young_us_jazz AS region:US AND age_range:18-24 AND inferred_interests CONTAINS jazz
DEFINE COHORT latam_power_users AS region IN (BR, MX, AR, CO, CL) AND engagement_level:power_user
DEFINE COHORT long_form_enthusiasts AS content_format_preference:long AND daily_active_hours > 2.0 AND NOT engagement_level:dormant

4.3 Predicate Validation Rules

  1. Every field referenced in a predicate must exist on the User entity. Referencing a non-existent field returns SchemaError::UnknownField.
  2. Predicate operators must match the field type. > on a keyword field returns SchemaError::TypeMismatch. CONTAINS on a non-keywords field returns SchemaError::TypeMismatch.
  3. Cohort names must be unique. Redefining a cohort with the same name replaces the previous definition (the bitmap is recomputed on the next refresh cycle).
  4. Maximum predicate depth is 8 levels of nesting. This prevents pathological evaluation but allows all practical cohort definitions.
  5. Maximum 500 named cohorts. This is a practical limit on the schema catalog, not on query-time ad-hoc cohorts which are unlimited.

4.4 Predicate Type Reference

Operator Applicable Field Types Bitmap Operation Example
: (equality) keyword, computed(keyword) Direct bitmap lookup region:US
IN keyword, computed(keyword) Union of bitmaps per value region IN (US, CA, MX)
CONTAINS keywords, computed(keywords) Direct bitmap lookup per value inferred_interests CONTAINS jazz
>, >=, <, <= i64, f64, computed(i64), computed(f64) Range scan on sorted numeric index platform_tenure_days > 365
range (a-b) i64, f64, computed(i64), computed(f64) Range scan on sorted numeric index birth_year:1995-2005
NOT any predicate Bitmap complement NOT engagement_level:dormant
AND predicates Bitmap intersection region:US AND age_range:18-24
OR predicates Bitmap union region:US OR region:CA

5. Membership Resolution

5.1 Resolution Mechanism

Cohort membership is resolved using the roaring bitmap indexes maintained by the Entity Model (spec 02, Section: Cohort-Ready Design). Every keyword and keywords field on the User entity has a term-to-bitmap index. Every numeric field has a sorted numeric index that supports range predicate resolution to bitmaps.

Resolution of "region:US AND age_range:18-24 AND inferred_interests CONTAINS jazz":

Step 1: region_bitmap["US"]              --> bitmap A  (all US users)
Step 2: age_range_bitmap["18-24"]        --> bitmap B  (all 18-24 users)
Step 3: interests_bitmap["jazz"]         --> bitmap C  (all jazz-interested users)
Step 4: A AND B AND C                    --> bitmap D  (the cohort)

Bitmap D is the cohort's resolved membership.
Cardinality: |D| = roaring::cardinality(D)

5.2 Resolution Latency Targets

Cohort Type Resolution Target Mechanism
Named static cohort < 1ms Pre-computed bitmap, cached in memory
Named dynamic cohort < 1ms Pre-computed bitmap, refreshed on schedule
Named hybrid cohort < 2ms Intersection of cached static + cached dynamic
Ad-hoc, 1 predicate term < 1ms Single bitmap lookup
Ad-hoc, 2-3 predicate terms (AND) < 2ms 2-3 bitmap intersections
Ad-hoc, 4+ predicate terms < 5ms Multiple bitmap operations
Ad-hoc with range predicates < 5ms Range scan + bitmap intersection
Ad-hoc with NOT < 3ms Bitmap complement + intersection

These targets assume 10M users and the bitmap memory budget of ~630 MB from the Entity Model spec.

5.3 Bitmap Caching Strategy

Named cohorts: The resolved bitmap is cached in memory alongside the cohort definition. Cache lifetime depends on cohort type:

Cohort Type Cache Lifetime Invalidation Trigger
Static Indefinite Any update_user() that changes a matching field
Dynamic Matches computed field refresh interval Background materializer recomputes the underlying field
Hybrid Min(static lifetime, dynamic refresh interval) Either trigger above

Invalidation mechanism for static cohorts: When update_user() modifies a field referenced by any named cohort's predicate, the affected cohort bitmaps are marked dirty. Recomputation is deferred to the next read (lazy) or the next background cycle (eager, default). The choice is configurable:

CohortConfig {
    // Eager: recompute bitmap immediately on user attribute change.
    // Higher write-path cost, always-fresh bitmaps.
    // Lazy: mark dirty, recompute on next query.
    // Lower write-path cost, first query after change pays recomputation.
    invalidation: CohortInvalidation::Eager,  // default
}

In practice, for static cohorts, the invalidation cost is trivial: flipping one bit in a roaring bitmap per user update. Eager invalidation is the right default.

Dynamic cohort refresh: Dynamic cohort bitmaps are refreshed by the background materializer as a side effect of computed field updates. When engagement_level is recomputed for a batch of users, every named cohort with engagement_level in its predicate has its bitmap updated in the same pass. No separate cohort refresh job is needed.

5.4 Integration with Signal System Dimensional Hierarchy

The Signal System spec (Section 7) defines a three-level dimensional hierarchy for cohort-scoped signal aggregation:

Level 0: GLOBAL                           -- one counter per item per signal per window
Level 1: PRIMARY DIMENSIONS               -- region (~20), language (~30), age_group (6)
Level 2: BEHAVIORAL SEGMENTS              -- up to 100 application-defined segments
Level 3: COMPOSITE (query-time estimate)  -- intersection of Level 1 and Level 2

Cohort membership resolution feeds directly into this hierarchy:

Cohort Predicate Dimensional Level Signal Aggregation Path
Single Level 1 dimension (e.g., region:US) Level 1 Exact rollup lookup
Single Level 2 segment (e.g., engagement_level:power_user) Level 2 Exact rollup lookup
Multiple Level 1 dimensions (e.g., region:US AND age_range:18-24) Level 3 Independence estimation from Level 1 rollups
Level 1 + Level 2 (e.g., region:US AND jazz_fans) Level 3 Independence estimation from Level 1 + Level 2
Named cohort registered as Level 2 segment Level 2 Exact rollup lookup

The key design decision: Any named cohort can optionally be registered as a Level 2 behavioral segment, which activates exact counter tracking at signal write time. This trades write amplification for query accuracy. The threshold for when to promote a cohort to Level 2 is discussed in Section 13 (Accuracy Analysis).

db.define_cohort(CohortDef {
    name: "young_us_jazz",
    predicate: Predicate::and(vec![
        Predicate::eq("region", "US"),
        Predicate::eq("age_range", "18-24"),
        Predicate::contains("inferred_interests", "jazz"),
    ]),
    // Promote to Level 2 segment for exact signal tracking.
    // Costs ~1 additional counter increment per signal write
    // from users matching this cohort, but provides exact
    // cohort-scoped signal aggregates instead of estimates.
    exact_tracking: true,
})?;

This is the organizing principle of the entire cohort system. Every feature, every API extension, and every storage decision exists to serve this three-layer model.

What is trending everywhere?

RETRIEVE items
USING PROFILE trending
WINDOW 24h
LIMIT 25

This query uses Level 0 (global) signal aggregates. It is already fully specified in the Signal System spec. No cohort resolution is involved. The ranking profile trending reads global velocity signals (share velocity, view velocity, engagement ratio) and ranks by pure signal momentum.

Signal path: Global counters in the hot tier and warm tier. O(1) per entity per signal. Exact.

Latency target: < 20ms for 25 results.

What is trending for users matching a profile?

RETRIEVE items
USING PROFILE trending
FOR COHORT young_us_jazz
WINDOW 24h
LIMIT 25

This query scopes signal aggregation to users matching the young_us_jazz cohort predicate. Instead of reading global view velocity, the query engine reads the cohort-scoped view velocity: "how many views did this item receive in the last 24 hours from users in this cohort?"

Signal path depends on how the cohort maps to the dimensional hierarchy:

Case A -- Single primary dimension (exact):

RETRIEVE items USING PROFILE trending FOR COHORT region:US WINDOW 24h LIMIT 25

Maps to Level 1 rollup for region:US. Direct counter lookup. Exact.

Case B -- Named cohort registered as Level 2 segment (exact):

RETRIEVE items USING PROFILE trending FOR COHORT young_us_jazz WINDOW 24h LIMIT 25

If young_us_jazz has exact_tracking: true, it is a Level 2 behavioral segment with its own counters. Direct counter lookup. Exact.

Case C -- Composite query (estimated):

RETRIEVE items USING PROFILE trending FOR COHORT region:US AND age_range:18-24 WINDOW 24h LIMIT 25

No exact counters for this intersection. Estimated from Level 1 rollups using the independence assumption:

C(region:US AND age_range:18-24) ~= C(region:US) * C(age_range:18-24) / C(global)

Accuracy: ~85-95% for weakly correlated dimensions (Section 13).

Latency target: < 50ms for 25 results (includes cohort resolution + signal aggregation + ranking).

Text or semantic search constrained to what is trending in a cohort.

SEARCH items
QUERY "piano"
WITHIN TRENDING FOR COHORT young_us_jazz
WINDOW 24h
LIMIT 20

This is the most complex query in the system. It composes three operations:

  1. Cohort resolution: Resolve young_us_jazz to a user bitmap.
  2. Cohort trending candidate generation: Identify items with high cohort-scoped velocity in the 24h window. This produces a candidate set (e.g., the top 500 items trending in this cohort).
  3. Search within candidates: Apply BM25 and/or semantic search for "piano" within the candidate set only. Rank by text relevance, re-weighted by cohort trending score.

Execution plan:

Step 1: Resolve cohort "young_us_jazz"           --> bitmap D (user set)
            Cost: < 2ms (cached bitmap intersection)

Step 2: Generate cohort trending candidates
            Read cohort-scoped velocity for all items with cohort tracking active
            Filter to items with velocity above threshold
            Sort by cohort velocity
            Take top 500 candidates
            Cost: < 20ms (scan 100K cohort-tracked items)

Step 3: Apply text search "piano" within 500 candidates
            BM25 score against inverted index, intersected with candidate set
            Optional: semantic search with query embedding
            Hybrid fusion (RRF or weighted) if both text and vector
            Cost: < 10ms (inverted index lookup + candidate intersection)

Step 4: Final ranking
            Combine text relevance score with cohort velocity score
            Apply diversity constraints
            Return top 20
            Cost: < 5ms

Total: < 37ms (within 50ms budget)

Query semantics: WITHIN TRENDING means "restrict the candidate set to items that are currently trending in this scope." It is not a filter (which would eliminate items from an existing candidate set) -- it is a candidate generation strategy. Items not trending in the cohort are never considered, regardless of their text relevance.

Latency target: < 50ms for 20 results.


7. Integration Architecture

How Cohorts Connect the Three Subsystems

                    ┌──────────────────────────────────────────────┐
                    │              QUERY ENGINE                     │
                    │                                              │
                    │  RETRIEVE items                              │
                    │  USING PROFILE trending                      │
                    │  FOR COHORT young_us_jazz        ┌────────┐ │
                    │  WINDOW 24h                      │ Result │ │
                    │  LIMIT 25                        │  Set   │ │
                    │                                  └────┬───┘ │
                    └──────────┬───────────────────────────┬┘─────┘
                               │                           │
                    ┌──────────▼───────────┐    ┌──────────▼──────────┐
                    │   ENTITY MODEL       │    │   SIGNAL SYSTEM     │
                    │                      │    │                     │
                    │  User attributes:    │    │  Dimensional rollups:│
                    │  - region: "US"      │    │  Level 0: global    │
                    │  - age_range: "18-24"│    │  Level 1: region,   │
                    │  - inferred_interests│    │    language, age     │
                    │    ["jazz", ...]     │    │  Level 2: segments  │
                    │                      │    │  Level 3: composite │
                    │  Bitmap indexes:     │    │    (estimated)      │
                    │  region["US"] → bmp  │    │                     │
                    │  age["18-24"] → bmp  │    │  Cohort-scoped      │
                    │  interest["jazz"]→bmp│    │  velocity per item  │
                    │                      │    │                     │
                    │  Cohort resolution:  │    │  Write-time cohort  │
                    │  A ∩ B ∩ C → bitmap D│    │  attribution:       │
                    │                      │    │  user memberships → │
                    │  UserCohortMembership│    │  counter increments │
                    │  cached per user     │    │                     │
                    └──────────────────────┘    └─────────────────────┘
                               │                           ▲
                               │    UserCohortMemberships  │
                               └───────────────────────────┘
                                 (cached on user, used at
                                  signal write time for
                                  cohort counter attribution)

Data Flow: Signal Write with Cohort Attribution

When a signal event arrives (e.g., user_123 views item_abc):

1. Load user_123's UserCohortMemberships from hot-tier cache
       {region: US, language: en, age_group: 18-24, segments: [jazz_fans, power_users]}

2. Check if item_abc has cohort tracking active
       (global signal rate > COHORT_ACTIVATION_THRESHOLD)

3. If cohort tracking active:
       a. Increment global counter (Level 0)                       -- always
       b. Increment region:US counter (Level 1)                    -- from membership
       c. Increment language:en counter (Level 1)                  -- from membership
       d. Increment age_group:18-24 counter (Level 1)             -- from membership
       e. Increment jazz_fans segment counter (Level 2)            -- from membership
       f. Increment power_users segment counter (Level 2)          -- from membership
       g. If young_us_jazz has exact_tracking:                     -- named cohort
          Increment young_us_jazz segment counter (Level 2)

4. If cohort tracking not active:
       a. Increment global counter only (Level 0)
       b. Check if global counter crossed activation threshold
          If yes, activate cohort tracking for item_abc

When a FOR COHORT query arrives:

1. Resolve cohort predicate to query plan
       Parse "young_us_jazz" → lookup named cohort definition
       Determine dimensional mapping:
         - If exact_tracking: true → Level 2 segment lookup
         - If single Level 1 dimension → Level 1 rollup lookup
         - If composite → independence estimation

2. For each candidate item (items with cohort tracking active):
       Read cohort-scoped signal aggregates per the query plan
       Compute velocity within the requested window

3. Rank candidates by cohort-scoped velocity
       Apply ranking profile (trending: velocity-dominant)
       Apply diversity constraints
       Return top-K results

8. Cohort-Scoped Ranking Profiles

Ranking profiles can reference cohort trending as a boost signal. This enables "For You, weighted toward what is trending among people like you."

db.define_profile(ProfileDef {
    name: "for_you_cohort_aware",
    version: 1,
    candidate: Candidate::Ann {
        query_vector: VectorSource::UserPreference,
        index: EntityKind::Item,
        top_k: 500,
    },
    boosts: vec![
        Boost::signal("view", Window::hours(24), Velocity, 0.3),
        Boost::relationship("interaction_weight", 0.2),
        Boost::social_proof(0.15),
        // New: boost items trending in the querying user's cohort
        Boost::cohort_trending("auto", Window::hours(24), 0.2),
    ],
    // ...
})?;

The Boost::cohort_trending("auto", ...) computes the querying user's primary cohort automatically from their attributes (region + age_range + top inferred interest) and boosts items trending in that cohort. The "auto" parameter means "derive the cohort from the querying user's attributes." A specific cohort name can also be used:

Boost::cohort_trending("young_us_jazz", Window::hours(24), 0.2)

8.2 Cohort-Relative Scoring

A powerful discovery signal: "this item is trending MORE in this cohort than globally." An item with global velocity of 100/hour and cohort velocity of 500/hour has a cohort-relative score of 5.0 -- it is 5x more popular among this cohort than the general population. This surfaces content that is specifically resonant with a population segment.

Boost::cohort_relative("young_us_jazz", Window::hours(24), 0.25)

The cohort-relative score is computed as:

cohort_relative_score = cohort_velocity / max(global_velocity, floor)

Where floor prevents division by zero and dampens noise for low-traffic items. Default floor: 10.0 events/hour.

Instead of using ANN or scan for candidate generation, a ranking profile can use cohort trending as its candidate source:

db.define_profile(ProfileDef {
    name: "trending_for_you",
    version: 1,
    candidate: Candidate::CohortTrending {
        cohort: CohortSource::Auto, // derive from querying user
        window: Window::hours(24),
        top_k: 200,
    },
    boosts: vec![
        // Re-rank by user preference match
        Boost::preference_match(0.3),
        Boost::signal("completion", Window::all_time(), Value, 0.2),
    ],
    // ...
})?;

This generates candidates from "items trending in the user's cohort" and then re-ranks by personal preference. It answers the question: "Of the things trending among people like me, which ones match my specific taste?"

8.4 CohortSource Enum

pub enum CohortSource {
    /// Derive cohort from the querying user's attributes.
    /// Uses the user's region, age_range, and top inferred interest
    /// to construct an automatic cohort predicate.
    Auto,

    /// Use a specific named cohort.
    Named(String),

    /// Use an inline predicate (ad-hoc cohort).
    Predicate(Predicate),
}

9. Hierarchical Cohort Model

9.1 Natural Hierarchy

Cohorts form a natural hierarchy that mirrors the signal system's dimensional hierarchy:

Global (all users)
├── Region (US, EU, APAC, LATAM, ...)
│   ├── Locale (en-US, en-GB, es-MX, ...)
│   └── Region + Age (US:18-24, US:25-34, ...)
│       └── Region + Age + Interest (US:18-24:jazz, ...)
├── Language (en, es, ja, ...)
│   └── Language + Age (en:18-24, ...)
├── Age Group (13-17, 18-24, 25-34, ...)
└── Behavioral Segments (power_users, jazz_fans, ...)
    └── Region + Segment (US:jazz_fans, ...)

9.2 Roll-up and Drill-down

The hierarchy enables efficient navigation:

Roll-up: "Trending in US" is the parent of "Trending in US among 18-24." If the child cohort is too small to produce reliable trending data (fewer than 1000 active users), the system falls back to the parent cohort and applies a weaker cohort-relative boost.

Drill-down: "Trending in US" can be decomposed into "Trending in US among 18-24" vs "Trending in US among 25-34" for analytics or A/B comparison.

9.3 Mapping to Signal System Levels

Hierarchy Level Signal System Level Counter Type Accuracy
Global Level 0 Always maintained Exact
Single primary dimension Level 1 Always maintained for active items Exact
Single behavioral segment Level 2 Maintained for registered segments Exact
Two primary dimensions Level 3 Estimated at query time ~85-95%
Primary + behavioral Level 3 Estimated at query time ~75-90%
Named cohort with exact_tracking Level 2 Maintained as explicit segment Exact

9.4 Minimum Population Threshold

Cohort-scoped trending is only meaningful when the cohort has sufficient active users to produce statistically reliable signal velocity. A cohort of 10 users cannot have meaningful "trending" content.

Minimum population for cohort trending queries:

Query Type Minimum Cohort Size Rationale
Cohort trending (top 25) 1,000 active users in window Statistical reliability of velocity
Cohort trending (top 10) 500 active users in window Smaller result set needs less data
Search within cohort trending 2,000 active users in window Needs enough trending candidates to search within
Cohort-relative scoring 500 active users in window Ratio needs denominator stability

"Active users in window" means users in the cohort who have generated at least one signal event within the query window.

When a cohort is below the minimum population threshold, the query engine:

  1. Returns a warning in the response: CohortWarning::InsufficientPopulation { cohort, size, minimum }.
  2. Falls back to the nearest parent cohort in the hierarchy that meets the threshold.
  3. Applies a cohort-relative boost from the original cohort (if any exact data exists) as a secondary signal.

10. Cohort Analytics

Platform operators need inverse queries -- not "what is trending in this cohort" but "what cohorts is this item trending in." These are operator-facing analytics, not end-user queries.

10.1 Item Cohort Performance

"Which cohorts is this item performing best in?"

let analysis = db.analyze_item_cohorts(AnalyzeItemCohorts {
    item: "item_abc",
    signal: "view",
    window: Window::hours(24),
    // Return cohorts where this item's velocity is highest
    sort: CohortAnalysisSort::AbsoluteVelocity,
    limit: 20,
})?;

// Returns:
// [
//   { cohort: "region:BR", velocity: 1200/h, relative: 3.2 },
//   { cohort: "age_range:18-24", velocity: 800/h, relative: 2.1 },
//   { cohort: "jazz_fans", velocity: 600/h, relative: 8.5 },
//   ...
// ]

This query iterates over all Level 1 and Level 2 dimensional rollups for the given item and signal, ranks by velocity, and returns the top cohorts. It answers: "who is this content resonating with?"

10.2 Cohort Velocity Anomalies

"What cohorts are showing unusual velocity for this category?"

let anomalies = db.detect_cohort_anomalies(CohortAnomalyDetection {
    filter: Filter::eq("category", "jazz"),
    signal: "view",
    window: Window::hours(6),
    // Detect cohorts where category velocity is > 2 standard deviations
    // above that cohort's historical baseline for this category
    threshold: AnomalyThreshold::StdDev(2.0),
})?;

// Returns:
// [
//   { cohort: "region:JP", category: "jazz", velocity: 5000/h,
//     baseline: 800/h, z_score: 3.2, since: "2h ago" },
//   ...
// ]

This enables alerting on unusual engagement patterns -- "jazz content is suddenly blowing up in Japan" -- which is valuable for editorial teams and content strategy.

10.3 Cohort Comparison

"How does this item's performance in cohort A compare to cohort B?"

let comparison = db.compare_cohorts(CohortComparison {
    item: "item_abc",
    cohort_a: "young_us_jazz",
    cohort_b: "gen_z",  // broader cohort
    signals: vec!["view", "like", "share", "completion"],
    window: Window::hours(24),
})?;

// Returns:
// {
//   cohort_a: { view: 600/h, like: 120/h, share: 45/h, completion: 0.82 },
//   cohort_b: { view: 200/h, like: 30/h, share: 8/h, completion: 0.65 },
//   ratios: { view: 3.0, like: 4.0, share: 5.6, completion: 1.26 },
// }

This supports A/B analysis of content performance across audience segments.


11. API Surface

11.1 Schema Operations

Define a named cohort:

db.define_cohort(CohortDef {
    name: "young_us_jazz",
    predicate: Predicate::and(vec![
        Predicate::eq("region", "US"),
        Predicate::eq("age_range", "18-24"),
        Predicate::contains("inferred_interests", "jazz"),
    ]),
    exact_tracking: true,   // register as Level 2 segment
})?;

Text DSL:

DEFINE COHORT young_us_jazz
    AS region:US AND age_range:18-24 AND inferred_interests CONTAINS jazz
    WITH EXACT TRACKING

List cohorts:

let cohorts = db.list_cohorts()?;
// Returns: Vec<CohortInfo> with name, predicate, type, cardinality, tracking mode

Describe cohort:

let info = db.describe_cohort("young_us_jazz")?;
// Returns: CohortInfo {
//   name: "young_us_jazz",
//   predicate: "region:US AND age_range:18-24 AND inferred_interests CONTAINS jazz",
//   cohort_type: CohortType::Hybrid,
//   cardinality: 42_350,
//   exact_tracking: true,
//   created_at: ...,
//   last_refreshed: ...,
// }

Drop cohort:

db.drop_cohort("young_us_jazz")?;

Dropping a cohort removes the definition and bitmap from the schema catalog. If the cohort had exact_tracking: true, the corresponding Level 2 segment counters are deallocated on the next background materializer cycle. Historical cohort-scoped signal data is retained in rollups but no longer receives new counter increments.

11.2 Query Extensions

FOR COHORT clause in RETRIEVE:

// Named cohort
let results = db.retrieve(Retrieve {
    entity: EntityKind::Item,
    profile: "trending",
    for_cohort: Some(CohortRef::Named("young_us_jazz")),
    window: Some(Window::hours(24)),
    limit: 25,
    ..Default::default()
})?;

// Ad-hoc cohort
let results = db.retrieve(Retrieve {
    entity: EntityKind::Item,
    profile: "trending",
    for_cohort: Some(CohortRef::Predicate(
        Predicate::and(vec![
            Predicate::eq("region", "JP"),
            Predicate::eq("age_range", "25-34"),
        ])
    )),
    window: Some(Window::hours(24)),
    limit: 25,
    ..Default::default()
})?;

Text DSL:

RETRIEVE items
USING PROFILE trending
FOR COHORT young_us_jazz
WINDOW 24h
LIMIT 25

RETRIEVE items
USING PROFILE trending
FOR COHORT region:JP AND age_range:25-34
WINDOW 24h
LIMIT 25

WITHIN TRENDING FOR COHORT in SEARCH:

let results = db.search(Search {
    query: "piano",
    within_trending: Some(WithinTrending {
        cohort: CohortRef::Named("young_us_jazz"),
        window: Window::hours(24),
        min_velocity: None,     // use default threshold
        max_candidates: 500,    // trending candidate pool size
    }),
    for_user: Some("user_123"),
    profile: "search",
    limit: 20,
    ..Default::default()
})?;

Text DSL:

SEARCH items
QUERY "piano"
WITHIN TRENDING FOR COHORT young_us_jazz
WINDOW 24h
FOR USER @user_123
USING PROFILE search
LIMIT 20

11.3 Write Path

No explicit cohort writes. There is no write_cohort_membership() or add_user_to_cohort() API. Membership is resolved from user attributes. The only write that affects cohort membership is update_user() (which changes attributes) and the background materializer (which recomputes computed fields).

Signal writes interact with cohorts through the cohort attribution mechanism (Section 7): the user's UserCohortMemberships struct determines which cohort counters are incremented.

11.4 Admin Operations

// List all named cohorts with cardinality
let cohorts = db.list_cohorts()?;

// Describe a specific cohort (predicate, type, cardinality, freshness)
let info = db.describe_cohort("young_us_jazz")?;

// Force refresh a cohort bitmap (normally happens on schedule)
db.refresh_cohort("young_us_jazz")?;

// Drop a named cohort
db.drop_cohort("young_us_jazz")?;

// Get cohort cardinality without full resolution (approximate, from cached bitmap)
let size = db.cohort_cardinality("young_us_jazz")?;
// Returns: 42_350

// Validate a predicate without defining a cohort
// (useful for UI that lets operators build cohort predicates)
let validation = db.validate_predicate(Predicate::and(vec![
    Predicate::eq("region", "US"),
    Predicate::eq("nonexistent_field", "value"),
]))?;
// Returns: Err(SchemaError::UnknownField("nonexistent_field"))

12. Worked Example

This traces the complete lifecycle of a cohort query, from schema definition through signal writes to query execution and result delivery.

Step 1: Define the Cohort

db.define_cohort(CohortDef {
    name: "young_us_jazz",
    predicate: Predicate::and(vec![
        Predicate::eq("region", "US"),
        Predicate::eq("age_range", "18-24"),
        Predicate::contains("inferred_interests", "jazz"),
    ]),
    exact_tracking: true,
})?;

The database:

  1. Validates the predicate (all fields exist on User entity, types match operators).
  2. Resolves the initial bitmap: region_bitmap["US"] AND age_range_bitmap["18-24"] AND inferred_interests_bitmap["jazz"] = 42,350 users.
  3. Caches the bitmap in memory.
  4. Registers young_us_jazz as a Level 2 behavioral segment in the signal system.
  5. Updates UserCohortMemberships for all 42,350 matching users to include the young_us_jazz segment bit.

Step 2: Signal Events Flow In

Over the next hour, users interact with content. Consider one signal event:

db.signal(Signal {
    kind: "view",
    item: "jazz_piano_video_42",
    user: "user_8847",  // a 22-year-old US user who likes jazz
    timestamp: Utc::now(),
    weight: 1.0,
    context: None,
})?;

The signal write path:

  1. Load user_8847's UserCohortMemberships: {region: US, language: en, age_group: 18-24, segments: [jazz_fans, power_users, young_us_jazz]}.
  2. Check if jazz_piano_video_42 has cohort tracking active. It does (it crossed the 100 events/hour threshold 3 hours ago).
  3. Increment counters:
    • Level 0: global view counter for jazz_piano_video_42 (+1)
    • Level 1: region:US counter (+1)
    • Level 1: language:en counter (+1)
    • Level 1: age_group:18-24 counter (+1)
    • Level 2: jazz_fans segment counter (+1)
    • Level 2: power_users segment counter (+1)
    • Level 2: young_us_jazz segment counter (+1) -- exact tracking

Total counter increments for this event: 7 (write amplification: 7x for this event, but only because cohort tracking is active and the user is in 3 segments).

Step 3: Query Execution

An application serves a "trending jazz for you" surface:

RETRIEVE items
USING PROFILE trending
FOR COHORT young_us_jazz
WINDOW 24h
LIMIT 25

Query plan:

Phase 1: Candidate Identification
    Source: all items with cohort tracking active (~100K items)
    Filter: items with young_us_jazz segment velocity > 0 in 24h window
    Result: ~2,400 candidate items with non-zero cohort velocity

Phase 2: Signal Read
    For each candidate, read from the young_us_jazz Level 2 segment counters:
    - view.velocity(24h) in young_us_jazz
    - share.velocity(24h) in young_us_jazz
    - like.velocity(24h) in young_us_jazz
    - engagement_ratio in young_us_jazz (likes + comments + shares / views)
    Cost: ~2,400 items * 4 signal reads * ~200ns = ~1.9ms

Phase 3: Ranking
    Apply trending profile scoring:
    - share_velocity weight 0.5
    - view_velocity weight 0.3
    - engagement_ratio weight 0.2
    Score each candidate
    Cost: ~2,400 * 50ns = ~120us

Phase 4: Diversity and Result Assembly
    Sort by score
    Apply max_per_creator:1
    Take top 25
    Cost: < 100us

Total: < 5ms for signal reads + < 1ms for ranking + < 2ms for candidate scan
     = ~8ms total (well within 50ms budget)

Result:

Results {
    results: vec![
        RankedItem {
            id: "jazz_piano_video_42",
            score: 0.89,
            signals: SignalSnapshot {
                values: {
                    "view": {"24h": 3420, "1h": 580},
                    "share": {"24h": 245, "1h": 67},
                    "like": {"24h": 890, "1h": 156},
                },
            },
            cohort_signals: Some(CohortSignalSnapshot {
                cohort: "young_us_jazz",
                values: {
                    "view": {"24h": 1850, "1h": 312},
                    "share": {"24h": 178, "1h": 52},
                    "like": {"24h": 620, "1h": 108},
                },
            }),
        },
        // ... 24 more items
    ],
    next_cursor: Some(...),
    total_candidates: 2400,
    cohort_info: Some(CohortQueryInfo {
        name: "young_us_jazz",
        cardinality: 42_350,
        active_in_window: 8_920,
        accuracy: CohortAccuracy::Exact,
    }),
}

The user types "piano" in the search bar on the same surface:

SEARCH items
QUERY "piano"
WITHIN TRENDING FOR COHORT young_us_jazz
WINDOW 24h
LIMIT 20

Query plan:

Phase 1: Cohort Trending Candidate Generation
    Same as Phase 1-2 above but with larger pool:
    Take top 500 items trending in young_us_jazz (24h window)
    Cost: ~10ms

Phase 2: Text Retrieval Within Candidates
    BM25 search for "piano" in inverted index
    Intersect BM25 result set with 500 trending candidates
    Matching items: ~35 (items containing "piano" that are also trending in cohort)
    Cost: ~3ms (inverted index lookup + bitmap intersection)

Phase 3: Hybrid Ranking
    For each of the 35 matching items:
    - text_relevance (BM25 score) * 0.5
    - cohort_trending_velocity * 0.3
    - cohort_relative_score * 0.2 (how much more popular in this cohort vs global)
    Cost: < 1ms

Phase 4: Diversity and Result Assembly
    Sort by hybrid score, apply diversity, take top 20
    Cost: < 1ms

Total: ~15ms (well within 50ms budget)

13. Accuracy Analysis

13.1 Exact vs Estimated Cohort Aggregates

The accuracy of cohort-scoped signal aggregates depends on how the cohort maps to the dimensional hierarchy:

Scenario Accuracy Error Source Mitigation
Global (Level 0) Exact None N/A
Single Level 1 dimension Exact None N/A
Single Level 2 segment Exact None N/A
Named cohort with exact_tracking Exact None N/A
Two Level 1 dimensions (AND) ~85-95% Independence assumption Promote to Level 2
Three Level 1 dimensions (AND) ~75-90% Independence assumption compounds Promote to Level 2
Level 1 + Level 2 (AND) ~80-92% Cross-level independence assumption Promote to Level 2
OR predicates ~90-98% Inclusion-exclusion estimation Exact union where possible

13.2 Independence Assumption Error Analysis

The composite estimation formula assumes independence between dimensions:

C(A AND B) ~= C(A) * C(B) / C(global)

When dimensions are correlated, the estimate diverges from the true count. The direction of error depends on the correlation:

Positive correlation (e.g., region:US and language:en): The estimate overcounts. More US users speak English than the independence assumption predicts, so the true intersection is larger than the estimate of the broader population but the ratio of signal events attributed is correct to within the correlation factor.

Negative correlation (e.g., region:JP and language:en): The estimate undercounts. Fewer Japanese users speak English than independence predicts.

Empirical correlation bounds for common dimension pairs:

Dimension Pair Correlation Strength Estimated Error Direction
region + language Moderate-strong 15-25% Overcount for matching pairs (US+en), undercount for mismatched
region + age_range Weak 5-10% Slight variation by region demographics
age_range + engagement_level Moderate 10-20% Younger users skew toward power_user
language + age_range Weak 5-10% Minimal correlation
region + inferred_interests Moderate 10-20% Cultural preferences vary by region
age_range + inferred_interests Moderate 10-15% Age influences interest patterns

13.3 When to Promote to Exact Tracking

A named cohort should be promoted to exact tracking (exact_tracking: true) when:

  1. The cohort is queried frequently. If a cohort trending query runs more than 10 times per minute, the estimation overhead and accuracy loss justify the write-time cost of exact tracking.

  2. The cohort combines correlated dimensions. A cohort like region:US AND language:en has strong correlation and will have 15-25% estimation error. Exact tracking eliminates this.

  3. The cohort is used for business-critical surfaces. The "trending for you" surface on a homepage warrants exact tracking. An internal analytics dashboard does not.

  4. The cohort is small. Small cohorts (< 10,000 users) amplify estimation error because the independence assumption has higher relative variance with smaller populations.

The cost of exact tracking: One additional counter increment per signal write from a matching user to a cohort-tracked item. For a cohort of 42,350 users and a platform with 50,000 signal events/second, approximately 0.4% of events (213/second) come from this cohort. Each event adds one counter increment. This is negligible write amplification.

Practical limit on exact-tracked cohorts: The Signal System spec (Section 7) allows up to 100 Level 2 behavioral segments. Named cohorts with exact_tracking consume segments from this pool. With 100 total segments minus the base behavioral segments (engagement_level: 5, content_format_preference: 3, session_pattern: 3 = 11), approximately 89 slots are available for exact-tracked named cohorts. This is sufficient for all high-value cohort definitions.

13.4 Error Impact on Ranking

Estimation error affects the absolute signal counts for a cohort but has a smaller effect on relative ranking within the cohort. If the estimation error is a roughly uniform multiplier across all items (which it is when the correlation factor is stable), then the ranking order of items by cohort velocity is preserved even with 15-25% absolute count error.

The scenario where estimation error distorts ranking is when different items have different cohort composition within the estimated population. For example, if item A is popular specifically among US English speakers and item B is popular among US Spanish speakers, and the cohort is estimated as region:US AND language:en, item B's signal counts will be overestimated (because the US population includes Spanish speakers, and the independence assumption does not subtract them). In practice, this distortion is small because the dimensional rollups already separate by language (Level 1), and the estimation only applies to the cross-dimension intersection.


14. Configuration and Defaults

14.1 Cohort System Configuration

pub struct CohortConfig {
    /// Maximum number of named cohorts.
    /// Default: 500.
    pub max_named_cohorts: usize,

    /// Maximum predicate depth (nesting levels).
    /// Default: 8.
    pub max_predicate_depth: usize,

    /// Cohort bitmap invalidation strategy.
    /// Eager: recompute bitmap on user attribute change.
    /// Lazy: mark dirty, recompute on next query.
    /// Default: Eager.
    pub invalidation: CohortInvalidation,

    /// Minimum cohort population for trending queries.
    /// Queries against cohorts smaller than this return a warning
    /// and fall back to the nearest parent cohort.
    /// Default: 1000.
    pub min_trending_population: u32,

    /// Maximum ad-hoc predicate terms per query.
    /// Limits query-time computation for inline cohort predicates.
    /// Default: 10.
    pub max_adhoc_predicate_terms: usize,

    /// Floor for cohort-relative scoring.
    /// Prevents division by near-zero global velocity.
    /// Default: 10.0 events per hour.
    pub relative_score_floor: f64,

    /// Maximum candidates for WITHIN TRENDING candidate generation.
    /// Default: 500.
    pub max_trending_candidates: usize,
}

14.2 Per-Cohort Configuration

pub struct CohortDef {
    /// Unique cohort name.
    pub name: String,

    /// Predicate over user attributes.
    pub predicate: Predicate,

    /// Whether to register as a Level 2 segment for exact signal tracking.
    /// Default: false.
    /// When true, consumes one Level 2 segment slot (max 89 available).
    pub exact_tracking: bool,
}

14.3 Default Thresholds

Parameter Default Rationale
Cohort activation threshold (item level) 100 events/hour From Signal System spec Section 7. Below this, cohort breakdown adds no useful information.
Minimum cohort population for trending 1,000 active users Statistical reliability. With < 1000 users, velocity signals are too noisy for meaningful trending.
Maximum named cohorts 500 Schema catalog practical limit. Each cohort adds one bitmap (~few KB compressed) to memory.
Maximum Level 2 segments (exact tracking) 89 available (100 total minus 11 base behavioral) Signal System spec Section 7. Write amplification scales with segment count.
Relative score floor 10.0 events/hour Prevents extreme ratios from low-traffic items. An item with 1 cohort view / 0.1 global views should not score 10x.
WITHIN TRENDING candidate pool 500 Balances search recall with query latency. 500 candidates searched in < 5ms.
Bitmap cache refresh (dynamic cohorts) Matches underlying field refresh Hourly for inferred_interests, 6-hourly for engagement_level. No separate refresh cycle.

15. Scale Considerations

15.1 Resource Budget Summary

Resource Value Source
Named cohort definitions Up to 500 Configuration limit
Level 2 exact-tracked cohorts Up to 89 Signal System spec (100 segments minus 11 base)
Level 1 primary dimension values ~56 (20 regions + 30 languages + 6 age groups) Signal System spec Section 7
Bitmap memory (10M users) ~630 MB Entity Model spec Section: Cohort-Ready Design
UserCohortMemberships cache (10M users) ~220 MB (22 bytes per user) Signal System spec Section 7
Dimensional rollup storage (7-day retention) ~316 GB Signal System spec Section 7
Write amplification (average) ~1.13x Signal System spec Section 7
Items with active cohort tracking ~100K Signal System spec Section 7 (threshold-gated)

15.2 Query Latency Budget

Operation Budget Components
Cohort resolution (named, cached) < 1ms Bitmap lookup from cache
Cohort resolution (ad-hoc, 3 terms) < 3ms 3 bitmap lookups + 2 intersections
Cohort trending (25 results) < 50ms Resolution (1ms) + candidate scan (20ms) + signal reads (10ms) + ranking (5ms) + diversity (1ms)
Search within cohort trending (20 results) < 50ms Resolution (1ms) + candidate gen (15ms) + text search (10ms) + ranking (5ms) + diversity (1ms)
Cohort analytics (item cohort analysis) < 200ms Scan all Level 1 + Level 2 dimensions for one item
Cohort comparison (2 cohorts, 4 signals) < 20ms 8 signal reads per item (2 cohorts * 4 signals)

15.3 Write Path Impact

The cohort system's primary write-path cost is counter attribution at signal write time. The cost depends on:

  1. Whether the target item has cohort tracking active. 99% of items do not (below threshold). For these items, the cohort system adds zero write-path cost.

  2. How many cohort memberships the user has. Average: 3 Level 1 dimensions + 5-10 Level 2 segments = 8-13 counter increments per event (for cohort-tracked items only).

  3. Whether any named exact-tracked cohorts match. Each matching exact-tracked cohort adds 1 counter increment.

Blended write amplification at 50,000 events/second:

  • 99% of events: 1x (global counter only) = 49,500 increments
  • 1% of events targeting cohort-tracked items: ~14x average = 7,000 increments
  • Total: 56,500 increments for 50,000 events = 1.13x write amplification

This matches the Signal System spec's analysis and is well within the performance budget.


16. Invariants and Correctness Guarantees

Membership Invariants

INV-COH-1: Bitmap consistency. A named cohort's cached bitmap is consistent with the underlying attribute indexes at the time of its last refresh. Formally: for any user U, if bitmap.contains(U) then predicate.evaluate(attributes(U)) == true as of the last refresh timestamp. The converse (predicate match implies bitmap membership) holds only for static cohorts and may lag by the refresh interval for dynamic cohorts.

INV-COH-2: No stale membership in signal attribution. A user's UserCohortMemberships is refreshed before any signal event from that user is attributed to cohort counters. A user who was in cohort C but is no longer (due to attribute change) does not contribute to C's counters after the membership update propagates.

INV-COH-3: Monotonic cardinality. The reported cardinality of a cohort bitmap matches the number of set bits. db.cohort_cardinality(name) equals bitmap.cardinality().

Signal Attribution Invariants

INV-COH-4: Attribution completeness. Every signal event from a user in cohort C targeting a cohort-tracked item increments C's counter exactly once. No double-counting, no missed attribution.

INV-COH-5: Level consistency. Exact-tracked cohort counters (Level 2) are consistent with what would be computed by filtering the global event stream by cohort membership. Formally: counter(item, signal, cohort, window) == count({event in events(item, signal, window) : event.user in cohort}).

INV-COH-6: Estimation bound. Composite cohort estimates (Level 3) satisfy: |estimate - true_count| / true_count < max_relative_error where max_relative_error is bounded by the mutual information between the constituent dimensions. The system does not guarantee a specific error bound but reports CohortAccuracy::Estimated { confidence } in query responses.

Query Invariants

INV-COH-7: Threshold enforcement. If a cohort's active population is below min_trending_population, the query engine never returns results ranked solely by that cohort's signal aggregates. It must fall back to a parent cohort or return the CohortWarning::InsufficientPopulation warning.

INV-COH-8: WITHIN TRENDING candidate containment. When a SEARCH ... WITHIN TRENDING FOR COHORT C query executes, every result item was a member of the cohort trending candidate set. No item outside the trending set appears in results, regardless of text relevance.

Property Tests

// P1: Bitmap matches predicate evaluation for all users.
proptest! {
    fn bitmap_matches_predicate(
        users in arb_user_set(100),
        predicate in arb_predicate(),
    ) {
        let bitmap = resolve_cohort(&users, &predicate);
        for user in &users {
            let in_bitmap = bitmap.contains(user.id);
            let matches = predicate.evaluate(&user.attributes);
            prop_assert_eq!(in_bitmap, matches,
                "user {} bitmap={} predicate={}", user.id, in_bitmap, matches);
        }
    }
}

// P2: Exact-tracked counter matches filtered event count.
proptest! {
    fn exact_counter_matches_events(
        events in arb_signal_events(1000),
        cohort in arb_cohort(),
    ) {
        let counter = cohort_counter(&events, &cohort);
        let filtered = events.iter()
            .filter(|e| cohort.contains(e.user_id))
            .count();
        prop_assert_eq!(counter, filtered as u64);
    }
}

// P3: Composite estimate is within expected error bounds.
proptest! {
    fn composite_estimate_bounded(
        events in arb_signal_events(10000),
        dim_a in arb_level1_dimension(),
        dim_b in arb_level1_dimension(),
    ) {
        let count_a = dimensional_count(&events, &dim_a);
        let count_b = dimensional_count(&events, &dim_b);
        let count_global = events.len() as f64;

        let estimate = count_a * count_b / count_global;
        let actual = events.iter()
            .filter(|e| dim_a.matches(e) && dim_b.matches(e))
            .count() as f64;

        // Allow up to 30% relative error for this test
        // (real error depends on correlation)
        if actual > 100.0 {
            let relative_error = (estimate - actual).abs() / actual;
            prop_assert!(relative_error < 0.30,
                "estimate={}, actual={}, error={:.1}%",
                estimate, actual, relative_error * 100.0);
        }
    }
}

// P4: WITHIN TRENDING results are subset of trending candidates.
proptest! {
    fn search_within_trending_containment(
        items in arb_items(500),
        cohort in arb_cohort(),
        query in arb_search_query(),
    ) {
        let trending_candidates = cohort_trending_candidates(&items, &cohort);
        let search_results = search_within_trending(&query, &cohort, &items);

        for result in &search_results {
            prop_assert!(trending_candidates.contains(&result.id),
                "result {} not in trending candidates", result.id);
        }
    }
}

Appendix A: Glossary

Term Definition
Cohort A named predicate over user attributes that defines a population segment
Predicate A boolean expression over user attribute fields (equality, range, set membership, compound)
Static cohort A cohort whose predicate references only slow-changing app-set attributes (region, age_range)
Dynamic cohort A cohort whose predicate references database-computed attributes (engagement_level, inferred_interests)
Hybrid cohort A cohort combining static and dynamic predicate terms
Ad-hoc cohort An inline predicate in a query, not named or saved
Cohort resolution The process of evaluating a predicate against user attribute bitmaps to produce a user set
Exact tracking Registering a cohort as a Level 2 behavioral segment with dedicated signal counters
Dimensional rollup Pre-aggregated signal counters per dimension value per item (Level 1 and Level 2 in the signal hierarchy)
Independence assumption The estimation that P(A AND B) = P(A) * P(B) used for composite cohort queries
Cohort-relative score Ratio of cohort velocity to global velocity for an item, measuring cohort-specific resonance
WITHIN TRENDING A query clause that restricts search candidates to items trending in a specified cohort
Cohort activation threshold The global signal rate above which an item begins tracking per-cohort counters (default: 100 events/hour)
Minimum population threshold The minimum number of active cohort users required for cohort trending queries (default: 1,000)

Appendix B: References

  1. Signal System Specification, Section 7: Cohort-Scoped Signal Aggregation. docs/specs/03-signal-system.md.
  2. Entity Model Specification, Section: Cohort-Ready Design. docs/specs/02-entity-model.md.
  3. Chambi, S., Lemire, D., Kaser, O., Godin, R. "Better bitmap performance with Roaring bitmaps." Software: Practice and Experience, 2016.
  4. Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C. "Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches." Foundations and Trends in Databases, 2012.