- Schema phase 1 (tasks 01-02): EntityId, EntityKind, Timestamp, Score, SignalTypeDef, DecayModel, Window, WindowSet — all with property tests and benchmarks scaffolding - Stub modules for storage, signals, query, ranking - Full documentation suite: VISION, USE_CASES, SEQUENCE, API, CODING_GUIDELINES, ai-lookup, research docs, specs, roadmap, planning docs - Marketing site (Next.js) with blog infrastructure - .claude/ agents and skills for the tidalDB development workflow - Foundation standards enforced: thiserror + tracing declared as dependencies, clippy::unwrap_used = deny added to lint config - .gitignore hardened: .next/, node_modules/, .env, secrets, logs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
14 KiB
Roadmap Impact Analysis: Cohort-Based Architecture and Scale-Ready Design
Date: 2026-02-20 Author: @tidal-visionary
Context
The product owner identified five requirements the current roadmap (M1-M6) does not address:
- Cohorts as a first-class primitive -- named predicates over user attributes that partition the user base into addressable segments
- Three-layer trending model -- global trending, cohort-scoped trending, and search within cohort-scoped trending
- Rich user attribute model -- demographics, interest taxonomy, behavioral segments, engagement patterns (the current User entity has only
languageandregion) - Query composition -- RETRIEVE and SEARCH must compose in a single query
- Scale-ready architecture from day one -- storage engine, signal system, and key encoding must be designed for partitioning
1. What Changes in Milestone ORDER
1.1 The Rich User Model Must Move Before Personalization (M3)
The User entity in API.md has two metadata fields: language and region. Cohorts are predicates over user attributes. If the user model has only two fields, the only cohorts you can define are locale-based partitions. The product owner explicitly requires demographics, interest taxonomy, behavioral segments, and engagement patterns.
Recommendation: Introduce the rich user attribute model as Phase 3.0 -- the first phase of M3 (Personalized Ranking), before preference vectors and feedback loops. Moving it earlier than M3 is not justified because M1 and M2 prove the signal and ranking thesis without any user context.
What breaks if we do not do this: Cohorts become meaningless -- they can only segment by two dimensions. The three-layer trending model collapses to one layer (global). The entire cohort architecture becomes an expensive way to do locale filtering.
1.2 Cohorts Must Come After the Rich User Model but Before Full Surface Coverage
Analysis: Cohorts and personalization are complementary, not sequential. Personalization answers "what does this user want?" Cohorts answer "what do users like this one want?" The three-layer trending model requires both:
- Layer 1 (global trending) works at M2 -- no user context needed
- Layer 2 (cohort-scoped trending) requires rich user attributes + scoped signal aggregation
- Layer 3 (search within cohort-scoped trending) requires query composition -- SEARCH intersected with a RETRIEVE result set
Recommended new milestone order:
- M1: Signal Engine (unchanged)
- M2: Ranked Retrieval (unchanged)
- M3: Personalized Ranking (expanded with rich user model)
- M4 (new): Cohort-Scoped Ranking -- "Trending for users like you"
- M5: Hybrid Search (was M4, expanded with query composition)
- M6: Full Surface Coverage (was M5)
- M7: Production Hardening (was M6)
1.3 Scale Architecture Must Be a Concern From M1
The product owner says "distribution is a later problem" is no longer acceptable. This does NOT mean building a distributed system. It means making design decisions in M1 that do not foreclose distribution later. CockroachDB learned this: the KV layer was designed for distribution from the start, even though it shipped single-node first.
For tidalDB, "scale-ready" means four things:
-
Key encoding must support range-based partitioning. The current
[entity_id: u64 BE][0x00][TAG:suffix]pattern is already correct. Entity_id prefix means all data for one entity is co-located, and you can split ranges at entity_id boundaries. -
Signal aggregation must support scoped rollups. Cohort-scoped trending requires aggregating signals across all entities matching a cohort predicate -- a fundamentally different data structure than per-entity running scores. The signal write path needs a
SignalObservertrait. -
The WAL must support logical partitioning. WAL entries must include entity type and partition key alongside entity ID. Adding this later means a WAL format migration.
-
Entity IDs must be partition-aware. u64 with big-endian encoding supports range-based partitioning naturally. Already correct.
Recommendation: Scale readiness is not a milestone -- it is an architectural constraint applied to every milestone starting with M1. The additions are small (S-complexity) but architecturally critical: partition key in WAL format, SignalObserver trait, aggregation_scope on SignalDef.
What breaks if we keep the old deferral: WAL format migration, key encoding redesign, and signal aggregation restructuring when distribution ships. These are the three most expensive retrofits in a database. The cost of retrofitting is 10-50x the cost of designing correctly.
2. What Changes in Milestone CONTENT
M1: Signal Engine
ADDED:
- Partition key in WAL entry format (initially
0x00for single-node) -- prevents WAL format migration later SignalObservertrait in signal ledger (no-op implementation) -- extensibility hook for cohort aggregationaggregation_scopefield onSignalDef(initially ignored) -- prevents schema migration later
These are S-complexity additions invisible to the M1 user but critical for M4.
M2: Ranked Retrieval
ADDED:
Scopedvariant in theCandidateenum forProfileDef-- allows candidate retrieval to be scoped to a pre-computed candidate set. Unused in M2 but makes the executor compositional from the start.CandidateSetintermediate type -- the scored, pre-diversity bitmap of entity IDs that currently exists as an anonymous intermediate. Making it a reusable type enables query composition in M5.
M-complexity additions that make the executor compositional.
M3: Personalized Ranking
ADDED (major):
- Rich user attribute model: Expand from 2 to 15+ fields. Demographics (age_range, locale), interest taxonomy (hierarchical keywords), behavioral segments (database-computed), engagement patterns (database-computed).
- Computed user fields materializer: Background process that derives behavioral segments from signal history --
preferred_format,engagement_frequency,active_hours,power_user_score. Analogous to signal rollup materializer but for user attributes. - User attribute indexes: Same bitmap/B-tree pattern as item metadata indexes, applied to user entities.
RESTRUCTURED: Phase 3.1 splits into Phase 3.1a (Rich User and Creator Entity Model) and Phase 3.1b (Relationship Graph). The split matters because the rich user model is needed for cohorts (M4) while the relationship graph is needed for personalization -- different downstream consumers, can be built in parallel.
M5 (was M4: Hybrid Search)
ADDED:
- Query composition executor -- the
WITHINclause that restricts a SEARCH to a pre-computed candidate set - Layer 3 integration:
SEARCH items QUERY "jazz piano" WITHIN TRENDING FOR COHORT @us_young_jazz LIMIT 20
M6 (was M5: Full Surface Coverage)
CHANGED: Signal rollups moved from "optional if benchmarks demand it" to required. Cohort-scoped 30d+ windowed aggregates across millions of entities cannot be computed from raw events in real time.
3. The New Milestone: M4 -- Cohort-Scoped Ranking
Milestone Thesis: "The database understands user segments as a query primitive. Trending for a cohort of US jazz fans produces different results than global trending."
Why this is a milestone and not a phase: It requires a new entity type (Cohort), a new signal aggregation path, a new candidate source, a new query clause, and background materialization. Too much for a phase, and independently user-testable.
Provisional Phases:
Phase 4.1: Cohort Definition and Membership (M complexity)
Cohort as a schema primitive. Named predicate over user attributes. Membership materialized as RoaringBitmap<UserId> with O(1) membership test. Incremental updates when user attributes change.
Phase 4.2: Cohort-Scoped Signal Aggregation (XL complexity -- highest risk) Signal write fan-out: when a signal arrives for an entity from a user in cohort C, update per-cohort running aggregates. Same decay/windowed pattern as entity signals but keyed by (cohort, entity). Sparse representation required to manage memory.
Phase 4.3: Cohort-Scoped Query Execution (L complexity)
FOR COHORT @cohort_id clause in RETRIEVE queries. Signal references resolve to cohort-scoped aggregates. Composes with FOR USER for personalization on top.
Phase 4.4: Cohort Lifecycle and Diagnostics (S complexity) List, inspect, delete cohorts. View cohort-scoped signal state for debugging.
Deferred from M4: Cohort-scoped search (Layer 3) deferred to M5 (needs Tantivy). Dynamic cohorts deferred to M6. Cohort-based A/B testing deferred to M7.
4. What Is Now Deferred That Should Not Be
Horizontal Distribution Design
The deferral of implementation is still correct. The deferral of design is now wrong. Storage engine, WAL format, key encoding, and signal aggregation must be designed so distribution can be added without restructuring. Distribution design constraints are applied from M1. Distribution implementation remains post-M7.
Signal Rollups
Now required in M6. Cohort-scoped 30d+ windows over millions of entities demand materialized rollups. The bucketed counter approach works for per-entity signals because each entity has bounded events. Cohort aggregates span millions of entities.
User Attribute Model
The 2-field model is a critical gap. Cannot answer "what is trending among young US jazz fans." Rich user model is now a required deliverable in M3.
5. Revised Milestone Theses
| Milestone | Original Thesis | Revised Thesis |
|---|---|---|
| M1 | Signals are a database primitive | Same, plus: signal system designed for future scoped aggregation |
| M2 | A single query retrieves, scores, and ranks | Same, plus: compositional executor supports scoped candidate sets |
| M3 | User context shapes ranking -- For You works | Same, plus: user model rich enough to define meaningful audience segments |
| M4 (new) | (did not exist) | Database understands user segments as query primitives |
| M5 (was M4) | Text + semantic + signals in one query | Same, plus: search within a scoped result set (query composition) |
| M6 (was M5) | Every use case works | Same, plus: cohort-scoped variants of trending/rising/browse |
| M7 (was M6) | Ready for real workloads | Same, plus: documented path to horizontal distribution |
6. Critical Path Analysis
Parallelization Opportunities
- M5 Phases (Tantivy, RRF, SEARCH parser) can start in parallel with M4. They depend on M2/M3, not M4. Only the query composition phase depends on M4.
- M3 Phase 3.0 (rich user model) can start as soon as M2 Phase 2.2 (metadata indexing) ships -- same bitmap/B-tree patterns applied to user entities.
- M4 Phase 4.1 (cohort definition) can start as soon as M3 Phase 3.0 ships -- without waiting for M3's feedback loop to complete.
Phases That Block the Most Downstream Work
| Phase | What It Blocks | Impact |
|---|---|---|
| Phase 1.4 (Signal Ledger) | Phase 1.5, 2.3, 4.2 | Everything after M1 |
| Phase 2.2 (Filters) | Phase 2.4, 2.5, 3.0, 3.1 | Everything after M2 |
| Phase 3.0 (Rich User Model) | Phase 4.1, 4.2, 4.3 | All of M4 and M5 composition |
| Phase 4.2 (Cohort Signals) | Phase 4.3, 5.X | M4 completion and query composition |
| Phase 2.5 (RETRIEVE Executor) | Phase 4.3, 5.X | Cohort queries and composition |
The Longest Pole
Phase 4.2 (Cohort-Scoped Signal Aggregation) at XL complexity is the highest-risk phase and blocks the most downstream work. Key risks:
- Memory budget: Per-cohort signal state for 50 cohorts * 10M entities naive = 40 GB. Requires sparse representation (only entities with signals from cohort members). Reduces to ~400 MB at 50 cohorts * 100K active entities each.
- Write amplification: Each signal write fans out to 1 entity state + N cohort state updates. At 5 cohorts per user average, 6x write cost. Must be amortized via batching.
- Correctness: When a user's attributes change and they move between cohorts, historical signals must NOT retroactively move. Cohort aggregates reflect "signals from users who were in this cohort when the signal was written."
Mitigation: Run a 2-3 day spike before committing to Phase 4.2 implementation to benchmark sparse cohort state memory, write amplification with fan-out, and cohort-scoped trending query latency.
7. What Does NOT Change
- M1 and M2 UAT scenarios -- signal correctness and ranked retrieval do not require cohorts
- Signal ledger architecture -- per-entity running decay scores unchanged; cohort aggregation is additional, not replacement
- USearch, Tantivy, fjall choices -- unaffected by cohort requirements
- Key encoding -- already supports range-based partitioning; cohort keys follow same pattern
- Query language structure --
FOR COHORTandWITHINare additive clauses - Embeddable Rust library deployment model -- cohorts are in-process primitives
8. Open Questions Requiring Resolution
- How many cohorts? 10 and 10,000 have radically different memory/write-amplification profiles.
- Static or dynamic predicates? Dynamic cohorts ("users who viewed jazz in last 7d") are dramatically more expensive.
- Point-in-time membership? "What was trending in this cohort yesterday?" requires historical snapshots.
- User attribute refresh cadence? Behavioral segments recomputed hourly? Daily?
- Automatic cohort assignment in M4 or M6? Auto-assignment requires a scoring function; manual is simpler.
This analysis should be reviewed by @tidal-engineer for technical feasibility assessment before the roadmap is revised.