jordan 413b712c0a chore: initialize tidalDB repository with schema foundation and standards

- Schema phase 1 (tasks 01-02): EntityId, EntityKind, Timestamp, Score, SignalTypeDef, DecayModel, Window, WindowSet — all with property tests and benchmarks scaffolding
- Stub modules for storage, signals, query, ranking
- Full documentation suite: VISION, USE_CASES, SEQUENCE, API, CODING_GUIDELINES, ai-lookup, research docs, specs, roadmap, planning docs
- Marketing site (Next.js) with blog infrastructure
- .claude/ agents and skills for the tidalDB development workflow
- Foundation standards enforced: thiserror + tracing declared as dependencies, clippy::unwrap_used = deny added to lint config
- .gitignore hardened: .next/, node_modules/, .env, secrets, logs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-20 12:52:20 -07:00

14 KiB

Raw Blame History

Roadmap Impact Analysis: Cohort-Based Architecture and Scale-Ready Design

Date: 2026-02-20 Author: @tidal-visionary

Context

The product owner identified five requirements the current roadmap (M1-M6) does not address:

Cohorts as a first-class primitive -- named predicates over user attributes that partition the user base into addressable segments
Three-layer trending model -- global trending, cohort-scoped trending, and search within cohort-scoped trending
Rich user attribute model -- demographics, interest taxonomy, behavioral segments, engagement patterns (the current User entity has only language and region)
Query composition -- RETRIEVE and SEARCH must compose in a single query
Scale-ready architecture from day one -- storage engine, signal system, and key encoding must be designed for partitioning

1. What Changes in Milestone ORDER

1.1 The Rich User Model Must Move Before Personalization (M3)

The User entity in API.md has two metadata fields: language and region. Cohorts are predicates over user attributes. If the user model has only two fields, the only cohorts you can define are locale-based partitions. The product owner explicitly requires demographics, interest taxonomy, behavioral segments, and engagement patterns.

Recommendation: Introduce the rich user attribute model as Phase 3.0 -- the first phase of M3 (Personalized Ranking), before preference vectors and feedback loops. Moving it earlier than M3 is not justified because M1 and M2 prove the signal and ranking thesis without any user context.

What breaks if we do not do this: Cohorts become meaningless -- they can only segment by two dimensions. The three-layer trending model collapses to one layer (global). The entire cohort architecture becomes an expensive way to do locale filtering.

1.2 Cohorts Must Come After the Rich User Model but Before Full Surface Coverage

Analysis: Cohorts and personalization are complementary, not sequential. Personalization answers "what does this user want?" Cohorts answer "what do users like this one want?" The three-layer trending model requires both:

Layer 1 (global trending) works at M2 -- no user context needed
Layer 2 (cohort-scoped trending) requires rich user attributes + scoped signal aggregation
Layer 3 (search within cohort-scoped trending) requires query composition -- SEARCH intersected with a RETRIEVE result set

Recommended new milestone order:

M1: Signal Engine (unchanged)
M2: Ranked Retrieval (unchanged)
M3: Personalized Ranking (expanded with rich user model)
M4 (new): Cohort-Scoped Ranking -- "Trending for users like you"
M5: Hybrid Search (was M4, expanded with query composition)
M6: Full Surface Coverage (was M5)
M7: Production Hardening (was M6)

1.3 Scale Architecture Must Be a Concern From M1

The product owner says "distribution is a later problem" is no longer acceptable. This does NOT mean building a distributed system. It means making design decisions in M1 that do not foreclose distribution later. CockroachDB learned this: the KV layer was designed for distribution from the start, even though it shipped single-node first.

For tidalDB, "scale-ready" means four things:

Key encoding must support range-based partitioning. The current [entity_id: u64 BE][0x00][TAG:suffix] pattern is already correct. Entity_id prefix means all data for one entity is co-located, and you can split ranges at entity_id boundaries.
Signal aggregation must support scoped rollups. Cohort-scoped trending requires aggregating signals across all entities matching a cohort predicate -- a fundamentally different data structure than per-entity running scores. The signal write path needs a SignalObserver trait.
The WAL must support logical partitioning. WAL entries must include entity type and partition key alongside entity ID. Adding this later means a WAL format migration.
Entity IDs must be partition-aware. u64 with big-endian encoding supports range-based partitioning naturally. Already correct.

Recommendation: Scale readiness is not a milestone -- it is an architectural constraint applied to every milestone starting with M1. The additions are small (S-complexity) but architecturally critical: partition key in WAL format, SignalObserver trait, aggregation_scope on SignalDef.

What breaks if we keep the old deferral: WAL format migration, key encoding redesign, and signal aggregation restructuring when distribution ships. These are the three most expensive retrofits in a database. The cost of retrofitting is 10-50x the cost of designing correctly.

2. What Changes in Milestone CONTENT

M1: Signal Engine

ADDED:

Partition key in WAL entry format (initially 0x00 for single-node) -- prevents WAL format migration later
SignalObserver trait in signal ledger (no-op implementation) -- extensibility hook for cohort aggregation
aggregation_scope field on SignalDef (initially ignored) -- prevents schema migration later

These are S-complexity additions invisible to the M1 user but critical for M4.

M2: Ranked Retrieval

ADDED:

Scoped variant in the Candidate enum for ProfileDef -- allows candidate retrieval to be scoped to a pre-computed candidate set. Unused in M2 but makes the executor compositional from the start.
CandidateSet intermediate type -- the scored, pre-diversity bitmap of entity IDs that currently exists as an anonymous intermediate. Making it a reusable type enables query composition in M5.

M-complexity additions that make the executor compositional.

M3: Personalized Ranking

ADDED (major):

Rich user attribute model: Expand from 2 to 15+ fields. Demographics (age_range, locale), interest taxonomy (hierarchical keywords), behavioral segments (database-computed), engagement patterns (database-computed).
Computed user fields materializer: Background process that derives behavioral segments from signal history -- preferred_format, engagement_frequency, active_hours, power_user_score. Analogous to signal rollup materializer but for user attributes.
User attribute indexes: Same bitmap/B-tree pattern as item metadata indexes, applied to user entities.

RESTRUCTURED: Phase 3.1 splits into Phase 3.1a (Rich User and Creator Entity Model) and Phase 3.1b (Relationship Graph). The split matters because the rich user model is needed for cohorts (M4) while the relationship graph is needed for personalization -- different downstream consumers, can be built in parallel.

M5 (was M4: Hybrid Search)

ADDED:

Query composition executor -- the WITHIN clause that restricts a SEARCH to a pre-computed candidate set
Layer 3 integration: SEARCH items QUERY "jazz piano" WITHIN TRENDING FOR COHORT @us_young_jazz LIMIT 20

M6 (was M5: Full Surface Coverage)

CHANGED: Signal rollups moved from "optional if benchmarks demand it" to required. Cohort-scoped 30d+ windowed aggregates across millions of entities cannot be computed from raw events in real time.

3. The New Milestone: M4 -- Cohort-Scoped Ranking

Milestone Thesis: "The database understands user segments as a query primitive. Trending for a cohort of US jazz fans produces different results than global trending."

Why this is a milestone and not a phase: It requires a new entity type (Cohort), a new signal aggregation path, a new candidate source, a new query clause, and background materialization. Too much for a phase, and independently user-testable.

Provisional Phases:

Phase 4.1: Cohort Definition and Membership (M complexity) Cohort as a schema primitive. Named predicate over user attributes. Membership materialized as RoaringBitmap<UserId> with O(1) membership test. Incremental updates when user attributes change.

Phase 4.2: Cohort-Scoped Signal Aggregation (XL complexity -- highest risk) Signal write fan-out: when a signal arrives for an entity from a user in cohort C, update per-cohort running aggregates. Same decay/windowed pattern as entity signals but keyed by (cohort, entity). Sparse representation required to manage memory.

Phase 4.3: Cohort-Scoped Query Execution (L complexity) FOR COHORT @cohort_id clause in RETRIEVE queries. Signal references resolve to cohort-scoped aggregates. Composes with FOR USER for personalization on top.

Phase 4.4: Cohort Lifecycle and Diagnostics (S complexity) List, inspect, delete cohorts. View cohort-scoped signal state for debugging.

Deferred from M4: Cohort-scoped search (Layer 3) deferred to M5 (needs Tantivy). Dynamic cohorts deferred to M6. Cohort-based A/B testing deferred to M7.

4. What Is Now Deferred That Should Not Be

Horizontal Distribution Design

The deferral of implementation is still correct. The deferral of design is now wrong. Storage engine, WAL format, key encoding, and signal aggregation must be designed so distribution can be added without restructuring. Distribution design constraints are applied from M1. Distribution implementation remains post-M7.

Signal Rollups

Now required in M6. Cohort-scoped 30d+ windows over millions of entities demand materialized rollups. The bucketed counter approach works for per-entity signals because each entity has bounded events. Cohort aggregates span millions of entities.

User Attribute Model

The 2-field model is a critical gap. Cannot answer "what is trending among young US jazz fans." Rich user model is now a required deliverable in M3.

5. Revised Milestone Theses

Milestone	Original Thesis	Revised Thesis
M1	Signals are a database primitive	Same, plus: signal system designed for future scoped aggregation
M2	A single query retrieves, scores, and ranks	Same, plus: compositional executor supports scoped candidate sets
M3	User context shapes ranking -- For You works	Same, plus: user model rich enough to define meaningful audience segments
M4 (new)	(did not exist)	Database understands user segments as query primitives
M5 (was M4)	Text + semantic + signals in one query	Same, plus: search within a scoped result set (query composition)
M6 (was M5)	Every use case works	Same, plus: cohort-scoped variants of trending/rising/browse
M7 (was M6)	Ready for real workloads	Same, plus: documented path to horizontal distribution

6. Critical Path Analysis

Parallelization Opportunities

M5 Phases (Tantivy, RRF, SEARCH parser) can start in parallel with M4. They depend on M2/M3, not M4. Only the query composition phase depends on M4.
M3 Phase 3.0 (rich user model) can start as soon as M2 Phase 2.2 (metadata indexing) ships -- same bitmap/B-tree patterns applied to user entities.
M4 Phase 4.1 (cohort definition) can start as soon as M3 Phase 3.0 ships -- without waiting for M3's feedback loop to complete.

Phases That Block the Most Downstream Work

Phase	What It Blocks	Impact
Phase 1.4 (Signal Ledger)	Phase 1.5, 2.3, 4.2	Everything after M1
Phase 2.2 (Filters)	Phase 2.4, 2.5, 3.0, 3.1	Everything after M2
Phase 3.0 (Rich User Model)	Phase 4.1, 4.2, 4.3	All of M4 and M5 composition
Phase 4.2 (Cohort Signals)	Phase 4.3, 5.X	M4 completion and query composition
Phase 2.5 (RETRIEVE Executor)	Phase 4.3, 5.X	Cohort queries and composition

The Longest Pole

Phase 4.2 (Cohort-Scoped Signal Aggregation) at XL complexity is the highest-risk phase and blocks the most downstream work. Key risks:

Memory budget: Per-cohort signal state for 50 cohorts * 10M entities naive = 40 GB. Requires sparse representation (only entities with signals from cohort members). Reduces to ~400 MB at 50 cohorts * 100K active entities each.
Write amplification: Each signal write fans out to 1 entity state + N cohort state updates. At 5 cohorts per user average, 6x write cost. Must be amortized via batching.
Correctness: When a user's attributes change and they move between cohorts, historical signals must NOT retroactively move. Cohort aggregates reflect "signals from users who were in this cohort when the signal was written."

Mitigation: Run a 2-3 day spike before committing to Phase 4.2 implementation to benchmark sparse cohort state memory, write amplification with fan-out, and cohort-scoped trending query latency.

7. What Does NOT Change

M1 and M2 UAT scenarios -- signal correctness and ranked retrieval do not require cohorts
Signal ledger architecture -- per-entity running decay scores unchanged; cohort aggregation is additional, not replacement
USearch, Tantivy, fjall choices -- unaffected by cohort requirements
Key encoding -- already supports range-based partitioning; cohort keys follow same pattern
Query language structure -- FOR COHORT and WITHIN are additive clauses
Embeddable Rust library deployment model -- cohorts are in-process primitives

8. Open Questions Requiring Resolution

How many cohorts? 10 and 10,000 have radically different memory/write-amplification profiles.
Static or dynamic predicates? Dynamic cohorts ("users who viewed jazz in last 7d") are dramatically more expensive.
Point-in-time membership? "What was trending in this cohort yesterday?" requires historical snapshots.
User attribute refresh cadence? Behavioral segments recomputed hourly? Daily?
Automatic cohort assignment in M4 or M6? Auto-assignment requires a scoring function; manual is simpler.

This analysis should be reviewed by @tidal-engineer for technical feasibility assessment before the roadmap is revised.

14 KiB Raw Blame History

Roadmap Impact Analysis: Cohort-Based Architecture and Scale-Ready Design

Context

1. What Changes in Milestone ORDER

1.1 The Rich User Model Must Move Before Personalization (M3)

1.2 Cohorts Must Come After the Rich User Model but Before Full Surface Coverage

1.3 Scale Architecture Must Be a Concern From M1

2. What Changes in Milestone CONTENT

M1: Signal Engine

M2: Ranked Retrieval

M3: Personalized Ranking

M5 (was M4: Hybrid Search)

M6 (was M5: Full Surface Coverage)

3. The New Milestone: M4 -- Cohort-Scoped Ranking

4. What Is Now Deferred That Should Not Be

Horizontal Distribution Design

Signal Rollups

User Attribute Model

5. Revised Milestone Theses

6. Critical Path Analysis

Parallelization Opportunities

Phases That Block the Most Downstream Work

The Longest Pole

7. What Does NOT Change

8. Open Questions Requiring Resolution

14 KiB

Raw Blame History