- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test coverage for every public API surface - m1p5: TidalDb public API — write_item, signal, read_decay_score, read_windowed_count, read_velocity; StorageBox enum routing memory vs fjall; WalSender/WalHandleWriter bridge; WAL replay on open - Periodic checkpoint: 30s background thread for persistent+schema mode; FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful shutdown via Arc<AtomicBool> + join before final checkpoint - ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing) - Milestone 2 planning scaffolding added under docs/planning/milestone-2/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
85 lines
6.9 KiB
Markdown
85 lines
6.9 KiB
Markdown
# Milestone 2, Phase 4: Diversity Enforcement
|
|
|
|
## Phase Deliverable
|
|
|
|
A post-scoring diversity pass that selects results from a scored candidate list to satisfy diversity constraints (`max_per_creator`, `format_mix`), without reducing result count. Implemented as a single greedy selection pass O(n) over the sorted candidate list. When constraints cannot be fully satisfied, the selector relaxes constraints in a defined order and returns results with a warning flag rather than an error.
|
|
|
|
This is the phase that turns ranked results from "the top N by score" into "the top N by score that a user would actually want to scroll through." Without diversity, a trending creator dominates the feed. With diversity, the database enforces variety -- no application logic required.
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `max_per_creator:N` enforced: no more than N items from any single creator in the result set
|
|
- [ ] `format_mix:true` enforced: no more than 60% of results from any single format
|
|
- [ ] Diversity pass does not reduce result count -- it selects the next-best candidate that satisfies constraints
|
|
- [ ] Diversity pass adds < 1ms for 200 candidates (benchmarked)
|
|
- [ ] When diversity constraints cannot be fully satisfied (too few creators), results are returned with a warning flag, not an error
|
|
- [ ] Property test: diversity constraints hold for 10,000 random candidate sets
|
|
|
|
## Dependencies
|
|
|
|
- **Requires:** m2p3 (profile executor produces `Vec<ScoredCandidate>` sorted by score descending, with `entity_id`, `score`, and `signal_snapshot`; `DiversitySpec` type defined on `RankingProfile` with `max_per_creator`, `format_mix`, `topic_diversity`, `category_min`)
|
|
- **Blocks:** m2p5 (RETRIEVE executor calls diversity enforcement as the penultimate step before result return)
|
|
|
|
## Research References
|
|
|
|
- [thoughts.md](../../../../thoughts.md) -- Part V.14 (MMR post-scoring diversity enforcement)
|
|
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- Filtered search and post-retrieval reranking patterns
|
|
|
|
## Spec References
|
|
|
|
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 9 (Diversity Enforcement):
|
|
- Section 9.1 (DiversitySpec structure: max_per_creator, format_mix, topic_diversity, category_min)
|
|
- Section 9.2 (Greedy MMR reranking algorithm pseudocode)
|
|
- Section 9.3 (Constraint details: per-page enforcement, format bonus, category minimum)
|
|
- Section 9.4 (Diversity and pagination: per-page, not global)
|
|
- Section 9.5 (Diversity as reordering, not filtering; relaxation under pressure)
|
|
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 4 (Scoring pipeline: diversity is Stage 8)
|
|
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 16 (Invariants INV-RANK-5: diversity never reduces result count, INV-RANK-6: diversity preserves relative score order within same-constraint group)
|
|
|
|
## Task Index
|
|
|
|
| # | Task | Delivers | Depends On | Complexity |
|
|
|---|------|----------|------------|------------|
|
|
| 01 | Diversity Types + Greedy Selector | `DiversityConstraints`, `DiversityResult`, `ConstraintViolation`, `DiversitySelector`, greedy selection algorithm with three-stage relaxation | None | M |
|
|
| 02 | Property Tests + Benchmarks | proptest property tests (10,000 random candidate sets), Criterion benchmarks (200-candidate < 1ms) | Task 01 | S |
|
|
|
|
## Task Dependency DAG
|
|
|
|
```
|
|
Task 01: Diversity Types + Greedy Selector
|
|
|
|
|
v
|
|
Task 02: Property Tests + Benchmarks
|
|
```
|
|
|
|
Task 01 delivers all types and the selection algorithm. Task 02 validates correctness via property tests and performance via benchmarks. Strictly sequential -- Task 02 tests the implementation from Task 01.
|
|
|
|
## File Layout
|
|
|
|
```
|
|
tidal/src/
|
|
ranking/
|
|
diversity.rs -- DiversityConstraints, DiversityResult, DiversitySelector,
|
|
ConstraintViolation (Task 01)
|
|
mod.rs -- add `pub mod diversity;` and re-exports (Task 01)
|
|
tidal/benches/
|
|
ranking.rs -- add diversity benchmarks (Task 02) to the existing ranking bench file
|
|
```
|
|
|
|
## Open Questions
|
|
|
|
1. **Creator ID and format in ScoredCandidate**: The diversity selector needs each candidate's `creator_id` and `format` to apply constraints. These are entity metadata fields. `ScoredCandidate` from m2p3 has `entity_id`, `score`, and `signal_snapshot` but not a general metadata map. Options:
|
|
- (A) Add `creator_id: Option<EntityId>` and `format: Option<String>` fields to `ScoredCandidate` -- cleanest, no extra lookup
|
|
- (B) `DiversitySelector` takes `&EntityStore` and loads metadata per candidate -- more flexible, extra lookup cost (~50ns per candidate)
|
|
- **Decision for M2:** Option A. The executor adds `creator_id` and `format` to `ScoredCandidate` at scoring time (they are already loaded from entity metadata during scoring). This keeps diversity O(n) without extra I/O. The `ScoredCandidate` struct gains two optional fields. This change is made as part of Task 01 in this phase.
|
|
|
|
2. **`min_exploration` constraint**: The exploration budget (10% of results from unfollowed creators) is an M3 feature (Spec 09 Section 10). `DiversityConstraints` includes a `min_exploration: Option<f64>` field for forward compatibility, but the M2 selector ignores it if set. A `todo!()` comment is added in the selector with "M3: implement exploration budget after relationship graph is available."
|
|
|
|
3. **Relaxation order**: The three-stage relaxation (double max_per_creator, ignore format_mix, accept anything) is the default for M2. The caller (m2p5 RETRIEVE executor) can configure a stricter relaxation policy in future milestones. For M2, hardcode the three-stage order.
|
|
|
|
4. **`DiversitySpec` vs `DiversityConstraints`**: `DiversitySpec` is already defined on `RankingProfile` (m2p3 Task 01) with fields `max_per_creator`, `format_mix`, `topic_diversity`, `category_min`. The `DiversityConstraints` struct in this phase is the runtime representation used by the selector, derived from `DiversitySpec` plus query-level overrides (the `DIVERSITY` clause). For M2, `DiversityConstraints` is constructed from `DiversitySpec` with a `From` impl. Query-level overrides are wired in m2p5.
|
|
|
|
5. **`topic_diversity` and `category_min`**: These are fields on `DiversitySpec` from the spec (Section 9.1). For M2, only `max_per_creator` and `format_mix` are implemented. `topic_diversity` requires embedding distance computation (O(n*k) where k = selected count) which changes the algorithm from greedy to MMR. `category_min` requires category metadata on each candidate. Both are deferred to M6. The `DiversityConstraints` struct includes these fields as `Option` types but the selector skips them with a `tracing::debug!` message when set.
|
|
|
|
6. **Diversity and pagination (Spec 09 Section 9.4)**: Diversity constraints apply per page, not globally across all pages. The selector operates on a single page's worth of candidates. The RETRIEVE executor (m2p5) handles pagination by passing the correct candidate slice to the selector. No pagination logic is needed in the diversity module itself.
|