tidaldb/docs/planning/milestone-2/phase-4/OVERVIEW.md

# Milestone 2, Phase 4: Diversity Enforcement

## Phase Deliverable

A post-scoring diversity pass that selects results from a scored candidate list to satisfy diversity constraints (`max_per_creator`, `format_mix`), without reducing result count. Implemented as a single greedy selection pass O(n) over the sorted candidate list. When constraints cannot be fully satisfied, the selector relaxes constraints in a defined order and returns results with a warning flag rather than an error.

This is the phase that turns ranked results from "the top N by score" into "the top N by score that a user would actually want to scroll through." Without diversity, a trending creator dominates the feed. With diversity, the database enforces variety -- no application logic required.

## Acceptance Criteria

- [ ] `max_per_creator:N` enforced: no more than N items from any single creator in the result set
- [ ] `format_mix:true` enforced: no more than 60% of results from any single format
- [ ] Diversity pass does not reduce result count -- it selects the next-best candidate that satisfies constraints
- [ ] Diversity pass adds < 1ms for 200 candidates (benchmarked)
- [ ] When diversity constraints cannot be fully satisfied (too few creators), results are returned with a warning flag, not an error
- [ ] Property test: diversity constraints hold for 10,000 random candidate sets

## Dependencies

- **Requires:** m2p3 (profile executor produces `Vec<ScoredCandidate>` sorted by score descending, with `entity_id`, `score`, and `signal_snapshot`; `DiversitySpec` type defined on `RankingProfile` with `max_per_creator`, `format_mix`, `topic_diversity`, `category_min`)
- **Blocks:** m2p5 (RETRIEVE executor calls diversity enforcement as the penultimate step before result return)

## Research References

- [thoughts.md](../../../../thoughts.md) -- Part V.14 (MMR post-scoring diversity enforcement)
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- Filtered search and post-retrieval reranking patterns

## Spec References

- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 9 (Diversity Enforcement):
  - Section 9.1 (DiversitySpec structure: max_per_creator, format_mix, topic_diversity, category_min)
  - Section 9.2 (Greedy MMR reranking algorithm pseudocode)
  - Section 9.3 (Constraint details: per-page enforcement, format bonus, category minimum)
  - Section 9.4 (Diversity and pagination: per-page, not global)
  - Section 9.5 (Diversity as reordering, not filtering; relaxation under pressure)
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 4 (Scoring pipeline: diversity is Stage 8)
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 16 (Invariants INV-RANK-5: diversity never reduces result count, INV-RANK-6: diversity preserves relative score order within same-constraint group)

## Task Index

| # | Task | Delivers | Depends On | Complexity |
|---|------|----------|------------|------------|
| 01 | Diversity Types + Greedy Selector | `DiversityConstraints`, `DiversityResult`, `ConstraintViolation`, `DiversitySelector`, greedy selection algorithm with three-stage relaxation | None | M |
| 02 | Property Tests + Benchmarks | proptest property tests (10,000 random candidate sets), Criterion benchmarks (200-candidate < 1ms) | Task 01 | S |

## Task Dependency DAG

```
Task 01: Diversity Types + Greedy Selector
    |
    v
Task 02: Property Tests + Benchmarks
```

Task 01 delivers all types and the selection algorithm. Task 02 validates correctness via property tests and performance via benchmarks. Strictly sequential -- Task 02 tests the implementation from Task 01.

## File Layout

```
tidal/src/
  ranking/
    diversity.rs     -- DiversityConstraints, DiversityResult, DiversitySelector,
                        ConstraintViolation (Task 01)
    mod.rs           -- add `pub mod diversity;` and re-exports (Task 01)
tidal/benches/
  ranking.rs         -- add diversity benchmarks (Task 02) to the existing ranking bench file
```

## Open Questions

1. **Creator ID and format in ScoredCandidate**: The diversity selector needs each candidate's `creator_id` and `format` to apply constraints. These are entity metadata fields. `ScoredCandidate` from m2p3 has `entity_id`, `score`, and `signal_snapshot` but not a general metadata map. Options:
   - (A) Add `creator_id: Option<EntityId>` and `format: Option<String>` fields to `ScoredCandidate` -- cleanest, no extra lookup
   - (B) `DiversitySelector` takes `&EntityStore` and loads metadata per candidate -- more flexible, extra lookup cost (~50ns per candidate)
   - **Decision for M2:** Option A. The executor adds `creator_id` and `format` to `ScoredCandidate` at scoring time (they are already loaded from entity metadata during scoring). This keeps diversity O(n) without extra I/O. The `ScoredCandidate` struct gains two optional fields. This change is made as part of Task 01 in this phase.

2. **`min_exploration` constraint**: The exploration budget (10% of results from unfollowed creators) is an M3 feature (Spec 09 Section 10). `DiversityConstraints` includes a `min_exploration: Option<f64>` field for forward compatibility, but the M2 selector ignores it if set. A `todo!()` comment is added in the selector with "M3: implement exploration budget after relationship graph is available."

3. **Relaxation order**: The three-stage relaxation (double max_per_creator, ignore format_mix, accept anything) is the default for M2. The caller (m2p5 RETRIEVE executor) can configure a stricter relaxation policy in future milestones. For M2, hardcode the three-stage order.

4. **`DiversitySpec` vs `DiversityConstraints`**: `DiversitySpec` is already defined on `RankingProfile` (m2p3 Task 01) with fields `max_per_creator`, `format_mix`, `topic_diversity`, `category_min`. The `DiversityConstraints` struct in this phase is the runtime representation used by the selector, derived from `DiversitySpec` plus query-level overrides (the `DIVERSITY` clause). For M2, `DiversityConstraints` is constructed from `DiversitySpec` with a `From` impl. Query-level overrides are wired in m2p5.

5. **`topic_diversity` and `category_min`**: These are fields on `DiversitySpec` from the spec (Section 9.1). For M2, only `max_per_creator` and `format_mix` are implemented. `topic_diversity` requires embedding distance computation (O(n*k) where k = selected count) which changes the algorithm from greedy to MMR. `category_min` requires category metadata on each candidate. Both are deferred to M6. The `DiversityConstraints` struct includes these fields as `Option` types but the selector skips them with a `tracing::debug!` message when set.

6. **Diversity and pagination (Spec 09 Section 9.4)**: Diversity constraints apply per page, not globally across all pages. The selector operates on a single page's worth of candidates. The RETRIEVE executor (m2p5) handles pagination by passing the correct candidate slice to the selector. No pagination logic is needed in the diversity module itself.