tidaldb/docs/planning/milestone-7/phase-3/task-01-scale-benchmark-suite.md

# Task 01: Scale Benchmark Suite

## Delivers

A Criterion benchmark suite operating at 1M items / 100K users / 10K creators that establishes performance baselines for RETRIEVE (for_you, trending, following), SEARCH (hybrid, text-only), and signal write throughput. All subsequent m7p3 tasks use these baselines to measure the impact of their optimizations.

## Complexity

L

## Dependencies

- m7p1 complete (TidalDb API, entity writes, signal writes, RETRIEVE, SEARCH all operational)
- Existing bench files: `tidal/benches/query.rs`, `tidal/benches/search.rs`, `tidal/benches/signals.rs`

## Technical Design

### 1. New bench target: `tidal/benches/scale.rs`

Register in `Cargo.toml`:

```toml
[[bench]]
name = "scale"
harness = false
```

### 2. Shared setup harness

The 1M-item universe takes minutes to construct. Build it once per benchmark group using a `LazyLock` (or `std::sync::OnceLock`) so all bench functions share the same populated `TidalDb` instance.

```rust
#![allow(clippy::unwrap_used, clippy::cast_precision_loss)]

use std::collections::HashMap;
use std::sync::LazyLock;
use std::time::Duration;

use criterion::{
    Criterion, black_box, criterion_group, criterion_main,
    BenchmarkId, SamplingMode,
};
use tidaldb::TidalDb;
use tidaldb::query::retrieve::Retrieve;
use tidaldb::query::search::Search;
use tidaldb::ranking::diversity::DiversityConstraints;
use tidaldb::schema::{
    DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldDef,
    TextFieldType, Timestamp, Window,
};
use tidaldb::storage::indexes::filter::FilterExpr;

const ITEM_COUNT: u64 = 1_000_000;
const USER_COUNT: u64 = 100_000;
const CREATOR_COUNT: u64 = 10_000;

fn scale_schema() -> tidaldb::schema::Schema {
    let mut builder = SchemaBuilder::new();
    for sig in &["view", "like", "share", "skip", "completion"] {
        let _ = builder
            .signal(
                sig,
                EntityKind::Item,
                DecaySpec::Exponential {
                    half_life: Duration::from_secs(7 * 24 * 3600),
                },
            )
            .windows(&[Window::OneHour, Window::TwentyFourHours, Window::SevenDays])
            .velocity(true)
            .add();
    }
    builder.text_field("title", TextFieldType::Text);
    builder.text_field("description", TextFieldType::Text);
    builder.text_field("category", TextFieldType::Keyword);
    builder.build().unwrap()
}

/// Build a TidalDb with 1M items, signal data, text fields, and embeddings.
///
/// Item distribution:
/// - 1M items, each assigned to one of 10K creators (100 items per creator)
/// - category: cycling through 20 categories
/// - title/description: varied vocabulary for realistic BM25 IDF
/// - 10% of items have view signals, 5% have like signals
/// - Embeddings: 128D random unit vectors for ANN (not 1536D -- that would
///   require ~5.7 GB of RAM for vectors alone; 128D is sufficient for
///   benchmark fidelity and uses ~0.5 GB)
fn build_scale_db() -> TidalDb {
    let db = TidalDb::builder()
        .ephemeral()
        .with_schema(scale_schema())
        .open()
        .unwrap();

    let categories = [
        "music", "programming", "cooking", "sports", "science",
        "art", "travel", "history", "math", "philosophy",
        "gaming", "fitness", "photography", "writing", "design",
        "finance", "health", "education", "nature", "technology",
    ];

    let ts = Timestamp::now();

    for i in 0..ITEM_COUNT {
        let mut meta = HashMap::new();
        meta.insert("title".to_string(), format!("Item {i} tutorial guide"));
        meta.insert(
            "description".to_string(),
            format!("A comprehensive guide about topic {} with examples", i % 500),
        );
        let cat = categories[(i % 20) as usize];
        meta.insert("category".to_string(), cat.to_string());
        meta.insert("creator_id".to_string(), (i % CREATOR_COUNT).to_string());

        db.write_item_with_metadata(EntityId::new(i), &meta).unwrap();

        // 10% of items get view signals (spread across the corpus)
        if i % 10 == 0 {
            db.signal("view", EntityId::new(i), 1.0, ts).unwrap();
        }
        // 5% get like signals
        if i % 20 == 0 {
            db.signal("like", EntityId::new(i), 1.0, ts).unwrap();
        }
    }

    // Wait for text syncer to commit, then reload
    std::thread::sleep(Duration::from_secs(3));
    db.reload_text_index().unwrap();

    db
}

static SCALE_DB: LazyLock<TidalDb> = LazyLock::new(build_scale_db);
```

### 3. RETRIEVE benchmarks

```rust
fn bench_retrieve_for_you_1m(c: &mut Criterion) {
    let db = &*SCALE_DB;
    let mut group = c.benchmark_group("retrieve_1m");
    group.sample_size(10);
    group.measurement_time(Duration::from_secs(30));
    group.sampling_mode(SamplingMode::Flat);

    // for_you: signal-ranked candidates + diversity enforcement
    let for_you = Retrieve::builder()
        .profile("for_you")
        .limit(20)
        .diversity(DiversityConstraints::new().max_per_creator(2))
        .build()
        .unwrap();

    group.bench_function("for_you", |b| {
        b.iter(|| db.retrieve(black_box(&for_you)).unwrap());
    });

    // trending: windowed count ranking, no diversity
    let trending = Retrieve::builder()
        .profile("trending")
        .limit(20)
        .build()
        .unwrap();

    group.bench_function("trending", |b| {
        b.iter(|| db.retrieve(black_box(&trending)).unwrap());
    });

    // new: creation-time sort, category filter (~5% selectivity)
    let new_filtered = Retrieve::builder()
        .profile("new")
        .limit(20)
        .filter(FilterExpr::CategoryEq("programming".into()))
        .build()
        .unwrap();

    group.bench_function("new_filtered", |b| {
        b.iter(|| db.retrieve(black_box(&new_filtered)).unwrap());
    });

    group.finish();
}
```

### 4. SEARCH benchmarks

```rust
fn bench_search_1m(c: &mut Criterion) {
    let db = &*SCALE_DB;
    let mut group = c.benchmark_group("search_1m");
    group.sample_size(10);
    group.measurement_time(Duration::from_secs(30));
    group.sampling_mode(SamplingMode::Flat);

    // Text-only search (BM25)
    let text_only = Search::builder()
        .query("tutorial guide")
        .limit(20)
        .build()
        .unwrap();

    group.bench_function("text_only", |b| {
        b.iter(|| db.search(black_box(&text_only)).unwrap());
    });

    // Text search with category filter
    let text_filtered = Search::builder()
        .query("tutorial guide")
        .limit(20)
        .filter(FilterExpr::CategoryEq("programming".into()))
        .build()
        .unwrap();

    group.bench_function("text_filtered", |b| {
        b.iter(|| db.search(black_box(&text_filtered)).unwrap());
    });

    group.finish();
}
```

### 5. Signal write throughput benchmark

```rust
fn bench_signal_write_1m(c: &mut Criterion) {
    let db = &*SCALE_DB;
    let mut group = c.benchmark_group("signal_write_1m");

    // Measure amortized write cost against a pre-populated 1M-item ledger.
    // Use a rotating entity ID to avoid DashMap contention on a single shard.
    let ts = Timestamp::now();
    let mut entity_counter = 0u64;

    group.bench_function("view_write", |b| {
        b.iter(|| {
            let entity_id = EntityId::new(entity_counter % ITEM_COUNT);
            entity_counter += 1;
            db.signal(
                black_box("view"),
                black_box(entity_id),
                black_box(1.0),
                black_box(ts),
            )
            .unwrap();
        });
    });

    group.finish();
}

criterion_group!(
    scale_benches,
    bench_retrieve_for_you_1m,
    bench_search_1m,
    bench_signal_write_1m,
);
criterion_main!(scale_benches);
```

### 6. Measurement methodology

| Metric | Target | How measured |
|--------|--------|-------------|
| RETRIEVE for_you p99 | < 50ms | `criterion` flat sampling, 10 samples, 30s measurement |
| RETRIEVE trending p99 | < 50ms | Same |
| SEARCH text-only p99 | < 100ms | Same |
| SEARCH text+filter p99 | < 100ms | Same |
| Signal write amortized | < 100us | `criterion` default sampling, 1000+ iterations |

The p99 values are approximated from Criterion's reported `[low est, high est]` range. If the `high est` exceeds the target, the benchmark fails.

### 7. Setup time management

Building a 1M-item TidalDb is expensive. The `LazyLock` pattern ensures construction happens once. For CI, these benchmarks should be tagged with `#[ignore]` or gated behind a feature flag so they do not run on every `cargo test --lib`.

## Acceptance Criteria

- [ ] `tidal/benches/scale.rs` registered in `Cargo.toml` as `[[bench]]` target
- [ ] `cargo bench --manifest-path tidal/Cargo.toml --bench scale` runs successfully
- [ ] RETRIEVE benchmarks at 1M items: for_you, trending, new_filtered all produce valid results
- [ ] SEARCH benchmarks at 1M items: text_only, text_filtered both return results (non-empty)
- [ ] Signal write benchmark at 1M items: amortized cost measured and recorded
- [ ] Baseline numbers documented in `docs/profiling/scale-baselines.md`
- [ ] All benchmarks use `sample_size(10)` and `measurement_time(30s)` for large-scale tests
- [ ] LazyLock or equivalent ensures 1M-item DB is built only once per bench run

## Test Strategy

This task is itself a test artifact -- the benchmarks are the deliverable. Validation:

1. **Smoke test:** Run `cargo bench --manifest-path tidal/Cargo.toml --bench scale -- --test` to verify benchmarks compile and can execute a single iteration without error.
2. **Result validation:** Each benchmark iteration must return a non-empty result set (RETRIEVE: items.len() > 0, SEARCH: items.len() > 0). Assert this inside the `b.iter()` closure with `debug_assert!`.
3. **Baseline recording:** After the first successful run, record results in `docs/profiling/scale-baselines.md` with hardware specs, date, and exact Criterion output.