tidaldb/docs/planning/milestone-7/phase-3/task-01-scale-benchmark-suite.md
2026-02-23 22:41:16 -07:00

9.7 KiB

Task 01: Scale Benchmark Suite

Delivers

A Criterion benchmark suite operating at 1M items / 100K users / 10K creators that establishes performance baselines for RETRIEVE (for_you, trending, following), SEARCH (hybrid, text-only), and signal write throughput. All subsequent m7p3 tasks use these baselines to measure the impact of their optimizations.

Complexity

L

Dependencies

  • m7p1 complete (TidalDb API, entity writes, signal writes, RETRIEVE, SEARCH all operational)
  • Existing bench files: tidal/benches/query.rs, tidal/benches/search.rs, tidal/benches/signals.rs

Technical Design

1. New bench target: tidal/benches/scale.rs

Register in Cargo.toml:

[[bench]]
name = "scale"
harness = false

2. Shared setup harness

The 1M-item universe takes minutes to construct. Build it once per benchmark group using a LazyLock (or std::sync::OnceLock) so all bench functions share the same populated TidalDb instance.

#![allow(clippy::unwrap_used, clippy::cast_precision_loss)]

use std::collections::HashMap;
use std::sync::LazyLock;
use std::time::Duration;

use criterion::{
    Criterion, black_box, criterion_group, criterion_main,
    BenchmarkId, SamplingMode,
};
use tidaldb::TidalDb;
use tidaldb::query::retrieve::Retrieve;
use tidaldb::query::search::Search;
use tidaldb::ranking::diversity::DiversityConstraints;
use tidaldb::schema::{
    DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldDef,
    TextFieldType, Timestamp, Window,
};
use tidaldb::storage::indexes::filter::FilterExpr;

const ITEM_COUNT: u64 = 1_000_000;
const USER_COUNT: u64 = 100_000;
const CREATOR_COUNT: u64 = 10_000;

fn scale_schema() -> tidaldb::schema::Schema {
    let mut builder = SchemaBuilder::new();
    for sig in &["view", "like", "share", "skip", "completion"] {
        let _ = builder
            .signal(
                sig,
                EntityKind::Item,
                DecaySpec::Exponential {
                    half_life: Duration::from_secs(7 * 24 * 3600),
                },
            )
            .windows(&[Window::OneHour, Window::TwentyFourHours, Window::SevenDays])
            .velocity(true)
            .add();
    }
    builder.text_field("title", TextFieldType::Text);
    builder.text_field("description", TextFieldType::Text);
    builder.text_field("category", TextFieldType::Keyword);
    builder.build().unwrap()
}

/// Build a TidalDb with 1M items, signal data, text fields, and embeddings.
///
/// Item distribution:
/// - 1M items, each assigned to one of 10K creators (100 items per creator)
/// - category: cycling through 20 categories
/// - title/description: varied vocabulary for realistic BM25 IDF
/// - 10% of items have view signals, 5% have like signals
/// - Embeddings: 128D random unit vectors for ANN (not 1536D -- that would
///   require ~5.7 GB of RAM for vectors alone; 128D is sufficient for
///   benchmark fidelity and uses ~0.5 GB)
fn build_scale_db() -> TidalDb {
    let db = TidalDb::builder()
        .ephemeral()
        .with_schema(scale_schema())
        .open()
        .unwrap();

    let categories = [
        "music", "programming", "cooking", "sports", "science",
        "art", "travel", "history", "math", "philosophy",
        "gaming", "fitness", "photography", "writing", "design",
        "finance", "health", "education", "nature", "technology",
    ];

    let ts = Timestamp::now();

    for i in 0..ITEM_COUNT {
        let mut meta = HashMap::new();
        meta.insert("title".to_string(), format!("Item {i} tutorial guide"));
        meta.insert(
            "description".to_string(),
            format!("A comprehensive guide about topic {} with examples", i % 500),
        );
        let cat = categories[(i % 20) as usize];
        meta.insert("category".to_string(), cat.to_string());
        meta.insert("creator_id".to_string(), (i % CREATOR_COUNT).to_string());

        db.write_item_with_metadata(EntityId::new(i), &meta).unwrap();

        // 10% of items get view signals (spread across the corpus)
        if i % 10 == 0 {
            db.signal("view", EntityId::new(i), 1.0, ts).unwrap();
        }
        // 5% get like signals
        if i % 20 == 0 {
            db.signal("like", EntityId::new(i), 1.0, ts).unwrap();
        }
    }

    // Wait for text syncer to commit, then reload
    std::thread::sleep(Duration::from_secs(3));
    db.reload_text_index().unwrap();

    db
}

static SCALE_DB: LazyLock<TidalDb> = LazyLock::new(build_scale_db);

3. RETRIEVE benchmarks

fn bench_retrieve_for_you_1m(c: &mut Criterion) {
    let db = &*SCALE_DB;
    let mut group = c.benchmark_group("retrieve_1m");
    group.sample_size(10);
    group.measurement_time(Duration::from_secs(30));
    group.sampling_mode(SamplingMode::Flat);

    // for_you: signal-ranked candidates + diversity enforcement
    let for_you = Retrieve::builder()
        .profile("for_you")
        .limit(20)
        .diversity(DiversityConstraints::new().max_per_creator(2))
        .build()
        .unwrap();

    group.bench_function("for_you", |b| {
        b.iter(|| db.retrieve(black_box(&for_you)).unwrap());
    });

    // trending: windowed count ranking, no diversity
    let trending = Retrieve::builder()
        .profile("trending")
        .limit(20)
        .build()
        .unwrap();

    group.bench_function("trending", |b| {
        b.iter(|| db.retrieve(black_box(&trending)).unwrap());
    });

    // new: creation-time sort, category filter (~5% selectivity)
    let new_filtered = Retrieve::builder()
        .profile("new")
        .limit(20)
        .filter(FilterExpr::CategoryEq("programming".into()))
        .build()
        .unwrap();

    group.bench_function("new_filtered", |b| {
        b.iter(|| db.retrieve(black_box(&new_filtered)).unwrap());
    });

    group.finish();
}

4. SEARCH benchmarks

fn bench_search_1m(c: &mut Criterion) {
    let db = &*SCALE_DB;
    let mut group = c.benchmark_group("search_1m");
    group.sample_size(10);
    group.measurement_time(Duration::from_secs(30));
    group.sampling_mode(SamplingMode::Flat);

    // Text-only search (BM25)
    let text_only = Search::builder()
        .query("tutorial guide")
        .limit(20)
        .build()
        .unwrap();

    group.bench_function("text_only", |b| {
        b.iter(|| db.search(black_box(&text_only)).unwrap());
    });

    // Text search with category filter
    let text_filtered = Search::builder()
        .query("tutorial guide")
        .limit(20)
        .filter(FilterExpr::CategoryEq("programming".into()))
        .build()
        .unwrap();

    group.bench_function("text_filtered", |b| {
        b.iter(|| db.search(black_box(&text_filtered)).unwrap());
    });

    group.finish();
}

5. Signal write throughput benchmark

fn bench_signal_write_1m(c: &mut Criterion) {
    let db = &*SCALE_DB;
    let mut group = c.benchmark_group("signal_write_1m");

    // Measure amortized write cost against a pre-populated 1M-item ledger.
    // Use a rotating entity ID to avoid DashMap contention on a single shard.
    let ts = Timestamp::now();
    let mut entity_counter = 0u64;

    group.bench_function("view_write", |b| {
        b.iter(|| {
            let entity_id = EntityId::new(entity_counter % ITEM_COUNT);
            entity_counter += 1;
            db.signal(
                black_box("view"),
                black_box(entity_id),
                black_box(1.0),
                black_box(ts),
            )
            .unwrap();
        });
    });

    group.finish();
}

criterion_group!(
    scale_benches,
    bench_retrieve_for_you_1m,
    bench_search_1m,
    bench_signal_write_1m,
);
criterion_main!(scale_benches);

6. Measurement methodology

Metric Target How measured
RETRIEVE for_you p99 < 50ms criterion flat sampling, 10 samples, 30s measurement
RETRIEVE trending p99 < 50ms Same
SEARCH text-only p99 < 100ms Same
SEARCH text+filter p99 < 100ms Same
Signal write amortized < 100us criterion default sampling, 1000+ iterations

The p99 values are approximated from Criterion's reported [low est, high est] range. If the high est exceeds the target, the benchmark fails.

7. Setup time management

Building a 1M-item TidalDb is expensive. The LazyLock pattern ensures construction happens once. For CI, these benchmarks should be tagged with #[ignore] or gated behind a feature flag so they do not run on every cargo test --lib.

Acceptance Criteria

  • tidal/benches/scale.rs registered in Cargo.toml as [[bench]] target
  • cargo bench --manifest-path tidal/Cargo.toml --bench scale runs successfully
  • RETRIEVE benchmarks at 1M items: for_you, trending, new_filtered all produce valid results
  • SEARCH benchmarks at 1M items: text_only, text_filtered both return results (non-empty)
  • Signal write benchmark at 1M items: amortized cost measured and recorded
  • Baseline numbers documented in docs/profiling/scale-baselines.md
  • All benchmarks use sample_size(10) and measurement_time(30s) for large-scale tests
  • LazyLock or equivalent ensures 1M-item DB is built only once per bench run

Test Strategy

This task is itself a test artifact -- the benchmarks are the deliverable. Validation:

  1. Smoke test: Run cargo bench --manifest-path tidal/Cargo.toml --bench scale -- --test to verify benchmarks compile and can execute a single iteration without error.
  2. Result validation: Each benchmark iteration must return a non-empty result set (RETRIEVE: items.len() > 0, SEARCH: items.len() > 0). Assert this inside the b.iter() closure with debug_assert!.
  3. Baseline recording: After the first successful run, record results in docs/profiling/scale-baselines.md with hardware specs, date, and exact Criterion output.