tidaldb/docs/planning/milestone-7/phase-3/task-01-scale-benchmark-suite.md
2026-02-23 22:41:16 -07:00

295 lines
9.7 KiB
Markdown

# Task 01: Scale Benchmark Suite
## Delivers
A Criterion benchmark suite operating at 1M items / 100K users / 10K creators that establishes performance baselines for RETRIEVE (for_you, trending, following), SEARCH (hybrid, text-only), and signal write throughput. All subsequent m7p3 tasks use these baselines to measure the impact of their optimizations.
## Complexity
L
## Dependencies
- m7p1 complete (TidalDb API, entity writes, signal writes, RETRIEVE, SEARCH all operational)
- Existing bench files: `tidal/benches/query.rs`, `tidal/benches/search.rs`, `tidal/benches/signals.rs`
## Technical Design
### 1. New bench target: `tidal/benches/scale.rs`
Register in `Cargo.toml`:
```toml
[[bench]]
name = "scale"
harness = false
```
### 2. Shared setup harness
The 1M-item universe takes minutes to construct. Build it once per benchmark group using a `LazyLock` (or `std::sync::OnceLock`) so all bench functions share the same populated `TidalDb` instance.
```rust
#![allow(clippy::unwrap_used, clippy::cast_precision_loss)]
use std::collections::HashMap;
use std::sync::LazyLock;
use std::time::Duration;
use criterion::{
Criterion, black_box, criterion_group, criterion_main,
BenchmarkId, SamplingMode,
};
use tidaldb::TidalDb;
use tidaldb::query::retrieve::Retrieve;
use tidaldb::query::search::Search;
use tidaldb::ranking::diversity::DiversityConstraints;
use tidaldb::schema::{
DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldDef,
TextFieldType, Timestamp, Window,
};
use tidaldb::storage::indexes::filter::FilterExpr;
const ITEM_COUNT: u64 = 1_000_000;
const USER_COUNT: u64 = 100_000;
const CREATOR_COUNT: u64 = 10_000;
fn scale_schema() -> tidaldb::schema::Schema {
let mut builder = SchemaBuilder::new();
for sig in &["view", "like", "share", "skip", "completion"] {
let _ = builder
.signal(
sig,
EntityKind::Item,
DecaySpec::Exponential {
half_life: Duration::from_secs(7 * 24 * 3600),
},
)
.windows(&[Window::OneHour, Window::TwentyFourHours, Window::SevenDays])
.velocity(true)
.add();
}
builder.text_field("title", TextFieldType::Text);
builder.text_field("description", TextFieldType::Text);
builder.text_field("category", TextFieldType::Keyword);
builder.build().unwrap()
}
/// Build a TidalDb with 1M items, signal data, text fields, and embeddings.
///
/// Item distribution:
/// - 1M items, each assigned to one of 10K creators (100 items per creator)
/// - category: cycling through 20 categories
/// - title/description: varied vocabulary for realistic BM25 IDF
/// - 10% of items have view signals, 5% have like signals
/// - Embeddings: 128D random unit vectors for ANN (not 1536D -- that would
/// require ~5.7 GB of RAM for vectors alone; 128D is sufficient for
/// benchmark fidelity and uses ~0.5 GB)
fn build_scale_db() -> TidalDb {
let db = TidalDb::builder()
.ephemeral()
.with_schema(scale_schema())
.open()
.unwrap();
let categories = [
"music", "programming", "cooking", "sports", "science",
"art", "travel", "history", "math", "philosophy",
"gaming", "fitness", "photography", "writing", "design",
"finance", "health", "education", "nature", "technology",
];
let ts = Timestamp::now();
for i in 0..ITEM_COUNT {
let mut meta = HashMap::new();
meta.insert("title".to_string(), format!("Item {i} tutorial guide"));
meta.insert(
"description".to_string(),
format!("A comprehensive guide about topic {} with examples", i % 500),
);
let cat = categories[(i % 20) as usize];
meta.insert("category".to_string(), cat.to_string());
meta.insert("creator_id".to_string(), (i % CREATOR_COUNT).to_string());
db.write_item_with_metadata(EntityId::new(i), &meta).unwrap();
// 10% of items get view signals (spread across the corpus)
if i % 10 == 0 {
db.signal("view", EntityId::new(i), 1.0, ts).unwrap();
}
// 5% get like signals
if i % 20 == 0 {
db.signal("like", EntityId::new(i), 1.0, ts).unwrap();
}
}
// Wait for text syncer to commit, then reload
std::thread::sleep(Duration::from_secs(3));
db.reload_text_index().unwrap();
db
}
static SCALE_DB: LazyLock<TidalDb> = LazyLock::new(build_scale_db);
```
### 3. RETRIEVE benchmarks
```rust
fn bench_retrieve_for_you_1m(c: &mut Criterion) {
let db = &*SCALE_DB;
let mut group = c.benchmark_group("retrieve_1m");
group.sample_size(10);
group.measurement_time(Duration::from_secs(30));
group.sampling_mode(SamplingMode::Flat);
// for_you: signal-ranked candidates + diversity enforcement
let for_you = Retrieve::builder()
.profile("for_you")
.limit(20)
.diversity(DiversityConstraints::new().max_per_creator(2))
.build()
.unwrap();
group.bench_function("for_you", |b| {
b.iter(|| db.retrieve(black_box(&for_you)).unwrap());
});
// trending: windowed count ranking, no diversity
let trending = Retrieve::builder()
.profile("trending")
.limit(20)
.build()
.unwrap();
group.bench_function("trending", |b| {
b.iter(|| db.retrieve(black_box(&trending)).unwrap());
});
// new: creation-time sort, category filter (~5% selectivity)
let new_filtered = Retrieve::builder()
.profile("new")
.limit(20)
.filter(FilterExpr::CategoryEq("programming".into()))
.build()
.unwrap();
group.bench_function("new_filtered", |b| {
b.iter(|| db.retrieve(black_box(&new_filtered)).unwrap());
});
group.finish();
}
```
### 4. SEARCH benchmarks
```rust
fn bench_search_1m(c: &mut Criterion) {
let db = &*SCALE_DB;
let mut group = c.benchmark_group("search_1m");
group.sample_size(10);
group.measurement_time(Duration::from_secs(30));
group.sampling_mode(SamplingMode::Flat);
// Text-only search (BM25)
let text_only = Search::builder()
.query("tutorial guide")
.limit(20)
.build()
.unwrap();
group.bench_function("text_only", |b| {
b.iter(|| db.search(black_box(&text_only)).unwrap());
});
// Text search with category filter
let text_filtered = Search::builder()
.query("tutorial guide")
.limit(20)
.filter(FilterExpr::CategoryEq("programming".into()))
.build()
.unwrap();
group.bench_function("text_filtered", |b| {
b.iter(|| db.search(black_box(&text_filtered)).unwrap());
});
group.finish();
}
```
### 5. Signal write throughput benchmark
```rust
fn bench_signal_write_1m(c: &mut Criterion) {
let db = &*SCALE_DB;
let mut group = c.benchmark_group("signal_write_1m");
// Measure amortized write cost against a pre-populated 1M-item ledger.
// Use a rotating entity ID to avoid DashMap contention on a single shard.
let ts = Timestamp::now();
let mut entity_counter = 0u64;
group.bench_function("view_write", |b| {
b.iter(|| {
let entity_id = EntityId::new(entity_counter % ITEM_COUNT);
entity_counter += 1;
db.signal(
black_box("view"),
black_box(entity_id),
black_box(1.0),
black_box(ts),
)
.unwrap();
});
});
group.finish();
}
criterion_group!(
scale_benches,
bench_retrieve_for_you_1m,
bench_search_1m,
bench_signal_write_1m,
);
criterion_main!(scale_benches);
```
### 6. Measurement methodology
| Metric | Target | How measured |
|--------|--------|-------------|
| RETRIEVE for_you p99 | < 50ms | `criterion` flat sampling, 10 samples, 30s measurement |
| RETRIEVE trending p99 | < 50ms | Same |
| SEARCH text-only p99 | < 100ms | Same |
| SEARCH text+filter p99 | < 100ms | Same |
| Signal write amortized | < 100us | `criterion` default sampling, 1000+ iterations |
The p99 values are approximated from Criterion's reported `[low est, high est]` range. If the `high est` exceeds the target, the benchmark fails.
### 7. Setup time management
Building a 1M-item TidalDb is expensive. The `LazyLock` pattern ensures construction happens once. For CI, these benchmarks should be tagged with `#[ignore]` or gated behind a feature flag so they do not run on every `cargo test --lib`.
## Acceptance Criteria
- [ ] `tidal/benches/scale.rs` registered in `Cargo.toml` as `[[bench]]` target
- [ ] `cargo bench --manifest-path tidal/Cargo.toml --bench scale` runs successfully
- [ ] RETRIEVE benchmarks at 1M items: for_you, trending, new_filtered all produce valid results
- [ ] SEARCH benchmarks at 1M items: text_only, text_filtered both return results (non-empty)
- [ ] Signal write benchmark at 1M items: amortized cost measured and recorded
- [ ] Baseline numbers documented in `docs/profiling/scale-baselines.md`
- [ ] All benchmarks use `sample_size(10)` and `measurement_time(30s)` for large-scale tests
- [ ] LazyLock or equivalent ensures 1M-item DB is built only once per bench run
## Test Strategy
This task is itself a test artifact -- the benchmarks are the deliverable. Validation:
1. **Smoke test:** Run `cargo bench --manifest-path tidal/Cargo.toml --bench scale -- --test` to verify benchmarks compile and can execute a single iteration without error.
2. **Result validation:** Each benchmark iteration must return a non-empty result set (RETRIEVE: items.len() > 0, SEARCH: items.len() > 0). Assert this inside the `b.iter()` closure with `debug_assert!`.
3. **Baseline recording:** After the first successful run, record results in `docs/profiling/scale-baselines.md` with hardware specs, date, and exact Criterion output.