tidaldb/docs/planning/milestone-7/phase-3/task-02-usearch-parameter-tuning.md
2026-02-23 22:41:16 -07:00

252 lines
8.4 KiB
Markdown

# Task 02: USearch Parameter Tuning
## Delivers
A systematic benchmark of USearch HNSW parameters (M x ef_construction) at 1M vectors, documenting the recall/latency tradeoff. The optimal configuration is applied to `VectorIndexConfig::default()`. ANN recall@10 must exceed 0.95.
## Complexity
M
## Dependencies
- task-01 complete (scale benchmark infrastructure)
- `docs/research/ann_for_tidaldb.md` (parameter guidance)
## Technical Design
### 1. Parameter matrix
The research doc identifies M and ef_construction as the two critical HNSW parameters for recall/latency tradeoff. At 1536D (production) or 128D (benchmark), the relationship between these parameters and recall quality must be measured, not assumed.
| Parameter | Values | Rationale |
|-----------|--------|-----------|
| M (connectivity) | 8, 16, 32 | M=16 is the USearch default; M=8 saves ~50% graph memory; M=32 improves recall under selective filters at 2x memory |
| ef_construction | 100, 200, 400 | Controls index build quality; diminishing returns past 200 in most benchmarks |
| ef_search | 200 (fixed) | Query-time expansion factor; held constant to isolate build-quality effects |
This produces a 3x3 = 9 configuration matrix.
### 2. Benchmark implementation
Add to `tidal/benches/vector.rs` or a new `tidal/benches/usearch_tuning.rs`:
```rust
#![allow(clippy::unwrap_used, clippy::cast_precision_loss)]
use criterion::{Criterion, black_box, criterion_group, criterion_main, BenchmarkId};
use rand::Rng;
use std::time::Duration;
use tidaldb::storage::vector::{
AdaptiveQueryPlanner, BruteForceIndex, DistanceMetric,
QuantizationLevel, VectorId, VectorIndex, VectorIndexConfig,
};
const DIM: usize = 128;
const N: u64 = 1_000_000;
const K: usize = 10;
const NUM_QUERIES: usize = 100;
fn random_unit_vector(dim: usize, rng: &mut impl Rng) -> Vec<f32> {
let v: Vec<f32> = (0..dim).map(|_| rng.random::<f32>() - 0.5).collect();
let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
if norm < f32::EPSILON {
let mut fallback = vec![0.0f32; dim];
fallback[0] = 1.0;
return fallback;
}
v.iter().map(|x| x / norm).collect()
}
struct TuningResult {
m: usize,
ef_construction: usize,
recall_at_10: f64,
mean_latency_us: f64,
build_time_secs: f64,
memory_bytes: usize,
}
/// Build an index with specific parameters, compute ground truth recall,
/// and measure search latency.
fn evaluate_config(m: usize, ef_construction: usize) -> TuningResult {
let config = VectorIndexConfig {
dimensions: DIM,
metric: DistanceMetric::L2,
quantization: QuantizationLevel::F32,
connectivity: m,
ef_construction,
ef_search: 200,
};
let mut rng = rand::rng();
// Build ground truth with brute force
let brute = BruteForceIndex::new(config.clone());
let build_start = std::time::Instant::now();
// (In practice, use the real USearch-backed index here, not BruteForceIndex)
for id in 0..N {
let vec = random_unit_vector(DIM, &mut rng);
brute.insert(id, &vec).unwrap();
}
let build_time = build_start.elapsed();
// Generate query vectors
let queries: Vec<Vec<f32>> = (0..NUM_QUERIES)
.map(|_| random_unit_vector(DIM, &mut rng))
.collect();
// Compute ground truth (brute force top-K for each query)
let ground_truths: Vec<Vec<VectorId>> = queries
.iter()
.map(|q| {
brute
.search(q, K, K * 2)
.unwrap()
.iter()
.map(|r| r.id)
.collect()
})
.collect();
// Search and measure recall + latency
let planner = AdaptiveQueryPlanner::with_defaults();
let mut total_recall = 0.0;
let mut total_latency = Duration::ZERO;
for (query, gt) in queries.iter().zip(ground_truths.iter()) {
let start = std::time::Instant::now();
let results = planner
.execute(&brute, query, K, None, 1.0, None)
.unwrap();
total_latency += start.elapsed();
let result_ids: Vec<VectorId> = results.iter().map(|r| r.id).collect();
let hits = result_ids.iter().filter(|id| gt.contains(id)).count();
total_recall += hits as f64 / gt.len() as f64;
}
TuningResult {
m,
ef_construction,
recall_at_10: total_recall / NUM_QUERIES as f64,
mean_latency_us: total_latency.as_micros() as f64 / NUM_QUERIES as f64,
build_time_secs: build_time.as_secs_f64(),
memory_bytes: 0, // Measured via index-specific API if available
}
}
```
### 3. Criterion benchmark for the optimal config
After determining the optimal (M, ef_construction) from the evaluation, add a Criterion benchmark that measures search latency at the chosen parameters:
```rust
fn bench_usearch_optimal_1m(c: &mut Criterion) {
let mut group = c.benchmark_group("usearch_1m");
group.sample_size(10);
group.measurement_time(Duration::from_secs(30));
// Build index with candidate-optimal config
let configs = [
(8, 100),
(8, 200),
(16, 100),
(16, 200),
(16, 400),
(32, 200),
(32, 400),
];
let mut rng = rand::rng();
let query = random_unit_vector(DIM, &mut rng);
for &(m, ef_c) in &configs {
let config = VectorIndexConfig {
dimensions: DIM,
metric: DistanceMetric::L2,
quantization: QuantizationLevel::F32,
connectivity: m,
ef_construction: ef_c,
ef_search: 200,
};
let index = BruteForceIndex::new(config);
// Pre-populate (in real implementation, use the HNSW-backed index)
for id in 0..10_000u64 {
let vec = random_unit_vector(DIM, &mut rng);
index.insert(id, &vec).unwrap();
}
group.bench_with_input(
BenchmarkId::new("search", format!("M{m}_ef{ef_c}")),
&(m, ef_c),
|b, _| {
b.iter(|| {
index.search(black_box(&query), black_box(K), black_box(200)).unwrap()
});
},
);
}
group.finish();
}
```
### 4. Apply optimal config
Once the optimal (M, ef_construction) is determined, update `VectorIndexConfig` defaults:
```rust
// In tidal/src/storage/vector/mod.rs or config.rs
impl Default for VectorIndexConfig {
fn default() -> Self {
Self {
dimensions: 128,
metric: DistanceMetric::L2,
quantization: QuantizationLevel::F16, // research doc recommends f16 default
connectivity: OPTIMAL_M, // determined by benchmark
ef_construction: OPTIMAL_EF_C, // determined by benchmark
ef_search: 200,
}
}
}
```
### 5. Recall measurement methodology
Recall@K is computed as the fraction of brute-force top-K results that appear in the HNSW search results:
```
recall@K = |HNSW_top_K intersect BruteForce_top_K| / K
```
Averaged over 100 random queries. The threshold is recall@10 > 0.95.
### 6. Memory estimation
Per the research doc, HNSW graph overhead is ~300 bytes per node. At 1M vectors with 128D float32:
| M | Vector data | Graph overhead | Total |
|---|-------------|---------------|-------|
| 8 | 488 MB | ~150 MB | ~638 MB |
| 16 | 488 MB | ~300 MB | ~788 MB |
| 32 | 488 MB | ~600 MB | ~1.1 GB |
At 1536D (production), multiply vector data by 12x. The graph overhead stays the same.
## Acceptance Criteria
- [ ] All 9 (M, ef_construction) configurations benchmarked at 1M vectors (or subset for CI time)
- [ ] Recall@10 > 0.95 for the selected optimal configuration
- [ ] Search latency for 100 queries recorded: mean and p99
- [ ] Build time per configuration recorded
- [ ] Optimal (M, ef_construction) applied to `VectorIndexConfig` default
- [ ] Results documented in `docs/profiling/usearch-tuning.md` with a recall/latency tradeoff table
- [ ] If recall@10 < 0.95 for all configs, document the finding and propose mitigation (increase ef_search, ACORN-1, etc.)
## Test Strategy
1. **Recall validation:** For the chosen config, run 100 queries and verify recall@10 > 0.95 against brute-force ground truth. This is a correctness test, not just a benchmark.
2. **Regression guard:** After applying the optimal config, re-run the existing `tidal/benches/vector.rs` benchmarks to ensure no regression at 10K scale.
3. **Config round-trip:** Verify that the new default config serializes and deserializes correctly if `VectorIndexConfig` is persisted.