252 lines
8.4 KiB
Markdown
252 lines
8.4 KiB
Markdown
# Task 02: USearch Parameter Tuning
|
|
|
|
## Delivers
|
|
|
|
A systematic benchmark of USearch HNSW parameters (M x ef_construction) at 1M vectors, documenting the recall/latency tradeoff. The optimal configuration is applied to `VectorIndexConfig::default()`. ANN recall@10 must exceed 0.95.
|
|
|
|
## Complexity
|
|
|
|
M
|
|
|
|
## Dependencies
|
|
|
|
- task-01 complete (scale benchmark infrastructure)
|
|
- `docs/research/ann_for_tidaldb.md` (parameter guidance)
|
|
|
|
## Technical Design
|
|
|
|
### 1. Parameter matrix
|
|
|
|
The research doc identifies M and ef_construction as the two critical HNSW parameters for recall/latency tradeoff. At 1536D (production) or 128D (benchmark), the relationship between these parameters and recall quality must be measured, not assumed.
|
|
|
|
| Parameter | Values | Rationale |
|
|
|-----------|--------|-----------|
|
|
| M (connectivity) | 8, 16, 32 | M=16 is the USearch default; M=8 saves ~50% graph memory; M=32 improves recall under selective filters at 2x memory |
|
|
| ef_construction | 100, 200, 400 | Controls index build quality; diminishing returns past 200 in most benchmarks |
|
|
| ef_search | 200 (fixed) | Query-time expansion factor; held constant to isolate build-quality effects |
|
|
|
|
This produces a 3x3 = 9 configuration matrix.
|
|
|
|
### 2. Benchmark implementation
|
|
|
|
Add to `tidal/benches/vector.rs` or a new `tidal/benches/usearch_tuning.rs`:
|
|
|
|
```rust
|
|
#![allow(clippy::unwrap_used, clippy::cast_precision_loss)]
|
|
|
|
use criterion::{Criterion, black_box, criterion_group, criterion_main, BenchmarkId};
|
|
use rand::Rng;
|
|
use std::time::Duration;
|
|
use tidaldb::storage::vector::{
|
|
AdaptiveQueryPlanner, BruteForceIndex, DistanceMetric,
|
|
QuantizationLevel, VectorId, VectorIndex, VectorIndexConfig,
|
|
};
|
|
|
|
const DIM: usize = 128;
|
|
const N: u64 = 1_000_000;
|
|
const K: usize = 10;
|
|
const NUM_QUERIES: usize = 100;
|
|
|
|
fn random_unit_vector(dim: usize, rng: &mut impl Rng) -> Vec<f32> {
|
|
let v: Vec<f32> = (0..dim).map(|_| rng.random::<f32>() - 0.5).collect();
|
|
let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
|
|
if norm < f32::EPSILON {
|
|
let mut fallback = vec![0.0f32; dim];
|
|
fallback[0] = 1.0;
|
|
return fallback;
|
|
}
|
|
v.iter().map(|x| x / norm).collect()
|
|
}
|
|
|
|
struct TuningResult {
|
|
m: usize,
|
|
ef_construction: usize,
|
|
recall_at_10: f64,
|
|
mean_latency_us: f64,
|
|
build_time_secs: f64,
|
|
memory_bytes: usize,
|
|
}
|
|
|
|
/// Build an index with specific parameters, compute ground truth recall,
|
|
/// and measure search latency.
|
|
fn evaluate_config(m: usize, ef_construction: usize) -> TuningResult {
|
|
let config = VectorIndexConfig {
|
|
dimensions: DIM,
|
|
metric: DistanceMetric::L2,
|
|
quantization: QuantizationLevel::F32,
|
|
connectivity: m,
|
|
ef_construction,
|
|
ef_search: 200,
|
|
};
|
|
|
|
let mut rng = rand::rng();
|
|
|
|
// Build ground truth with brute force
|
|
let brute = BruteForceIndex::new(config.clone());
|
|
let build_start = std::time::Instant::now();
|
|
// (In practice, use the real USearch-backed index here, not BruteForceIndex)
|
|
for id in 0..N {
|
|
let vec = random_unit_vector(DIM, &mut rng);
|
|
brute.insert(id, &vec).unwrap();
|
|
}
|
|
let build_time = build_start.elapsed();
|
|
|
|
// Generate query vectors
|
|
let queries: Vec<Vec<f32>> = (0..NUM_QUERIES)
|
|
.map(|_| random_unit_vector(DIM, &mut rng))
|
|
.collect();
|
|
|
|
// Compute ground truth (brute force top-K for each query)
|
|
let ground_truths: Vec<Vec<VectorId>> = queries
|
|
.iter()
|
|
.map(|q| {
|
|
brute
|
|
.search(q, K, K * 2)
|
|
.unwrap()
|
|
.iter()
|
|
.map(|r| r.id)
|
|
.collect()
|
|
})
|
|
.collect();
|
|
|
|
// Search and measure recall + latency
|
|
let planner = AdaptiveQueryPlanner::with_defaults();
|
|
let mut total_recall = 0.0;
|
|
let mut total_latency = Duration::ZERO;
|
|
|
|
for (query, gt) in queries.iter().zip(ground_truths.iter()) {
|
|
let start = std::time::Instant::now();
|
|
let results = planner
|
|
.execute(&brute, query, K, None, 1.0, None)
|
|
.unwrap();
|
|
total_latency += start.elapsed();
|
|
|
|
let result_ids: Vec<VectorId> = results.iter().map(|r| r.id).collect();
|
|
let hits = result_ids.iter().filter(|id| gt.contains(id)).count();
|
|
total_recall += hits as f64 / gt.len() as f64;
|
|
}
|
|
|
|
TuningResult {
|
|
m,
|
|
ef_construction,
|
|
recall_at_10: total_recall / NUM_QUERIES as f64,
|
|
mean_latency_us: total_latency.as_micros() as f64 / NUM_QUERIES as f64,
|
|
build_time_secs: build_time.as_secs_f64(),
|
|
memory_bytes: 0, // Measured via index-specific API if available
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. Criterion benchmark for the optimal config
|
|
|
|
After determining the optimal (M, ef_construction) from the evaluation, add a Criterion benchmark that measures search latency at the chosen parameters:
|
|
|
|
```rust
|
|
fn bench_usearch_optimal_1m(c: &mut Criterion) {
|
|
let mut group = c.benchmark_group("usearch_1m");
|
|
group.sample_size(10);
|
|
group.measurement_time(Duration::from_secs(30));
|
|
|
|
// Build index with candidate-optimal config
|
|
let configs = [
|
|
(8, 100),
|
|
(8, 200),
|
|
(16, 100),
|
|
(16, 200),
|
|
(16, 400),
|
|
(32, 200),
|
|
(32, 400),
|
|
];
|
|
|
|
let mut rng = rand::rng();
|
|
let query = random_unit_vector(DIM, &mut rng);
|
|
|
|
for &(m, ef_c) in &configs {
|
|
let config = VectorIndexConfig {
|
|
dimensions: DIM,
|
|
metric: DistanceMetric::L2,
|
|
quantization: QuantizationLevel::F32,
|
|
connectivity: m,
|
|
ef_construction: ef_c,
|
|
ef_search: 200,
|
|
};
|
|
let index = BruteForceIndex::new(config);
|
|
// Pre-populate (in real implementation, use the HNSW-backed index)
|
|
for id in 0..10_000u64 {
|
|
let vec = random_unit_vector(DIM, &mut rng);
|
|
index.insert(id, &vec).unwrap();
|
|
}
|
|
|
|
group.bench_with_input(
|
|
BenchmarkId::new("search", format!("M{m}_ef{ef_c}")),
|
|
&(m, ef_c),
|
|
|b, _| {
|
|
b.iter(|| {
|
|
index.search(black_box(&query), black_box(K), black_box(200)).unwrap()
|
|
});
|
|
},
|
|
);
|
|
}
|
|
|
|
group.finish();
|
|
}
|
|
```
|
|
|
|
### 4. Apply optimal config
|
|
|
|
Once the optimal (M, ef_construction) is determined, update `VectorIndexConfig` defaults:
|
|
|
|
```rust
|
|
// In tidal/src/storage/vector/mod.rs or config.rs
|
|
impl Default for VectorIndexConfig {
|
|
fn default() -> Self {
|
|
Self {
|
|
dimensions: 128,
|
|
metric: DistanceMetric::L2,
|
|
quantization: QuantizationLevel::F16, // research doc recommends f16 default
|
|
connectivity: OPTIMAL_M, // determined by benchmark
|
|
ef_construction: OPTIMAL_EF_C, // determined by benchmark
|
|
ef_search: 200,
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 5. Recall measurement methodology
|
|
|
|
Recall@K is computed as the fraction of brute-force top-K results that appear in the HNSW search results:
|
|
|
|
```
|
|
recall@K = |HNSW_top_K intersect BruteForce_top_K| / K
|
|
```
|
|
|
|
Averaged over 100 random queries. The threshold is recall@10 > 0.95.
|
|
|
|
### 6. Memory estimation
|
|
|
|
Per the research doc, HNSW graph overhead is ~300 bytes per node. At 1M vectors with 128D float32:
|
|
|
|
| M | Vector data | Graph overhead | Total |
|
|
|---|-------------|---------------|-------|
|
|
| 8 | 488 MB | ~150 MB | ~638 MB |
|
|
| 16 | 488 MB | ~300 MB | ~788 MB |
|
|
| 32 | 488 MB | ~600 MB | ~1.1 GB |
|
|
|
|
At 1536D (production), multiply vector data by 12x. The graph overhead stays the same.
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] All 9 (M, ef_construction) configurations benchmarked at 1M vectors (or subset for CI time)
|
|
- [ ] Recall@10 > 0.95 for the selected optimal configuration
|
|
- [ ] Search latency for 100 queries recorded: mean and p99
|
|
- [ ] Build time per configuration recorded
|
|
- [ ] Optimal (M, ef_construction) applied to `VectorIndexConfig` default
|
|
- [ ] Results documented in `docs/profiling/usearch-tuning.md` with a recall/latency tradeoff table
|
|
- [ ] If recall@10 < 0.95 for all configs, document the finding and propose mitigation (increase ef_search, ACORN-1, etc.)
|
|
|
|
## Test Strategy
|
|
|
|
1. **Recall validation:** For the chosen config, run 100 queries and verify recall@10 > 0.95 against brute-force ground truth. This is a correctness test, not just a benchmark.
|
|
2. **Regression guard:** After applying the optimal config, re-run the existing `tidal/benches/vector.rs` benchmarks to ensure no regression at 10K scale.
|
|
3. **Config round-trip:** Verify that the new default config serializes and deserializes correctly if `VectorIndexConfig` is persisted.
|