jordan 213b8efcca feat: complete M6-M7 + Enterprise Readiness milestones; split oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-23 22:41:16 -07:00

8.4 KiB

Raw Blame History

Task 02: USearch Parameter Tuning

Delivers

A systematic benchmark of USearch HNSW parameters (M x ef_construction) at 1M vectors, documenting the recall/latency tradeoff. The optimal configuration is applied to VectorIndexConfig::default(). ANN recall@10 must exceed 0.95.

Complexity

Dependencies

task-01 complete (scale benchmark infrastructure)
docs/research/ann_for_tidaldb.md (parameter guidance)

Technical Design

1. Parameter matrix

The research doc identifies M and ef_construction as the two critical HNSW parameters for recall/latency tradeoff. At 1536D (production) or 128D (benchmark), the relationship between these parameters and recall quality must be measured, not assumed.

Parameter	Values	Rationale
M (connectivity)	8, 16, 32	M=16 is the USearch default; M=8 saves ~50% graph memory; M=32 improves recall under selective filters at 2x memory
ef_construction	100, 200, 400	Controls index build quality; diminishing returns past 200 in most benchmarks
ef_search	200 (fixed)	Query-time expansion factor; held constant to isolate build-quality effects

This produces a 3x3 = 9 configuration matrix.

2. Benchmark implementation

Add to tidal/benches/vector.rs or a new tidal/benches/usearch_tuning.rs:

#![allow(clippy::unwrap_used, clippy::cast_precision_loss)]

use criterion::{Criterion, black_box, criterion_group, criterion_main, BenchmarkId};
use rand::Rng;
use std::time::Duration;
use tidaldb::storage::vector::{
    AdaptiveQueryPlanner, BruteForceIndex, DistanceMetric,
    QuantizationLevel, VectorId, VectorIndex, VectorIndexConfig,
};

const DIM: usize = 128;
const N: u64 = 1_000_000;
const K: usize = 10;
const NUM_QUERIES: usize = 100;

fn random_unit_vector(dim: usize, rng: &mut impl Rng) -> Vec<f32> {
    let v: Vec<f32> = (0..dim).map(|_| rng.random::<f32>() - 0.5).collect();
    let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
    if norm < f32::EPSILON {
        let mut fallback = vec![0.0f32; dim];
        fallback[0] = 1.0;
        return fallback;
    }
    v.iter().map(|x| x / norm).collect()
}

struct TuningResult {
    m: usize,
    ef_construction: usize,
    recall_at_10: f64,
    mean_latency_us: f64,
    build_time_secs: f64,
    memory_bytes: usize,
}

/// Build an index with specific parameters, compute ground truth recall,
/// and measure search latency.
fn evaluate_config(m: usize, ef_construction: usize) -> TuningResult {
    let config = VectorIndexConfig {
        dimensions: DIM,
        metric: DistanceMetric::L2,
        quantization: QuantizationLevel::F32,
        connectivity: m,
        ef_construction,
        ef_search: 200,
    };

    let mut rng = rand::rng();

    // Build ground truth with brute force
    let brute = BruteForceIndex::new(config.clone());
    let build_start = std::time::Instant::now();
    // (In practice, use the real USearch-backed index here, not BruteForceIndex)
    for id in 0..N {
        let vec = random_unit_vector(DIM, &mut rng);
        brute.insert(id, &vec).unwrap();
    }
    let build_time = build_start.elapsed();

    // Generate query vectors
    let queries: Vec<Vec<f32>> = (0..NUM_QUERIES)
        .map(|_| random_unit_vector(DIM, &mut rng))
        .collect();

    // Compute ground truth (brute force top-K for each query)
    let ground_truths: Vec<Vec<VectorId>> = queries
        .iter()
        .map(|q| {
            brute
                .search(q, K, K * 2)
                .unwrap()
                .iter()
                .map(|r| r.id)
                .collect()
        })
        .collect();

    // Search and measure recall + latency
    let planner = AdaptiveQueryPlanner::with_defaults();
    let mut total_recall = 0.0;
    let mut total_latency = Duration::ZERO;

    for (query, gt) in queries.iter().zip(ground_truths.iter()) {
        let start = std::time::Instant::now();
        let results = planner
            .execute(&brute, query, K, None, 1.0, None)
            .unwrap();
        total_latency += start.elapsed();

        let result_ids: Vec<VectorId> = results.iter().map(|r| r.id).collect();
        let hits = result_ids.iter().filter(|id| gt.contains(id)).count();
        total_recall += hits as f64 / gt.len() as f64;
    }

    TuningResult {
        m,
        ef_construction,
        recall_at_10: total_recall / NUM_QUERIES as f64,
        mean_latency_us: total_latency.as_micros() as f64 / NUM_QUERIES as f64,
        build_time_secs: build_time.as_secs_f64(),
        memory_bytes: 0, // Measured via index-specific API if available
    }
}

3. Criterion benchmark for the optimal config

After determining the optimal (M, ef_construction) from the evaluation, add a Criterion benchmark that measures search latency at the chosen parameters:

fn bench_usearch_optimal_1m(c: &mut Criterion) {
    let mut group = c.benchmark_group("usearch_1m");
    group.sample_size(10);
    group.measurement_time(Duration::from_secs(30));

    // Build index with candidate-optimal config
    let configs = [
        (8, 100),
        (8, 200),
        (16, 100),
        (16, 200),
        (16, 400),
        (32, 200),
        (32, 400),
    ];

    let mut rng = rand::rng();
    let query = random_unit_vector(DIM, &mut rng);

    for &(m, ef_c) in &configs {
        let config = VectorIndexConfig {
            dimensions: DIM,
            metric: DistanceMetric::L2,
            quantization: QuantizationLevel::F32,
            connectivity: m,
            ef_construction: ef_c,
            ef_search: 200,
        };
        let index = BruteForceIndex::new(config);
        // Pre-populate (in real implementation, use the HNSW-backed index)
        for id in 0..10_000u64 {
            let vec = random_unit_vector(DIM, &mut rng);
            index.insert(id, &vec).unwrap();
        }

        group.bench_with_input(
            BenchmarkId::new("search", format!("M{m}_ef{ef_c}")),
            &(m, ef_c),
            |b, _| {
                b.iter(|| {
                    index.search(black_box(&query), black_box(K), black_box(200)).unwrap()
                });
            },
        );
    }

    group.finish();
}

4. Apply optimal config

Once the optimal (M, ef_construction) is determined, update VectorIndexConfig defaults:

// In tidal/src/storage/vector/mod.rs or config.rs
impl Default for VectorIndexConfig {
    fn default() -> Self {
        Self {
            dimensions: 128,
            metric: DistanceMetric::L2,
            quantization: QuantizationLevel::F16, // research doc recommends f16 default
            connectivity: OPTIMAL_M,               // determined by benchmark
            ef_construction: OPTIMAL_EF_C,          // determined by benchmark
            ef_search: 200,
        }
    }
}

5. Recall measurement methodology

Recall@K is computed as the fraction of brute-force top-K results that appear in the HNSW search results:

recall@K = |HNSW_top_K intersect BruteForce_top_K| / K

Averaged over 100 random queries. The threshold is recall@10 > 0.95.

6. Memory estimation

Per the research doc, HNSW graph overhead is ~300 bytes per node. At 1M vectors with 128D float32:

M	Vector data	Graph overhead	Total
8	488 MB	~150 MB	~638 MB
16	488 MB	~300 MB	~788 MB
32	488 MB	~600 MB	~1.1 GB

At 1536D (production), multiply vector data by 12x. The graph overhead stays the same.

Acceptance Criteria

All 9 (M, ef_construction) configurations benchmarked at 1M vectors (or subset for CI time)
Recall@10 > 0.95 for the selected optimal configuration
Search latency for 100 queries recorded: mean and p99
Build time per configuration recorded
Optimal (M, ef_construction) applied to VectorIndexConfig default
Results documented in docs/profiling/usearch-tuning.md with a recall/latency tradeoff table
If recall@10 < 0.95 for all configs, document the finding and propose mitigation (increase ef_search, ACORN-1, etc.)

Test Strategy

Recall validation: For the chosen config, run 100 queries and verify recall@10 > 0.95 against brute-force ground truth. This is a correctness test, not just a benchmark.
Regression guard: After applying the optimal config, re-run the existing tidal/benches/vector.rs benchmarks to ensure no regression at 10K scale.
Config round-trip: Verify that the new default config serializes and deserializes correctly if VectorIndexConfig is persisted.

8.4 KiB Raw Blame History