tidaldb/docs/planning/milestone-2/phase-1/task-02-usearch-backend.md
jordan 6fdaa1584b feat: complete M1 signal engine — m0p3 samples/docs, m1p5 TidalDb API, examples, and periodic checkpoint
- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples
  (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test
  coverage for every public API surface
- m1p5: TidalDb public API — write_item, signal, read_decay_score,
  read_windowed_count, read_velocity; StorageBox enum routing memory vs
  fjall; WalSender/WalHandleWriter bridge; WAL replay on open
- Periodic checkpoint: 30s background thread for persistent+schema mode;
  FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful
  shutdown via Arc<AtomicBool> + join before final checkpoint
- ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing)
- Milestone 2 planning scaffolding added under docs/planning/milestone-2/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 22:45:10 -07:00

24 KiB

Task 02: USearch Backend

Context

Milestone: 2 -- Ranked Retrieval Phase: m2p1 -- Vector Index Integration (USearch) Depends On: Task 01 (VectorIndex trait, types, l2_distance_sq) Blocks: Task 04 (Adaptive Query Planner -- needs USearch for benchmarking) Complexity: L

Objective

Deliver UsearchIndex, the production HNSW implementation wrapping the usearch Rust crate (Apache-2.0, C++ FFI via cxx). This is the performance-critical vector index that tidalDB uses for approximate nearest neighbor search at scale. At 10M vectors of dimension 1536, USearch achieves ~127K QPS at f32 and ~167K QPS at int8, with recall@100 > 95% -- numbers validated by ScyllaDB, ClickHouse, and DuckDB in production.

This is the only module in tidalDB where #![forbid(unsafe_code)] is relaxed. The usearch crate uses CXX for C++ FFI, which requires unsafe at the binding boundary. Every unsafe block must have a // SAFETY: comment explaining why the invariants hold. The #[allow(unsafe_code)] attribute is scoped to this single file (storage/vector/usearch.rs).

The USearch backend implements the full VectorIndex trait: insert, search, filtered_search, delete, reserve, save, load, view. It uses f16 quantization by default, M=16, ef_construction=200, ef_search=200 -- parameters validated by the research doc as optimal for 1536-dimensional embeddings at tidalDB's target scale.

Requirements

  • UsearchIndex wraps usearch::Index from the usearch crate
  • Implements VectorIndex trait from Task 01
  • Default config: f16 quantization (usearch::ScalarKind::F16), M=16, ef_construction=200, ef_search=200, metric=L2sq
  • insert() delegates to usearch::Index::add(key, vector)
  • search() delegates to usearch::Index::search(query, k)
  • filtered_search() delegates to usearch::Index::filtered_search(query, k, predicate)
  • delete() delegates to usearch::Index::remove(key) (lazy tombstone)
  • reserve() delegates to usearch::Index::reserve(capacity)
  • save(), load(), view() delegate to USearch persistence methods
  • len() and len_live() use USearch's size() and capacity reporting
  • #[allow(unsafe_code)] scoped to usearch.rs only, with // SAFETY: on every unsafe block
  • Integration test: insert 1000 random vectors, search for 10 query vectors, compare recall against BruteForceIndex
  • UsearchIndex is Send + Sync

Technical Design

Module Structure

tidal/src/storage/vector/
  usearch.rs  -- UsearchIndex, #[allow(unsafe_code)]

Cargo.toml Addition

[dependencies]
usearch = "2"  # or latest stable version supporting filtered_search

Note: The exact version must be verified at implementation time. The usearch crate must support filtered_search with a predicate callback. If the latest published version does not support this API, the implementation must either:

  1. Use a version that does (check crate changelog).
  2. Fall back to hnsw_rs (pure Rust, Filterable trait) -- see Open Question 1 in OVERVIEW.md.

Lint Configuration

Unsafe code: The usearch crate (v2.x) provides a safe Rust API at the Index level -- CXX bridge handles the FFI internally. At implementation time, verify that all Index methods (add, search, filtered_search, remove, save, load, view) have safe signatures. If confirmed safe, do NOT add #[allow(unsafe_code)] and keep crate-level forbid(unsafe_code). Only add #[allow(unsafe_code)] if specific call sites require it, with // SAFETY: comments. The current expectation is that no unsafe blocks are needed in usearch.rs.

Public API

// === storage/vector/usearch.rs ===
//! USearch HNSW backend for approximate nearest neighbor search.
//!
//! This module wraps the `usearch` crate (Apache-2.0, C++ FFI via CXX)
//! behind the `VectorIndex` trait. It is the ONLY module in tidalDB that
//! uses `unsafe` code, and only at the C++ FFI boundary.
//!
//! # Safety
//!
//! All unsafe blocks delegate to `usearch::Index` methods which perform
//! C++ interop via CXX. The safety invariants are:
//! - Vectors passed to USearch have the correct dimensionality (checked
//!   before the FFI call).
//! - The `usearch::Index` handle is valid for the lifetime of `UsearchIndex`.
//! - `reserve()` has been called with sufficient capacity before insertion.
#![allow(unsafe_code)]

use std::path::Path;
use super::{VectorIndex, VectorId, VectorSearchResult, VectorIndexConfig, VectorError,
            DistanceMetric, QuantizationLevel};

/// Production HNSW index backed by USearch.
///
/// Uses f16 quantization by default, M=16, ef_construction=200, ef_search=200.
/// Supports concurrent reads and writes (validated by ScyllaDB at 1B vectors).
///
/// # Persistence
///
/// - `save(path)`: Full serialization to disk. Coordinated with WAL checkpoint.
/// - `load(path)`: Full deserialization into writable RAM.
/// - `view(path)`: Zero-copy mmap for read-only serving (instant restart).
pub struct UsearchIndex {
    inner: usearch::Index,
    config: VectorIndexConfig,
}

impl UsearchIndex {
    /// Create a new empty index with the given configuration.
    ///
    /// # Errors
    ///
    /// Returns `VectorError::Backend` if USearch fails to initialize.
    pub fn new(config: VectorIndexConfig) -> Result<Self, VectorError>;
}

Internal Design

Index construction:

impl UsearchIndex {
    pub fn new(config: VectorIndexConfig) -> Result<Self, VectorError> {
        let metric = match config.metric {
            DistanceMetric::L2 => usearch::MetricKind::L2sq,
            DistanceMetric::InnerProduct => usearch::MetricKind::IP,
        };
        let quantization = match config.quantization {
            QuantizationLevel::F32 => usearch::ScalarKind::F32,
            QuantizationLevel::F16 => usearch::ScalarKind::F16,
            QuantizationLevel::Int8 => usearch::ScalarKind::I8,
        };

        let options = usearch::IndexOptions {
            dimensions: config.dimensions,
            metric,
            quantization,
            connectivity: config.connectivity,
            expansion_add: config.ef_construction,
            expansion_search: config.ef_search,
            ..Default::default()
        };

        // SAFETY: usearch::new_index performs C++ allocation via CXX.
        // The returned Index handle is valid until dropped.
        let inner = usearch::new_index(&options)
            .map_err(|e| VectorError::Backend(format!("USearch init failed: {e}")))?;

        Ok(Self { inner, config })
    }
}

Insert implementation:

fn insert(&self, id: VectorId, embedding: &[f32]) -> Result<(), VectorError> {
    if embedding.len() != self.config.dimensions {
        return Err(VectorError::DimensionMismatch {
            expected: self.config.dimensions,
            got: embedding.len(),
        });
    }

    // SAFETY: embedding slice has correct length (checked above).
    // USearch::add performs C++ FFI to insert the vector into the HNSW graph.
    // The key (u64) and vector data are copied into USearch's internal storage.
    self.inner.add(id, embedding)
        .map_err(|e| VectorError::Backend(format!("USearch insert failed: {e}")))?;

    Ok(())
}

Search implementation:

fn search(
    &self,
    query: &[f32],
    k: usize,
    ef_search: usize,
) -> Result<Vec<VectorSearchResult>, VectorError> {
    if query.len() != self.config.dimensions {
        return Err(VectorError::DimensionMismatch {
            expected: self.config.dimensions,
            got: query.len(),
        });
    }

    // SAFETY: query slice has correct length (checked above).
    // USearch::search performs HNSW traversal via C++ FFI.
    // Results are copied back into Rust-owned memory.
    let results = self.inner.search(query, k)
        .map_err(|e| VectorError::Backend(format!("USearch search failed: {e}")))?;

    Ok(results.keys.iter().zip(results.distances.iter())
        .map(|(&id, &dist)| VectorSearchResult { id, distance: dist })
        .collect())
}

Filtered search implementation:

fn filtered_search(
    &self,
    query: &[f32],
    k: usize,
    ef_search: usize,
    filter: &dyn Fn(VectorId) -> bool,
) -> Result<Vec<VectorSearchResult>, VectorError> {
    if query.len() != self.config.dimensions {
        return Err(VectorError::DimensionMismatch {
            expected: self.config.dimensions,
            got: query.len(),
        });
    }

    // SAFETY: query slice has correct length (checked above).
    // The predicate closure is called from C++ during HNSW traversal.
    // CXX marshals the u64 key to Rust and back. The closure captures
    // only the filter reference which outlives the search call.
    let results = self.inner.filtered_search(query, k, |key| filter(key))
        .map_err(|e| VectorError::Backend(format!("USearch filtered_search failed: {e}")))?;

    Ok(results.keys.iter().zip(results.distances.iter())
        .map(|(&id, &dist)| VectorSearchResult { id, distance: dist })
        .collect())
}

Note on filtered_search args: USearch's filtered_search takes (query, count, filter) -- there is no ef_search parameter. To use a different ef_search for this query, call self.inner.change_expansion_search(ef) BEFORE filtered_search. See the ef_search override note below.

ef_search override: Calling change_expansion_search(ef) before a search changes a global index parameter. Under concurrent searches this is NOT safe. For M2 (single-threaded query path or low concurrency), wrap the (change_expansion_search, search) pair in a Mutex guard. For M7 and high concurrency, investigate USearch's thread-safe ef_search API or fix ef_search at construction time. Document this in the Open Questions.

Delete implementation:

fn delete(&self, id: VectorId) -> Result<(), VectorError> {
    // SAFETY: USearch::remove performs lazy tombstoning via C++ FFI.
    // The node remains in the graph for navigation but is excluded from results.
    self.inner.remove(id)
        .map_err(|e| VectorError::Backend(format!("USearch delete failed: {e}")))?;
    Ok(())
}

Persistence implementation:

fn save(&self, path: &Path) -> Result<(), VectorError> {
    let path_str = path.to_str()
        .ok_or_else(|| VectorError::Io(std::io::Error::new(
            std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?;
    // SAFETY: USearch::save serializes the entire index to disk via C++ I/O.
    self.inner.save(path_str)
        .map_err(|e| VectorError::Backend(format!("USearch save failed: {e}")))?;
    Ok(())
}

fn load(path: &Path, config: &VectorIndexConfig) -> Result<Self, VectorError> {
    let index = Self::new(config.clone())?;
    let path_str = path.to_str()
        .ok_or_else(|| VectorError::Io(std::io::Error::new(
            std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?;
    // SAFETY: USearch::load deserializes from disk into writable RAM via C++ I/O.
    index.inner.load(path_str)
        .map_err(|e| VectorError::Backend(format!("USearch load failed: {e}")))?;
    Ok(index)
}

fn view(path: &Path, config: &VectorIndexConfig) -> Result<Self, VectorError> {
    // view() now receives config, matching the updated VectorIndex trait
    // signature from Task 01 (Fix 2a). Create an index with the config
    // options, then call USearch's view() to mmap the file.
    let index = Self::new(config.clone())?;
    let path_str = path.to_str()
        .ok_or_else(|| VectorError::Io(std::io::Error::new(
            std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?;
    // SAFETY: USearch::view memory-maps the file for read-only access via C++ I/O.
    index.inner.view(path_str)
        .map_err(|e| VectorError::Backend(format!("USearch view failed: {e}")))?;
    Ok(index)
}

len and len_live implementation:

fn len(&self) -> usize {
    self.inner.size()
}

fn len_live(&self) -> usize {
    // USearch tracks live vs tombstoned internally.
    // If the crate exposes this, use it. Otherwise, len() is the best estimate.
    // Investigate at implementation time.
    self.inner.size() // may need adjustment
}

Error Handling

  • All USearch errors are mapped to VectorError::Backend(String) with the original error message.
  • Dimension checks happen before any FFI call to provide clear Rust-side errors.
  • I/O errors from persistence are mapped to VectorError::Io when possible, VectorError::Backend otherwise.
  • If reserve() is not called before insertion and USearch fails, the error is VectorError::Backend with a message suggesting reserve().

Test Strategy

Integration Tests

// === tests/vector_usearch.rs (integration test) ===

use tidaldb::storage::vector::*;
use rand::Rng;

/// Generate a random unit vector of the given dimension.
fn random_unit_vector(dim: usize, rng: &mut impl Rng) -> Vec<f32> {
    let v: Vec<f32> = (0..dim).map(|_| rng.gen::<f32>() - 0.5).collect();
    let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
    v.iter().map(|x| x / norm).collect()
}

#[test]
fn usearch_insert_and_search_1000_vectors() {
    let dim = 128; // smaller dim for test speed
    let config = VectorIndexConfig {
        dimensions: dim,
        metric: DistanceMetric::L2,
        quantization: QuantizationLevel::F16,
        connectivity: 16,
        ef_construction: 200,
        ef_search: 200,
    };

    let usearch_index = UsearchIndex::new(config.clone()).unwrap();
    usearch_index.reserve(2000).unwrap();

    let brute_index = BruteForceIndex::new(config.clone());

    let mut rng = rand::thread_rng();
    let vectors: Vec<(u64, Vec<f32>)> = (0..1000)
        .map(|id| (id, random_unit_vector(dim, &mut rng)))
        .collect();

    // Insert into both indexes
    for (id, v) in &vectors {
        usearch_index.insert(*id, v).unwrap();
        brute_index.insert(*id, v).unwrap();
    }

    // Search with 10 random queries, measure recall
    let mut total_recall = 0.0;
    let k = 100;
    let n_queries = 10;

    for _ in 0..n_queries {
        let query = random_unit_vector(dim, &mut rng);

        let exact_results = brute_index.search(&query, k, 0).unwrap();
        let approx_results = usearch_index.search(&query, k, 0).unwrap();

        let exact_ids: std::collections::HashSet<u64> =
            exact_results.iter().map(|r| r.id).collect();
        let approx_ids: std::collections::HashSet<u64> =
            approx_results.iter().map(|r| r.id).collect();

        let overlap = exact_ids.intersection(&approx_ids).count();
        let recall = overlap as f64 / k as f64;
        total_recall += recall;
    }

    let mean_recall = total_recall / n_queries as f64;
    assert!(mean_recall > 0.90,
        "recall@{k} should be > 0.90, got {mean_recall:.3}");
}

#[test]
fn usearch_filtered_search_excludes_non_matching() {
    let dim = 64;
    let config = VectorIndexConfig {
        dimensions: dim,
        ..VectorIndexConfig::default()
    };

    let index = UsearchIndex::new(config).unwrap();
    index.reserve(200).unwrap();

    let mut rng = rand::thread_rng();
    for id in 0..100u64 {
        let v = random_unit_vector(dim, &mut rng);
        index.insert(id, &v).unwrap();
    }

    // Only include even IDs
    let query = random_unit_vector(dim, &mut rng);
    let results = index.filtered_search(&query, 50, 0, &|id| id % 2 == 0).unwrap();

    for r in &results {
        assert!(r.id % 2 == 0, "filtered_search returned odd ID {}", r.id);
    }
}

#[test]
fn usearch_delete_excludes_from_results() {
    let dim = 64;
    let config = VectorIndexConfig {
        dimensions: dim,
        ..VectorIndexConfig::default()
    };

    let index = UsearchIndex::new(config).unwrap();
    index.reserve(200).unwrap();

    let mut rng = rand::thread_rng();
    let vectors: Vec<(u64, Vec<f32>)> = (0..50)
        .map(|id| (id, random_unit_vector(dim, &mut rng)))
        .collect();

    for (id, v) in &vectors {
        index.insert(*id, v).unwrap();
    }

    // Delete ID 0
    index.delete(0).unwrap();

    // Search for the deleted vector -- it should not appear
    let results = index.search(&vectors[0].1, 50, 0).unwrap();
    assert!(results.iter().all(|r| r.id != 0),
        "deleted vector should not appear in results");
}

#[test]
fn usearch_save_load_roundtrip() {
    let dim = 64;
    let config = VectorIndexConfig {
        dimensions: dim,
        ..VectorIndexConfig::default()
    };

    let index = UsearchIndex::new(config.clone()).unwrap();
    index.reserve(200).unwrap();

    let mut rng = rand::thread_rng();
    for id in 0..100u64 {
        let v = random_unit_vector(dim, &mut rng);
        index.insert(id, &v).unwrap();
    }

    let dir = tempfile::tempdir().unwrap();
    let path = dir.path().join("test.usearch");

    // Save
    index.save(&path).unwrap();

    // Load
    let loaded = UsearchIndex::load(&path, &config).unwrap();
    assert_eq!(loaded.len(), 100);

    // Search on loaded index should produce similar results
    let query = random_unit_vector(dim, &mut rng);
    let results_orig = index.search(&query, 10, 0).unwrap();
    let results_loaded = loaded.search(&query, 10, 0).unwrap();

    // Top-1 should match (high probability for exact same index)
    assert_eq!(results_orig[0].id, results_loaded[0].id);
}

#[test]
fn usearch_view_readonly() {
    let dim = 64;
    let config = VectorIndexConfig {
        dimensions: dim,
        ..VectorIndexConfig::default()
    };

    let index = UsearchIndex::new(config.clone()).unwrap();
    index.reserve(100).unwrap();

    let mut rng = rand::thread_rng();
    for id in 0..50u64 {
        let v = random_unit_vector(dim, &mut rng);
        index.insert(id, &v).unwrap();
    }

    let dir = tempfile::tempdir().unwrap();
    let path = dir.path().join("test.usearch");
    index.save(&path).unwrap();

    // View (mmap read-only)
    let viewed = UsearchIndex::view(&path, &config).unwrap();
    assert_eq!(viewed.len(), 50);

    // Search should work on view'd index
    let query = random_unit_vector(dim, &mut rng);
    let results = viewed.search(&query, 10, 0).unwrap();
    assert!(!results.is_empty());
}

#[test]
fn usearch_dimension_mismatch() {
    let config = VectorIndexConfig {
        dimensions: 64,
        ..VectorIndexConfig::default()
    };

    let index = UsearchIndex::new(config).unwrap();
    index.reserve(10).unwrap();

    // Wrong dimension on insert
    let result = index.insert(1, &[1.0; 32]); // 32 dims instead of 64
    assert!(matches!(result, Err(VectorError::DimensionMismatch { expected: 64, got: 32 })));

    // Wrong dimension on search
    index.insert(1, &[0.0; 64]).unwrap();
    let result = index.search(&[1.0; 32], 1, 0);
    assert!(matches!(result, Err(VectorError::DimensionMismatch { .. })));
}

#[test]
fn usearch_is_send_and_sync() {
    fn assert_send_sync<T: Send + Sync>() {}
    assert_send_sync::<UsearchIndex>();
}

#[test]
fn usearch_recall_at_10k() {
    // Larger recall test at 10K vectors, matching the phase acceptance criteria.
    // Uses smaller dimensions (128) for test speed.
    let dim = 128;
    let n = 10_000;
    let k = 100;
    let config = VectorIndexConfig {
        dimensions: dim,
        metric: DistanceMetric::L2,
        quantization: QuantizationLevel::F16,
        connectivity: 16,
        ef_construction: 200,
        ef_search: 200,
    };

    let usearch_index = UsearchIndex::new(config.clone()).unwrap();
    usearch_index.reserve(n * 2).unwrap();

    let brute_index = BruteForceIndex::new(config);

    let mut rng = rand::thread_rng();
    for id in 0..n as u64 {
        let v = random_unit_vector(dim, &mut rng);
        usearch_index.insert(id, &v).unwrap();
        brute_index.insert(id, &v).unwrap();
    }

    // 10 queries, compute mean recall@100
    let mut total_recall = 0.0;
    for _ in 0..10 {
        let query = random_unit_vector(dim, &mut rng);
        let exact = brute_index.search(&query, k, 0).unwrap();
        let approx = usearch_index.search(&query, k, 0).unwrap();

        let exact_ids: std::collections::HashSet<u64> = exact.iter().map(|r| r.id).collect();
        let approx_ids: std::collections::HashSet<u64> = approx.iter().map(|r| r.id).collect();
        let recall = exact_ids.intersection(&approx_ids).count() as f64 / k as f64;
        total_recall += recall;
    }

    let mean_recall = total_recall / 10.0;
    assert!(mean_recall > 0.95,
        "recall@{k} at {n} vectors should be > 0.95, got {mean_recall:.3}");
}

Acceptance Criteria

  • UsearchIndex wraps usearch::Index from the usearch crate
  • UsearchIndex implements VectorIndex trait (all methods)
  • Default config: f16 quantization, M=16, ef_construction=200, ef_search=200, L2sq metric
  • insert() validates dimensions before FFI call
  • search() returns results sorted by ascending L2 distance
  • filtered_search() passes predicate closure to USearch's callback API; all returned results satisfy the predicate
  • delete() tombstones the vector; it is excluded from subsequent search results
  • reserve() pre-allocates capacity in USearch
  • save() persists the full index to disk
  • load() restores a writable index from disk; search produces identical results
  • view() memory-maps the index for read-only search
  • #[allow(unsafe_code)] scoped to usearch.rs only
  • Every unsafe block has a // SAFETY: comment
  • Integration test: 1000 vectors, 10 queries, recall@100 > 0.90
  • Integration test: 10K vectors, recall@100 > 0.95 (matching phase acceptance criteria)
  • Integration test: filtered_search returns only predicate-matching results
  • Integration test: save/load roundtrip preserves search results
  • UsearchIndex is Send + Sync
  • cargo clippy -- -D warnings passes
  • All integration tests pass

Research References

  • docs/research/ann_for_tidaldb.md -- USearch evaluation: 127K QPS at f32, 167K QPS at int8, ScyllaDB validates concurrent operation at 1B vectors, f16 as optimal default (half memory, < 1% recall loss), filtered_search(query, k, |key| predicate(key)) implements in-graph filtering, view() for zero-copy mmap serving

Spec References

  • docs/specs/07-vector-retrieval.md -- Section 2 (HNSW internals: M=16, ef_construction=200, ef_search=200, L2 distance over normalized vectors), Section 3 (filtered ANN: USearch predicate callback, in-graph filtering preserves graph navigation), Section 4 (quantization: f16 default, ScalarKind mapping), Section 7 (persistence: save/load/view lifecycle, checkpoint coordination), Section 11 (UsearchIndex implementation sketch), Section 12 (performance targets: < 10ms ANN at 10K, recall@100 > 95%)

Implementation Notes

  • Add usearch = "2" (or the latest stable version with filtered_search support) to tidal/Cargo.toml [dependencies].
  • Change [lints.rust] unsafe_code from "forbid" to "deny" in Cargo.toml. Add a comment: # deny (not forbid) to allow #[allow(unsafe_code)] in usearch FFI module.
  • Add rand = "0.9" to [dev-dependencies] for random vector generation in tests.
  • The usearch crate depends on cxx for C++ interop. This adds a C++ compiler requirement to the build. Document this in a top-level build note.
  • If USearch does not expose a way to distinguish live vs tombstoned vectors, len_live() should track deletions via an internal AtomicUsize counter decremented on each delete() call.
  • The view() method signature in the VectorIndex trait now takes (path, config) per the updated trait definition in Task 01. USearch requires knowing the index dimensions/metric to initialize the mmap'd index, so the config parameter is passed through to USearch construction before calling view().
  • Do NOT implement per-query ef_search override in this task if the USearch crate does not support it cleanly. Accept the parameter, log a debug warning if it differs from the default, and use the index-level default. Per-query override can be added when the adaptive query planner (Task 04) needs it.
  • Do NOT wrap UsearchIndex in RwLock unless testing reveals that concurrent insert + search causes data races. USearch claims thread safety for concurrent reads and writes. Verify in the integration test by running searches and inserts from multiple threads.