# Task 02: USearch Backend ## Context **Milestone:** 2 -- Ranked Retrieval **Phase:** m2p1 -- Vector Index Integration (USearch) **Depends On:** Task 01 (VectorIndex trait, types, `l2_distance_sq`) **Blocks:** Task 04 (Adaptive Query Planner -- needs USearch for benchmarking) **Complexity:** L ## Objective Deliver `UsearchIndex`, the production HNSW implementation wrapping the `usearch` Rust crate (Apache-2.0, C++ FFI via `cxx`). This is the performance-critical vector index that tidalDB uses for approximate nearest neighbor search at scale. At 10M vectors of dimension 1536, USearch achieves ~127K QPS at f32 and ~167K QPS at int8, with recall@100 > 95% -- numbers validated by ScyllaDB, ClickHouse, and DuckDB in production. This is the only module in tidalDB where `#![forbid(unsafe_code)]` is relaxed. The `usearch` crate uses CXX for C++ FFI, which requires `unsafe` at the binding boundary. Every `unsafe` block must have a `// SAFETY:` comment explaining why the invariants hold. The `#[allow(unsafe_code)]` attribute is scoped to this single file (`storage/vector/usearch.rs`). The USearch backend implements the full `VectorIndex` trait: `insert`, `search`, `filtered_search`, `delete`, `reserve`, `save`, `load`, `view`. It uses f16 quantization by default, M=16, ef_construction=200, ef_search=200 -- parameters validated by the research doc as optimal for 1536-dimensional embeddings at tidalDB's target scale. ## Requirements - `UsearchIndex` wraps `usearch::Index` from the `usearch` crate - Implements `VectorIndex` trait from Task 01 - Default config: f16 quantization (`usearch::ScalarKind::F16`), M=16, ef_construction=200, ef_search=200, metric=L2sq - `insert()` delegates to `usearch::Index::add(key, vector)` - `search()` delegates to `usearch::Index::search(query, k)` - `filtered_search()` delegates to `usearch::Index::filtered_search(query, k, predicate)` - `delete()` delegates to `usearch::Index::remove(key)` (lazy tombstone) - `reserve()` delegates to `usearch::Index::reserve(capacity)` - `save()`, `load()`, `view()` delegate to USearch persistence methods - `len()` and `len_live()` use USearch's `size()` and capacity reporting - `#[allow(unsafe_code)]` scoped to `usearch.rs` only, with `// SAFETY:` on every unsafe block - Integration test: insert 1000 random vectors, search for 10 query vectors, compare recall against `BruteForceIndex` - `UsearchIndex` is `Send + Sync` ## Technical Design ### Module Structure ``` tidal/src/storage/vector/ usearch.rs -- UsearchIndex, #[allow(unsafe_code)] ``` ### Cargo.toml Addition ```toml [dependencies] usearch = "2" # or latest stable version supporting filtered_search ``` Note: The exact version must be verified at implementation time. The `usearch` crate must support `filtered_search` with a predicate callback. If the latest published version does not support this API, the implementation must either: 1. Use a version that does (check crate changelog). 2. Fall back to `hnsw_rs` (pure Rust, `Filterable` trait) -- see Open Question 1 in OVERVIEW.md. ### Lint Configuration **Unsafe code:** The `usearch` crate (v2.x) provides a safe Rust API at the `Index` level -- CXX bridge handles the FFI internally. At implementation time, verify that all `Index` methods (`add`, `search`, `filtered_search`, `remove`, `save`, `load`, `view`) have safe signatures. If confirmed safe, **do NOT add `#[allow(unsafe_code)]`** and keep crate-level `forbid(unsafe_code)`. Only add `#[allow(unsafe_code)]` if specific call sites require it, with `// SAFETY:` comments. The current expectation is that no unsafe blocks are needed in `usearch.rs`. ### Public API ```rust // === storage/vector/usearch.rs === //! USearch HNSW backend for approximate nearest neighbor search. //! //! This module wraps the `usearch` crate (Apache-2.0, C++ FFI via CXX) //! behind the `VectorIndex` trait. It is the ONLY module in tidalDB that //! uses `unsafe` code, and only at the C++ FFI boundary. //! //! # Safety //! //! All unsafe blocks delegate to `usearch::Index` methods which perform //! C++ interop via CXX. The safety invariants are: //! - Vectors passed to USearch have the correct dimensionality (checked //! before the FFI call). //! - The `usearch::Index` handle is valid for the lifetime of `UsearchIndex`. //! - `reserve()` has been called with sufficient capacity before insertion. #![allow(unsafe_code)] use std::path::Path; use super::{VectorIndex, VectorId, VectorSearchResult, VectorIndexConfig, VectorError, DistanceMetric, QuantizationLevel}; /// Production HNSW index backed by USearch. /// /// Uses f16 quantization by default, M=16, ef_construction=200, ef_search=200. /// Supports concurrent reads and writes (validated by ScyllaDB at 1B vectors). /// /// # Persistence /// /// - `save(path)`: Full serialization to disk. Coordinated with WAL checkpoint. /// - `load(path)`: Full deserialization into writable RAM. /// - `view(path)`: Zero-copy mmap for read-only serving (instant restart). pub struct UsearchIndex { inner: usearch::Index, config: VectorIndexConfig, } impl UsearchIndex { /// Create a new empty index with the given configuration. /// /// # Errors /// /// Returns `VectorError::Backend` if USearch fails to initialize. pub fn new(config: VectorIndexConfig) -> Result; } ``` ### Internal Design **Index construction:** ```rust impl UsearchIndex { pub fn new(config: VectorIndexConfig) -> Result { let metric = match config.metric { DistanceMetric::L2 => usearch::MetricKind::L2sq, DistanceMetric::InnerProduct => usearch::MetricKind::IP, }; let quantization = match config.quantization { QuantizationLevel::F32 => usearch::ScalarKind::F32, QuantizationLevel::F16 => usearch::ScalarKind::F16, QuantizationLevel::Int8 => usearch::ScalarKind::I8, }; let options = usearch::IndexOptions { dimensions: config.dimensions, metric, quantization, connectivity: config.connectivity, expansion_add: config.ef_construction, expansion_search: config.ef_search, ..Default::default() }; // SAFETY: usearch::new_index performs C++ allocation via CXX. // The returned Index handle is valid until dropped. let inner = usearch::new_index(&options) .map_err(|e| VectorError::Backend(format!("USearch init failed: {e}")))?; Ok(Self { inner, config }) } } ``` **Insert implementation:** ```rust fn insert(&self, id: VectorId, embedding: &[f32]) -> Result<(), VectorError> { if embedding.len() != self.config.dimensions { return Err(VectorError::DimensionMismatch { expected: self.config.dimensions, got: embedding.len(), }); } // SAFETY: embedding slice has correct length (checked above). // USearch::add performs C++ FFI to insert the vector into the HNSW graph. // The key (u64) and vector data are copied into USearch's internal storage. self.inner.add(id, embedding) .map_err(|e| VectorError::Backend(format!("USearch insert failed: {e}")))?; Ok(()) } ``` **Search implementation:** ```rust fn search( &self, query: &[f32], k: usize, ef_search: usize, ) -> Result, VectorError> { if query.len() != self.config.dimensions { return Err(VectorError::DimensionMismatch { expected: self.config.dimensions, got: query.len(), }); } // SAFETY: query slice has correct length (checked above). // USearch::search performs HNSW traversal via C++ FFI. // Results are copied back into Rust-owned memory. let results = self.inner.search(query, k) .map_err(|e| VectorError::Backend(format!("USearch search failed: {e}")))?; Ok(results.keys.iter().zip(results.distances.iter()) .map(|(&id, &dist)| VectorSearchResult { id, distance: dist }) .collect()) } ``` **Filtered search implementation:** ```rust fn filtered_search( &self, query: &[f32], k: usize, ef_search: usize, filter: &dyn Fn(VectorId) -> bool, ) -> Result, VectorError> { if query.len() != self.config.dimensions { return Err(VectorError::DimensionMismatch { expected: self.config.dimensions, got: query.len(), }); } // SAFETY: query slice has correct length (checked above). // The predicate closure is called from C++ during HNSW traversal. // CXX marshals the u64 key to Rust and back. The closure captures // only the filter reference which outlives the search call. let results = self.inner.filtered_search(query, k, |key| filter(key)) .map_err(|e| VectorError::Backend(format!("USearch filtered_search failed: {e}")))?; Ok(results.keys.iter().zip(results.distances.iter()) .map(|(&id, &dist)| VectorSearchResult { id, distance: dist }) .collect()) } ``` **Note on `filtered_search` args:** USearch's `filtered_search` takes (query, count, filter) -- there is no `ef_search` parameter. To use a different `ef_search` for this query, call `self.inner.change_expansion_search(ef)` BEFORE `filtered_search`. See the ef_search override note below. **ef_search override:** Calling `change_expansion_search(ef)` before a search changes a global index parameter. Under concurrent searches this is NOT safe. For M2 (single-threaded query path or low concurrency), wrap the `(change_expansion_search, search)` pair in a `Mutex` guard. For M7 and high concurrency, investigate USearch's thread-safe ef_search API or fix ef_search at construction time. Document this in the Open Questions. **Delete implementation:** ```rust fn delete(&self, id: VectorId) -> Result<(), VectorError> { // SAFETY: USearch::remove performs lazy tombstoning via C++ FFI. // The node remains in the graph for navigation but is excluded from results. self.inner.remove(id) .map_err(|e| VectorError::Backend(format!("USearch delete failed: {e}")))?; Ok(()) } ``` **Persistence implementation:** ```rust fn save(&self, path: &Path) -> Result<(), VectorError> { let path_str = path.to_str() .ok_or_else(|| VectorError::Io(std::io::Error::new( std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?; // SAFETY: USearch::save serializes the entire index to disk via C++ I/O. self.inner.save(path_str) .map_err(|e| VectorError::Backend(format!("USearch save failed: {e}")))?; Ok(()) } fn load(path: &Path, config: &VectorIndexConfig) -> Result { let index = Self::new(config.clone())?; let path_str = path.to_str() .ok_or_else(|| VectorError::Io(std::io::Error::new( std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?; // SAFETY: USearch::load deserializes from disk into writable RAM via C++ I/O. index.inner.load(path_str) .map_err(|e| VectorError::Backend(format!("USearch load failed: {e}")))?; Ok(index) } fn view(path: &Path, config: &VectorIndexConfig) -> Result { // view() now receives config, matching the updated VectorIndex trait // signature from Task 01 (Fix 2a). Create an index with the config // options, then call USearch's view() to mmap the file. let index = Self::new(config.clone())?; let path_str = path.to_str() .ok_or_else(|| VectorError::Io(std::io::Error::new( std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?; // SAFETY: USearch::view memory-maps the file for read-only access via C++ I/O. index.inner.view(path_str) .map_err(|e| VectorError::Backend(format!("USearch view failed: {e}")))?; Ok(index) } ``` **`len` and `len_live` implementation:** ```rust fn len(&self) -> usize { self.inner.size() } fn len_live(&self) -> usize { // USearch tracks live vs tombstoned internally. // If the crate exposes this, use it. Otherwise, len() is the best estimate. // Investigate at implementation time. self.inner.size() // may need adjustment } ``` ### Error Handling - All USearch errors are mapped to `VectorError::Backend(String)` with the original error message. - Dimension checks happen before any FFI call to provide clear Rust-side errors. - I/O errors from persistence are mapped to `VectorError::Io` when possible, `VectorError::Backend` otherwise. - If `reserve()` is not called before insertion and USearch fails, the error is `VectorError::Backend` with a message suggesting `reserve()`. ## Test Strategy ### Integration Tests ```rust // === tests/vector_usearch.rs (integration test) === use tidaldb::storage::vector::*; use rand::Rng; /// Generate a random unit vector of the given dimension. fn random_unit_vector(dim: usize, rng: &mut impl Rng) -> Vec { let v: Vec = (0..dim).map(|_| rng.gen::() - 0.5).collect(); let norm: f32 = v.iter().map(|x| x * x).sum::().sqrt(); v.iter().map(|x| x / norm).collect() } #[test] fn usearch_insert_and_search_1000_vectors() { let dim = 128; // smaller dim for test speed let config = VectorIndexConfig { dimensions: dim, metric: DistanceMetric::L2, quantization: QuantizationLevel::F16, connectivity: 16, ef_construction: 200, ef_search: 200, }; let usearch_index = UsearchIndex::new(config.clone()).unwrap(); usearch_index.reserve(2000).unwrap(); let brute_index = BruteForceIndex::new(config.clone()); let mut rng = rand::thread_rng(); let vectors: Vec<(u64, Vec)> = (0..1000) .map(|id| (id, random_unit_vector(dim, &mut rng))) .collect(); // Insert into both indexes for (id, v) in &vectors { usearch_index.insert(*id, v).unwrap(); brute_index.insert(*id, v).unwrap(); } // Search with 10 random queries, measure recall let mut total_recall = 0.0; let k = 100; let n_queries = 10; for _ in 0..n_queries { let query = random_unit_vector(dim, &mut rng); let exact_results = brute_index.search(&query, k, 0).unwrap(); let approx_results = usearch_index.search(&query, k, 0).unwrap(); let exact_ids: std::collections::HashSet = exact_results.iter().map(|r| r.id).collect(); let approx_ids: std::collections::HashSet = approx_results.iter().map(|r| r.id).collect(); let overlap = exact_ids.intersection(&approx_ids).count(); let recall = overlap as f64 / k as f64; total_recall += recall; } let mean_recall = total_recall / n_queries as f64; assert!(mean_recall > 0.90, "recall@{k} should be > 0.90, got {mean_recall:.3}"); } #[test] fn usearch_filtered_search_excludes_non_matching() { let dim = 64; let config = VectorIndexConfig { dimensions: dim, ..VectorIndexConfig::default() }; let index = UsearchIndex::new(config).unwrap(); index.reserve(200).unwrap(); let mut rng = rand::thread_rng(); for id in 0..100u64 { let v = random_unit_vector(dim, &mut rng); index.insert(id, &v).unwrap(); } // Only include even IDs let query = random_unit_vector(dim, &mut rng); let results = index.filtered_search(&query, 50, 0, &|id| id % 2 == 0).unwrap(); for r in &results { assert!(r.id % 2 == 0, "filtered_search returned odd ID {}", r.id); } } #[test] fn usearch_delete_excludes_from_results() { let dim = 64; let config = VectorIndexConfig { dimensions: dim, ..VectorIndexConfig::default() }; let index = UsearchIndex::new(config).unwrap(); index.reserve(200).unwrap(); let mut rng = rand::thread_rng(); let vectors: Vec<(u64, Vec)> = (0..50) .map(|id| (id, random_unit_vector(dim, &mut rng))) .collect(); for (id, v) in &vectors { index.insert(*id, v).unwrap(); } // Delete ID 0 index.delete(0).unwrap(); // Search for the deleted vector -- it should not appear let results = index.search(&vectors[0].1, 50, 0).unwrap(); assert!(results.iter().all(|r| r.id != 0), "deleted vector should not appear in results"); } #[test] fn usearch_save_load_roundtrip() { let dim = 64; let config = VectorIndexConfig { dimensions: dim, ..VectorIndexConfig::default() }; let index = UsearchIndex::new(config.clone()).unwrap(); index.reserve(200).unwrap(); let mut rng = rand::thread_rng(); for id in 0..100u64 { let v = random_unit_vector(dim, &mut rng); index.insert(id, &v).unwrap(); } let dir = tempfile::tempdir().unwrap(); let path = dir.path().join("test.usearch"); // Save index.save(&path).unwrap(); // Load let loaded = UsearchIndex::load(&path, &config).unwrap(); assert_eq!(loaded.len(), 100); // Search on loaded index should produce similar results let query = random_unit_vector(dim, &mut rng); let results_orig = index.search(&query, 10, 0).unwrap(); let results_loaded = loaded.search(&query, 10, 0).unwrap(); // Top-1 should match (high probability for exact same index) assert_eq!(results_orig[0].id, results_loaded[0].id); } #[test] fn usearch_view_readonly() { let dim = 64; let config = VectorIndexConfig { dimensions: dim, ..VectorIndexConfig::default() }; let index = UsearchIndex::new(config.clone()).unwrap(); index.reserve(100).unwrap(); let mut rng = rand::thread_rng(); for id in 0..50u64 { let v = random_unit_vector(dim, &mut rng); index.insert(id, &v).unwrap(); } let dir = tempfile::tempdir().unwrap(); let path = dir.path().join("test.usearch"); index.save(&path).unwrap(); // View (mmap read-only) let viewed = UsearchIndex::view(&path, &config).unwrap(); assert_eq!(viewed.len(), 50); // Search should work on view'd index let query = random_unit_vector(dim, &mut rng); let results = viewed.search(&query, 10, 0).unwrap(); assert!(!results.is_empty()); } #[test] fn usearch_dimension_mismatch() { let config = VectorIndexConfig { dimensions: 64, ..VectorIndexConfig::default() }; let index = UsearchIndex::new(config).unwrap(); index.reserve(10).unwrap(); // Wrong dimension on insert let result = index.insert(1, &[1.0; 32]); // 32 dims instead of 64 assert!(matches!(result, Err(VectorError::DimensionMismatch { expected: 64, got: 32 }))); // Wrong dimension on search index.insert(1, &[0.0; 64]).unwrap(); let result = index.search(&[1.0; 32], 1, 0); assert!(matches!(result, Err(VectorError::DimensionMismatch { .. }))); } #[test] fn usearch_is_send_and_sync() { fn assert_send_sync() {} assert_send_sync::(); } #[test] fn usearch_recall_at_10k() { // Larger recall test at 10K vectors, matching the phase acceptance criteria. // Uses smaller dimensions (128) for test speed. let dim = 128; let n = 10_000; let k = 100; let config = VectorIndexConfig { dimensions: dim, metric: DistanceMetric::L2, quantization: QuantizationLevel::F16, connectivity: 16, ef_construction: 200, ef_search: 200, }; let usearch_index = UsearchIndex::new(config.clone()).unwrap(); usearch_index.reserve(n * 2).unwrap(); let brute_index = BruteForceIndex::new(config); let mut rng = rand::thread_rng(); for id in 0..n as u64 { let v = random_unit_vector(dim, &mut rng); usearch_index.insert(id, &v).unwrap(); brute_index.insert(id, &v).unwrap(); } // 10 queries, compute mean recall@100 let mut total_recall = 0.0; for _ in 0..10 { let query = random_unit_vector(dim, &mut rng); let exact = brute_index.search(&query, k, 0).unwrap(); let approx = usearch_index.search(&query, k, 0).unwrap(); let exact_ids: std::collections::HashSet = exact.iter().map(|r| r.id).collect(); let approx_ids: std::collections::HashSet = approx.iter().map(|r| r.id).collect(); let recall = exact_ids.intersection(&approx_ids).count() as f64 / k as f64; total_recall += recall; } let mean_recall = total_recall / 10.0; assert!(mean_recall > 0.95, "recall@{k} at {n} vectors should be > 0.95, got {mean_recall:.3}"); } ``` ## Acceptance Criteria - [ ] `UsearchIndex` wraps `usearch::Index` from the `usearch` crate - [ ] `UsearchIndex` implements `VectorIndex` trait (all methods) - [ ] Default config: f16 quantization, M=16, ef_construction=200, ef_search=200, L2sq metric - [ ] `insert()` validates dimensions before FFI call - [ ] `search()` returns results sorted by ascending L2 distance - [ ] `filtered_search()` passes predicate closure to USearch's callback API; all returned results satisfy the predicate - [ ] `delete()` tombstones the vector; it is excluded from subsequent search results - [ ] `reserve()` pre-allocates capacity in USearch - [ ] `save()` persists the full index to disk - [ ] `load()` restores a writable index from disk; search produces identical results - [ ] `view()` memory-maps the index for read-only search - [ ] `#[allow(unsafe_code)]` scoped to `usearch.rs` only - [ ] Every `unsafe` block has a `// SAFETY:` comment - [ ] Integration test: 1000 vectors, 10 queries, recall@100 > 0.90 - [ ] Integration test: 10K vectors, recall@100 > 0.95 (matching phase acceptance criteria) - [ ] Integration test: filtered_search returns only predicate-matching results - [ ] Integration test: save/load roundtrip preserves search results - [ ] `UsearchIndex` is `Send + Sync` - [ ] `cargo clippy -- -D warnings` passes - [ ] All integration tests pass ## Research References - [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- USearch evaluation: 127K QPS at f32, 167K QPS at int8, ScyllaDB validates concurrent operation at 1B vectors, f16 as optimal default (half memory, < 1% recall loss), `filtered_search(query, k, |key| predicate(key))` implements in-graph filtering, `view()` for zero-copy mmap serving ## Spec References - [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 2 (HNSW internals: M=16, ef_construction=200, ef_search=200, L2 distance over normalized vectors), Section 3 (filtered ANN: USearch predicate callback, in-graph filtering preserves graph navigation), Section 4 (quantization: f16 default, ScalarKind mapping), Section 7 (persistence: save/load/view lifecycle, checkpoint coordination), Section 11 (UsearchIndex implementation sketch), Section 12 (performance targets: < 10ms ANN at 10K, recall@100 > 95%) ## Implementation Notes - Add `usearch = "2"` (or the latest stable version with `filtered_search` support) to `tidal/Cargo.toml` `[dependencies]`. - Change `[lints.rust] unsafe_code` from `"forbid"` to `"deny"` in `Cargo.toml`. Add a comment: `# deny (not forbid) to allow #[allow(unsafe_code)] in usearch FFI module`. - Add `rand = "0.9"` to `[dev-dependencies]` for random vector generation in tests. - The `usearch` crate depends on `cxx` for C++ interop. This adds a C++ compiler requirement to the build. Document this in a top-level build note. - If USearch does not expose a way to distinguish live vs tombstoned vectors, `len_live()` should track deletions via an internal `AtomicUsize` counter decremented on each `delete()` call. - The `view()` method signature in the `VectorIndex` trait now takes `(path, config)` per the updated trait definition in Task 01. USearch requires knowing the index dimensions/metric to initialize the mmap'd index, so the config parameter is passed through to USearch construction before calling `view()`. - Do NOT implement per-query `ef_search` override in this task if the USearch crate does not support it cleanly. Accept the parameter, log a debug warning if it differs from the default, and use the index-level default. Per-query override can be added when the adaptive query planner (Task 04) needs it. - Do NOT wrap `UsearchIndex` in `RwLock` unless testing reveals that concurrent `insert` + `search` causes data races. USearch claims thread safety for concurrent reads and writes. Verify in the integration test by running searches and inserts from multiple threads.