- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test coverage for every public API surface - m1p5: TidalDb public API — write_item, signal, read_decay_score, read_windowed_count, read_velocity; StorageBox enum routing memory vs fjall; WalSender/WalHandleWriter bridge; WAL replay on open - Periodic checkpoint: 30s background thread for persistent+schema mode; FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful shutdown via Arc<AtomicBool> + join before final checkpoint - ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing) - Milestone 2 planning scaffolding added under docs/planning/milestone-2/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
24 KiB
Task 02: USearch Backend
Context
Milestone: 2 -- Ranked Retrieval
Phase: m2p1 -- Vector Index Integration (USearch)
Depends On: Task 01 (VectorIndex trait, types, l2_distance_sq)
Blocks: Task 04 (Adaptive Query Planner -- needs USearch for benchmarking)
Complexity: L
Objective
Deliver UsearchIndex, the production HNSW implementation wrapping the usearch Rust crate (Apache-2.0, C++ FFI via cxx). This is the performance-critical vector index that tidalDB uses for approximate nearest neighbor search at scale. At 10M vectors of dimension 1536, USearch achieves ~127K QPS at f32 and ~167K QPS at int8, with recall@100 > 95% -- numbers validated by ScyllaDB, ClickHouse, and DuckDB in production.
This is the only module in tidalDB where #![forbid(unsafe_code)] is relaxed. The usearch crate uses CXX for C++ FFI, which requires unsafe at the binding boundary. Every unsafe block must have a // SAFETY: comment explaining why the invariants hold. The #[allow(unsafe_code)] attribute is scoped to this single file (storage/vector/usearch.rs).
The USearch backend implements the full VectorIndex trait: insert, search, filtered_search, delete, reserve, save, load, view. It uses f16 quantization by default, M=16, ef_construction=200, ef_search=200 -- parameters validated by the research doc as optimal for 1536-dimensional embeddings at tidalDB's target scale.
Requirements
UsearchIndexwrapsusearch::Indexfrom theusearchcrate- Implements
VectorIndextrait from Task 01 - Default config: f16 quantization (
usearch::ScalarKind::F16), M=16, ef_construction=200, ef_search=200, metric=L2sq insert()delegates tousearch::Index::add(key, vector)search()delegates tousearch::Index::search(query, k)filtered_search()delegates tousearch::Index::filtered_search(query, k, predicate)delete()delegates tousearch::Index::remove(key)(lazy tombstone)reserve()delegates tousearch::Index::reserve(capacity)save(),load(),view()delegate to USearch persistence methodslen()andlen_live()use USearch'ssize()and capacity reporting#[allow(unsafe_code)]scoped tousearch.rsonly, with// SAFETY:on every unsafe block- Integration test: insert 1000 random vectors, search for 10 query vectors, compare recall against
BruteForceIndex UsearchIndexisSend + Sync
Technical Design
Module Structure
tidal/src/storage/vector/
usearch.rs -- UsearchIndex, #[allow(unsafe_code)]
Cargo.toml Addition
[dependencies]
usearch = "2" # or latest stable version supporting filtered_search
Note: The exact version must be verified at implementation time. The usearch crate must support filtered_search with a predicate callback. If the latest published version does not support this API, the implementation must either:
- Use a version that does (check crate changelog).
- Fall back to
hnsw_rs(pure Rust,Filterabletrait) -- see Open Question 1 in OVERVIEW.md.
Lint Configuration
Unsafe code: The usearch crate (v2.x) provides a safe Rust API at the Index level -- CXX bridge handles the FFI internally. At implementation time, verify that all Index methods (add, search, filtered_search, remove, save, load, view) have safe signatures. If confirmed safe, do NOT add #[allow(unsafe_code)] and keep crate-level forbid(unsafe_code). Only add #[allow(unsafe_code)] if specific call sites require it, with // SAFETY: comments. The current expectation is that no unsafe blocks are needed in usearch.rs.
Public API
// === storage/vector/usearch.rs ===
//! USearch HNSW backend for approximate nearest neighbor search.
//!
//! This module wraps the `usearch` crate (Apache-2.0, C++ FFI via CXX)
//! behind the `VectorIndex` trait. It is the ONLY module in tidalDB that
//! uses `unsafe` code, and only at the C++ FFI boundary.
//!
//! # Safety
//!
//! All unsafe blocks delegate to `usearch::Index` methods which perform
//! C++ interop via CXX. The safety invariants are:
//! - Vectors passed to USearch have the correct dimensionality (checked
//! before the FFI call).
//! - The `usearch::Index` handle is valid for the lifetime of `UsearchIndex`.
//! - `reserve()` has been called with sufficient capacity before insertion.
#![allow(unsafe_code)]
use std::path::Path;
use super::{VectorIndex, VectorId, VectorSearchResult, VectorIndexConfig, VectorError,
DistanceMetric, QuantizationLevel};
/// Production HNSW index backed by USearch.
///
/// Uses f16 quantization by default, M=16, ef_construction=200, ef_search=200.
/// Supports concurrent reads and writes (validated by ScyllaDB at 1B vectors).
///
/// # Persistence
///
/// - `save(path)`: Full serialization to disk. Coordinated with WAL checkpoint.
/// - `load(path)`: Full deserialization into writable RAM.
/// - `view(path)`: Zero-copy mmap for read-only serving (instant restart).
pub struct UsearchIndex {
inner: usearch::Index,
config: VectorIndexConfig,
}
impl UsearchIndex {
/// Create a new empty index with the given configuration.
///
/// # Errors
///
/// Returns `VectorError::Backend` if USearch fails to initialize.
pub fn new(config: VectorIndexConfig) -> Result<Self, VectorError>;
}
Internal Design
Index construction:
impl UsearchIndex {
pub fn new(config: VectorIndexConfig) -> Result<Self, VectorError> {
let metric = match config.metric {
DistanceMetric::L2 => usearch::MetricKind::L2sq,
DistanceMetric::InnerProduct => usearch::MetricKind::IP,
};
let quantization = match config.quantization {
QuantizationLevel::F32 => usearch::ScalarKind::F32,
QuantizationLevel::F16 => usearch::ScalarKind::F16,
QuantizationLevel::Int8 => usearch::ScalarKind::I8,
};
let options = usearch::IndexOptions {
dimensions: config.dimensions,
metric,
quantization,
connectivity: config.connectivity,
expansion_add: config.ef_construction,
expansion_search: config.ef_search,
..Default::default()
};
// SAFETY: usearch::new_index performs C++ allocation via CXX.
// The returned Index handle is valid until dropped.
let inner = usearch::new_index(&options)
.map_err(|e| VectorError::Backend(format!("USearch init failed: {e}")))?;
Ok(Self { inner, config })
}
}
Insert implementation:
fn insert(&self, id: VectorId, embedding: &[f32]) -> Result<(), VectorError> {
if embedding.len() != self.config.dimensions {
return Err(VectorError::DimensionMismatch {
expected: self.config.dimensions,
got: embedding.len(),
});
}
// SAFETY: embedding slice has correct length (checked above).
// USearch::add performs C++ FFI to insert the vector into the HNSW graph.
// The key (u64) and vector data are copied into USearch's internal storage.
self.inner.add(id, embedding)
.map_err(|e| VectorError::Backend(format!("USearch insert failed: {e}")))?;
Ok(())
}
Search implementation:
fn search(
&self,
query: &[f32],
k: usize,
ef_search: usize,
) -> Result<Vec<VectorSearchResult>, VectorError> {
if query.len() != self.config.dimensions {
return Err(VectorError::DimensionMismatch {
expected: self.config.dimensions,
got: query.len(),
});
}
// SAFETY: query slice has correct length (checked above).
// USearch::search performs HNSW traversal via C++ FFI.
// Results are copied back into Rust-owned memory.
let results = self.inner.search(query, k)
.map_err(|e| VectorError::Backend(format!("USearch search failed: {e}")))?;
Ok(results.keys.iter().zip(results.distances.iter())
.map(|(&id, &dist)| VectorSearchResult { id, distance: dist })
.collect())
}
Filtered search implementation:
fn filtered_search(
&self,
query: &[f32],
k: usize,
ef_search: usize,
filter: &dyn Fn(VectorId) -> bool,
) -> Result<Vec<VectorSearchResult>, VectorError> {
if query.len() != self.config.dimensions {
return Err(VectorError::DimensionMismatch {
expected: self.config.dimensions,
got: query.len(),
});
}
// SAFETY: query slice has correct length (checked above).
// The predicate closure is called from C++ during HNSW traversal.
// CXX marshals the u64 key to Rust and back. The closure captures
// only the filter reference which outlives the search call.
let results = self.inner.filtered_search(query, k, |key| filter(key))
.map_err(|e| VectorError::Backend(format!("USearch filtered_search failed: {e}")))?;
Ok(results.keys.iter().zip(results.distances.iter())
.map(|(&id, &dist)| VectorSearchResult { id, distance: dist })
.collect())
}
Note on filtered_search args: USearch's filtered_search takes (query, count, filter) -- there is no ef_search parameter. To use a different ef_search for this query, call self.inner.change_expansion_search(ef) BEFORE filtered_search. See the ef_search override note below.
ef_search override: Calling change_expansion_search(ef) before a search changes a global index parameter. Under concurrent searches this is NOT safe. For M2 (single-threaded query path or low concurrency), wrap the (change_expansion_search, search) pair in a Mutex guard. For M7 and high concurrency, investigate USearch's thread-safe ef_search API or fix ef_search at construction time. Document this in the Open Questions.
Delete implementation:
fn delete(&self, id: VectorId) -> Result<(), VectorError> {
// SAFETY: USearch::remove performs lazy tombstoning via C++ FFI.
// The node remains in the graph for navigation but is excluded from results.
self.inner.remove(id)
.map_err(|e| VectorError::Backend(format!("USearch delete failed: {e}")))?;
Ok(())
}
Persistence implementation:
fn save(&self, path: &Path) -> Result<(), VectorError> {
let path_str = path.to_str()
.ok_or_else(|| VectorError::Io(std::io::Error::new(
std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?;
// SAFETY: USearch::save serializes the entire index to disk via C++ I/O.
self.inner.save(path_str)
.map_err(|e| VectorError::Backend(format!("USearch save failed: {e}")))?;
Ok(())
}
fn load(path: &Path, config: &VectorIndexConfig) -> Result<Self, VectorError> {
let index = Self::new(config.clone())?;
let path_str = path.to_str()
.ok_or_else(|| VectorError::Io(std::io::Error::new(
std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?;
// SAFETY: USearch::load deserializes from disk into writable RAM via C++ I/O.
index.inner.load(path_str)
.map_err(|e| VectorError::Backend(format!("USearch load failed: {e}")))?;
Ok(index)
}
fn view(path: &Path, config: &VectorIndexConfig) -> Result<Self, VectorError> {
// view() now receives config, matching the updated VectorIndex trait
// signature from Task 01 (Fix 2a). Create an index with the config
// options, then call USearch's view() to mmap the file.
let index = Self::new(config.clone())?;
let path_str = path.to_str()
.ok_or_else(|| VectorError::Io(std::io::Error::new(
std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?;
// SAFETY: USearch::view memory-maps the file for read-only access via C++ I/O.
index.inner.view(path_str)
.map_err(|e| VectorError::Backend(format!("USearch view failed: {e}")))?;
Ok(index)
}
len and len_live implementation:
fn len(&self) -> usize {
self.inner.size()
}
fn len_live(&self) -> usize {
// USearch tracks live vs tombstoned internally.
// If the crate exposes this, use it. Otherwise, len() is the best estimate.
// Investigate at implementation time.
self.inner.size() // may need adjustment
}
Error Handling
- All USearch errors are mapped to
VectorError::Backend(String)with the original error message. - Dimension checks happen before any FFI call to provide clear Rust-side errors.
- I/O errors from persistence are mapped to
VectorError::Iowhen possible,VectorError::Backendotherwise. - If
reserve()is not called before insertion and USearch fails, the error isVectorError::Backendwith a message suggestingreserve().
Test Strategy
Integration Tests
// === tests/vector_usearch.rs (integration test) ===
use tidaldb::storage::vector::*;
use rand::Rng;
/// Generate a random unit vector of the given dimension.
fn random_unit_vector(dim: usize, rng: &mut impl Rng) -> Vec<f32> {
let v: Vec<f32> = (0..dim).map(|_| rng.gen::<f32>() - 0.5).collect();
let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
v.iter().map(|x| x / norm).collect()
}
#[test]
fn usearch_insert_and_search_1000_vectors() {
let dim = 128; // smaller dim for test speed
let config = VectorIndexConfig {
dimensions: dim,
metric: DistanceMetric::L2,
quantization: QuantizationLevel::F16,
connectivity: 16,
ef_construction: 200,
ef_search: 200,
};
let usearch_index = UsearchIndex::new(config.clone()).unwrap();
usearch_index.reserve(2000).unwrap();
let brute_index = BruteForceIndex::new(config.clone());
let mut rng = rand::thread_rng();
let vectors: Vec<(u64, Vec<f32>)> = (0..1000)
.map(|id| (id, random_unit_vector(dim, &mut rng)))
.collect();
// Insert into both indexes
for (id, v) in &vectors {
usearch_index.insert(*id, v).unwrap();
brute_index.insert(*id, v).unwrap();
}
// Search with 10 random queries, measure recall
let mut total_recall = 0.0;
let k = 100;
let n_queries = 10;
for _ in 0..n_queries {
let query = random_unit_vector(dim, &mut rng);
let exact_results = brute_index.search(&query, k, 0).unwrap();
let approx_results = usearch_index.search(&query, k, 0).unwrap();
let exact_ids: std::collections::HashSet<u64> =
exact_results.iter().map(|r| r.id).collect();
let approx_ids: std::collections::HashSet<u64> =
approx_results.iter().map(|r| r.id).collect();
let overlap = exact_ids.intersection(&approx_ids).count();
let recall = overlap as f64 / k as f64;
total_recall += recall;
}
let mean_recall = total_recall / n_queries as f64;
assert!(mean_recall > 0.90,
"recall@{k} should be > 0.90, got {mean_recall:.3}");
}
#[test]
fn usearch_filtered_search_excludes_non_matching() {
let dim = 64;
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config).unwrap();
index.reserve(200).unwrap();
let mut rng = rand::thread_rng();
for id in 0..100u64 {
let v = random_unit_vector(dim, &mut rng);
index.insert(id, &v).unwrap();
}
// Only include even IDs
let query = random_unit_vector(dim, &mut rng);
let results = index.filtered_search(&query, 50, 0, &|id| id % 2 == 0).unwrap();
for r in &results {
assert!(r.id % 2 == 0, "filtered_search returned odd ID {}", r.id);
}
}
#[test]
fn usearch_delete_excludes_from_results() {
let dim = 64;
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config).unwrap();
index.reserve(200).unwrap();
let mut rng = rand::thread_rng();
let vectors: Vec<(u64, Vec<f32>)> = (0..50)
.map(|id| (id, random_unit_vector(dim, &mut rng)))
.collect();
for (id, v) in &vectors {
index.insert(*id, v).unwrap();
}
// Delete ID 0
index.delete(0).unwrap();
// Search for the deleted vector -- it should not appear
let results = index.search(&vectors[0].1, 50, 0).unwrap();
assert!(results.iter().all(|r| r.id != 0),
"deleted vector should not appear in results");
}
#[test]
fn usearch_save_load_roundtrip() {
let dim = 64;
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config.clone()).unwrap();
index.reserve(200).unwrap();
let mut rng = rand::thread_rng();
for id in 0..100u64 {
let v = random_unit_vector(dim, &mut rng);
index.insert(id, &v).unwrap();
}
let dir = tempfile::tempdir().unwrap();
let path = dir.path().join("test.usearch");
// Save
index.save(&path).unwrap();
// Load
let loaded = UsearchIndex::load(&path, &config).unwrap();
assert_eq!(loaded.len(), 100);
// Search on loaded index should produce similar results
let query = random_unit_vector(dim, &mut rng);
let results_orig = index.search(&query, 10, 0).unwrap();
let results_loaded = loaded.search(&query, 10, 0).unwrap();
// Top-1 should match (high probability for exact same index)
assert_eq!(results_orig[0].id, results_loaded[0].id);
}
#[test]
fn usearch_view_readonly() {
let dim = 64;
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config.clone()).unwrap();
index.reserve(100).unwrap();
let mut rng = rand::thread_rng();
for id in 0..50u64 {
let v = random_unit_vector(dim, &mut rng);
index.insert(id, &v).unwrap();
}
let dir = tempfile::tempdir().unwrap();
let path = dir.path().join("test.usearch");
index.save(&path).unwrap();
// View (mmap read-only)
let viewed = UsearchIndex::view(&path, &config).unwrap();
assert_eq!(viewed.len(), 50);
// Search should work on view'd index
let query = random_unit_vector(dim, &mut rng);
let results = viewed.search(&query, 10, 0).unwrap();
assert!(!results.is_empty());
}
#[test]
fn usearch_dimension_mismatch() {
let config = VectorIndexConfig {
dimensions: 64,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config).unwrap();
index.reserve(10).unwrap();
// Wrong dimension on insert
let result = index.insert(1, &[1.0; 32]); // 32 dims instead of 64
assert!(matches!(result, Err(VectorError::DimensionMismatch { expected: 64, got: 32 })));
// Wrong dimension on search
index.insert(1, &[0.0; 64]).unwrap();
let result = index.search(&[1.0; 32], 1, 0);
assert!(matches!(result, Err(VectorError::DimensionMismatch { .. })));
}
#[test]
fn usearch_is_send_and_sync() {
fn assert_send_sync<T: Send + Sync>() {}
assert_send_sync::<UsearchIndex>();
}
#[test]
fn usearch_recall_at_10k() {
// Larger recall test at 10K vectors, matching the phase acceptance criteria.
// Uses smaller dimensions (128) for test speed.
let dim = 128;
let n = 10_000;
let k = 100;
let config = VectorIndexConfig {
dimensions: dim,
metric: DistanceMetric::L2,
quantization: QuantizationLevel::F16,
connectivity: 16,
ef_construction: 200,
ef_search: 200,
};
let usearch_index = UsearchIndex::new(config.clone()).unwrap();
usearch_index.reserve(n * 2).unwrap();
let brute_index = BruteForceIndex::new(config);
let mut rng = rand::thread_rng();
for id in 0..n as u64 {
let v = random_unit_vector(dim, &mut rng);
usearch_index.insert(id, &v).unwrap();
brute_index.insert(id, &v).unwrap();
}
// 10 queries, compute mean recall@100
let mut total_recall = 0.0;
for _ in 0..10 {
let query = random_unit_vector(dim, &mut rng);
let exact = brute_index.search(&query, k, 0).unwrap();
let approx = usearch_index.search(&query, k, 0).unwrap();
let exact_ids: std::collections::HashSet<u64> = exact.iter().map(|r| r.id).collect();
let approx_ids: std::collections::HashSet<u64> = approx.iter().map(|r| r.id).collect();
let recall = exact_ids.intersection(&approx_ids).count() as f64 / k as f64;
total_recall += recall;
}
let mean_recall = total_recall / 10.0;
assert!(mean_recall > 0.95,
"recall@{k} at {n} vectors should be > 0.95, got {mean_recall:.3}");
}
Acceptance Criteria
UsearchIndexwrapsusearch::Indexfrom theusearchcrateUsearchIndeximplementsVectorIndextrait (all methods)- Default config: f16 quantization, M=16, ef_construction=200, ef_search=200, L2sq metric
insert()validates dimensions before FFI callsearch()returns results sorted by ascending L2 distancefiltered_search()passes predicate closure to USearch's callback API; all returned results satisfy the predicatedelete()tombstones the vector; it is excluded from subsequent search resultsreserve()pre-allocates capacity in USearchsave()persists the full index to diskload()restores a writable index from disk; search produces identical resultsview()memory-maps the index for read-only search#[allow(unsafe_code)]scoped tousearch.rsonly- Every
unsafeblock has a// SAFETY:comment - Integration test: 1000 vectors, 10 queries, recall@100 > 0.90
- Integration test: 10K vectors, recall@100 > 0.95 (matching phase acceptance criteria)
- Integration test: filtered_search returns only predicate-matching results
- Integration test: save/load roundtrip preserves search results
UsearchIndexisSend + Synccargo clippy -- -D warningspasses- All integration tests pass
Research References
- docs/research/ann_for_tidaldb.md -- USearch evaluation: 127K QPS at f32, 167K QPS at int8, ScyllaDB validates concurrent operation at 1B vectors, f16 as optimal default (half memory, < 1% recall loss),
filtered_search(query, k, |key| predicate(key))implements in-graph filtering,view()for zero-copy mmap serving
Spec References
- docs/specs/07-vector-retrieval.md -- Section 2 (HNSW internals: M=16, ef_construction=200, ef_search=200, L2 distance over normalized vectors), Section 3 (filtered ANN: USearch predicate callback, in-graph filtering preserves graph navigation), Section 4 (quantization: f16 default, ScalarKind mapping), Section 7 (persistence: save/load/view lifecycle, checkpoint coordination), Section 11 (UsearchIndex implementation sketch), Section 12 (performance targets: < 10ms ANN at 10K, recall@100 > 95%)
Implementation Notes
- Add
usearch = "2"(or the latest stable version withfiltered_searchsupport) totidal/Cargo.toml[dependencies]. - Change
[lints.rust] unsafe_codefrom"forbid"to"deny"inCargo.toml. Add a comment:# deny (not forbid) to allow #[allow(unsafe_code)] in usearch FFI module. - Add
rand = "0.9"to[dev-dependencies]for random vector generation in tests. - The
usearchcrate depends oncxxfor C++ interop. This adds a C++ compiler requirement to the build. Document this in a top-level build note. - If USearch does not expose a way to distinguish live vs tombstoned vectors,
len_live()should track deletions via an internalAtomicUsizecounter decremented on eachdelete()call. - The
view()method signature in theVectorIndextrait now takes(path, config)per the updated trait definition in Task 01. USearch requires knowing the index dimensions/metric to initialize the mmap'd index, so the config parameter is passed through to USearch construction before callingview(). - Do NOT implement per-query
ef_searchoverride in this task if the USearch crate does not support it cleanly. Accept the parameter, log a debug warning if it differs from the default, and use the index-level default. Per-query override can be added when the adaptive query planner (Task 04) needs it. - Do NOT wrap
UsearchIndexinRwLockunless testing reveals that concurrentinsert+searchcauses data races. USearch claims thread safety for concurrent reads and writes. Verify in the integration test by running searches and inserts from multiple threads.