- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test coverage for every public API surface - m1p5: TidalDb public API — write_item, signal, read_decay_score, read_windowed_count, read_velocity; StorageBox enum routing memory vs fjall; WalSender/WalHandleWriter bridge; WAL replay on open - Periodic checkpoint: 30s background thread for persistent+schema mode; FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful shutdown via Arc<AtomicBool> + join before final checkpoint - ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing) - Milestone 2 planning scaffolding added under docs/planning/milestone-2/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
608 lines
24 KiB
Markdown
608 lines
24 KiB
Markdown
# Task 02: USearch Backend
|
|
|
|
## Context
|
|
|
|
**Milestone:** 2 -- Ranked Retrieval
|
|
**Phase:** m2p1 -- Vector Index Integration (USearch)
|
|
**Depends On:** Task 01 (VectorIndex trait, types, `l2_distance_sq`)
|
|
**Blocks:** Task 04 (Adaptive Query Planner -- needs USearch for benchmarking)
|
|
**Complexity:** L
|
|
|
|
## Objective
|
|
|
|
Deliver `UsearchIndex`, the production HNSW implementation wrapping the `usearch` Rust crate (Apache-2.0, C++ FFI via `cxx`). This is the performance-critical vector index that tidalDB uses for approximate nearest neighbor search at scale. At 10M vectors of dimension 1536, USearch achieves ~127K QPS at f32 and ~167K QPS at int8, with recall@100 > 95% -- numbers validated by ScyllaDB, ClickHouse, and DuckDB in production.
|
|
|
|
This is the only module in tidalDB where `#![forbid(unsafe_code)]` is relaxed. The `usearch` crate uses CXX for C++ FFI, which requires `unsafe` at the binding boundary. Every `unsafe` block must have a `// SAFETY:` comment explaining why the invariants hold. The `#[allow(unsafe_code)]` attribute is scoped to this single file (`storage/vector/usearch.rs`).
|
|
|
|
The USearch backend implements the full `VectorIndex` trait: `insert`, `search`, `filtered_search`, `delete`, `reserve`, `save`, `load`, `view`. It uses f16 quantization by default, M=16, ef_construction=200, ef_search=200 -- parameters validated by the research doc as optimal for 1536-dimensional embeddings at tidalDB's target scale.
|
|
|
|
## Requirements
|
|
|
|
- `UsearchIndex` wraps `usearch::Index` from the `usearch` crate
|
|
- Implements `VectorIndex` trait from Task 01
|
|
- Default config: f16 quantization (`usearch::ScalarKind::F16`), M=16, ef_construction=200, ef_search=200, metric=L2sq
|
|
- `insert()` delegates to `usearch::Index::add(key, vector)`
|
|
- `search()` delegates to `usearch::Index::search(query, k)`
|
|
- `filtered_search()` delegates to `usearch::Index::filtered_search(query, k, predicate)`
|
|
- `delete()` delegates to `usearch::Index::remove(key)` (lazy tombstone)
|
|
- `reserve()` delegates to `usearch::Index::reserve(capacity)`
|
|
- `save()`, `load()`, `view()` delegate to USearch persistence methods
|
|
- `len()` and `len_live()` use USearch's `size()` and capacity reporting
|
|
- `#[allow(unsafe_code)]` scoped to `usearch.rs` only, with `// SAFETY:` on every unsafe block
|
|
- Integration test: insert 1000 random vectors, search for 10 query vectors, compare recall against `BruteForceIndex`
|
|
- `UsearchIndex` is `Send + Sync`
|
|
|
|
## Technical Design
|
|
|
|
### Module Structure
|
|
|
|
```
|
|
tidal/src/storage/vector/
|
|
usearch.rs -- UsearchIndex, #[allow(unsafe_code)]
|
|
```
|
|
|
|
### Cargo.toml Addition
|
|
|
|
```toml
|
|
[dependencies]
|
|
usearch = "2" # or latest stable version supporting filtered_search
|
|
```
|
|
|
|
Note: The exact version must be verified at implementation time. The `usearch` crate must support `filtered_search` with a predicate callback. If the latest published version does not support this API, the implementation must either:
|
|
1. Use a version that does (check crate changelog).
|
|
2. Fall back to `hnsw_rs` (pure Rust, `Filterable` trait) -- see Open Question 1 in OVERVIEW.md.
|
|
|
|
### Lint Configuration
|
|
|
|
**Unsafe code:** The `usearch` crate (v2.x) provides a safe Rust API at the `Index` level -- CXX bridge handles the FFI internally. At implementation time, verify that all `Index` methods (`add`, `search`, `filtered_search`, `remove`, `save`, `load`, `view`) have safe signatures. If confirmed safe, **do NOT add `#[allow(unsafe_code)]`** and keep crate-level `forbid(unsafe_code)`. Only add `#[allow(unsafe_code)]` if specific call sites require it, with `// SAFETY:` comments. The current expectation is that no unsafe blocks are needed in `usearch.rs`.
|
|
|
|
### Public API
|
|
|
|
```rust
|
|
// === storage/vector/usearch.rs ===
|
|
//! USearch HNSW backend for approximate nearest neighbor search.
|
|
//!
|
|
//! This module wraps the `usearch` crate (Apache-2.0, C++ FFI via CXX)
|
|
//! behind the `VectorIndex` trait. It is the ONLY module in tidalDB that
|
|
//! uses `unsafe` code, and only at the C++ FFI boundary.
|
|
//!
|
|
//! # Safety
|
|
//!
|
|
//! All unsafe blocks delegate to `usearch::Index` methods which perform
|
|
//! C++ interop via CXX. The safety invariants are:
|
|
//! - Vectors passed to USearch have the correct dimensionality (checked
|
|
//! before the FFI call).
|
|
//! - The `usearch::Index` handle is valid for the lifetime of `UsearchIndex`.
|
|
//! - `reserve()` has been called with sufficient capacity before insertion.
|
|
#![allow(unsafe_code)]
|
|
|
|
use std::path::Path;
|
|
use super::{VectorIndex, VectorId, VectorSearchResult, VectorIndexConfig, VectorError,
|
|
DistanceMetric, QuantizationLevel};
|
|
|
|
/// Production HNSW index backed by USearch.
|
|
///
|
|
/// Uses f16 quantization by default, M=16, ef_construction=200, ef_search=200.
|
|
/// Supports concurrent reads and writes (validated by ScyllaDB at 1B vectors).
|
|
///
|
|
/// # Persistence
|
|
///
|
|
/// - `save(path)`: Full serialization to disk. Coordinated with WAL checkpoint.
|
|
/// - `load(path)`: Full deserialization into writable RAM.
|
|
/// - `view(path)`: Zero-copy mmap for read-only serving (instant restart).
|
|
pub struct UsearchIndex {
|
|
inner: usearch::Index,
|
|
config: VectorIndexConfig,
|
|
}
|
|
|
|
impl UsearchIndex {
|
|
/// Create a new empty index with the given configuration.
|
|
///
|
|
/// # Errors
|
|
///
|
|
/// Returns `VectorError::Backend` if USearch fails to initialize.
|
|
pub fn new(config: VectorIndexConfig) -> Result<Self, VectorError>;
|
|
}
|
|
```
|
|
|
|
### Internal Design
|
|
|
|
**Index construction:**
|
|
|
|
```rust
|
|
impl UsearchIndex {
|
|
pub fn new(config: VectorIndexConfig) -> Result<Self, VectorError> {
|
|
let metric = match config.metric {
|
|
DistanceMetric::L2 => usearch::MetricKind::L2sq,
|
|
DistanceMetric::InnerProduct => usearch::MetricKind::IP,
|
|
};
|
|
let quantization = match config.quantization {
|
|
QuantizationLevel::F32 => usearch::ScalarKind::F32,
|
|
QuantizationLevel::F16 => usearch::ScalarKind::F16,
|
|
QuantizationLevel::Int8 => usearch::ScalarKind::I8,
|
|
};
|
|
|
|
let options = usearch::IndexOptions {
|
|
dimensions: config.dimensions,
|
|
metric,
|
|
quantization,
|
|
connectivity: config.connectivity,
|
|
expansion_add: config.ef_construction,
|
|
expansion_search: config.ef_search,
|
|
..Default::default()
|
|
};
|
|
|
|
// SAFETY: usearch::new_index performs C++ allocation via CXX.
|
|
// The returned Index handle is valid until dropped.
|
|
let inner = usearch::new_index(&options)
|
|
.map_err(|e| VectorError::Backend(format!("USearch init failed: {e}")))?;
|
|
|
|
Ok(Self { inner, config })
|
|
}
|
|
}
|
|
```
|
|
|
|
**Insert implementation:**
|
|
|
|
```rust
|
|
fn insert(&self, id: VectorId, embedding: &[f32]) -> Result<(), VectorError> {
|
|
if embedding.len() != self.config.dimensions {
|
|
return Err(VectorError::DimensionMismatch {
|
|
expected: self.config.dimensions,
|
|
got: embedding.len(),
|
|
});
|
|
}
|
|
|
|
// SAFETY: embedding slice has correct length (checked above).
|
|
// USearch::add performs C++ FFI to insert the vector into the HNSW graph.
|
|
// The key (u64) and vector data are copied into USearch's internal storage.
|
|
self.inner.add(id, embedding)
|
|
.map_err(|e| VectorError::Backend(format!("USearch insert failed: {e}")))?;
|
|
|
|
Ok(())
|
|
}
|
|
```
|
|
|
|
**Search implementation:**
|
|
|
|
```rust
|
|
fn search(
|
|
&self,
|
|
query: &[f32],
|
|
k: usize,
|
|
ef_search: usize,
|
|
) -> Result<Vec<VectorSearchResult>, VectorError> {
|
|
if query.len() != self.config.dimensions {
|
|
return Err(VectorError::DimensionMismatch {
|
|
expected: self.config.dimensions,
|
|
got: query.len(),
|
|
});
|
|
}
|
|
|
|
// SAFETY: query slice has correct length (checked above).
|
|
// USearch::search performs HNSW traversal via C++ FFI.
|
|
// Results are copied back into Rust-owned memory.
|
|
let results = self.inner.search(query, k)
|
|
.map_err(|e| VectorError::Backend(format!("USearch search failed: {e}")))?;
|
|
|
|
Ok(results.keys.iter().zip(results.distances.iter())
|
|
.map(|(&id, &dist)| VectorSearchResult { id, distance: dist })
|
|
.collect())
|
|
}
|
|
```
|
|
|
|
**Filtered search implementation:**
|
|
|
|
```rust
|
|
fn filtered_search(
|
|
&self,
|
|
query: &[f32],
|
|
k: usize,
|
|
ef_search: usize,
|
|
filter: &dyn Fn(VectorId) -> bool,
|
|
) -> Result<Vec<VectorSearchResult>, VectorError> {
|
|
if query.len() != self.config.dimensions {
|
|
return Err(VectorError::DimensionMismatch {
|
|
expected: self.config.dimensions,
|
|
got: query.len(),
|
|
});
|
|
}
|
|
|
|
// SAFETY: query slice has correct length (checked above).
|
|
// The predicate closure is called from C++ during HNSW traversal.
|
|
// CXX marshals the u64 key to Rust and back. The closure captures
|
|
// only the filter reference which outlives the search call.
|
|
let results = self.inner.filtered_search(query, k, |key| filter(key))
|
|
.map_err(|e| VectorError::Backend(format!("USearch filtered_search failed: {e}")))?;
|
|
|
|
Ok(results.keys.iter().zip(results.distances.iter())
|
|
.map(|(&id, &dist)| VectorSearchResult { id, distance: dist })
|
|
.collect())
|
|
}
|
|
```
|
|
|
|
**Note on `filtered_search` args:** USearch's `filtered_search` takes (query, count, filter) -- there is no `ef_search` parameter. To use a different `ef_search` for this query, call `self.inner.change_expansion_search(ef)` BEFORE `filtered_search`. See the ef_search override note below.
|
|
|
|
**ef_search override:** Calling `change_expansion_search(ef)` before a search changes a global index parameter. Under concurrent searches this is NOT safe. For M2 (single-threaded query path or low concurrency), wrap the `(change_expansion_search, search)` pair in a `Mutex` guard. For M7 and high concurrency, investigate USearch's thread-safe ef_search API or fix ef_search at construction time. Document this in the Open Questions.
|
|
|
|
**Delete implementation:**
|
|
|
|
```rust
|
|
fn delete(&self, id: VectorId) -> Result<(), VectorError> {
|
|
// SAFETY: USearch::remove performs lazy tombstoning via C++ FFI.
|
|
// The node remains in the graph for navigation but is excluded from results.
|
|
self.inner.remove(id)
|
|
.map_err(|e| VectorError::Backend(format!("USearch delete failed: {e}")))?;
|
|
Ok(())
|
|
}
|
|
```
|
|
|
|
**Persistence implementation:**
|
|
|
|
```rust
|
|
fn save(&self, path: &Path) -> Result<(), VectorError> {
|
|
let path_str = path.to_str()
|
|
.ok_or_else(|| VectorError::Io(std::io::Error::new(
|
|
std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?;
|
|
// SAFETY: USearch::save serializes the entire index to disk via C++ I/O.
|
|
self.inner.save(path_str)
|
|
.map_err(|e| VectorError::Backend(format!("USearch save failed: {e}")))?;
|
|
Ok(())
|
|
}
|
|
|
|
fn load(path: &Path, config: &VectorIndexConfig) -> Result<Self, VectorError> {
|
|
let index = Self::new(config.clone())?;
|
|
let path_str = path.to_str()
|
|
.ok_or_else(|| VectorError::Io(std::io::Error::new(
|
|
std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?;
|
|
// SAFETY: USearch::load deserializes from disk into writable RAM via C++ I/O.
|
|
index.inner.load(path_str)
|
|
.map_err(|e| VectorError::Backend(format!("USearch load failed: {e}")))?;
|
|
Ok(index)
|
|
}
|
|
|
|
fn view(path: &Path, config: &VectorIndexConfig) -> Result<Self, VectorError> {
|
|
// view() now receives config, matching the updated VectorIndex trait
|
|
// signature from Task 01 (Fix 2a). Create an index with the config
|
|
// options, then call USearch's view() to mmap the file.
|
|
let index = Self::new(config.clone())?;
|
|
let path_str = path.to_str()
|
|
.ok_or_else(|| VectorError::Io(std::io::Error::new(
|
|
std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?;
|
|
// SAFETY: USearch::view memory-maps the file for read-only access via C++ I/O.
|
|
index.inner.view(path_str)
|
|
.map_err(|e| VectorError::Backend(format!("USearch view failed: {e}")))?;
|
|
Ok(index)
|
|
}
|
|
```
|
|
|
|
**`len` and `len_live` implementation:**
|
|
|
|
```rust
|
|
fn len(&self) -> usize {
|
|
self.inner.size()
|
|
}
|
|
|
|
fn len_live(&self) -> usize {
|
|
// USearch tracks live vs tombstoned internally.
|
|
// If the crate exposes this, use it. Otherwise, len() is the best estimate.
|
|
// Investigate at implementation time.
|
|
self.inner.size() // may need adjustment
|
|
}
|
|
```
|
|
|
|
### Error Handling
|
|
|
|
- All USearch errors are mapped to `VectorError::Backend(String)` with the original error message.
|
|
- Dimension checks happen before any FFI call to provide clear Rust-side errors.
|
|
- I/O errors from persistence are mapped to `VectorError::Io` when possible, `VectorError::Backend` otherwise.
|
|
- If `reserve()` is not called before insertion and USearch fails, the error is `VectorError::Backend` with a message suggesting `reserve()`.
|
|
|
|
## Test Strategy
|
|
|
|
### Integration Tests
|
|
|
|
```rust
|
|
// === tests/vector_usearch.rs (integration test) ===
|
|
|
|
use tidaldb::storage::vector::*;
|
|
use rand::Rng;
|
|
|
|
/// Generate a random unit vector of the given dimension.
|
|
fn random_unit_vector(dim: usize, rng: &mut impl Rng) -> Vec<f32> {
|
|
let v: Vec<f32> = (0..dim).map(|_| rng.gen::<f32>() - 0.5).collect();
|
|
let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
|
|
v.iter().map(|x| x / norm).collect()
|
|
}
|
|
|
|
#[test]
|
|
fn usearch_insert_and_search_1000_vectors() {
|
|
let dim = 128; // smaller dim for test speed
|
|
let config = VectorIndexConfig {
|
|
dimensions: dim,
|
|
metric: DistanceMetric::L2,
|
|
quantization: QuantizationLevel::F16,
|
|
connectivity: 16,
|
|
ef_construction: 200,
|
|
ef_search: 200,
|
|
};
|
|
|
|
let usearch_index = UsearchIndex::new(config.clone()).unwrap();
|
|
usearch_index.reserve(2000).unwrap();
|
|
|
|
let brute_index = BruteForceIndex::new(config.clone());
|
|
|
|
let mut rng = rand::thread_rng();
|
|
let vectors: Vec<(u64, Vec<f32>)> = (0..1000)
|
|
.map(|id| (id, random_unit_vector(dim, &mut rng)))
|
|
.collect();
|
|
|
|
// Insert into both indexes
|
|
for (id, v) in &vectors {
|
|
usearch_index.insert(*id, v).unwrap();
|
|
brute_index.insert(*id, v).unwrap();
|
|
}
|
|
|
|
// Search with 10 random queries, measure recall
|
|
let mut total_recall = 0.0;
|
|
let k = 100;
|
|
let n_queries = 10;
|
|
|
|
for _ in 0..n_queries {
|
|
let query = random_unit_vector(dim, &mut rng);
|
|
|
|
let exact_results = brute_index.search(&query, k, 0).unwrap();
|
|
let approx_results = usearch_index.search(&query, k, 0).unwrap();
|
|
|
|
let exact_ids: std::collections::HashSet<u64> =
|
|
exact_results.iter().map(|r| r.id).collect();
|
|
let approx_ids: std::collections::HashSet<u64> =
|
|
approx_results.iter().map(|r| r.id).collect();
|
|
|
|
let overlap = exact_ids.intersection(&approx_ids).count();
|
|
let recall = overlap as f64 / k as f64;
|
|
total_recall += recall;
|
|
}
|
|
|
|
let mean_recall = total_recall / n_queries as f64;
|
|
assert!(mean_recall > 0.90,
|
|
"recall@{k} should be > 0.90, got {mean_recall:.3}");
|
|
}
|
|
|
|
#[test]
|
|
fn usearch_filtered_search_excludes_non_matching() {
|
|
let dim = 64;
|
|
let config = VectorIndexConfig {
|
|
dimensions: dim,
|
|
..VectorIndexConfig::default()
|
|
};
|
|
|
|
let index = UsearchIndex::new(config).unwrap();
|
|
index.reserve(200).unwrap();
|
|
|
|
let mut rng = rand::thread_rng();
|
|
for id in 0..100u64 {
|
|
let v = random_unit_vector(dim, &mut rng);
|
|
index.insert(id, &v).unwrap();
|
|
}
|
|
|
|
// Only include even IDs
|
|
let query = random_unit_vector(dim, &mut rng);
|
|
let results = index.filtered_search(&query, 50, 0, &|id| id % 2 == 0).unwrap();
|
|
|
|
for r in &results {
|
|
assert!(r.id % 2 == 0, "filtered_search returned odd ID {}", r.id);
|
|
}
|
|
}
|
|
|
|
#[test]
|
|
fn usearch_delete_excludes_from_results() {
|
|
let dim = 64;
|
|
let config = VectorIndexConfig {
|
|
dimensions: dim,
|
|
..VectorIndexConfig::default()
|
|
};
|
|
|
|
let index = UsearchIndex::new(config).unwrap();
|
|
index.reserve(200).unwrap();
|
|
|
|
let mut rng = rand::thread_rng();
|
|
let vectors: Vec<(u64, Vec<f32>)> = (0..50)
|
|
.map(|id| (id, random_unit_vector(dim, &mut rng)))
|
|
.collect();
|
|
|
|
for (id, v) in &vectors {
|
|
index.insert(*id, v).unwrap();
|
|
}
|
|
|
|
// Delete ID 0
|
|
index.delete(0).unwrap();
|
|
|
|
// Search for the deleted vector -- it should not appear
|
|
let results = index.search(&vectors[0].1, 50, 0).unwrap();
|
|
assert!(results.iter().all(|r| r.id != 0),
|
|
"deleted vector should not appear in results");
|
|
}
|
|
|
|
#[test]
|
|
fn usearch_save_load_roundtrip() {
|
|
let dim = 64;
|
|
let config = VectorIndexConfig {
|
|
dimensions: dim,
|
|
..VectorIndexConfig::default()
|
|
};
|
|
|
|
let index = UsearchIndex::new(config.clone()).unwrap();
|
|
index.reserve(200).unwrap();
|
|
|
|
let mut rng = rand::thread_rng();
|
|
for id in 0..100u64 {
|
|
let v = random_unit_vector(dim, &mut rng);
|
|
index.insert(id, &v).unwrap();
|
|
}
|
|
|
|
let dir = tempfile::tempdir().unwrap();
|
|
let path = dir.path().join("test.usearch");
|
|
|
|
// Save
|
|
index.save(&path).unwrap();
|
|
|
|
// Load
|
|
let loaded = UsearchIndex::load(&path, &config).unwrap();
|
|
assert_eq!(loaded.len(), 100);
|
|
|
|
// Search on loaded index should produce similar results
|
|
let query = random_unit_vector(dim, &mut rng);
|
|
let results_orig = index.search(&query, 10, 0).unwrap();
|
|
let results_loaded = loaded.search(&query, 10, 0).unwrap();
|
|
|
|
// Top-1 should match (high probability for exact same index)
|
|
assert_eq!(results_orig[0].id, results_loaded[0].id);
|
|
}
|
|
|
|
#[test]
|
|
fn usearch_view_readonly() {
|
|
let dim = 64;
|
|
let config = VectorIndexConfig {
|
|
dimensions: dim,
|
|
..VectorIndexConfig::default()
|
|
};
|
|
|
|
let index = UsearchIndex::new(config.clone()).unwrap();
|
|
index.reserve(100).unwrap();
|
|
|
|
let mut rng = rand::thread_rng();
|
|
for id in 0..50u64 {
|
|
let v = random_unit_vector(dim, &mut rng);
|
|
index.insert(id, &v).unwrap();
|
|
}
|
|
|
|
let dir = tempfile::tempdir().unwrap();
|
|
let path = dir.path().join("test.usearch");
|
|
index.save(&path).unwrap();
|
|
|
|
// View (mmap read-only)
|
|
let viewed = UsearchIndex::view(&path, &config).unwrap();
|
|
assert_eq!(viewed.len(), 50);
|
|
|
|
// Search should work on view'd index
|
|
let query = random_unit_vector(dim, &mut rng);
|
|
let results = viewed.search(&query, 10, 0).unwrap();
|
|
assert!(!results.is_empty());
|
|
}
|
|
|
|
#[test]
|
|
fn usearch_dimension_mismatch() {
|
|
let config = VectorIndexConfig {
|
|
dimensions: 64,
|
|
..VectorIndexConfig::default()
|
|
};
|
|
|
|
let index = UsearchIndex::new(config).unwrap();
|
|
index.reserve(10).unwrap();
|
|
|
|
// Wrong dimension on insert
|
|
let result = index.insert(1, &[1.0; 32]); // 32 dims instead of 64
|
|
assert!(matches!(result, Err(VectorError::DimensionMismatch { expected: 64, got: 32 })));
|
|
|
|
// Wrong dimension on search
|
|
index.insert(1, &[0.0; 64]).unwrap();
|
|
let result = index.search(&[1.0; 32], 1, 0);
|
|
assert!(matches!(result, Err(VectorError::DimensionMismatch { .. })));
|
|
}
|
|
|
|
#[test]
|
|
fn usearch_is_send_and_sync() {
|
|
fn assert_send_sync<T: Send + Sync>() {}
|
|
assert_send_sync::<UsearchIndex>();
|
|
}
|
|
|
|
#[test]
|
|
fn usearch_recall_at_10k() {
|
|
// Larger recall test at 10K vectors, matching the phase acceptance criteria.
|
|
// Uses smaller dimensions (128) for test speed.
|
|
let dim = 128;
|
|
let n = 10_000;
|
|
let k = 100;
|
|
let config = VectorIndexConfig {
|
|
dimensions: dim,
|
|
metric: DistanceMetric::L2,
|
|
quantization: QuantizationLevel::F16,
|
|
connectivity: 16,
|
|
ef_construction: 200,
|
|
ef_search: 200,
|
|
};
|
|
|
|
let usearch_index = UsearchIndex::new(config.clone()).unwrap();
|
|
usearch_index.reserve(n * 2).unwrap();
|
|
|
|
let brute_index = BruteForceIndex::new(config);
|
|
|
|
let mut rng = rand::thread_rng();
|
|
for id in 0..n as u64 {
|
|
let v = random_unit_vector(dim, &mut rng);
|
|
usearch_index.insert(id, &v).unwrap();
|
|
brute_index.insert(id, &v).unwrap();
|
|
}
|
|
|
|
// 10 queries, compute mean recall@100
|
|
let mut total_recall = 0.0;
|
|
for _ in 0..10 {
|
|
let query = random_unit_vector(dim, &mut rng);
|
|
let exact = brute_index.search(&query, k, 0).unwrap();
|
|
let approx = usearch_index.search(&query, k, 0).unwrap();
|
|
|
|
let exact_ids: std::collections::HashSet<u64> = exact.iter().map(|r| r.id).collect();
|
|
let approx_ids: std::collections::HashSet<u64> = approx.iter().map(|r| r.id).collect();
|
|
let recall = exact_ids.intersection(&approx_ids).count() as f64 / k as f64;
|
|
total_recall += recall;
|
|
}
|
|
|
|
let mean_recall = total_recall / 10.0;
|
|
assert!(mean_recall > 0.95,
|
|
"recall@{k} at {n} vectors should be > 0.95, got {mean_recall:.3}");
|
|
}
|
|
```
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `UsearchIndex` wraps `usearch::Index` from the `usearch` crate
|
|
- [ ] `UsearchIndex` implements `VectorIndex` trait (all methods)
|
|
- [ ] Default config: f16 quantization, M=16, ef_construction=200, ef_search=200, L2sq metric
|
|
- [ ] `insert()` validates dimensions before FFI call
|
|
- [ ] `search()` returns results sorted by ascending L2 distance
|
|
- [ ] `filtered_search()` passes predicate closure to USearch's callback API; all returned results satisfy the predicate
|
|
- [ ] `delete()` tombstones the vector; it is excluded from subsequent search results
|
|
- [ ] `reserve()` pre-allocates capacity in USearch
|
|
- [ ] `save()` persists the full index to disk
|
|
- [ ] `load()` restores a writable index from disk; search produces identical results
|
|
- [ ] `view()` memory-maps the index for read-only search
|
|
- [ ] `#[allow(unsafe_code)]` scoped to `usearch.rs` only
|
|
- [ ] Every `unsafe` block has a `// SAFETY:` comment
|
|
- [ ] Integration test: 1000 vectors, 10 queries, recall@100 > 0.90
|
|
- [ ] Integration test: 10K vectors, recall@100 > 0.95 (matching phase acceptance criteria)
|
|
- [ ] Integration test: filtered_search returns only predicate-matching results
|
|
- [ ] Integration test: save/load roundtrip preserves search results
|
|
- [ ] `UsearchIndex` is `Send + Sync`
|
|
- [ ] `cargo clippy -- -D warnings` passes
|
|
- [ ] All integration tests pass
|
|
|
|
## Research References
|
|
|
|
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- USearch evaluation: 127K QPS at f32, 167K QPS at int8, ScyllaDB validates concurrent operation at 1B vectors, f16 as optimal default (half memory, < 1% recall loss), `filtered_search(query, k, |key| predicate(key))` implements in-graph filtering, `view()` for zero-copy mmap serving
|
|
|
|
## Spec References
|
|
|
|
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 2 (HNSW internals: M=16, ef_construction=200, ef_search=200, L2 distance over normalized vectors), Section 3 (filtered ANN: USearch predicate callback, in-graph filtering preserves graph navigation), Section 4 (quantization: f16 default, ScalarKind mapping), Section 7 (persistence: save/load/view lifecycle, checkpoint coordination), Section 11 (UsearchIndex implementation sketch), Section 12 (performance targets: < 10ms ANN at 10K, recall@100 > 95%)
|
|
|
|
## Implementation Notes
|
|
|
|
- Add `usearch = "2"` (or the latest stable version with `filtered_search` support) to `tidal/Cargo.toml` `[dependencies]`.
|
|
- Change `[lints.rust] unsafe_code` from `"forbid"` to `"deny"` in `Cargo.toml`. Add a comment: `# deny (not forbid) to allow #[allow(unsafe_code)] in usearch FFI module`.
|
|
- Add `rand = "0.9"` to `[dev-dependencies]` for random vector generation in tests.
|
|
- The `usearch` crate depends on `cxx` for C++ interop. This adds a C++ compiler requirement to the build. Document this in a top-level build note.
|
|
- If USearch does not expose a way to distinguish live vs tombstoned vectors, `len_live()` should track deletions via an internal `AtomicUsize` counter decremented on each `delete()` call.
|
|
- The `view()` method signature in the `VectorIndex` trait now takes `(path, config)` per the updated trait definition in Task 01. USearch requires knowing the index dimensions/metric to initialize the mmap'd index, so the config parameter is passed through to USearch construction before calling `view()`.
|
|
- Do NOT implement per-query `ef_search` override in this task if the USearch crate does not support it cleanly. Accept the parameter, log a debug warning if it differs from the default, and use the index-level default. Per-query override can be added when the adaptive query planner (Task 04) needs it.
|
|
- Do NOT wrap `UsearchIndex` in `RwLock` unless testing reveals that concurrent `insert` + `search` causes data races. USearch claims thread safety for concurrent reads and writes. Verify in the integration test by running searches and inserts from multiple threads.
|