jordan 6fdaa1584b feat: complete M1 signal engine — m0p3 samples/docs, m1p5 TidalDb API, examples, and periodic checkpoint

- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples
  (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test
  coverage for every public API surface
- m1p5: TidalDb public API — write_item, signal, read_decay_score,
  read_windowed_count, read_velocity; StorageBox enum routing memory vs
  fjall; WalSender/WalHandleWriter bridge; WAL replay on open
- Periodic checkpoint: 30s background thread for persistent+schema mode;
  FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful
  shutdown via Arc<AtomicBool> + join before final checkpoint
- ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing)
- Milestone 2 planning scaffolding added under docs/planning/milestone-2/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-20 22:45:10 -07:00

31 KiB

Raw Blame History

Task 01: VectorIndex Trait + BruteForceIndex

Context

Milestone: 2 -- Ranked Retrieval Phase: m2p1 -- Vector Index Integration (USearch) Depends On: None (uses types from m1p1 but no m2p1 tasks) Blocks: Task 02 (USearch Backend), Task 03 (Embedding Lifecycle), Task 04 (Adaptive Query Planner) Complexity: M

Objective

Deliver the VectorIndex trait -- the public interface for all ANN operations in tidalDB -- along with the full type system for vector search (VectorId, VectorSearchResult, VectorIndexConfig, DistanceMetric, QuantizationLevel, VectorError) and two pure-Rust implementations: BruteForceIndex (exact linear-scan search) and MockVectorIndex (predetermined results for unit tests).

The VectorIndex trait is the abstraction boundary. No module outside storage/vector/ will ever know whether USearch, hnsw_rs, or brute-force is behind it. This is the same pattern as StorageEngine in m1p3: define the trait first, implement brute-force for correctness, then add the production backend in the next task.

BruteForceIndex is not a throwaway. It serves three permanent roles:

Correctness oracle -- recall measurements compare HNSW results against BruteForceIndex exact results.
Small datasets -- when the index has fewer than ~10,000 vectors, brute-force is faster than HNSW because there is no graph construction overhead.
Pre-filter fallback -- the adaptive query planner (Task 04) uses BruteForceIndex-style linear scan over bitmap-filtered candidate sets when selectivity < 1%.

No unsafe code in this task. Pure Rust throughout.

Requirements

VectorIndex trait: insert, search, filtered_search, delete, reserve, save, load, view, len, len_live, is_empty, tombstone_ratio
All trait methods match the signatures in Spec 07, Section 11
VectorIndex: Send + Sync bound
VectorId = u64 type alias
VectorSearchResult { id: VectorId, distance: f32 } with Debug, Clone
VectorIndexConfig with all HNSW parameters
DistanceMetric enum: L2, InnerProduct
QuantizationLevel enum: F32, F16, Int8
VectorError enum with Display, Debug, From<std::io::Error>
BruteForceIndex: RwLock<HashMap<VectorId, Vec<f32>>> for storage, linear scan for search
BruteForceIndex::search returns results sorted by ascending L2 squared distance
BruteForceIndex::filtered_search applies predicate during linear scan, returns only matching results
BruteForceIndex::delete removes the vector from the HashMap (true delete, not tombstone)
BruteForceIndex::save/load/view use a simple binary format for test persistence
MockVectorIndex: predetermined results, call recording for test assertions
No unsafe code

Technical Design

Module Structure

tidal/src/storage/vector/
  mod.rs      -- VectorIndex trait, all types, re-exports
  brute.rs    -- BruteForceIndex, MockVectorIndex

Public API

// === storage/vector/mod.rs ===

use std::path::Path;

/// A unique identifier for an entity in the vector index.
/// Corresponds to the u64 representation of the application-provided entity ID.
pub type VectorId = u64;

/// A scored search result from the vector index.
#[derive(Debug, Clone)]
pub struct VectorSearchResult {
    /// Entity ID in the vector index.
    pub id: VectorId,
    /// L2 squared distance from query vector. Lower = more similar.
    /// For L2-normalized vectors, range is [0.0, 4.0] where 0.0 = identical.
    pub distance: f32,
}

/// Configuration for vector index construction.
#[derive(Debug, Clone)]
pub struct VectorIndexConfig {
    /// Number of dimensions per vector.
    pub dimensions: usize,
    /// Distance metric.
    pub metric: DistanceMetric,
    /// Quantization level for stored vectors.
    pub quantization: QuantizationLevel,
    /// Maximum connections per node per layer (M parameter). Default: 16.
    pub connectivity: usize,
    /// Beam width during index construction. Default: 200.
    pub ef_construction: usize,
    /// Default beam width during search (overridable per query). Default: 200.
    pub ef_search: usize,
}

impl Default for VectorIndexConfig {
    fn default() -> Self {
        Self {
            dimensions: 1536,
            metric: DistanceMetric::L2,
            quantization: QuantizationLevel::F16,
            connectivity: 16,
            ef_construction: 200,
            ef_search: 200,
        }
    }
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum DistanceMetric {
    /// L2 squared distance. Default for cosine over normalized vectors.
    L2,
    /// Inner product. For MIPS workloads (with XBOX transformation).
    InnerProduct,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum QuantizationLevel {
    /// Full precision (4 bytes per dimension).
    F32,
    /// Half precision (2 bytes per dimension). Default.
    F16,
    /// Scalar quantization (1 byte per dimension).
    Int8,
}

/// Errors from vector index operations.
#[derive(Debug)]
pub enum VectorError {
    /// Vector dimensions do not match index configuration.
    DimensionMismatch { expected: usize, got: usize },
    /// Index is at capacity and cannot accept more vectors.
    CapacityExceeded { capacity: usize },
    /// Vector ID not found in the index.
    NotFound { id: VectorId },
    /// I/O error during persistence.
    Io(std::io::Error),
    /// Index file is corrupted or incompatible.
    CorruptedIndex(String),
    /// USearch or backend-specific error.
    Backend(String),
    /// Vector has zero L2 norm and cannot be normalized.
    ZeroNormVector,
}

// Note: `ZeroNormVector` is not in Spec 07 Section 11 but is required by `l2_normalize()` in Task 03. Spec 07 should be updated to include it.

impl std::fmt::Display for VectorError { /* variant-specific messages */ }
impl std::error::Error for VectorError {}
impl From<std::io::Error> for VectorError { /* wraps as VectorError::Io */ }

/// The vector index trait. All ANN operations go through this interface.
///
/// Implementations must be `Send + Sync` for concurrent search + insert.
///
/// # Contract
///
/// - Vectors passed to `insert()` must already be L2-normalized. The trait
///   does not normalize -- the caller (embedding lifecycle, Task 03) is
///   responsible for normalization before insertion.
/// - `search()` and `filtered_search()` return results sorted by ascending
///   distance (most similar first).
/// - `delete()` marks a vector as tombstoned. Tombstoned vectors are excluded
///   from search results but may remain in the index structure.
pub trait VectorIndex: Send + Sync {
    /// Insert a vector into the index.
    ///
    /// If a vector with this ID already exists, it is replaced (delete + insert).
    ///
    /// # Errors
    ///
    /// - `VectorError::CapacityExceeded` if the index is full.
    /// - `VectorError::DimensionMismatch` if `embedding.len() != config.dimensions`.
    fn insert(&self, id: VectorId, embedding: &[f32]) -> Result<(), VectorError>;

    /// Search for the K nearest neighbors to the query vector.
    ///
    /// Results are ordered by ascending distance (most similar first).
    ///
    /// # Arguments
    ///
    /// * `query` -- The query vector. Must be L2-normalized.
    /// * `k` -- Number of results to return.
    /// * `ef_search` -- Beam width override. If 0, uses the index default.
    fn search(
        &self,
        query: &[f32],
        k: usize,
        ef_search: usize,
    ) -> Result<Vec<VectorSearchResult>, VectorError>;

    /// Search for the K nearest neighbors that satisfy a filter predicate.
    ///
    /// The predicate is evaluated during traversal. Nodes failing the predicate
    /// are used for navigation but excluded from results (in-graph filtering).
    ///
    /// # Arguments
    ///
    /// * `query` -- The query vector. Must be L2-normalized.
    /// * `k` -- Number of results to return.
    /// * `ef_search` -- Beam width override. If 0, uses the index default.
    /// * `filter` -- Predicate per candidate node. Return `true` to include.
    fn filtered_search(
        &self,
        query: &[f32],
        k: usize,
        ef_search: usize,
        filter: &dyn Fn(VectorId) -> bool,
    ) -> Result<Vec<VectorSearchResult>, VectorError>;

    /// Remove a vector from the index (lazy tombstone).
    ///
    /// # Errors
    ///
    /// - `VectorError::NotFound` if the ID is not in the index.
    fn delete(&self, id: VectorId) -> Result<(), VectorError>;

    /// Reserve capacity for at least `additional` more vectors.
    fn reserve(&self, additional: usize) -> Result<(), VectorError>;

    /// Persist the index to disk.
    fn save(&self, path: &Path) -> Result<(), VectorError>;

    /// Load an index from disk into writable memory.
    fn load(path: &Path, config: &VectorIndexConfig) -> Result<Self, VectorError>
    where
        Self: Sized;

    /// Memory-map an index from disk for read-only access.
    // config required by USearch to initialize the mmap'd index with correct parameters
    fn view(path: &Path, config: &VectorIndexConfig) -> Result<Self, VectorError>
    where
        Self: Sized;

    /// Number of vectors in the index (including tombstoned).
    fn len(&self) -> usize;

    /// Number of live (non-tombstoned) vectors.
    fn len_live(&self) -> usize;

    /// Whether the index is empty.
    fn is_empty(&self) -> bool {
        self.len_live() == 0
    }

    /// Ratio of tombstoned vectors to total vectors.
    fn tombstone_ratio(&self) -> f64 {
        if self.len() == 0 {
            0.0
        } else {
            (self.len() - self.len_live()) as f64 / self.len() as f64
        }
    }
}

BruteForceIndex

// === storage/vector/brute.rs ===

use std::collections::HashMap;
use std::sync::RwLock;
use std::path::Path;
use std::io::{Read, Write, BufReader, BufWriter};
use std::fs::File;
use super::{VectorIndex, VectorId, VectorSearchResult, VectorIndexConfig, VectorError};

/// Exact nearest-neighbor search via linear scan.
///
/// Used for:
/// 1. Correctness verification (recall measurement against HNSW).
/// 2. Small datasets (< 10,000 vectors where brute-force is faster).
/// 3. Pre-filter fallback (adaptive query planner uses brute-force for
///    very selective filters where the filtered set is small).
pub struct BruteForceIndex {
    vectors: RwLock<HashMap<VectorId, Vec<f32>>>,
    config: VectorIndexConfig,
}

impl BruteForceIndex {
    pub fn new(config: VectorIndexConfig) -> Self;

    /// Number of vectors (HashMap length).
    fn vector_count(&self) -> usize;
}

Search implementation:

Acquire read lock on vectors
Compute L2 squared distance between query and every stored vector
Collect (VectorId, f32) pairs into a Vec
Sort by ascending distance
Take first k results
Return as Vec<VectorSearchResult>

L2 squared distance function:

/// Compute L2 squared distance between two vectors of equal length.
///
/// For L2-normalized vectors, this is equivalent to `2 - 2 * cos(a, b)`.
/// Returns sum of squared differences.
pub(crate) fn l2_distance_sq(a: &[f32], b: &[f32]) -> f32 {
    debug_assert_eq!(a.len(), b.len());
    a.iter()
        .zip(b.iter())
        .map(|(x, y)| {
            let d = x - y;
            d * d
        })
        .sum()
}

Persistence (save/load/view):

BruteForceIndex uses a simple binary format for test persistence:

Header:
  [magic: 4 bytes "BFVI"]
  [version: 1 byte (0x01)]
  [dimensions: 4 bytes LE]
  [count: 8 bytes LE]

Per vector:
  [id: 8 bytes LE]
  [vector: dimensions * 4 bytes, f32 LE]

view() loads the same file as load() (brute-force has no mmap mode -- it is always in-memory). This is acceptable because BruteForceIndex is not the production backend.

Filtered search: Same as search() but skips vectors where filter(id) == false before adding to the distance computation. This means brute-force filtered search only computes distances for vectors passing the filter, which is why it is fast for very selective filters.

MockVectorIndex

/// Configurable mock for unit tests.
///
/// Returns predetermined results from search calls and records all method
/// invocations for verification.
pub struct MockVectorIndex {
    search_results: RwLock<Vec<Vec<VectorSearchResult>>>,
    call_log: RwLock<Vec<VectorIndexCall>>,
    config: VectorIndexConfig,
    inserted_count: RwLock<usize>,
}

#[derive(Debug, Clone)]
pub enum VectorIndexCall {
    Insert { id: VectorId },
    Delete { id: VectorId },
    Search { k: usize, ef_search: usize },
    FilteredSearch { k: usize, ef_search: usize },
    Reserve { additional: usize },
    Save,
    Load,
    View,
}

impl MockVectorIndex {
    /// Create a mock with predetermined search results.
    ///
    /// Each call to `search()` or `filtered_search()` pops the first element
    /// from `search_results`. If empty, returns an empty Vec.
    pub fn new(config: VectorIndexConfig, search_results: Vec<Vec<VectorSearchResult>>) -> Self;

    /// Get the recorded call log.
    pub fn calls(&self) -> Vec<VectorIndexCall>;

    /// Clear the call log.
    pub fn clear_calls(&self);
}

Error Handling

insert() with wrong dimensions: returns VectorError::DimensionMismatch { expected, got }.
search() with wrong query dimensions: returns VectorError::DimensionMismatch.
delete() for unknown ID: returns VectorError::NotFound { id }.
save()/load() I/O failures: returns VectorError::Io(e).
load() with corrupt file: returns VectorError::CorruptedIndex(msg).

Test Strategy

Property Tests

use proptest::prelude::*;

// Insert + search roundtrip: every inserted vector is retrievable.
proptest! {
    #[test]
    fn insert_search_roundtrip(
        dim in 2usize..64,
        n_vectors in 1usize..200,
        k in 1usize..50,
    ) {
        let k = k.min(n_vectors);
        let config = VectorIndexConfig {
            dimensions: dim,
            ..VectorIndexConfig::default()
        };
        let index = BruteForceIndex::new(config);

        // Insert random unit vectors
        let mut rng = proptest::test_runner::TestRng::deterministic_rng(
            proptest::test_runner::RngAlgorithm::ChaCha
        );
        for id in 0..n_vectors as u64 {
            let v: Vec<f32> = (0..dim).map(|_| rng.gen::<f32>() - 0.5).collect();
            let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
            let unit: Vec<f32> = v.iter().map(|x| x / norm).collect();
            index.insert(id, &unit).unwrap();
        }

        // Search for each inserted vector: it should be the top-1 result
        for id in 0..n_vectors as u64 {
            // Note: test must be in the module (or use pub(crate) vectors field) to access this private field.
            let v = index.vectors.read().unwrap()[&id].clone();
            let results = index.search(&v, 1, 0).unwrap();
            prop_assert!(!results.is_empty());
            prop_assert_eq!(results[0].id, id);
            prop_assert!(results[0].distance < 1e-6, "self-search should return distance ~0");
        }
    }
}

// Delete excludes tombstoned IDs from search results.
proptest! {
    #[test]
    fn delete_excludes_from_results(
        dim in 2usize..32,
        n_vectors in 5usize..100,
    ) {
        let config = VectorIndexConfig {
            dimensions: dim,
            ..VectorIndexConfig::default()
        };
        let index = BruteForceIndex::new(config);

        // Insert vectors
        let vectors: Vec<Vec<f32>> = (0..n_vectors).map(|_| {
            let v: Vec<f32> = (0..dim).map(|i| ((i * 7 + 13) % 100) as f32 / 100.0 - 0.5).collect();
            let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
            v.iter().map(|x| x / norm).collect()
        }).collect();
        for (id, v) in vectors.iter().enumerate() {
            index.insert(id as u64, v).unwrap();
        }

        // Delete the first vector
        index.delete(0).unwrap();

        // Search should not return deleted ID
        let query = &vectors[0];
        let results = index.search(query, n_vectors, 0).unwrap();
        prop_assert!(results.iter().all(|r| r.id != 0),
            "deleted vector should not appear in results");
        prop_assert_eq!(results.len(), n_vectors - 1);
    }
}

// filtered_search honors all predicates.
proptest! {
    #[test]
    fn filtered_search_honors_predicate(
        dim in 2usize..32,
        n_vectors in 10usize..100,
        k in 1usize..20,
    ) {
        let k = k.min(n_vectors / 2);
        let config = VectorIndexConfig {
            dimensions: dim,
            ..VectorIndexConfig::default()
        };
        let index = BruteForceIndex::new(config);

        for id in 0..n_vectors as u64 {
            let v: Vec<f32> = (0..dim).map(|i| ((id as usize * 3 + i * 7) % 100) as f32 / 100.0).collect();
            let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
            let unit: Vec<f32> = v.iter().map(|x| x / norm).collect();
            index.insert(id, &unit).unwrap();
        }

        // Filter: only even IDs
        let predicate = |id: VectorId| id % 2 == 0;
        let query: Vec<f32> = (0..dim).map(|i| (i as f32) / dim as f32).collect();
        let norm: f32 = query.iter().map(|x| x * x).sum::<f32>().sqrt();
        let unit_query: Vec<f32> = query.iter().map(|x| x / norm).collect();

        let results = index.filtered_search(&unit_query, k, 0, &predicate).unwrap();
        for r in &results {
            prop_assert!(r.id % 2 == 0,
                "filtered_search returned odd ID {}", r.id);
        }
    }
}

// Search results are sorted by ascending distance.
proptest! {
    #[test]
    fn results_sorted_by_distance(
        dim in 2usize..32,
        n_vectors in 5usize..100,
        k in 2usize..50,
    ) {
        let k = k.min(n_vectors);
        let config = VectorIndexConfig {
            dimensions: dim,
            ..VectorIndexConfig::default()
        };
        let index = BruteForceIndex::new(config);

        for id in 0..n_vectors as u64 {
            let v: Vec<f32> = (0..dim).map(|i| ((id as usize + i) % 100) as f32 / 100.0).collect();
            let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
            let unit: Vec<f32> = v.iter().map(|x| x / norm).collect();
            index.insert(id, &unit).unwrap();
        }

        let query: Vec<f32> = vec![1.0 / (dim as f32).sqrt(); dim];
        let results = index.search(&query, k, 0).unwrap();
        for w in results.windows(2) {
            prop_assert!(w[0].distance <= w[1].distance,
                "results not sorted: {} > {}", w[0].distance, w[1].distance);
        }
    }
}

Unit Tests

#[test]
fn brute_force_new_is_empty() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    assert_eq!(index.len(), 0);
    assert_eq!(index.len_live(), 0);
    assert!(index.is_empty());
    assert!((index.tombstone_ratio() - 0.0).abs() < f64::EPSILON);
}

#[test]
fn brute_force_insert_and_len() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
    index.insert(2, &[0.0, 1.0, 0.0]).unwrap();
    assert_eq!(index.len(), 2);
    assert_eq!(index.len_live(), 2);
    assert!(!index.is_empty());
}

#[test]
fn brute_force_dimension_mismatch() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    let result = index.insert(1, &[1.0, 0.0]); // 2 dims instead of 3
    assert!(matches!(result, Err(VectorError::DimensionMismatch { expected: 3, got: 2 })));
}

#[test]
fn brute_force_search_dimension_mismatch() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
    let result = index.search(&[1.0, 0.0], 1, 0); // 2 dims query
    assert!(matches!(result, Err(VectorError::DimensionMismatch { .. })));
}

#[test]
fn brute_force_self_search_distance_zero() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    let v = [1.0, 0.0, 0.0];
    index.insert(42, &v).unwrap();
    let results = index.search(&v, 1, 0).unwrap();
    assert_eq!(results.len(), 1);
    assert_eq!(results[0].id, 42);
    assert!(results[0].distance < 1e-6);
}

#[test]
fn brute_force_search_empty_index() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    let results = index.search(&[1.0, 0.0, 0.0], 10, 0).unwrap();
    assert!(results.is_empty());
}

#[test]
fn brute_force_search_k_larger_than_index() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
    index.insert(2, &[0.0, 1.0, 0.0]).unwrap();
    let results = index.search(&[1.0, 0.0, 0.0], 100, 0).unwrap();
    assert_eq!(results.len(), 2); // returns all available, not error
}

#[test]
fn brute_force_orthogonal_vectors_distance() {
    // For unit vectors a, b: ||a - b||^2 = 2 - 2*cos(a,b)
    // Orthogonal unit vectors: cos = 0, so distance = 2.0
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
    let results = index.search(&[0.0, 1.0, 0.0], 1, 0).unwrap();
    assert!((results[0].distance - 2.0).abs() < 1e-5,
        "orthogonal unit vectors should have L2^2 distance of 2.0, got {}", results[0].distance);
}

#[test]
fn brute_force_identical_vectors_distance() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    let v = [0.577_350_3, 0.577_350_3, 0.577_350_3]; // unit vector
    index.insert(1, &v).unwrap();
    let results = index.search(&v, 1, 0).unwrap();
    assert!(results[0].distance < 1e-6);
}

#[test]
fn brute_force_delete_and_search() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
    index.insert(2, &[0.0, 1.0, 0.0]).unwrap();
    index.insert(3, &[0.0, 0.0, 1.0]).unwrap();

    index.delete(2).unwrap();
    assert_eq!(index.len(), 2); // BruteForce does true delete
    assert_eq!(index.len_live(), 2);

    let results = index.search(&[0.0, 1.0, 0.0], 10, 0).unwrap();
    assert!(results.iter().all(|r| r.id != 2));
}

#[test]
fn brute_force_delete_not_found() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    let result = index.delete(999);
    assert!(matches!(result, Err(VectorError::NotFound { id: 999 })));
}

#[test]
fn brute_force_insert_replaces_existing() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
    index.insert(1, &[0.0, 1.0, 0.0]).unwrap(); // replace

    assert_eq!(index.len(), 1); // still 1 vector
    let results = index.search(&[0.0, 1.0, 0.0], 1, 0).unwrap();
    assert_eq!(results[0].id, 1);
    assert!(results[0].distance < 1e-6, "should match the replacement vector");
}

#[test]
fn brute_force_filtered_search_excludes_non_matching() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    for id in 0..10u64 {
        let v = [1.0, 0.0, 0.0]; // all same direction
        index.insert(id, &v).unwrap();
    }

    // Only include even IDs
    let results = index.filtered_search(&[1.0, 0.0, 0.0], 10, 0, &|id| id % 2 == 0).unwrap();
    assert_eq!(results.len(), 5);
    assert!(results.iter().all(|r| r.id % 2 == 0));
}

#[test]
fn brute_force_filtered_search_empty_result() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    index.insert(1, &[1.0, 0.0, 0.0]).unwrap();

    // Predicate that matches nothing
    let results = index.filtered_search(&[1.0, 0.0, 0.0], 10, 0, &|_| false).unwrap();
    assert!(results.is_empty());
}

#[test]
fn brute_force_save_load_roundtrip() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config.clone());
    index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
    index.insert(2, &[0.0, 1.0, 0.0]).unwrap();

    let dir = tempfile::tempdir().unwrap();
    let path = dir.path().join("test.bfvi");
    index.save(&path).unwrap();

    let loaded = BruteForceIndex::load(&path, &config).unwrap();
    assert_eq!(loaded.len(), 2);

    // Search should produce identical results
    let results_orig = index.search(&[1.0, 0.0, 0.0], 2, 0).unwrap();
    let results_loaded = loaded.search(&[1.0, 0.0, 0.0], 2, 0).unwrap();
    assert_eq!(results_orig.len(), results_loaded.len());
    for (a, b) in results_orig.iter().zip(results_loaded.iter()) {
        assert_eq!(a.id, b.id);
        assert!((a.distance - b.distance).abs() < 1e-6);
    }
}

#[test]
fn brute_force_reserve_is_noop() {
    // BruteForce uses HashMap, which resizes automatically.
    // reserve() is a noop but must not error.
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let index = BruteForceIndex::new(config);
    assert!(index.reserve(1_000_000).is_ok());
}

#[test]
fn l2_distance_sq_correctness() {
    let a = [1.0, 0.0, 0.0];
    let b = [0.0, 1.0, 0.0];
    let dist = l2_distance_sq(&a, &b);
    assert!((dist - 2.0).abs() < 1e-6);

    let c = [1.0, 0.0, 0.0];
    assert!(l2_distance_sq(&a, &c) < 1e-6);
}

#[test]
fn mock_vector_index_returns_predetermined() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let results = vec![
        vec![VectorSearchResult { id: 42, distance: 0.1 }],
        vec![VectorSearchResult { id: 99, distance: 0.5 }],
    ];
    let mock = MockVectorIndex::new(config, results);

    let r1 = mock.search(&[1.0, 0.0, 0.0], 1, 0).unwrap();
    assert_eq!(r1[0].id, 42);

    let r2 = mock.search(&[0.0, 1.0, 0.0], 1, 0).unwrap();
    assert_eq!(r2[0].id, 99);

    // Third call: no more results, returns empty
    let r3 = mock.search(&[0.0, 0.0, 1.0], 1, 0).unwrap();
    assert!(r3.is_empty());
}

#[test]
fn mock_vector_index_records_calls() {
    let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
    let mock = MockVectorIndex::new(config, vec![]);

    mock.insert(1, &[1.0, 0.0, 0.0]).unwrap();
    mock.delete(1).unwrap();
    mock.search(&[1.0, 0.0, 0.0], 10, 200).unwrap();
    mock.filtered_search(&[1.0, 0.0, 0.0], 5, 0, &|_| true).unwrap();

    let calls = mock.calls();
    assert_eq!(calls.len(), 4);
    assert!(matches!(calls[0], VectorIndexCall::Insert { id: 1 }));
    assert!(matches!(calls[1], VectorIndexCall::Delete { id: 1 }));
    assert!(matches!(calls[2], VectorIndexCall::Search { k: 10, ef_search: 200 }));
    assert!(matches!(calls[3], VectorIndexCall::FilteredSearch { k: 5, ef_search: 0 }));
}

#[test]
fn vector_index_is_send_and_sync() {
    fn assert_send_sync<T: Send + Sync>() {}
    assert_send_sync::<BruteForceIndex>();
    assert_send_sync::<MockVectorIndex>();
}

#[test]
fn vector_index_config_defaults() {
    let config = VectorIndexConfig::default();
    assert_eq!(config.dimensions, 1536);
    assert_eq!(config.metric, DistanceMetric::L2);
    assert_eq!(config.quantization, QuantizationLevel::F16);
    assert_eq!(config.connectivity, 16);
    assert_eq!(config.ef_construction, 200);
    assert_eq!(config.ef_search, 200);
}

Acceptance Criteria

VectorIndex trait with all methods from Spec 07, Section 11
VectorIndex: Send + Sync bound
VectorId = u64 type alias
VectorSearchResult, VectorIndexConfig, DistanceMetric, QuantizationLevel, VectorError types with correct derives
VectorIndexConfig::default() returns dimensions=1536, L2, F16, M=16, ef_construction=200, ef_search=200
VectorError implements Display, Error, From<std::io::Error>
l2_distance_sq() computes correct L2 squared distance
BruteForceIndex::search() returns exact nearest neighbors sorted by ascending distance
BruteForceIndex::filtered_search() returns only results where filter(id) == true
BruteForceIndex::insert() validates dimensions and rejects mismatches
BruteForceIndex::insert() replaces existing vectors with the same ID
BruteForceIndex::delete() removes vectors; they never appear in search results
BruteForceIndex::delete() returns NotFound for unknown IDs
BruteForceIndex::save() and load() roundtrip produces identical search results
MockVectorIndex returns predetermined results and records call history
All property tests pass: insert+search roundtrip, delete exclusion, filtered_search predicate honor, result ordering
BruteForceIndex and MockVectorIndex are Send + Sync
No unsafe code
cargo clippy -- -D warnings passes
All property tests and unit tests pass

Research References

docs/research/ann_for_tidaldb.md -- Section "Implementation recommendation: wrap USearch, build the planner": "A BruteForceIndex exists for correctness verification and small-dataset deployments", brute-force breakeven point (~2,000-5,000 vectors)

Spec References

docs/specs/07-vector-retrieval.md -- Section 11 (VectorIndex trait: full API signatures, VectorError variants, BruteForceIndex implementation sketch, MockVectorIndex), Section 12 (performance targets), Section 13 (invariants 1-3: insert retrievability, delete exclusion, filtered_search predicate compliance)

Implementation Notes

Add pub mod vector; to tidal/src/storage/mod.rs. The vector module is a submodule of storage because vector indexes are a storage concern (persistence, key encoding, entity store integration).
BruteForceIndex uses true deletion (HashMap::remove), not lazy tombstoning. This means len() and len_live() always return the same value. The tombstone_ratio() default implementation handles this correctly (returns 0.0). USearch (Task 02) uses lazy tombstoning, where len() > len_live().
The ef_search parameter is ignored by BruteForceIndex (exact search has no beam width). It is accepted for trait compliance but unused.
view() for BruteForceIndex delegates to load() since there is no mmap mode. This is documented on the method.
reserve() for BruteForceIndex is a no-op since HashMap resizes automatically. This is documented on the method.
Do NOT add the usearch crate dependency in this task. That is Task 02.
Do NOT implement l2_normalize() in this task. That is Task 03 (Embedding Lifecycle).
Do NOT implement the adaptive query planner in this task. That is Task 04.
The l2_distance_sq() function is pub(crate) -- it is used by BruteForceIndex and by Task 04's planner for brute-force fallback. It is not a public API.

31 KiB Raw Blame History