tidaldb/docs/planning/milestone-2/phase-1/OVERVIEW.md
jordan 6fdaa1584b feat: complete M1 signal engine — m0p3 samples/docs, m1p5 TidalDb API, examples, and periodic checkpoint
- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples
  (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test
  coverage for every public API surface
- m1p5: TidalDb public API — write_item, signal, read_decay_score,
  read_windowed_count, read_velocity; StorageBox enum routing memory vs
  fjall; WalSender/WalHandleWriter bridge; WAL replay on open
- Periodic checkpoint: 30s background thread for persistent+schema mode;
  FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful
  shutdown via Arc<AtomicBool> + join before final checkpoint
- ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing)
- Milestone 2 planning scaffolding added under docs/planning/milestone-2/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 22:45:10 -07:00

9.0 KiB

Milestone 2, Phase 1: Vector Index Integration (USearch)

Phase Deliverable

The VectorIndex trait and two implementations: BruteForceIndex (pure Rust, exact search) and UsearchIndex (USearch C++ FFI, HNSW approximate search). Items can be inserted with embeddings and retrieved by approximate nearest neighbor similarity. Vectors are L2-normalized at insertion time so L2 distance is equivalent to cosine similarity. An adaptive query planner routes filtered ANN queries to the optimal strategy based on estimated selectivity: brute-force for very selective filters (< 1%), widened ef_search for the danger zone (1-20%), and standard in-graph predicate filtering for broad filters (> 20%). The USearch backend uses f16 quantization by default, mmap persistence via view() for instant restart, and full save()/load() for checkpoint coordination. A BruteForceIndex exists for correctness verification, small datasets, and the pre-filter brute-force strategy.

Acceptance Criteria

  • VectorIndex trait with insert(key, vector), delete(key), search(query, k, ef_search), filtered_search(query, k, ef_search, predicate), save(), load(), view(), reserve(), len(), len_live(), is_empty(), tombstone_ratio()
  • VectorSearchResult { id: VectorId, distance: f32 } and VectorIndexConfig { dimensions, metric, quantization, connectivity, ef_construction, ef_search } types
  • DistanceMetric enum: L2, InnerProduct
  • QuantizationLevel enum: F32, F16, Int8
  • VectorError enum: DimensionMismatch, CapacityExceeded, NotFound, Io, CorruptedIndex, Backend, ZeroNormVector
  • BruteForceIndex implements VectorIndex with exact linear-scan search, RwLock<HashMap<VectorId, Vec<f32>>> storage
  • MockVectorIndex returns predetermined results for unit tests and records call history
  • USearch backend implements the trait with f16 quantization (default), M=16, ef_construction=200, ef_search=200
  • USearch filtered_search passes predicate closure to USearch's predicate callback API
  • USearch reserve() for capacity management (2x over-provision)
  • USearch save(), load(), view() delegating to USearch persistence methods
  • #![forbid(unsafe_code)] relaxed only in storage/vector/usearch.rs with // SAFETY: comments on every unsafe block
  • l2_normalize(v: &[f32]) -> Result<Vec<f32>, VectorError> normalizes to unit length; fails on zero-norm vectors
  • EmbeddingSlotRegistry maps (EntityKind, slot_name) to EmbeddingSlotState { index, dimensions, quantization, params }
  • Insert path: validate dims, normalize, store f32 in entity store (META key with EMB:slot_name suffix), insert quantized into HNSW
  • Update path: tombstone in HNSW, insert new vector
  • Delete path: tombstone only
  • Adaptive query planner: selectivity < 1% triggers pre-filter + brute-force; 1-20% uses filtered_search with widened ef_search (2-3x); > 20% uses standard filtered_search; 100% (no filter) uses unfiltered search()
    • Note: ROADMAP.md acceptance criteria say selectivity < 2% -> brute-force. These docs use the refined spec thresholds from Spec 07 Section 9 (< 1% brute-force, 1-20% widened HNSW, > 20% in-graph filter). The roadmap threshold of 2% is superseded by the spec.
  • ANN retrieval at 10K vectors returns top-100 with recall@100 > 0.95 (measured against BruteForceIndex)
  • ANN retrieval latency < 10ms at 10K vectors (Criterion benchmark)
  • Persistence: save() on checkpoint, view() on restart for immediate read serving
  • Criterion benchmarks: unfiltered search, filtered search at 20% and 5% selectivity, brute-force search, recall@100

Dependencies

  • Requires: m1p1 (types: EntityId, EntityKind), m1p3 (storage: StorageEngine trait, key encoding with Tag::Meta for embedding persistence), m1p5 (entity write API for storing embeddings in entity store)
  • Blocks: m2p2 (metadata indexes use the VectorIndex for filter bitmap selectivity estimation), m2p5 (RETRIEVE executor needs ANN search for candidate generation)

Research References

  • docs/research/ann_for_tidaldb.md -- USearch evaluation (127K QPS at f32, 167K at int8), filtered search callback architecture, ACORN-1 two-hop expansion, f16 as optimal default, mmap persistence strategy, memory budget analysis (31.5 GB at 10M x 1536d x f16), reserve() capacity planning
  • thoughts.md -- Part V.9 (hybrid storage: "vector index and text index are derived state, always rebuildable from the entity store")

Spec References

  • docs/specs/07-vector-retrieval.md -- Section 2 (HNSW internals: M=16, ef_construction=200, ef_search=200), Section 3 (filtered ANN: three strategies with selectivity thresholds), Section 4 (quantization: f16 default, < 1% recall loss), Section 5 (multiple embedding spaces, slot registry), Section 6 (embedding lifecycle: insert, update, delete paths), Section 7 (persistence: save/load/view, delta journal), Section 9 (adaptive query planner: decision tree, threshold reference, runtime statistics), Section 11 (VectorIndex trait, full API), Section 12 (performance targets), Section 13 (invariants: 10 correctness guarantees)
  • docs/specs/00-architecture-overview.md -- Module map showing storage/vector/

Task Index

# Task Delivers Depends On Complexity
01 VectorIndex Trait + BruteForceIndex VectorIndex trait, all types, BruteForceIndex, MockVectorIndex, property tests None M
02 USearch Backend UsearchIndex wrapping USearch via Rust crate, f16 quantization, mmap persistence, #[allow(unsafe_code)] Task 01 L
03 Embedding Lifecycle + Slot Registry l2_normalize, EmbeddingSlotRegistry, insert/update/delete paths, entity store integration Task 01 M
04 Adaptive Query Planner + Benchmarks AdaptiveQueryPlanner, SelectivityEstimator, AnnQueryStats, Criterion benchmarks Task 01, Task 02 M

Task Dependency DAG

Task 01: VectorIndex Trait + BruteForceIndex
    |                   \
    |                    \
    v                     v
Task 02: USearch Backend  Task 03: Embedding Lifecycle + Slot Registry
    |
    +----> Task 04: Adaptive Query Planner + Benchmarks
    |        (also depends on Task 01)

Task 01 is the foundation -- it defines the trait all other tasks implement or consume. Tasks 02 and 03 are parallelizable after Task 01. Task 04 requires both Task 01 (for trait types) and Task 02 (for USearch backend to benchmark against).

File Layout

tidal/src/
  storage/
    vector/
      mod.rs        -- VectorIndex trait, VectorError, VectorSearchResult, VectorIndexConfig,
                       DistanceMetric, QuantizationLevel, VectorId, types re-exports (Task 01)
      brute.rs      -- BruteForceIndex, MockVectorIndex (Task 01)
      usearch.rs    -- UsearchIndex, #[allow(unsafe_code)] (Task 02)
      lifecycle.rs  -- l2_normalize, embedding insert/update/delete path (Task 03)
      registry.rs   -- EmbeddingSlotRegistry, EmbeddingSlotState (Task 03)
      planner.rs    -- AdaptiveQueryPlanner, SelectivityEstimator, AnnQueryStats (Task 04)
    mod.rs          -- add `pub mod vector;`
tidal/benches/
  vector.rs         -- Criterion benchmarks (Task 04)
tidal/Cargo.toml    -- add `usearch` and `rayon` dependencies

Open Questions

  1. usearch crate version: The usearch crate (crates.io) wraps USearch via CXX. Verify the latest stable version supports filtered_search with a predicate callback. If not, the alternative is hnsw_rs which is pure Rust but lacks quantization and deletion support. The research doc recommends USearch but notes hnsw_rs as a fallback.

  2. Capacity planning: USearch reserve() must be called before first insertion. tidalDB should over-provision by 2x the schema-defined entity limit. What happens if the index fills up? Need to benchmark whether a full rebuild is needed or if reserve() can be called again with higher capacity.

  3. Concurrency model: USearch claims concurrent reads + writes. Verify that filtered_search and insert can truly run concurrently without a mutex wrapper. If not, add RwLock<UsearchIndex> and document the contention implication. ScyllaDB validates concurrent operation at 1B vectors but tidalDB's access patterns may differ.

  4. Delta journal vs full save: The spec recommends a delta journal for incremental persistence (Spec 07, Section 7). For M2 at 10K items (not 10M), a full save() on every checkpoint is fast enough (10K x 1536d x f16 = ~30 MB, writes in < 100ms). Defer delta journal implementation to M7 unless benchmarks show otherwise.

  5. Selectivity estimation without m2p2: Task 04 (planner) depends on selectivity estimates from metadata bitmap indexes (m2p2). For the m2p1 phase, the planner can use a fixed threshold with a placeholder SelectivityEstimator that returns 1.0 (always use in-graph filter). Wire up the real estimator when m2p2 is implemented.