- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test coverage for every public API surface - m1p5: TidalDb public API — write_item, signal, read_decay_score, read_windowed_count, read_velocity; StorageBox enum routing memory vs fjall; WalSender/WalHandleWriter bridge; WAL replay on open - Periodic checkpoint: 30s background thread for persistent+schema mode; FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful shutdown via Arc<AtomicBool> + join before final checkpoint - ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing) - Milestone 2 planning scaffolding added under docs/planning/milestone-2/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9.0 KiB
Milestone 2, Phase 1: Vector Index Integration (USearch)
Phase Deliverable
The VectorIndex trait and two implementations: BruteForceIndex (pure Rust, exact search) and UsearchIndex (USearch C++ FFI, HNSW approximate search). Items can be inserted with embeddings and retrieved by approximate nearest neighbor similarity. Vectors are L2-normalized at insertion time so L2 distance is equivalent to cosine similarity. An adaptive query planner routes filtered ANN queries to the optimal strategy based on estimated selectivity: brute-force for very selective filters (< 1%), widened ef_search for the danger zone (1-20%), and standard in-graph predicate filtering for broad filters (> 20%). The USearch backend uses f16 quantization by default, mmap persistence via view() for instant restart, and full save()/load() for checkpoint coordination. A BruteForceIndex exists for correctness verification, small datasets, and the pre-filter brute-force strategy.
Acceptance Criteria
VectorIndextrait withinsert(key, vector),delete(key),search(query, k, ef_search),filtered_search(query, k, ef_search, predicate),save(),load(),view(),reserve(),len(),len_live(),is_empty(),tombstone_ratio()VectorSearchResult { id: VectorId, distance: f32 }andVectorIndexConfig { dimensions, metric, quantization, connectivity, ef_construction, ef_search }typesDistanceMetricenum:L2,InnerProductQuantizationLevelenum:F32,F16,Int8VectorErrorenum:DimensionMismatch,CapacityExceeded,NotFound,Io,CorruptedIndex,Backend,ZeroNormVectorBruteForceIndeximplementsVectorIndexwith exact linear-scan search,RwLock<HashMap<VectorId, Vec<f32>>>storageMockVectorIndexreturns predetermined results for unit tests and records call history- USearch backend implements the trait with f16 quantization (default), M=16, ef_construction=200, ef_search=200
- USearch
filtered_searchpasses predicate closure to USearch's predicate callback API - USearch
reserve()for capacity management (2x over-provision) - USearch
save(),load(),view()delegating to USearch persistence methods #![forbid(unsafe_code)]relaxed only instorage/vector/usearch.rswith// SAFETY:comments on every unsafe blockl2_normalize(v: &[f32]) -> Result<Vec<f32>, VectorError>normalizes to unit length; fails on zero-norm vectorsEmbeddingSlotRegistrymaps(EntityKind, slot_name)toEmbeddingSlotState { index, dimensions, quantization, params }- Insert path: validate dims, normalize, store f32 in entity store (
METAkey withEMB:slot_namesuffix), insert quantized into HNSW - Update path: tombstone in HNSW, insert new vector
- Delete path: tombstone only
- Adaptive query planner: selectivity < 1% triggers pre-filter + brute-force; 1-20% uses
filtered_searchwith widenedef_search(2-3x); > 20% uses standardfiltered_search; 100% (no filter) uses unfilteredsearch()- Note: ROADMAP.md acceptance criteria say selectivity < 2% -> brute-force. These docs use the refined spec thresholds from Spec 07 Section 9 (< 1% brute-force, 1-20% widened HNSW, > 20% in-graph filter). The roadmap threshold of 2% is superseded by the spec.
- ANN retrieval at 10K vectors returns top-100 with recall@100 > 0.95 (measured against
BruteForceIndex) - ANN retrieval latency < 10ms at 10K vectors (Criterion benchmark)
- Persistence:
save()on checkpoint,view()on restart for immediate read serving - Criterion benchmarks: unfiltered search, filtered search at 20% and 5% selectivity, brute-force search, recall@100
Dependencies
- Requires: m1p1 (types:
EntityId,EntityKind), m1p3 (storage:StorageEnginetrait, key encoding withTag::Metafor embedding persistence), m1p5 (entity write API for storing embeddings in entity store) - Blocks: m2p2 (metadata indexes use the
VectorIndexfor filter bitmap selectivity estimation), m2p5 (RETRIEVE executor needs ANN search for candidate generation)
Research References
- docs/research/ann_for_tidaldb.md -- USearch evaluation (127K QPS at f32, 167K at int8), filtered search callback architecture, ACORN-1 two-hop expansion, f16 as optimal default, mmap persistence strategy, memory budget analysis (31.5 GB at 10M x 1536d x f16),
reserve()capacity planning - thoughts.md -- Part V.9 (hybrid storage: "vector index and text index are derived state, always rebuildable from the entity store")
Spec References
- docs/specs/07-vector-retrieval.md -- Section 2 (HNSW internals: M=16, ef_construction=200, ef_search=200), Section 3 (filtered ANN: three strategies with selectivity thresholds), Section 4 (quantization: f16 default, < 1% recall loss), Section 5 (multiple embedding spaces, slot registry), Section 6 (embedding lifecycle: insert, update, delete paths), Section 7 (persistence: save/load/view, delta journal), Section 9 (adaptive query planner: decision tree, threshold reference, runtime statistics), Section 11 (VectorIndex trait, full API), Section 12 (performance targets), Section 13 (invariants: 10 correctness guarantees)
- docs/specs/00-architecture-overview.md -- Module map showing
storage/vector/
Task Index
| # | Task | Delivers | Depends On | Complexity |
|---|---|---|---|---|
| 01 | VectorIndex Trait + BruteForceIndex | VectorIndex trait, all types, BruteForceIndex, MockVectorIndex, property tests |
None | M |
| 02 | USearch Backend | UsearchIndex wrapping USearch via Rust crate, f16 quantization, mmap persistence, #[allow(unsafe_code)] |
Task 01 | L |
| 03 | Embedding Lifecycle + Slot Registry | l2_normalize, EmbeddingSlotRegistry, insert/update/delete paths, entity store integration |
Task 01 | M |
| 04 | Adaptive Query Planner + Benchmarks | AdaptiveQueryPlanner, SelectivityEstimator, AnnQueryStats, Criterion benchmarks |
Task 01, Task 02 | M |
Task Dependency DAG
Task 01: VectorIndex Trait + BruteForceIndex
| \
| \
v v
Task 02: USearch Backend Task 03: Embedding Lifecycle + Slot Registry
|
+----> Task 04: Adaptive Query Planner + Benchmarks
| (also depends on Task 01)
Task 01 is the foundation -- it defines the trait all other tasks implement or consume. Tasks 02 and 03 are parallelizable after Task 01. Task 04 requires both Task 01 (for trait types) and Task 02 (for USearch backend to benchmark against).
File Layout
tidal/src/
storage/
vector/
mod.rs -- VectorIndex trait, VectorError, VectorSearchResult, VectorIndexConfig,
DistanceMetric, QuantizationLevel, VectorId, types re-exports (Task 01)
brute.rs -- BruteForceIndex, MockVectorIndex (Task 01)
usearch.rs -- UsearchIndex, #[allow(unsafe_code)] (Task 02)
lifecycle.rs -- l2_normalize, embedding insert/update/delete path (Task 03)
registry.rs -- EmbeddingSlotRegistry, EmbeddingSlotState (Task 03)
planner.rs -- AdaptiveQueryPlanner, SelectivityEstimator, AnnQueryStats (Task 04)
mod.rs -- add `pub mod vector;`
tidal/benches/
vector.rs -- Criterion benchmarks (Task 04)
tidal/Cargo.toml -- add `usearch` and `rayon` dependencies
Open Questions
-
usearchcrate version: Theusearchcrate (crates.io) wraps USearch via CXX. Verify the latest stable version supportsfiltered_searchwith a predicate callback. If not, the alternative ishnsw_rswhich is pure Rust but lacks quantization and deletion support. The research doc recommends USearch but notes hnsw_rs as a fallback. -
Capacity planning: USearch
reserve()must be called before first insertion. tidalDB should over-provision by 2x the schema-defined entity limit. What happens if the index fills up? Need to benchmark whether a full rebuild is needed or ifreserve()can be called again with higher capacity. -
Concurrency model: USearch claims concurrent reads + writes. Verify that
filtered_searchandinsertcan truly run concurrently without a mutex wrapper. If not, addRwLock<UsearchIndex>and document the contention implication. ScyllaDB validates concurrent operation at 1B vectors but tidalDB's access patterns may differ. -
Delta journal vs full save: The spec recommends a delta journal for incremental persistence (Spec 07, Section 7). For M2 at 10K items (not 10M), a full
save()on every checkpoint is fast enough (10K x 1536d x f16 = ~30 MB, writes in < 100ms). Defer delta journal implementation to M7 unless benchmarks show otherwise. -
Selectivity estimation without m2p2: Task 04 (planner) depends on selectivity estimates from metadata bitmap indexes (m2p2). For the m2p1 phase, the planner can use a fixed threshold with a placeholder
SelectivityEstimatorthat returns 1.0 (always use in-graph filter). Wire up the real estimator when m2p2 is implemented.