- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test coverage for every public API surface - m1p5: TidalDb public API — write_item, signal, read_decay_score, read_windowed_count, read_velocity; StorageBox enum routing memory vs fjall; WalSender/WalHandleWriter bridge; WAL replay on open - Periodic checkpoint: 30s background thread for persistent+schema mode; FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful shutdown via Arc<AtomicBool> + join before final checkpoint - ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing) - Milestone 2 planning scaffolding added under docs/planning/milestone-2/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
102 lines
9.0 KiB
Markdown
102 lines
9.0 KiB
Markdown
# Milestone 2, Phase 1: Vector Index Integration (USearch)
|
|
|
|
## Phase Deliverable
|
|
|
|
The `VectorIndex` trait and two implementations: `BruteForceIndex` (pure Rust, exact search) and `UsearchIndex` (USearch C++ FFI, HNSW approximate search). Items can be inserted with embeddings and retrieved by approximate nearest neighbor similarity. Vectors are L2-normalized at insertion time so L2 distance is equivalent to cosine similarity. An adaptive query planner routes filtered ANN queries to the optimal strategy based on estimated selectivity: brute-force for very selective filters (< 1%), widened `ef_search` for the danger zone (1-20%), and standard in-graph predicate filtering for broad filters (> 20%). The USearch backend uses f16 quantization by default, mmap persistence via `view()` for instant restart, and full `save()`/`load()` for checkpoint coordination. A `BruteForceIndex` exists for correctness verification, small datasets, and the pre-filter brute-force strategy.
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `VectorIndex` trait with `insert(key, vector)`, `delete(key)`, `search(query, k, ef_search)`, `filtered_search(query, k, ef_search, predicate)`, `save()`, `load()`, `view()`, `reserve()`, `len()`, `len_live()`, `is_empty()`, `tombstone_ratio()`
|
|
- [ ] `VectorSearchResult { id: VectorId, distance: f32 }` and `VectorIndexConfig { dimensions, metric, quantization, connectivity, ef_construction, ef_search }` types
|
|
- [ ] `DistanceMetric` enum: `L2`, `InnerProduct`
|
|
- [ ] `QuantizationLevel` enum: `F32`, `F16`, `Int8`
|
|
- [ ] `VectorError` enum: `DimensionMismatch`, `CapacityExceeded`, `NotFound`, `Io`, `CorruptedIndex`, `Backend`, `ZeroNormVector`
|
|
- [ ] `BruteForceIndex` implements `VectorIndex` with exact linear-scan search, `RwLock<HashMap<VectorId, Vec<f32>>>` storage
|
|
- [ ] `MockVectorIndex` returns predetermined results for unit tests and records call history
|
|
- [ ] USearch backend implements the trait with f16 quantization (default), M=16, ef_construction=200, ef_search=200
|
|
- [ ] USearch `filtered_search` passes predicate closure to USearch's predicate callback API
|
|
- [ ] USearch `reserve()` for capacity management (2x over-provision)
|
|
- [ ] USearch `save()`, `load()`, `view()` delegating to USearch persistence methods
|
|
- [ ] `#![forbid(unsafe_code)]` relaxed only in `storage/vector/usearch.rs` with `// SAFETY:` comments on every unsafe block
|
|
- [ ] `l2_normalize(v: &[f32]) -> Result<Vec<f32>, VectorError>` normalizes to unit length; fails on zero-norm vectors
|
|
- [ ] `EmbeddingSlotRegistry` maps `(EntityKind, slot_name)` to `EmbeddingSlotState { index, dimensions, quantization, params }`
|
|
- [ ] Insert path: validate dims, normalize, store f32 in entity store (`META` key with `EMB:slot_name` suffix), insert quantized into HNSW
|
|
- [ ] Update path: tombstone in HNSW, insert new vector
|
|
- [ ] Delete path: tombstone only
|
|
- [ ] Adaptive query planner: selectivity < 1% triggers pre-filter + brute-force; 1-20% uses `filtered_search` with widened `ef_search` (2-3x); > 20% uses standard `filtered_search`; 100% (no filter) uses unfiltered `search()`
|
|
- Note: ROADMAP.md acceptance criteria say selectivity < 2% -> brute-force. These docs use the refined spec thresholds from Spec 07 Section 9 (< 1% brute-force, 1-20% widened HNSW, > 20% in-graph filter). The roadmap threshold of 2% is superseded by the spec.
|
|
- [ ] ANN retrieval at 10K vectors returns top-100 with recall@100 > 0.95 (measured against `BruteForceIndex`)
|
|
- [ ] ANN retrieval latency < 10ms at 10K vectors (Criterion benchmark)
|
|
- [ ] Persistence: `save()` on checkpoint, `view()` on restart for immediate read serving
|
|
- [ ] Criterion benchmarks: unfiltered search, filtered search at 20% and 5% selectivity, brute-force search, recall@100
|
|
|
|
## Dependencies
|
|
|
|
- **Requires:** m1p1 (types: `EntityId`, `EntityKind`), m1p3 (storage: `StorageEngine` trait, key encoding with `Tag::Meta` for embedding persistence), m1p5 (entity write API for storing embeddings in entity store)
|
|
- **Blocks:** m2p2 (metadata indexes use the `VectorIndex` for filter bitmap selectivity estimation), m2p5 (RETRIEVE executor needs ANN search for candidate generation)
|
|
|
|
## Research References
|
|
|
|
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- USearch evaluation (127K QPS at f32, 167K at int8), filtered search callback architecture, ACORN-1 two-hop expansion, f16 as optimal default, mmap persistence strategy, memory budget analysis (31.5 GB at 10M x 1536d x f16), `reserve()` capacity planning
|
|
- [thoughts.md](../../../../thoughts.md) -- Part V.9 (hybrid storage: "vector index and text index are derived state, always rebuildable from the entity store")
|
|
|
|
## Spec References
|
|
|
|
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 2 (HNSW internals: M=16, ef_construction=200, ef_search=200), Section 3 (filtered ANN: three strategies with selectivity thresholds), Section 4 (quantization: f16 default, < 1% recall loss), Section 5 (multiple embedding spaces, slot registry), Section 6 (embedding lifecycle: insert, update, delete paths), Section 7 (persistence: save/load/view, delta journal), Section 9 (adaptive query planner: decision tree, threshold reference, runtime statistics), Section 11 (VectorIndex trait, full API), Section 12 (performance targets), Section 13 (invariants: 10 correctness guarantees)
|
|
- [docs/specs/00-architecture-overview.md](../../../specs/00-architecture-overview.md) -- Module map showing `storage/vector/`
|
|
|
|
## Task Index
|
|
|
|
| # | Task | Delivers | Depends On | Complexity |
|
|
|---|------|----------|------------|------------|
|
|
| 01 | VectorIndex Trait + BruteForceIndex | `VectorIndex` trait, all types, `BruteForceIndex`, `MockVectorIndex`, property tests | None | M |
|
|
| 02 | USearch Backend | `UsearchIndex` wrapping USearch via Rust crate, f16 quantization, mmap persistence, `#[allow(unsafe_code)]` | Task 01 | L |
|
|
| 03 | Embedding Lifecycle + Slot Registry | `l2_normalize`, `EmbeddingSlotRegistry`, insert/update/delete paths, entity store integration | Task 01 | M |
|
|
| 04 | Adaptive Query Planner + Benchmarks | `AdaptiveQueryPlanner`, `SelectivityEstimator`, `AnnQueryStats`, Criterion benchmarks | Task 01, Task 02 | M |
|
|
|
|
## Task Dependency DAG
|
|
|
|
```
|
|
Task 01: VectorIndex Trait + BruteForceIndex
|
|
| \
|
|
| \
|
|
v v
|
|
Task 02: USearch Backend Task 03: Embedding Lifecycle + Slot Registry
|
|
|
|
|
+----> Task 04: Adaptive Query Planner + Benchmarks
|
|
| (also depends on Task 01)
|
|
```
|
|
|
|
Task 01 is the foundation -- it defines the trait all other tasks implement or consume. Tasks 02 and 03 are parallelizable after Task 01. Task 04 requires both Task 01 (for trait types) and Task 02 (for USearch backend to benchmark against).
|
|
|
|
## File Layout
|
|
|
|
```
|
|
tidal/src/
|
|
storage/
|
|
vector/
|
|
mod.rs -- VectorIndex trait, VectorError, VectorSearchResult, VectorIndexConfig,
|
|
DistanceMetric, QuantizationLevel, VectorId, types re-exports (Task 01)
|
|
brute.rs -- BruteForceIndex, MockVectorIndex (Task 01)
|
|
usearch.rs -- UsearchIndex, #[allow(unsafe_code)] (Task 02)
|
|
lifecycle.rs -- l2_normalize, embedding insert/update/delete path (Task 03)
|
|
registry.rs -- EmbeddingSlotRegistry, EmbeddingSlotState (Task 03)
|
|
planner.rs -- AdaptiveQueryPlanner, SelectivityEstimator, AnnQueryStats (Task 04)
|
|
mod.rs -- add `pub mod vector;`
|
|
tidal/benches/
|
|
vector.rs -- Criterion benchmarks (Task 04)
|
|
tidal/Cargo.toml -- add `usearch` and `rayon` dependencies
|
|
```
|
|
|
|
## Open Questions
|
|
|
|
1. **`usearch` crate version**: The `usearch` crate (crates.io) wraps USearch via CXX. Verify the latest stable version supports `filtered_search` with a predicate callback. If not, the alternative is `hnsw_rs` which is pure Rust but lacks quantization and deletion support. The research doc recommends USearch but notes hnsw_rs as a fallback.
|
|
|
|
2. **Capacity planning**: USearch `reserve()` must be called before first insertion. tidalDB should over-provision by 2x the schema-defined entity limit. What happens if the index fills up? Need to benchmark whether a full rebuild is needed or if `reserve()` can be called again with higher capacity.
|
|
|
|
3. **Concurrency model**: USearch claims concurrent reads + writes. Verify that `filtered_search` and `insert` can truly run concurrently without a mutex wrapper. If not, add `RwLock<UsearchIndex>` and document the contention implication. ScyllaDB validates concurrent operation at 1B vectors but tidalDB's access patterns may differ.
|
|
|
|
4. **Delta journal vs full save**: The spec recommends a delta journal for incremental persistence (Spec 07, Section 7). For M2 at 10K items (not 10M), a full `save()` on every checkpoint is fast enough (10K x 1536d x f16 = ~30 MB, writes in < 100ms). Defer delta journal implementation to M7 unless benchmarks show otherwise.
|
|
|
|
5. **Selectivity estimation without m2p2**: Task 04 (planner) depends on selectivity estimates from metadata bitmap indexes (m2p2). For the m2p1 phase, the planner can use a fixed threshold with a placeholder `SelectivityEstimator` that returns 1.0 (always use in-graph filter). Wire up the real estimator when m2p2 is implemented.
|