jordan 6fdaa1584b feat: complete M1 signal engine — m0p3 samples/docs, m1p5 TidalDb API, examples, and periodic checkpoint

- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples
  (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test
  coverage for every public API surface
- m1p5: TidalDb public API — write_item, signal, read_decay_score,
  read_windowed_count, read_velocity; StorageBox enum routing memory vs
  fjall; WalSender/WalHandleWriter bridge; WAL replay on open
- Periodic checkpoint: 30s background thread for persistent+schema mode;
  FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful
  shutdown via Arc<AtomicBool> + join before final checkpoint
- ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing)
- Milestone 2 planning scaffolding added under docs/planning/milestone-2/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-20 22:45:10 -07:00

9.0 KiB

Raw Blame History

Milestone 2, Phase 1: Vector Index Integration (USearch)

Phase Deliverable

The VectorIndex trait and two implementations: BruteForceIndex (pure Rust, exact search) and UsearchIndex (USearch C++ FFI, HNSW approximate search). Items can be inserted with embeddings and retrieved by approximate nearest neighbor similarity. Vectors are L2-normalized at insertion time so L2 distance is equivalent to cosine similarity. An adaptive query planner routes filtered ANN queries to the optimal strategy based on estimated selectivity: brute-force for very selective filters (< 1%), widened ef_search for the danger zone (1-20%), and standard in-graph predicate filtering for broad filters (> 20%). The USearch backend uses f16 quantization by default, mmap persistence via view() for instant restart, and full save()/load() for checkpoint coordination. A BruteForceIndex exists for correctness verification, small datasets, and the pre-filter brute-force strategy.

Acceptance Criteria

VectorIndex trait with insert(key, vector), delete(key), search(query, k, ef_search), filtered_search(query, k, ef_search, predicate), save(), load(), view(), reserve(), len(), len_live(), is_empty(), tombstone_ratio()
VectorSearchResult { id: VectorId, distance: f32 } and VectorIndexConfig { dimensions, metric, quantization, connectivity, ef_construction, ef_search } types
DistanceMetric enum: L2, InnerProduct
QuantizationLevel enum: F32, F16, Int8
VectorError enum: DimensionMismatch, CapacityExceeded, NotFound, Io, CorruptedIndex, Backend, ZeroNormVector
BruteForceIndex implements VectorIndex with exact linear-scan search, RwLock<HashMap<VectorId, Vec<f32>>> storage
MockVectorIndex returns predetermined results for unit tests and records call history
USearch backend implements the trait with f16 quantization (default), M=16, ef_construction=200, ef_search=200
USearch filtered_search passes predicate closure to USearch's predicate callback API
USearch reserve() for capacity management (2x over-provision)
USearch save(), load(), view() delegating to USearch persistence methods
#![forbid(unsafe_code)] relaxed only in storage/vector/usearch.rs with // SAFETY: comments on every unsafe block
l2_normalize(v: &[f32]) -> Result<Vec<f32>, VectorError> normalizes to unit length; fails on zero-norm vectors
EmbeddingSlotRegistry maps (EntityKind, slot_name) to EmbeddingSlotState { index, dimensions, quantization, params }
Insert path: validate dims, normalize, store f32 in entity store (META key with EMB:slot_name suffix), insert quantized into HNSW
Update path: tombstone in HNSW, insert new vector
Delete path: tombstone only
Adaptive query planner: selectivity < 1% triggers pre-filter + brute-force; 1-20% uses filtered_search with widened ef_search (2-3x); > 20% uses standard filtered_search; 100% (no filter) uses unfiltered search()
- Note: ROADMAP.md acceptance criteria say selectivity < 2% -> brute-force. These docs use the refined spec thresholds from Spec 07 Section 9 (< 1% brute-force, 1-20% widened HNSW, > 20% in-graph filter). The roadmap threshold of 2% is superseded by the spec.
ANN retrieval at 10K vectors returns top-100 with recall@100 > 0.95 (measured against BruteForceIndex)
ANN retrieval latency < 10ms at 10K vectors (Criterion benchmark)
Persistence: save() on checkpoint, view() on restart for immediate read serving
Criterion benchmarks: unfiltered search, filtered search at 20% and 5% selectivity, brute-force search, recall@100

Dependencies

Requires: m1p1 (types: EntityId, EntityKind), m1p3 (storage: StorageEngine trait, key encoding with Tag::Meta for embedding persistence), m1p5 (entity write API for storing embeddings in entity store)
Blocks: m2p2 (metadata indexes use the VectorIndex for filter bitmap selectivity estimation), m2p5 (RETRIEVE executor needs ANN search for candidate generation)

Research References

docs/research/ann_for_tidaldb.md -- USearch evaluation (127K QPS at f32, 167K at int8), filtered search callback architecture, ACORN-1 two-hop expansion, f16 as optimal default, mmap persistence strategy, memory budget analysis (31.5 GB at 10M x 1536d x f16), reserve() capacity planning
thoughts.md -- Part V.9 (hybrid storage: "vector index and text index are derived state, always rebuildable from the entity store")

Spec References

docs/specs/07-vector-retrieval.md -- Section 2 (HNSW internals: M=16, ef_construction=200, ef_search=200), Section 3 (filtered ANN: three strategies with selectivity thresholds), Section 4 (quantization: f16 default, < 1% recall loss), Section 5 (multiple embedding spaces, slot registry), Section 6 (embedding lifecycle: insert, update, delete paths), Section 7 (persistence: save/load/view, delta journal), Section 9 (adaptive query planner: decision tree, threshold reference, runtime statistics), Section 11 (VectorIndex trait, full API), Section 12 (performance targets), Section 13 (invariants: 10 correctness guarantees)
docs/specs/00-architecture-overview.md -- Module map showing storage/vector/

Task Index

#	Task	Delivers	Depends On	Complexity
01	VectorIndex Trait + BruteForceIndex	`VectorIndex` trait, all types, `BruteForceIndex`, `MockVectorIndex`, property tests	None	M
02	USearch Backend	`UsearchIndex` wrapping USearch via Rust crate, f16 quantization, mmap persistence, `#[allow(unsafe_code)]`	Task 01	L
03	Embedding Lifecycle + Slot Registry	`l2_normalize`, `EmbeddingSlotRegistry`, insert/update/delete paths, entity store integration	Task 01	M
04	Adaptive Query Planner + Benchmarks	`AdaptiveQueryPlanner`, `SelectivityEstimator`, `AnnQueryStats`, Criterion benchmarks	Task 01, Task 02	M

Task Dependency DAG

Task 01: VectorIndex Trait + BruteForceIndex
    |                   \
    |                    \
    v                     v
Task 02: USearch Backend  Task 03: Embedding Lifecycle + Slot Registry
    |
    +----> Task 04: Adaptive Query Planner + Benchmarks
    |        (also depends on Task 01)

Task 01 is the foundation -- it defines the trait all other tasks implement or consume. Tasks 02 and 03 are parallelizable after Task 01. Task 04 requires both Task 01 (for trait types) and Task 02 (for USearch backend to benchmark against).

File Layout

tidal/src/
  storage/
    vector/
      mod.rs        -- VectorIndex trait, VectorError, VectorSearchResult, VectorIndexConfig,
                       DistanceMetric, QuantizationLevel, VectorId, types re-exports (Task 01)
      brute.rs      -- BruteForceIndex, MockVectorIndex (Task 01)
      usearch.rs    -- UsearchIndex, #[allow(unsafe_code)] (Task 02)
      lifecycle.rs  -- l2_normalize, embedding insert/update/delete path (Task 03)
      registry.rs   -- EmbeddingSlotRegistry, EmbeddingSlotState (Task 03)
      planner.rs    -- AdaptiveQueryPlanner, SelectivityEstimator, AnnQueryStats (Task 04)
    mod.rs          -- add `pub mod vector;`
tidal/benches/
  vector.rs         -- Criterion benchmarks (Task 04)
tidal/Cargo.toml    -- add `usearch` and `rayon` dependencies

Open Questions

usearch crate version: The usearch crate (crates.io) wraps USearch via CXX. Verify the latest stable version supports filtered_search with a predicate callback. If not, the alternative is hnsw_rs which is pure Rust but lacks quantization and deletion support. The research doc recommends USearch but notes hnsw_rs as a fallback.
Capacity planning: USearch reserve() must be called before first insertion. tidalDB should over-provision by 2x the schema-defined entity limit. What happens if the index fills up? Need to benchmark whether a full rebuild is needed or if reserve() can be called again with higher capacity.
Concurrency model: USearch claims concurrent reads + writes. Verify that filtered_search and insert can truly run concurrently without a mutex wrapper. If not, add RwLock<UsearchIndex> and document the contention implication. ScyllaDB validates concurrent operation at 1B vectors but tidalDB's access patterns may differ.
Delta journal vs full save: The spec recommends a delta journal for incremental persistence (Spec 07, Section 7). For M2 at 10K items (not 10M), a full save() on every checkpoint is fast enough (10K x 1536d x f16 = ~30 MB, writes in < 100ms). Defer delta journal implementation to M7 unless benchmarks show otherwise.
Selectivity estimation without m2p2: Task 04 (planner) depends on selectivity estimates from metadata bitmap indexes (m2p2). For the m2p1 phase, the planner can use a fixed threshold with a placeholder SelectivityEstimator that returns 1.0 (always use in-graph filter). Wire up the real estimator when m2p2 is implemented.

9.0 KiB Raw Blame History