# Milestone 2, Phase 2: Metadata Indexes and Filter Engine ## Phase Deliverable Roaring bitmap indexes for categorical metadata fields (category, format, creator_id, tags) and B-tree range indexes for numeric/timestamp fields (created_at, duration). A composable filter engine that evaluates arbitrary filter combinations and produces either a `RoaringBitmap` (for pre-filtering ANN) or a `Fn(EntityId) -> bool` predicate closure (for in-graph filtering). Filter selectivity estimates for the adaptive query planner from m2p1. This is the indexing layer that makes `FILTER category:jazz, format:video, duration_min:5m, created_within:7d` execute in microseconds instead of milliseconds. Without it, every metadata filter requires a full entity scan. With it, the query planner can estimate selectivity before choosing an ANN strategy (Spec 07 Section 9), and the RETRIEVE executor can intersect pre-computed bitmaps instead of loading entity metadata per candidate (Spec 08 Section 7). ## Acceptance Criteria - [ ] Roaring bitmap per high-cardinality metadata value: category, format, creator_id, tags (multi-value) - [ ] B-tree index for range attributes: created_at (nanosecond timestamps), duration (seconds) - [ ] Filter expressions are composable: AND across dimensions, OR within a dimension, NOT for negation - [ ] `filter.selectivity()` estimates the fraction of items matching (for query planner) - [ ] `filter.to_bitmap()` returns a `RoaringBitmap` for pre-filtering - [ ] `filter.to_predicate()` returns a `Fn(EntityId) -> bool` for in-graph filtering - [ ] Filters tested: `category:jazz`, `format:video`, `duration_min:5m`, `created_within:7d`, and arbitrary AND/OR/NOT combinations - [ ] Filter evaluation < 1 microsecond per candidate (benchmarked via bitmap containment check) - [ ] Index insert and delete operations are correct (property tested) - [ ] Selectivity estimates are in [0.0, 1.0] for all inputs (property tested) ## Dependencies - **Requires:** m1p1 (types: `EntityId`, `EntityKind`, `Timestamp`), m1p3 (storage: `StorageEngine` trait, key encoding with `Tag::Idx` for index persistence), m1p5 (entity write API -- bitmap indexes are updated when entities are written) - **Blocks:** m2p1 Task 04 (adaptive query planner's `SelectivityEstimator` uses m2p2's bitmap cardinalities), m2p5 (RETRIEVE executor applies filters to candidate sets) ## Research References - [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- Selectivity estimation via bitmap cardinality, pre-filter brute-force strategy for selective filters (<1%), danger zone (1-20%) requiring widened ef_search, bitmap intersection as the standard pre-filtering primitive across Qdrant/Weaviate/Pinecone - [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 3 (filtered ANN: three strategies with selectivity thresholds, selectivity estimation from bitmap cardinalities), Section 9 (adaptive query planner: decision tree using selectivity estimates) - [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- Section 7 (filter evaluation: bitmap-based architecture, filter push-down, filter types, short-circuit evaluation, user-state filter implementation) - [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 3.2 (scan strategy: metadata-indexed scan resolves filters to roaring bitmaps before signal reads), Section 4 Stage 3 (filter evaluation using pre-computed roaring bitmaps for keyword fields and range scans for numeric fields) ## Spec References - [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- Section 7.1 (bitmap-based architecture: `category_bitmap["jazz"] intersect format_bitmap["video"]`), Section 7.2 (filter push-down into ANN predicate callback or pre-filter set), Section 7.3 (`Filter` enum: `Eq`, `Any`, `Range`, `Min`, `Max`, `Preset`, `CreatedWithin`, `CreatedAfter`, `CreatedBefore`), Section 7.4 (short-circuit evaluation: sort by ascending cardinality, abort on empty intersection) - [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 3 (selectivity estimation: keyword equality uses `cardinality(bitmap[field][value]) / total_entities`; compound AND uses independence assumption `product(individual)`; compound OR uses `1 - product(1 - s_i)`) ## Task Index | # | Task | Delivers | Depends On | Complexity | |---|------|----------|------------|------------| | 01 | Roaring Bitmap Indexes | `BitmapIndex` struct, insert/delete/get/cardinality, persistence via `Tag::Idx`, multi-value field support (tags) | None | M | | 02 | B-tree Range Indexes | `RangeIndex` struct, insert/delete/range query returning `RoaringBitmap`, selectivity estimation for ranges | None | S | | 03 | Composable Filter Engine | `FilterExpr` AST, `FilterEvaluator`, `FilterResult` (bitmap or predicate), selectivity estimation, Criterion benchmarks | Task 01, Task 02 | M | ## Task Dependency DAG ``` Task 01: Roaring Bitmap Indexes Task 02: B-tree Range Indexes | | +-----------------------------------+ | v Task 03: Composable Filter Engine ``` Tasks 01 and 02 are fully parallelizable -- they share no types or state beyond `EntityId`. Task 03 composes them into the filter evaluation pipeline. ## File Layout ``` tidal/src/ storage/ indexes/ mod.rs -- pub use re-exports, IndexError type bitmap.rs -- BitmapIndex (Task 01) range.rs -- RangeIndex (Task 02) filter.rs -- FilterExpr, FilterEvaluator, FilterResult (Task 03) mod.rs -- add `pub mod indexes;` tidal/benches/ filters.rs -- Criterion benchmarks (Task 03) tidal/Cargo.toml -- add `roaring` dependency ``` ## Open Questions 1. **`roaring` vs `croaring`**: The `roaring` crate (pure Rust, simple API) vs `croaring` (C bindings, faster bulk operations). For M2 with 10K items, `roaring` is sufficient and keeps the `#![forbid(unsafe_code)]` crate-level lint intact. Use `roaring`. If M7 benchmarks at 1M+ items show roaring is a bottleneck, switch to `croaring`. 2. **Index update on entity write**: When `db.write_item(id, metadata)` is called, the bitmap and range indexes must be updated atomically with the storage engine write. Define the update order: storage engine first (source of truth), then update in-memory indexes. If the process crashes between storage write and in-memory update, the indexes are rebuilt from the storage engine on restart. The m2p2 phase defines the index data structures and their in-memory operations; the wiring into the entity write path is done in m2p5 (RETRIEVE executor) or a dedicated integration task. 3. **Multi-value fields (tags)**: Tags are a multi-value field -- one entity can have multiple tags. The bitmap index must support this: `insert(entity_id, "jazz")` and `insert(entity_id, "piano")` for the same entity. The entity appears in the bitmap for EACH tag value. When deleting an entity, it must be removed from ALL tag bitmaps. The `BitmapIndex` API uses `insert(entity_id, field_value)` and `delete(entity_id, field_value)`, so multi-value is handled by calling insert once per value. 4. **Index warming on startup**: At startup, load all bitmap indexes from storage engine before accepting queries. Time to warm 10K items with 10 category values = ~10ms (acceptable). At 1M items this may take ~1s. At 10M items this becomes a concern -- defer index pre-warming optimization to M7. 5. **Persistence granularity**: Each `(field_name, field_value)` pair's bitmap is stored as a single key in the storage engine: `encode_key(EntityId(0), Tag::Idx, b"BMP:{field_name}:{field_value}")`. For M2 with 10K items and ~50 distinct metadata values, this means ~50 keys. At 10M items with 10K distinct values, this means ~10K keys -- still manageable. Serialization uses `RoaringBitmap::serialize_into()` / `RoaringBitmap::deserialize_from()`. 6. **Separate index keyspace vs entity keyspace**: Index bitmaps are global (not per-entity) -- they map field values to sets of entity IDs. The subject-prefix key encoding (`[entity_id][NUL][tag][suffix]`) is entity-centric. Index keys need a different encoding since they are field-value-centric, not entity-centric. Solution: use a reserved entity ID (e.g., `EntityId(0)` or `EntityId(u64::MAX)`) as the "index root" with `Tag::Idx`, or use a dedicated prefix outside the entity keyspace. Decision: use `EntityId(0)` as the index root -- it is never a valid entity ID in practice, and keeps the key encoding uniform.