- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test coverage for every public API surface - m1p5: TidalDb public API — write_item, signal, read_decay_score, read_windowed_count, read_velocity; StorageBox enum routing memory vs fjall; WalSender/WalHandleWriter bridge; WAL replay on open - Periodic checkpoint: 30s background thread for persistent+schema mode; FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful shutdown via Arc<AtomicBool> + join before final checkpoint - ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing) - Milestone 2 planning scaffolding added under docs/planning/milestone-2/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8.4 KiB
Milestone 2, Phase 2: Metadata Indexes and Filter Engine
Phase Deliverable
Roaring bitmap indexes for categorical metadata fields (category, format, creator_id, tags) and B-tree range indexes for numeric/timestamp fields (created_at, duration). A composable filter engine that evaluates arbitrary filter combinations and produces either a RoaringBitmap (for pre-filtering ANN) or a Fn(EntityId) -> bool predicate closure (for in-graph filtering). Filter selectivity estimates for the adaptive query planner from m2p1.
This is the indexing layer that makes FILTER category:jazz, format:video, duration_min:5m, created_within:7d execute in microseconds instead of milliseconds. Without it, every metadata filter requires a full entity scan. With it, the query planner can estimate selectivity before choosing an ANN strategy (Spec 07 Section 9), and the RETRIEVE executor can intersect pre-computed bitmaps instead of loading entity metadata per candidate (Spec 08 Section 7).
Acceptance Criteria
- Roaring bitmap per high-cardinality metadata value: category, format, creator_id, tags (multi-value)
- B-tree index for range attributes: created_at (nanosecond timestamps), duration (seconds)
- Filter expressions are composable: AND across dimensions, OR within a dimension, NOT for negation
filter.selectivity()estimates the fraction of items matching (for query planner)filter.to_bitmap()returns aRoaringBitmapfor pre-filteringfilter.to_predicate()returns aFn(EntityId) -> boolfor in-graph filtering- Filters tested:
category:jazz,format:video,duration_min:5m,created_within:7d, and arbitrary AND/OR/NOT combinations - Filter evaluation < 1 microsecond per candidate (benchmarked via bitmap containment check)
- Index insert and delete operations are correct (property tested)
- Selectivity estimates are in [0.0, 1.0] for all inputs (property tested)
Dependencies
- Requires: m1p1 (types:
EntityId,EntityKind,Timestamp), m1p3 (storage:StorageEnginetrait, key encoding withTag::Idxfor index persistence), m1p5 (entity write API -- bitmap indexes are updated when entities are written) - Blocks: m2p1 Task 04 (adaptive query planner's
SelectivityEstimatoruses m2p2's bitmap cardinalities), m2p5 (RETRIEVE executor applies filters to candidate sets)
Research References
- docs/research/ann_for_tidaldb.md -- Selectivity estimation via bitmap cardinality, pre-filter brute-force strategy for selective filters (<1%), danger zone (1-20%) requiring widened ef_search, bitmap intersection as the standard pre-filtering primitive across Qdrant/Weaviate/Pinecone
- docs/specs/07-vector-retrieval.md -- Section 3 (filtered ANN: three strategies with selectivity thresholds, selectivity estimation from bitmap cardinalities), Section 9 (adaptive query planner: decision tree using selectivity estimates)
- docs/specs/08-query-engine.md -- Section 7 (filter evaluation: bitmap-based architecture, filter push-down, filter types, short-circuit evaluation, user-state filter implementation)
- docs/specs/09-ranking-scoring.md -- Section 3.2 (scan strategy: metadata-indexed scan resolves filters to roaring bitmaps before signal reads), Section 4 Stage 3 (filter evaluation using pre-computed roaring bitmaps for keyword fields and range scans for numeric fields)
Spec References
- docs/specs/08-query-engine.md -- Section 7.1 (bitmap-based architecture:
category_bitmap["jazz"] intersect format_bitmap["video"]), Section 7.2 (filter push-down into ANN predicate callback or pre-filter set), Section 7.3 (Filterenum:Eq,Any,Range,Min,Max,Preset,CreatedWithin,CreatedAfter,CreatedBefore), Section 7.4 (short-circuit evaluation: sort by ascending cardinality, abort on empty intersection) - docs/specs/07-vector-retrieval.md -- Section 3 (selectivity estimation: keyword equality uses
cardinality(bitmap[field][value]) / total_entities; compound AND uses independence assumptionproduct(individual); compound OR uses1 - product(1 - s_i))
Task Index
| # | Task | Delivers | Depends On | Complexity |
|---|---|---|---|---|
| 01 | Roaring Bitmap Indexes | BitmapIndex struct, insert/delete/get/cardinality, persistence via Tag::Idx, multi-value field support (tags) |
None | M |
| 02 | B-tree Range Indexes | RangeIndex<V> struct, insert/delete/range query returning RoaringBitmap, selectivity estimation for ranges |
None | S |
| 03 | Composable Filter Engine | FilterExpr AST, FilterEvaluator, FilterResult (bitmap or predicate), selectivity estimation, Criterion benchmarks |
Task 01, Task 02 | M |
Task Dependency DAG
Task 01: Roaring Bitmap Indexes Task 02: B-tree Range Indexes
| |
+-----------------------------------+
|
v
Task 03: Composable Filter Engine
Tasks 01 and 02 are fully parallelizable -- they share no types or state beyond EntityId. Task 03 composes them into the filter evaluation pipeline.
File Layout
tidal/src/
storage/
indexes/
mod.rs -- pub use re-exports, IndexError type
bitmap.rs -- BitmapIndex (Task 01)
range.rs -- RangeIndex<V> (Task 02)
filter.rs -- FilterExpr, FilterEvaluator, FilterResult (Task 03)
mod.rs -- add `pub mod indexes;`
tidal/benches/
filters.rs -- Criterion benchmarks (Task 03)
tidal/Cargo.toml -- add `roaring` dependency
Open Questions
-
roaringvscroaring: Theroaringcrate (pure Rust, simple API) vscroaring(C bindings, faster bulk operations). For M2 with 10K items,roaringis sufficient and keeps the#![forbid(unsafe_code)]crate-level lint intact. Useroaring. If M7 benchmarks at 1M+ items show roaring is a bottleneck, switch tocroaring. -
Index update on entity write: When
db.write_item(id, metadata)is called, the bitmap and range indexes must be updated atomically with the storage engine write. Define the update order: storage engine first (source of truth), then update in-memory indexes. If the process crashes between storage write and in-memory update, the indexes are rebuilt from the storage engine on restart. The m2p2 phase defines the index data structures and their in-memory operations; the wiring into the entity write path is done in m2p5 (RETRIEVE executor) or a dedicated integration task. -
Multi-value fields (tags): Tags are a multi-value field -- one entity can have multiple tags. The bitmap index must support this:
insert(entity_id, "jazz")andinsert(entity_id, "piano")for the same entity. The entity appears in the bitmap for EACH tag value. When deleting an entity, it must be removed from ALL tag bitmaps. TheBitmapIndexAPI usesinsert(entity_id, field_value)anddelete(entity_id, field_value), so multi-value is handled by calling insert once per value. -
Index warming on startup: At startup, load all bitmap indexes from storage engine before accepting queries. Time to warm 10K items with 10 category values = ~10ms (acceptable). At 1M items this may take ~1s. At 10M items this becomes a concern -- defer index pre-warming optimization to M7.
-
Persistence granularity: Each
(field_name, field_value)pair's bitmap is stored as a single key in the storage engine:encode_key(EntityId(0), Tag::Idx, b"BMP:{field_name}:{field_value}"). For M2 with 10K items and ~50 distinct metadata values, this means ~50 keys. At 10M items with 10K distinct values, this means ~10K keys -- still manageable. Serialization usesRoaringBitmap::serialize_into()/RoaringBitmap::deserialize_from(). -
Separate index keyspace vs entity keyspace: Index bitmaps are global (not per-entity) -- they map field values to sets of entity IDs. The subject-prefix key encoding (
[entity_id][NUL][tag][suffix]) is entity-centric. Index keys need a different encoding since they are field-value-centric, not entity-centric. Solution: use a reserved entity ID (e.g.,EntityId(0)orEntityId(u64::MAX)) as the "index root" withTag::Idx, or use a dedicated prefix outside the entity keyspace. Decision: useEntityId(0)as the index root -- it is never a valid entity ID in practice, and keeps the key encoding uniform.