tidaldb/docs/planning/milestone-2/phase-2/OVERVIEW.md
jordan 6fdaa1584b feat: complete M1 signal engine — m0p3 samples/docs, m1p5 TidalDb API, examples, and periodic checkpoint
- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples
  (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test
  coverage for every public API surface
- m1p5: TidalDb public API — write_item, signal, read_decay_score,
  read_windowed_count, read_velocity; StorageBox enum routing memory vs
  fjall; WalSender/WalHandleWriter bridge; WAL replay on open
- Periodic checkpoint: 30s background thread for persistent+schema mode;
  FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful
  shutdown via Arc<AtomicBool> + join before final checkpoint
- ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing)
- Milestone 2 planning scaffolding added under docs/planning/milestone-2/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 22:45:10 -07:00

8.4 KiB

Milestone 2, Phase 2: Metadata Indexes and Filter Engine

Phase Deliverable

Roaring bitmap indexes for categorical metadata fields (category, format, creator_id, tags) and B-tree range indexes for numeric/timestamp fields (created_at, duration). A composable filter engine that evaluates arbitrary filter combinations and produces either a RoaringBitmap (for pre-filtering ANN) or a Fn(EntityId) -> bool predicate closure (for in-graph filtering). Filter selectivity estimates for the adaptive query planner from m2p1.

This is the indexing layer that makes FILTER category:jazz, format:video, duration_min:5m, created_within:7d execute in microseconds instead of milliseconds. Without it, every metadata filter requires a full entity scan. With it, the query planner can estimate selectivity before choosing an ANN strategy (Spec 07 Section 9), and the RETRIEVE executor can intersect pre-computed bitmaps instead of loading entity metadata per candidate (Spec 08 Section 7).

Acceptance Criteria

  • Roaring bitmap per high-cardinality metadata value: category, format, creator_id, tags (multi-value)
  • B-tree index for range attributes: created_at (nanosecond timestamps), duration (seconds)
  • Filter expressions are composable: AND across dimensions, OR within a dimension, NOT for negation
  • filter.selectivity() estimates the fraction of items matching (for query planner)
  • filter.to_bitmap() returns a RoaringBitmap for pre-filtering
  • filter.to_predicate() returns a Fn(EntityId) -> bool for in-graph filtering
  • Filters tested: category:jazz, format:video, duration_min:5m, created_within:7d, and arbitrary AND/OR/NOT combinations
  • Filter evaluation < 1 microsecond per candidate (benchmarked via bitmap containment check)
  • Index insert and delete operations are correct (property tested)
  • Selectivity estimates are in [0.0, 1.0] for all inputs (property tested)

Dependencies

  • Requires: m1p1 (types: EntityId, EntityKind, Timestamp), m1p3 (storage: StorageEngine trait, key encoding with Tag::Idx for index persistence), m1p5 (entity write API -- bitmap indexes are updated when entities are written)
  • Blocks: m2p1 Task 04 (adaptive query planner's SelectivityEstimator uses m2p2's bitmap cardinalities), m2p5 (RETRIEVE executor applies filters to candidate sets)

Research References

  • docs/research/ann_for_tidaldb.md -- Selectivity estimation via bitmap cardinality, pre-filter brute-force strategy for selective filters (<1%), danger zone (1-20%) requiring widened ef_search, bitmap intersection as the standard pre-filtering primitive across Qdrant/Weaviate/Pinecone
  • docs/specs/07-vector-retrieval.md -- Section 3 (filtered ANN: three strategies with selectivity thresholds, selectivity estimation from bitmap cardinalities), Section 9 (adaptive query planner: decision tree using selectivity estimates)
  • docs/specs/08-query-engine.md -- Section 7 (filter evaluation: bitmap-based architecture, filter push-down, filter types, short-circuit evaluation, user-state filter implementation)
  • docs/specs/09-ranking-scoring.md -- Section 3.2 (scan strategy: metadata-indexed scan resolves filters to roaring bitmaps before signal reads), Section 4 Stage 3 (filter evaluation using pre-computed roaring bitmaps for keyword fields and range scans for numeric fields)

Spec References

  • docs/specs/08-query-engine.md -- Section 7.1 (bitmap-based architecture: category_bitmap["jazz"] intersect format_bitmap["video"]), Section 7.2 (filter push-down into ANN predicate callback or pre-filter set), Section 7.3 (Filter enum: Eq, Any, Range, Min, Max, Preset, CreatedWithin, CreatedAfter, CreatedBefore), Section 7.4 (short-circuit evaluation: sort by ascending cardinality, abort on empty intersection)
  • docs/specs/07-vector-retrieval.md -- Section 3 (selectivity estimation: keyword equality uses cardinality(bitmap[field][value]) / total_entities; compound AND uses independence assumption product(individual); compound OR uses 1 - product(1 - s_i))

Task Index

# Task Delivers Depends On Complexity
01 Roaring Bitmap Indexes BitmapIndex struct, insert/delete/get/cardinality, persistence via Tag::Idx, multi-value field support (tags) None M
02 B-tree Range Indexes RangeIndex<V> struct, insert/delete/range query returning RoaringBitmap, selectivity estimation for ranges None S
03 Composable Filter Engine FilterExpr AST, FilterEvaluator, FilterResult (bitmap or predicate), selectivity estimation, Criterion benchmarks Task 01, Task 02 M

Task Dependency DAG

Task 01: Roaring Bitmap Indexes      Task 02: B-tree Range Indexes
    |                                   |
    +-----------------------------------+
                    |
                    v
    Task 03: Composable Filter Engine

Tasks 01 and 02 are fully parallelizable -- they share no types or state beyond EntityId. Task 03 composes them into the filter evaluation pipeline.

File Layout

tidal/src/
  storage/
    indexes/
      mod.rs        -- pub use re-exports, IndexError type
      bitmap.rs     -- BitmapIndex (Task 01)
      range.rs      -- RangeIndex<V> (Task 02)
      filter.rs     -- FilterExpr, FilterEvaluator, FilterResult (Task 03)
    mod.rs          -- add `pub mod indexes;`
tidal/benches/
  filters.rs        -- Criterion benchmarks (Task 03)
tidal/Cargo.toml    -- add `roaring` dependency

Open Questions

  1. roaring vs croaring: The roaring crate (pure Rust, simple API) vs croaring (C bindings, faster bulk operations). For M2 with 10K items, roaring is sufficient and keeps the #![forbid(unsafe_code)] crate-level lint intact. Use roaring. If M7 benchmarks at 1M+ items show roaring is a bottleneck, switch to croaring.

  2. Index update on entity write: When db.write_item(id, metadata) is called, the bitmap and range indexes must be updated atomically with the storage engine write. Define the update order: storage engine first (source of truth), then update in-memory indexes. If the process crashes between storage write and in-memory update, the indexes are rebuilt from the storage engine on restart. The m2p2 phase defines the index data structures and their in-memory operations; the wiring into the entity write path is done in m2p5 (RETRIEVE executor) or a dedicated integration task.

  3. Multi-value fields (tags): Tags are a multi-value field -- one entity can have multiple tags. The bitmap index must support this: insert(entity_id, "jazz") and insert(entity_id, "piano") for the same entity. The entity appears in the bitmap for EACH tag value. When deleting an entity, it must be removed from ALL tag bitmaps. The BitmapIndex API uses insert(entity_id, field_value) and delete(entity_id, field_value), so multi-value is handled by calling insert once per value.

  4. Index warming on startup: At startup, load all bitmap indexes from storage engine before accepting queries. Time to warm 10K items with 10 category values = ~10ms (acceptable). At 1M items this may take ~1s. At 10M items this becomes a concern -- defer index pre-warming optimization to M7.

  5. Persistence granularity: Each (field_name, field_value) pair's bitmap is stored as a single key in the storage engine: encode_key(EntityId(0), Tag::Idx, b"BMP:{field_name}:{field_value}"). For M2 with 10K items and ~50 distinct metadata values, this means ~50 keys. At 10M items with 10K distinct values, this means ~10K keys -- still manageable. Serialization uses RoaringBitmap::serialize_into() / RoaringBitmap::deserialize_from().

  6. Separate index keyspace vs entity keyspace: Index bitmaps are global (not per-entity) -- they map field values to sets of entity IDs. The subject-prefix key encoding ([entity_id][NUL][tag][suffix]) is entity-centric. Index keys need a different encoding since they are field-value-centric, not entity-centric. Solution: use a reserved entity ID (e.g., EntityId(0) or EntityId(u64::MAX)) as the "index root" with Tag::Idx, or use a dedicated prefix outside the entity keyspace. Decision: use EntityId(0) as the index root -- it is never a valid entity ID in practice, and keeps the key encoding uniform.