# Task 01: Roaring Bitmap Indexes ## Context **Milestone:** 2 -- Ranked Retrieval **Phase:** m2p2 -- Metadata Indexes and Filter Engine **Depends On:** None (uses types from m1p1 but no m2p2 tasks) **Blocks:** Task 03 (Composable Filter Engine) **Complexity:** M ## Objective Deliver `BitmapIndex`, the in-memory index structure that maps categorical metadata field values to `RoaringBitmap` sets of matching entity IDs. This is the data structure that makes `FILTER category:jazz` resolve in microseconds: look up `category_bitmap["jazz"]`, get a `RoaringBitmap` of all jazz item IDs, intersect with other filter bitmaps. No entity scan. No metadata loads. The bitmap index supports exact-match lookups (`category:jazz`), multi-value fields (`tags:classical` where one entity has multiple tags), and provides cardinality counts for the adaptive query planner's selectivity estimation (Spec 07 Section 3, Section 9). Every production system that solves filtered ANN -- Qdrant, Weaviate, Pinecone -- uses roaring bitmaps as the pre-filter primitive. This is table stakes. ## Requirements - `BitmapIndex` struct: maps `(field_name, field_value)` pairs to `RoaringBitmap` of matching entity IDs - `insert(entity_id, field_value)` -- adds entity to the bitmap for that value - `delete(entity_id, field_value)` -- removes entity from the bitmap for that value - `get(field_value) -> Option<&RoaringBitmap>` -- returns bitmap for exact-match filter - `cardinality(field_value) -> u64` -- returns number of entities matching that value - `total_count() -> u64` -- returns total number of distinct entity IDs indexed across all values - `values() -> impl Iterator` -- enumerate all indexed field values - `field_name()` -- returns the field this index covers - Multi-value field support: one entity can appear in multiple value bitmaps (e.g., tags) - `delete_entity(entity_id)` -- removes entity from ALL value bitmaps (for entity deletion) - Persistence: serialize/deserialize each value bitmap to/from the storage engine via `Tag::Idx` - On startup: load from storage engine, rebuild in-memory state - Pure Rust: use the `roaring` crate (crates.io), not `croaring` (C bindings) - `Send + Sync` (for concurrent read access during queries) ## Technical Design ### Module Structure ``` tidal/src/storage/ indexes/ mod.rs -- IndexError, pub use re-exports bitmap.rs -- BitmapIndex (this task) ``` ### Public API ```rust // === storage/indexes/bitmap.rs === use roaring::RoaringBitmap; use std::collections::HashMap; use std::sync::RwLock; /// Error type for index operations. #[derive(Debug, thiserror::Error)] pub enum IndexError { #[error("storage error: {0}")] Storage(String), #[error("serialization error: {0}")] Serialization(String), } /// A roaring bitmap index for a single categorical metadata field. /// /// Maps field values (strings) to `RoaringBitmap` sets of entity IDs /// that have that value. Supports exact-match lookups, multi-value /// fields (tags), and cardinality queries for selectivity estimation. /// /// # Concurrency /// /// Reads and writes are protected by an `RwLock`. Reads (get, cardinality, /// total_count) take the read lock. Writes (insert, delete, delete_entity) /// take the write lock. This is acceptable because: /// - Writes happen on the entity write path (not the hot query path). /// - Reads happen during filter evaluation (concurrent with other reads). /// - At M2 scale (10K entities), contention is negligible. /// /// At M7 scale (1M+ entities), if write contention becomes measurable, /// switch to a sharded index (one BitmapIndex per shard of the field /// value space). /// /// # Persistence /// /// Each `(field_name, field_value)` bitmap is stored as a key in the /// storage engine using `Tag::Idx`: /// /// ```text /// Key: [INDEX_ROOT_ID: 8 bytes BE][0x00][Tag::Idx][b"BMP:"][field_name][b":"][field_value] /// Value: RoaringBitmap serialized bytes /// ``` /// /// `INDEX_ROOT_ID` is `EntityId(0)` -- a reserved entity ID used as /// the root for all index keys. pub struct BitmapIndex { field_name: String, values: RwLock>, } /// The reserved entity ID used as the root for index storage keys. /// /// EntityId 0 is reserved for system-level keys across all subsystems. /// The signal system also uses EntityId(0) with `Tag::Sig` for checkpoint /// metadata (see `signals/checkpoint.rs`). The `Tag::Idx` discriminant /// ensures no key collision — the full key format `[entity_id][NUL][tag][suffix]` /// keeps each subsystem's keys in separate namespaces. pub const INDEX_ROOT_ID: u64 = 0; impl BitmapIndex { /// Create a new, empty bitmap index for the given field. pub fn new(field_name: impl Into) -> Self; /// The field name this index covers. pub fn field_name(&self) -> &str; /// Add an entity to the bitmap for the given field value. /// /// For multi-value fields (tags), call once per value for the /// same entity. /// /// Returns `true` if the entity was newly added (was not already /// in this value's bitmap). pub fn insert(&self, entity_id: u32, value: impl Into) -> bool; /// Remove an entity from the bitmap for the given field value. /// /// Returns `true` if the entity was present and removed. pub fn delete(&self, entity_id: u32, value: &str) -> bool; /// Remove an entity from ALL value bitmaps. /// /// Used when an entity is deleted entirely. Scans all values /// and removes the entity ID from each bitmap. /// /// Returns the number of bitmaps the entity was removed from. pub fn delete_entity(&self, entity_id: u32) -> usize; /// Look up the bitmap for an exact field value. /// /// Returns `None` if no entities have this value. /// /// The returned bitmap is a clone (cheap for roaring bitmaps /// due to copy-on-write internals). Callers can intersect, /// union, or iterate over it freely. pub fn get(&self, value: &str) -> Option; /// Look up bitmaps for multiple values and return their union. /// /// Implements OR-within-dimension: `category IN [jazz, blues]` /// returns the union of the jazz and blues bitmaps. pub fn get_union(&self, values: &[&str]) -> RoaringBitmap; /// Number of entities matching a specific field value. /// /// Used by the adaptive query planner for selectivity estimation. pub fn cardinality(&self, value: &str) -> u64; /// Total number of distinct entity IDs indexed across all values. /// /// Used as the denominator for selectivity: /// `selectivity = cardinality(value) / total_count()`. /// /// Note: this is the cardinality of the union of all value /// bitmaps, NOT the sum of individual cardinalities (which /// would double-count entities in multi-value fields). pub fn total_count(&self) -> u64; /// Enumerate all indexed field values. pub fn values(&self) -> Vec; /// Number of distinct field values in this index. pub fn distinct_values(&self) -> usize; /// Check if the index is empty (no entities indexed). pub fn is_empty(&self) -> bool; /// Serialize all bitmaps to storage engine key-value pairs. /// /// Returns a `Vec<(Vec, Vec)>` of `(key, value)` pairs /// suitable for `StorageEngine::write_batch()`. /// /// Key format: `encode_key(EntityId(INDEX_ROOT_ID), Tag::Idx, suffix)` /// where suffix = `b"BMP:" + field_name + b":" + field_value`. pub fn serialize_to_kv_pairs(&self) -> Result, Vec)>, IndexError>; /// Deserialize bitmaps from storage engine key-value pairs. /// /// Scans all keys with the `BMP:{field_name}:` prefix and /// deserializes each value as a `RoaringBitmap`. pub fn load_from_kv_pairs( field_name: impl Into, pairs: impl Iterator, Vec)>, ) -> Result; } ``` ### Internal Design **`RoaringBitmap` entity ID representation:** `RoaringBitmap` operates on `u32` values. `EntityId` is `u64`. For M2 (10K entities), all entity IDs fit in `u32`. For production at 10M+ entities, we need a strategy. Options: 1. **Truncate to u32**: Works if entity IDs are assigned sequentially from 0. This is the simplest approach and matches how most databases assign internal row IDs. 2. **Split high/low bits**: Use the high 32 bits as a partition key and the low 32 bits in the bitmap. Requires multiple bitmaps per value. 3. **Use `RoaringTreemap`**: The `roaring` crate provides `RoaringTreemap` which operates on `u64`. Slightly slower for small sets but handles the full ID space. Decision for M2: Use `u32` with an explicit conversion from `EntityId`. The `insert` and `delete` APIs accept `u32` directly. The caller (entity write path, m2p5) is responsible for the `EntityId::as_u64() as u32` conversion. A debug assertion verifies no truncation: `debug_assert!(entity_id.as_u64() <= u64::from(u32::MAX))`. At M7+, if entity IDs exceed u32, switch to `RoaringTreemap`. **HashMap vs BTreeMap for value storage:** `HashMap` is used because: - Field value lookups are exact-match (hash is O(1)). - No ordering needed for bitmap index values. - `BTreeMap` would be used if we needed prefix or range queries on field values, which we do not. **Persistence key encoding:** ``` Key: encode_key(EntityId(0), Tag::Idx, b"BMP:category:jazz") Value: RoaringBitmap::serialize_into() bytes ``` The `INDEX_ROOT_ID = EntityId(0)` anchors all index keys to the same entity prefix, making them scannable via `scan_prefix(entity_tag_prefix(EntityId(0), Tag::Idx))`. The suffix `BMP:{field_name}:{field_value}` distinguishes bitmap keys from future index types (e.g., `RNG:` for range indexes in Task 02). **Bitmap serialization:** The `roaring` crate provides `serialize_into(&mut writer)` and `deserialize_from(reader)` using the standard Roaring bitmap serialization format (compatible with CRoaring, Roaring Java, etc.). At 10K items with ~50 distinct category values, each bitmap is ~1-4 KB serialized. Total index persistence: ~100 KB. ### Error Handling - `insert()` and `delete()` are infallible -- they operate on in-memory data only. - `serialize_to_kv_pairs()` can fail if bitmap serialization fails (theoretically impossible for valid bitmaps, but the API is `Result` for correctness). - `load_from_kv_pairs()` can fail if the stored bytes are corrupted. Returns `IndexError::Serialization` with context. ## Test Strategy ### Property Tests ```rust use proptest::prelude::*; use roaring::RoaringBitmap; // Insert-query roundtrip: every inserted entity appears in get(). proptest! { #[test] fn insert_query_roundtrip( entries in prop::collection::vec( (0u32..100_000, "[a-z]{1,5}"), 1..500, ), ) { let index = BitmapIndex::new("test_field"); for &(id, ref value) in &entries { index.insert(id, value.clone()); } for &(id, ref value) in &entries { let bitmap = index.get(value).expect("value should exist"); prop_assert!(bitmap.contains(id), "entity {id} not found in bitmap for '{value}'"); } } } // Delete removes correctly: after delete, entity is absent. proptest! { #[test] fn delete_removes_entity( ids in prop::collection::vec(0u32..10_000, 1..100), value in "[a-z]{1,5}", ) { let index = BitmapIndex::new("test_field"); for &id in &ids { index.insert(id, value.clone()); } // Delete first half let half = ids.len() / 2; for &id in &ids[..half] { index.delete(id, &value); } // Verify first half absent, second half present let bitmap = index.get(&value).unwrap_or_default(); for &id in &ids[..half] { prop_assert!(!bitmap.contains(id), "deleted entity {id} still in bitmap"); } for &id in &ids[half..] { prop_assert!(bitmap.contains(id), "surviving entity {id} missing from bitmap"); } } } // Cardinality matches bitmap length. proptest! { #[test] fn cardinality_matches_bitmap( ids in prop::collection::hash_set(0u32..100_000, 1..1000), value in "[a-z]{1,3}", ) { let index = BitmapIndex::new("test_field"); for &id in &ids { index.insert(id, value.clone()); } prop_assert_eq!( index.cardinality(&value), ids.len() as u64, ); } } // Total count equals distinct entity IDs across all values. proptest! { #[test] fn total_count_no_double_counting( entries in prop::collection::vec( (0u32..1_000, "[a-z]{1,3}"), 1..200, ), ) { let index = BitmapIndex::new("tags"); let mut all_ids = RoaringBitmap::new(); for &(id, ref value) in &entries { index.insert(id, value.clone()); all_ids.insert(id); } prop_assert_eq!(index.total_count(), all_ids.len()); } } // get_union returns the union of individual bitmaps. proptest! { #[test] fn get_union_is_correct( a_ids in prop::collection::vec(0u32..1000, 1..50), b_ids in prop::collection::vec(0u32..1000, 1..50), ) { let index = BitmapIndex::new("test_field"); for &id in &a_ids { index.insert(id, "a".to_string()); } for &id in &b_ids { index.insert(id, "b".to_string()); } let union = index.get_union(&["a", "b"]); let manual_union = { let mut bm = index.get("a").unwrap_or_default(); bm |= &index.get("b").unwrap_or_default(); bm }; prop_assert_eq!(union, manual_union); } } // Serialize-deserialize roundtrip preserves all bitmaps. proptest! { #[test] fn serialize_deserialize_roundtrip( entries in prop::collection::vec( (0u32..10_000, "[a-z]{1,3}"), 1..100, ), ) { let index = BitmapIndex::new("test_field"); for &(id, ref value) in &entries { index.insert(id, value.clone()); } let kv_pairs = index.serialize_to_kv_pairs().unwrap(); let restored = BitmapIndex::load_from_kv_pairs( "test_field", kv_pairs.into_iter(), ).unwrap(); // Verify all values and bitmaps match for value in index.values() { let orig = index.get(&value).unwrap(); let rest = restored.get(&value).unwrap(); prop_assert_eq!(orig, rest, "mismatch for value '{value}'"); } prop_assert_eq!(index.total_count(), restored.total_count()); } } ``` ### Unit Tests ```rust #[test] fn new_index_is_empty() { let index = BitmapIndex::new("category"); assert!(index.is_empty()); assert_eq!(index.total_count(), 0); assert_eq!(index.distinct_values(), 0); assert!(index.get("jazz").is_none()); assert_eq!(index.cardinality("jazz"), 0); } #[test] fn insert_single_entity() { let index = BitmapIndex::new("category"); assert!(index.insert(1, "jazz")); assert!(!index.is_empty()); assert_eq!(index.cardinality("jazz"), 1); assert_eq!(index.total_count(), 1); let bitmap = index.get("jazz").unwrap(); assert!(bitmap.contains(1)); } #[test] fn insert_same_entity_same_value_is_idempotent() { let index = BitmapIndex::new("category"); assert!(index.insert(1, "jazz")); // first insert: true (newly added) assert!(!index.insert(1, "jazz")); // second insert: false (already present) assert_eq!(index.cardinality("jazz"), 1); } #[test] fn multi_value_field_tags() { let index = BitmapIndex::new("tags"); index.insert(1, "jazz"); index.insert(1, "piano"); index.insert(1, "classical"); // Entity 1 appears in all three tag bitmaps assert!(index.get("jazz").unwrap().contains(1)); assert!(index.get("piano").unwrap().contains(1)); assert!(index.get("classical").unwrap().contains(1)); // Total count is 1, not 3 (one distinct entity) assert_eq!(index.total_count(), 1); assert_eq!(index.distinct_values(), 3); } #[test] fn delete_entity_from_all_values() { let index = BitmapIndex::new("tags"); index.insert(1, "jazz"); index.insert(1, "piano"); index.insert(2, "jazz"); let removed_count = index.delete_entity(1); assert_eq!(removed_count, 2); // removed from "jazz" and "piano" // Entity 1 gone from both bitmaps assert!(!index.get("jazz").unwrap().contains(1)); assert!(index.get("piano").is_none()); // only had entity 1 // Entity 2 still in "jazz" assert!(index.get("jazz").unwrap().contains(2)); assert_eq!(index.total_count(), 1); } #[test] fn get_union_multiple_values() { let index = BitmapIndex::new("category"); index.insert(1, "jazz"); index.insert(2, "blues"); index.insert(3, "jazz"); let union = index.get_union(&["jazz", "blues"]); assert_eq!(union.len(), 3); assert!(union.contains(1)); assert!(union.contains(2)); assert!(union.contains(3)); } #[test] fn get_union_with_missing_value() { let index = BitmapIndex::new("category"); index.insert(1, "jazz"); let union = index.get_union(&["jazz", "nonexistent"]); assert_eq!(union.len(), 1); assert!(union.contains(1)); } #[test] fn get_union_empty_values() { let index = BitmapIndex::new("category"); index.insert(1, "jazz"); let union = index.get_union(&[]); assert!(union.is_empty()); } #[test] fn values_enumerates_all() { let index = BitmapIndex::new("category"); index.insert(1, "jazz"); index.insert(2, "blues"); index.insert(3, "rock"); let mut values = index.values(); values.sort(); assert_eq!(values, vec!["blues", "jazz", "rock"]); } #[test] fn delete_nonexistent_returns_false() { let index = BitmapIndex::new("category"); assert!(!index.delete(99, "jazz")); } #[test] fn field_name_accessor() { let index = BitmapIndex::new("category"); assert_eq!(index.field_name(), "category"); } #[test] fn empty_value_bitmaps_are_cleaned_up() { let index = BitmapIndex::new("category"); index.insert(1, "jazz"); index.delete(1, "jazz"); // After removing the only entity, the value bitmap should be // removed or empty. get() returns None for empty/removed. assert!(index.get("jazz").is_none() || index.get("jazz").unwrap().is_empty()); assert_eq!(index.total_count(), 0); } ``` ## Acceptance Criteria - [ ] `BitmapIndex` stores one `RoaringBitmap` per distinct field value, mapping field values to sets of `u32` entity IDs - [ ] `insert(entity_id, field_value)` adds the entity to the bitmap for that value; returns true if newly added - [ ] `delete(entity_id, field_value)` removes the entity from the bitmap; returns true if was present - [ ] `delete_entity(entity_id)` removes the entity from ALL value bitmaps; returns count of bitmaps modified - [ ] `get(field_value)` returns the bitmap for exact-match lookup, or `None` if no entities have that value - [ ] `get_union(values)` returns the union bitmap across multiple values (OR-within-dimension) - [ ] `cardinality(field_value)` returns the count of entities matching that value - [ ] `total_count()` returns the count of distinct entity IDs across all values (no double-counting for multi-value fields) - [ ] Multi-value fields work: one entity can appear in multiple value bitmaps (tested with tags) - [ ] `serialize_to_kv_pairs()` and `load_from_kv_pairs()` roundtrip preserves all bitmaps exactly (property tested) - [ ] Key encoding uses `encode_key(EntityId(0), Tag::Idx, b"BMP:{field_name}:{field_value}")` for persistence - [ ] `BitmapIndex` is `Send + Sync` - [ ] No `unsafe` code - [ ] `cargo clippy -- -D warnings` passes - [ ] All property tests and unit tests pass ## Research References - [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- "resolve filter predicates to roaring bitmaps", "Weaviate produces a RoaringBitmap allow-list", "Pinecone intersects metadata bitmaps per field with IVF cluster assignments" - [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 3 (selectivity estimation: `cardinality(bitmap[field][value]) / total_entities`) ## Spec References - [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- Section 7.1 (bitmap-based architecture: `category_bitmap["jazz"]`, `format_bitmap["video"]`, bitmap intersection for compound filters), Section 7.4 (short-circuit evaluation: "if any bitmap has cardinality 0, return empty Results immediately") - [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 3.2 (metadata-indexed scan: "filters on keyword fields are resolved to roaring bitmaps and intersected before any signal reads") ## Implementation Notes - Add `roaring = "0.10"` to `[dependencies]` in `tidal/Cargo.toml`. The `roaring` crate is pure Rust, no `unsafe`, well-maintained (~55M crates.io downloads), and implements the standard Roaring bitmap format. - `RoaringBitmap` uses `u32`. This limits entity IDs to ~4 billion, which is sufficient through M6 (100K items). If M7+ needs u64 entity IDs, switch to `RoaringTreemap` (same crate, same API, u64 keys). Add a `debug_assert!(entity_id <= u32::MAX as u64)` at the conversion boundary. - Empty bitmaps (all entities deleted for a value) should be removed from the `HashMap` to avoid accumulating dead entries. Check `bitmap.is_empty()` after `delete()` and remove the entry. - The `RwLock` is from `std::sync::RwLock`, not `parking_lot`. At M2 scale (10K entities, low contention), `std` RwLock is sufficient. If profiling shows contention at M7, switch to `parking_lot::RwLock` which is faster under high contention. - Do NOT implement bitmap compression tuning (run optimization) in this task. The `roaring` crate handles compression automatically. - Do NOT implement the correlation cache for co-occurring filter pairs (Spec 07 Section 3: "maintain joint statistics for common filter combinations"). This is deferred to M5 or later.