tidaldb/docs/planning/milestone-2/phase-2/task-01-roaring-bitmap-indexes.md
jordan 6fdaa1584b feat: complete M1 signal engine — m0p3 samples/docs, m1p5 TidalDb API, examples, and periodic checkpoint
- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples
  (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test
  coverage for every public API surface
- m1p5: TidalDb public API — write_item, signal, read_decay_score,
  read_windowed_count, read_velocity; StorageBox enum routing memory vs
  fjall; WalSender/WalHandleWriter bridge; WAL replay on open
- Periodic checkpoint: 30s background thread for persistent+schema mode;
  FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful
  shutdown via Arc<AtomicBool> + join before final checkpoint
- ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing)
- Milestone 2 planning scaffolding added under docs/planning/milestone-2/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 22:45:10 -07:00

559 lines
22 KiB
Markdown

# Task 01: Roaring Bitmap Indexes
## Context
**Milestone:** 2 -- Ranked Retrieval
**Phase:** m2p2 -- Metadata Indexes and Filter Engine
**Depends On:** None (uses types from m1p1 but no m2p2 tasks)
**Blocks:** Task 03 (Composable Filter Engine)
**Complexity:** M
## Objective
Deliver `BitmapIndex`, the in-memory index structure that maps categorical metadata field values to `RoaringBitmap` sets of matching entity IDs. This is the data structure that makes `FILTER category:jazz` resolve in microseconds: look up `category_bitmap["jazz"]`, get a `RoaringBitmap` of all jazz item IDs, intersect with other filter bitmaps. No entity scan. No metadata loads.
The bitmap index supports exact-match lookups (`category:jazz`), multi-value fields (`tags:classical` where one entity has multiple tags), and provides cardinality counts for the adaptive query planner's selectivity estimation (Spec 07 Section 3, Section 9). Every production system that solves filtered ANN -- Qdrant, Weaviate, Pinecone -- uses roaring bitmaps as the pre-filter primitive. This is table stakes.
## Requirements
- `BitmapIndex` struct: maps `(field_name, field_value)` pairs to `RoaringBitmap` of matching entity IDs
- `insert(entity_id, field_value)` -- adds entity to the bitmap for that value
- `delete(entity_id, field_value)` -- removes entity from the bitmap for that value
- `get(field_value) -> Option<&RoaringBitmap>` -- returns bitmap for exact-match filter
- `cardinality(field_value) -> u64` -- returns number of entities matching that value
- `total_count() -> u64` -- returns total number of distinct entity IDs indexed across all values
- `values() -> impl Iterator<Item = &str>` -- enumerate all indexed field values
- `field_name()` -- returns the field this index covers
- Multi-value field support: one entity can appear in multiple value bitmaps (e.g., tags)
- `delete_entity(entity_id)` -- removes entity from ALL value bitmaps (for entity deletion)
- Persistence: serialize/deserialize each value bitmap to/from the storage engine via `Tag::Idx`
- On startup: load from storage engine, rebuild in-memory state
- Pure Rust: use the `roaring` crate (crates.io), not `croaring` (C bindings)
- `Send + Sync` (for concurrent read access during queries)
## Technical Design
### Module Structure
```
tidal/src/storage/
indexes/
mod.rs -- IndexError, pub use re-exports
bitmap.rs -- BitmapIndex (this task)
```
### Public API
```rust
// === storage/indexes/bitmap.rs ===
use roaring::RoaringBitmap;
use std::collections::HashMap;
use std::sync::RwLock;
/// Error type for index operations.
#[derive(Debug, thiserror::Error)]
pub enum IndexError {
#[error("storage error: {0}")]
Storage(String),
#[error("serialization error: {0}")]
Serialization(String),
}
/// A roaring bitmap index for a single categorical metadata field.
///
/// Maps field values (strings) to `RoaringBitmap` sets of entity IDs
/// that have that value. Supports exact-match lookups, multi-value
/// fields (tags), and cardinality queries for selectivity estimation.
///
/// # Concurrency
///
/// Reads and writes are protected by an `RwLock`. Reads (get, cardinality,
/// total_count) take the read lock. Writes (insert, delete, delete_entity)
/// take the write lock. This is acceptable because:
/// - Writes happen on the entity write path (not the hot query path).
/// - Reads happen during filter evaluation (concurrent with other reads).
/// - At M2 scale (10K entities), contention is negligible.
///
/// At M7 scale (1M+ entities), if write contention becomes measurable,
/// switch to a sharded index (one BitmapIndex per shard of the field
/// value space).
///
/// # Persistence
///
/// Each `(field_name, field_value)` bitmap is stored as a key in the
/// storage engine using `Tag::Idx`:
///
/// ```text
/// Key: [INDEX_ROOT_ID: 8 bytes BE][0x00][Tag::Idx][b"BMP:"][field_name][b":"][field_value]
/// Value: RoaringBitmap serialized bytes
/// ```
///
/// `INDEX_ROOT_ID` is `EntityId(0)` -- a reserved entity ID used as
/// the root for all index keys.
pub struct BitmapIndex {
field_name: String,
values: RwLock<HashMap<String, RoaringBitmap>>,
}
/// The reserved entity ID used as the root for index storage keys.
///
/// EntityId 0 is reserved for system-level keys across all subsystems.
/// The signal system also uses EntityId(0) with `Tag::Sig` for checkpoint
/// metadata (see `signals/checkpoint.rs`). The `Tag::Idx` discriminant
/// ensures no key collision — the full key format `[entity_id][NUL][tag][suffix]`
/// keeps each subsystem's keys in separate namespaces.
pub const INDEX_ROOT_ID: u64 = 0;
impl BitmapIndex {
/// Create a new, empty bitmap index for the given field.
pub fn new(field_name: impl Into<String>) -> Self;
/// The field name this index covers.
pub fn field_name(&self) -> &str;
/// Add an entity to the bitmap for the given field value.
///
/// For multi-value fields (tags), call once per value for the
/// same entity.
///
/// Returns `true` if the entity was newly added (was not already
/// in this value's bitmap).
pub fn insert(&self, entity_id: u32, value: impl Into<String>) -> bool;
/// Remove an entity from the bitmap for the given field value.
///
/// Returns `true` if the entity was present and removed.
pub fn delete(&self, entity_id: u32, value: &str) -> bool;
/// Remove an entity from ALL value bitmaps.
///
/// Used when an entity is deleted entirely. Scans all values
/// and removes the entity ID from each bitmap.
///
/// Returns the number of bitmaps the entity was removed from.
pub fn delete_entity(&self, entity_id: u32) -> usize;
/// Look up the bitmap for an exact field value.
///
/// Returns `None` if no entities have this value.
///
/// The returned bitmap is a clone (cheap for roaring bitmaps
/// due to copy-on-write internals). Callers can intersect,
/// union, or iterate over it freely.
pub fn get(&self, value: &str) -> Option<RoaringBitmap>;
/// Look up bitmaps for multiple values and return their union.
///
/// Implements OR-within-dimension: `category IN [jazz, blues]`
/// returns the union of the jazz and blues bitmaps.
pub fn get_union(&self, values: &[&str]) -> RoaringBitmap;
/// Number of entities matching a specific field value.
///
/// Used by the adaptive query planner for selectivity estimation.
pub fn cardinality(&self, value: &str) -> u64;
/// Total number of distinct entity IDs indexed across all values.
///
/// Used as the denominator for selectivity:
/// `selectivity = cardinality(value) / total_count()`.
///
/// Note: this is the cardinality of the union of all value
/// bitmaps, NOT the sum of individual cardinalities (which
/// would double-count entities in multi-value fields).
pub fn total_count(&self) -> u64;
/// Enumerate all indexed field values.
pub fn values(&self) -> Vec<String>;
/// Number of distinct field values in this index.
pub fn distinct_values(&self) -> usize;
/// Check if the index is empty (no entities indexed).
pub fn is_empty(&self) -> bool;
/// Serialize all bitmaps to storage engine key-value pairs.
///
/// Returns a `Vec<(Vec<u8>, Vec<u8>)>` of `(key, value)` pairs
/// suitable for `StorageEngine::write_batch()`.
///
/// Key format: `encode_key(EntityId(INDEX_ROOT_ID), Tag::Idx, suffix)`
/// where suffix = `b"BMP:" + field_name + b":" + field_value`.
pub fn serialize_to_kv_pairs(&self) -> Result<Vec<(Vec<u8>, Vec<u8>)>, IndexError>;
/// Deserialize bitmaps from storage engine key-value pairs.
///
/// Scans all keys with the `BMP:{field_name}:` prefix and
/// deserializes each value as a `RoaringBitmap`.
pub fn load_from_kv_pairs(
field_name: impl Into<String>,
pairs: impl Iterator<Item = (Vec<u8>, Vec<u8>)>,
) -> Result<Self, IndexError>;
}
```
### Internal Design
**`RoaringBitmap` entity ID representation:**
`RoaringBitmap` operates on `u32` values. `EntityId` is `u64`. For M2 (10K entities), all entity IDs fit in `u32`. For production at 10M+ entities, we need a strategy. Options:
1. **Truncate to u32**: Works if entity IDs are assigned sequentially from 0. This is the simplest approach and matches how most databases assign internal row IDs.
2. **Split high/low bits**: Use the high 32 bits as a partition key and the low 32 bits in the bitmap. Requires multiple bitmaps per value.
3. **Use `RoaringTreemap`**: The `roaring` crate provides `RoaringTreemap` which operates on `u64`. Slightly slower for small sets but handles the full ID space.
Decision for M2: Use `u32` with an explicit conversion from `EntityId`. The `insert` and `delete` APIs accept `u32` directly. The caller (entity write path, m2p5) is responsible for the `EntityId::as_u64() as u32` conversion. A debug assertion verifies no truncation: `debug_assert!(entity_id.as_u64() <= u64::from(u32::MAX))`. At M7+, if entity IDs exceed u32, switch to `RoaringTreemap`.
**HashMap vs BTreeMap for value storage:**
`HashMap<String, RoaringBitmap>` is used because:
- Field value lookups are exact-match (hash is O(1)).
- No ordering needed for bitmap index values.
- `BTreeMap` would be used if we needed prefix or range queries on field values, which we do not.
**Persistence key encoding:**
```
Key: encode_key(EntityId(0), Tag::Idx, b"BMP:category:jazz")
Value: RoaringBitmap::serialize_into() bytes
```
The `INDEX_ROOT_ID = EntityId(0)` anchors all index keys to the same entity prefix, making them scannable via `scan_prefix(entity_tag_prefix(EntityId(0), Tag::Idx))`. The suffix `BMP:{field_name}:{field_value}` distinguishes bitmap keys from future index types (e.g., `RNG:` for range indexes in Task 02).
**Bitmap serialization:**
The `roaring` crate provides `serialize_into(&mut writer)` and `deserialize_from(reader)` using the standard Roaring bitmap serialization format (compatible with CRoaring, Roaring Java, etc.). At 10K items with ~50 distinct category values, each bitmap is ~1-4 KB serialized. Total index persistence: ~100 KB.
### Error Handling
- `insert()` and `delete()` are infallible -- they operate on in-memory data only.
- `serialize_to_kv_pairs()` can fail if bitmap serialization fails (theoretically impossible for valid bitmaps, but the API is `Result` for correctness).
- `load_from_kv_pairs()` can fail if the stored bytes are corrupted. Returns `IndexError::Serialization` with context.
## Test Strategy
### Property Tests
```rust
use proptest::prelude::*;
use roaring::RoaringBitmap;
// Insert-query roundtrip: every inserted entity appears in get().
proptest! {
#[test]
fn insert_query_roundtrip(
entries in prop::collection::vec(
(0u32..100_000, "[a-z]{1,5}"),
1..500,
),
) {
let index = BitmapIndex::new("test_field");
for &(id, ref value) in &entries {
index.insert(id, value.clone());
}
for &(id, ref value) in &entries {
let bitmap = index.get(value).expect("value should exist");
prop_assert!(bitmap.contains(id), "entity {id} not found in bitmap for '{value}'");
}
}
}
// Delete removes correctly: after delete, entity is absent.
proptest! {
#[test]
fn delete_removes_entity(
ids in prop::collection::vec(0u32..10_000, 1..100),
value in "[a-z]{1,5}",
) {
let index = BitmapIndex::new("test_field");
for &id in &ids {
index.insert(id, value.clone());
}
// Delete first half
let half = ids.len() / 2;
for &id in &ids[..half] {
index.delete(id, &value);
}
// Verify first half absent, second half present
let bitmap = index.get(&value).unwrap_or_default();
for &id in &ids[..half] {
prop_assert!(!bitmap.contains(id), "deleted entity {id} still in bitmap");
}
for &id in &ids[half..] {
prop_assert!(bitmap.contains(id), "surviving entity {id} missing from bitmap");
}
}
}
// Cardinality matches bitmap length.
proptest! {
#[test]
fn cardinality_matches_bitmap(
ids in prop::collection::hash_set(0u32..100_000, 1..1000),
value in "[a-z]{1,3}",
) {
let index = BitmapIndex::new("test_field");
for &id in &ids {
index.insert(id, value.clone());
}
prop_assert_eq!(
index.cardinality(&value),
ids.len() as u64,
);
}
}
// Total count equals distinct entity IDs across all values.
proptest! {
#[test]
fn total_count_no_double_counting(
entries in prop::collection::vec(
(0u32..1_000, "[a-z]{1,3}"),
1..200,
),
) {
let index = BitmapIndex::new("tags");
let mut all_ids = RoaringBitmap::new();
for &(id, ref value) in &entries {
index.insert(id, value.clone());
all_ids.insert(id);
}
prop_assert_eq!(index.total_count(), all_ids.len());
}
}
// get_union returns the union of individual bitmaps.
proptest! {
#[test]
fn get_union_is_correct(
a_ids in prop::collection::vec(0u32..1000, 1..50),
b_ids in prop::collection::vec(0u32..1000, 1..50),
) {
let index = BitmapIndex::new("test_field");
for &id in &a_ids {
index.insert(id, "a".to_string());
}
for &id in &b_ids {
index.insert(id, "b".to_string());
}
let union = index.get_union(&["a", "b"]);
let manual_union = {
let mut bm = index.get("a").unwrap_or_default();
bm |= &index.get("b").unwrap_or_default();
bm
};
prop_assert_eq!(union, manual_union);
}
}
// Serialize-deserialize roundtrip preserves all bitmaps.
proptest! {
#[test]
fn serialize_deserialize_roundtrip(
entries in prop::collection::vec(
(0u32..10_000, "[a-z]{1,3}"),
1..100,
),
) {
let index = BitmapIndex::new("test_field");
for &(id, ref value) in &entries {
index.insert(id, value.clone());
}
let kv_pairs = index.serialize_to_kv_pairs().unwrap();
let restored = BitmapIndex::load_from_kv_pairs(
"test_field",
kv_pairs.into_iter(),
).unwrap();
// Verify all values and bitmaps match
for value in index.values() {
let orig = index.get(&value).unwrap();
let rest = restored.get(&value).unwrap();
prop_assert_eq!(orig, rest, "mismatch for value '{value}'");
}
prop_assert_eq!(index.total_count(), restored.total_count());
}
}
```
### Unit Tests
```rust
#[test]
fn new_index_is_empty() {
let index = BitmapIndex::new("category");
assert!(index.is_empty());
assert_eq!(index.total_count(), 0);
assert_eq!(index.distinct_values(), 0);
assert!(index.get("jazz").is_none());
assert_eq!(index.cardinality("jazz"), 0);
}
#[test]
fn insert_single_entity() {
let index = BitmapIndex::new("category");
assert!(index.insert(1, "jazz"));
assert!(!index.is_empty());
assert_eq!(index.cardinality("jazz"), 1);
assert_eq!(index.total_count(), 1);
let bitmap = index.get("jazz").unwrap();
assert!(bitmap.contains(1));
}
#[test]
fn insert_same_entity_same_value_is_idempotent() {
let index = BitmapIndex::new("category");
assert!(index.insert(1, "jazz")); // first insert: true (newly added)
assert!(!index.insert(1, "jazz")); // second insert: false (already present)
assert_eq!(index.cardinality("jazz"), 1);
}
#[test]
fn multi_value_field_tags() {
let index = BitmapIndex::new("tags");
index.insert(1, "jazz");
index.insert(1, "piano");
index.insert(1, "classical");
// Entity 1 appears in all three tag bitmaps
assert!(index.get("jazz").unwrap().contains(1));
assert!(index.get("piano").unwrap().contains(1));
assert!(index.get("classical").unwrap().contains(1));
// Total count is 1, not 3 (one distinct entity)
assert_eq!(index.total_count(), 1);
assert_eq!(index.distinct_values(), 3);
}
#[test]
fn delete_entity_from_all_values() {
let index = BitmapIndex::new("tags");
index.insert(1, "jazz");
index.insert(1, "piano");
index.insert(2, "jazz");
let removed_count = index.delete_entity(1);
assert_eq!(removed_count, 2); // removed from "jazz" and "piano"
// Entity 1 gone from both bitmaps
assert!(!index.get("jazz").unwrap().contains(1));
assert!(index.get("piano").is_none()); // only had entity 1
// Entity 2 still in "jazz"
assert!(index.get("jazz").unwrap().contains(2));
assert_eq!(index.total_count(), 1);
}
#[test]
fn get_union_multiple_values() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
index.insert(2, "blues");
index.insert(3, "jazz");
let union = index.get_union(&["jazz", "blues"]);
assert_eq!(union.len(), 3);
assert!(union.contains(1));
assert!(union.contains(2));
assert!(union.contains(3));
}
#[test]
fn get_union_with_missing_value() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
let union = index.get_union(&["jazz", "nonexistent"]);
assert_eq!(union.len(), 1);
assert!(union.contains(1));
}
#[test]
fn get_union_empty_values() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
let union = index.get_union(&[]);
assert!(union.is_empty());
}
#[test]
fn values_enumerates_all() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
index.insert(2, "blues");
index.insert(3, "rock");
let mut values = index.values();
values.sort();
assert_eq!(values, vec!["blues", "jazz", "rock"]);
}
#[test]
fn delete_nonexistent_returns_false() {
let index = BitmapIndex::new("category");
assert!(!index.delete(99, "jazz"));
}
#[test]
fn field_name_accessor() {
let index = BitmapIndex::new("category");
assert_eq!(index.field_name(), "category");
}
#[test]
fn empty_value_bitmaps_are_cleaned_up() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
index.delete(1, "jazz");
// After removing the only entity, the value bitmap should be
// removed or empty. get() returns None for empty/removed.
assert!(index.get("jazz").is_none() || index.get("jazz").unwrap().is_empty());
assert_eq!(index.total_count(), 0);
}
```
## Acceptance Criteria
- [ ] `BitmapIndex` stores one `RoaringBitmap` per distinct field value, mapping field values to sets of `u32` entity IDs
- [ ] `insert(entity_id, field_value)` adds the entity to the bitmap for that value; returns true if newly added
- [ ] `delete(entity_id, field_value)` removes the entity from the bitmap; returns true if was present
- [ ] `delete_entity(entity_id)` removes the entity from ALL value bitmaps; returns count of bitmaps modified
- [ ] `get(field_value)` returns the bitmap for exact-match lookup, or `None` if no entities have that value
- [ ] `get_union(values)` returns the union bitmap across multiple values (OR-within-dimension)
- [ ] `cardinality(field_value)` returns the count of entities matching that value
- [ ] `total_count()` returns the count of distinct entity IDs across all values (no double-counting for multi-value fields)
- [ ] Multi-value fields work: one entity can appear in multiple value bitmaps (tested with tags)
- [ ] `serialize_to_kv_pairs()` and `load_from_kv_pairs()` roundtrip preserves all bitmaps exactly (property tested)
- [ ] Key encoding uses `encode_key(EntityId(0), Tag::Idx, b"BMP:{field_name}:{field_value}")` for persistence
- [ ] `BitmapIndex` is `Send + Sync`
- [ ] No `unsafe` code
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All property tests and unit tests pass
## Research References
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- "resolve filter predicates to roaring bitmaps", "Weaviate produces a RoaringBitmap allow-list", "Pinecone intersects metadata bitmaps per field with IVF cluster assignments"
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 3 (selectivity estimation: `cardinality(bitmap[field][value]) / total_entities`)
## Spec References
- [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- Section 7.1 (bitmap-based architecture: `category_bitmap["jazz"]`, `format_bitmap["video"]`, bitmap intersection for compound filters), Section 7.4 (short-circuit evaluation: "if any bitmap has cardinality 0, return empty Results immediately")
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 3.2 (metadata-indexed scan: "filters on keyword fields are resolved to roaring bitmaps and intersected before any signal reads")
## Implementation Notes
- Add `roaring = "0.10"` to `[dependencies]` in `tidal/Cargo.toml`. The `roaring` crate is pure Rust, no `unsafe`, well-maintained (~55M crates.io downloads), and implements the standard Roaring bitmap format.
- `RoaringBitmap` uses `u32`. This limits entity IDs to ~4 billion, which is sufficient through M6 (100K items). If M7+ needs u64 entity IDs, switch to `RoaringTreemap` (same crate, same API, u64 keys). Add a `debug_assert!(entity_id <= u32::MAX as u64)` at the conversion boundary.
- Empty bitmaps (all entities deleted for a value) should be removed from the `HashMap` to avoid accumulating dead entries. Check `bitmap.is_empty()` after `delete()` and remove the entry.
- The `RwLock` is from `std::sync::RwLock`, not `parking_lot`. At M2 scale (10K entities, low contention), `std` RwLock is sufficient. If profiling shows contention at M7, switch to `parking_lot::RwLock` which is faster under high contention.
- Do NOT implement bitmap compression tuning (run optimization) in this task. The `roaring` crate handles compression automatically.
- Do NOT implement the correlation cache for co-occurring filter pairs (Spec 07 Section 3: "maintain joint statistics for common filter combinations"). This is deferred to M5 or later.