- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test coverage for every public API surface - m1p5: TidalDb public API — write_item, signal, read_decay_score, read_windowed_count, read_velocity; StorageBox enum routing memory vs fjall; WalSender/WalHandleWriter bridge; WAL replay on open - Periodic checkpoint: 30s background thread for persistent+schema mode; FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful shutdown via Arc<AtomicBool> + join before final checkpoint - ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing) - Milestone 2 planning scaffolding added under docs/planning/milestone-2/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
22 KiB
Task 01: Roaring Bitmap Indexes
Context
Milestone: 2 -- Ranked Retrieval Phase: m2p2 -- Metadata Indexes and Filter Engine Depends On: None (uses types from m1p1 but no m2p2 tasks) Blocks: Task 03 (Composable Filter Engine) Complexity: M
Objective
Deliver BitmapIndex, the in-memory index structure that maps categorical metadata field values to RoaringBitmap sets of matching entity IDs. This is the data structure that makes FILTER category:jazz resolve in microseconds: look up category_bitmap["jazz"], get a RoaringBitmap of all jazz item IDs, intersect with other filter bitmaps. No entity scan. No metadata loads.
The bitmap index supports exact-match lookups (category:jazz), multi-value fields (tags:classical where one entity has multiple tags), and provides cardinality counts for the adaptive query planner's selectivity estimation (Spec 07 Section 3, Section 9). Every production system that solves filtered ANN -- Qdrant, Weaviate, Pinecone -- uses roaring bitmaps as the pre-filter primitive. This is table stakes.
Requirements
BitmapIndexstruct: maps(field_name, field_value)pairs toRoaringBitmapof matching entity IDsinsert(entity_id, field_value)-- adds entity to the bitmap for that valuedelete(entity_id, field_value)-- removes entity from the bitmap for that valueget(field_value) -> Option<&RoaringBitmap>-- returns bitmap for exact-match filtercardinality(field_value) -> u64-- returns number of entities matching that valuetotal_count() -> u64-- returns total number of distinct entity IDs indexed across all valuesvalues() -> impl Iterator<Item = &str>-- enumerate all indexed field valuesfield_name()-- returns the field this index covers- Multi-value field support: one entity can appear in multiple value bitmaps (e.g., tags)
delete_entity(entity_id)-- removes entity from ALL value bitmaps (for entity deletion)- Persistence: serialize/deserialize each value bitmap to/from the storage engine via
Tag::Idx - On startup: load from storage engine, rebuild in-memory state
- Pure Rust: use the
roaringcrate (crates.io), notcroaring(C bindings) Send + Sync(for concurrent read access during queries)
Technical Design
Module Structure
tidal/src/storage/
indexes/
mod.rs -- IndexError, pub use re-exports
bitmap.rs -- BitmapIndex (this task)
Public API
// === storage/indexes/bitmap.rs ===
use roaring::RoaringBitmap;
use std::collections::HashMap;
use std::sync::RwLock;
/// Error type for index operations.
#[derive(Debug, thiserror::Error)]
pub enum IndexError {
#[error("storage error: {0}")]
Storage(String),
#[error("serialization error: {0}")]
Serialization(String),
}
/// A roaring bitmap index for a single categorical metadata field.
///
/// Maps field values (strings) to `RoaringBitmap` sets of entity IDs
/// that have that value. Supports exact-match lookups, multi-value
/// fields (tags), and cardinality queries for selectivity estimation.
///
/// # Concurrency
///
/// Reads and writes are protected by an `RwLock`. Reads (get, cardinality,
/// total_count) take the read lock. Writes (insert, delete, delete_entity)
/// take the write lock. This is acceptable because:
/// - Writes happen on the entity write path (not the hot query path).
/// - Reads happen during filter evaluation (concurrent with other reads).
/// - At M2 scale (10K entities), contention is negligible.
///
/// At M7 scale (1M+ entities), if write contention becomes measurable,
/// switch to a sharded index (one BitmapIndex per shard of the field
/// value space).
///
/// # Persistence
///
/// Each `(field_name, field_value)` bitmap is stored as a key in the
/// storage engine using `Tag::Idx`:
///
/// ```text
/// Key: [INDEX_ROOT_ID: 8 bytes BE][0x00][Tag::Idx][b"BMP:"][field_name][b":"][field_value]
/// Value: RoaringBitmap serialized bytes
/// ```
///
/// `INDEX_ROOT_ID` is `EntityId(0)` -- a reserved entity ID used as
/// the root for all index keys.
pub struct BitmapIndex {
field_name: String,
values: RwLock<HashMap<String, RoaringBitmap>>,
}
/// The reserved entity ID used as the root for index storage keys.
///
/// EntityId 0 is reserved for system-level keys across all subsystems.
/// The signal system also uses EntityId(0) with `Tag::Sig` for checkpoint
/// metadata (see `signals/checkpoint.rs`). The `Tag::Idx` discriminant
/// ensures no key collision — the full key format `[entity_id][NUL][tag][suffix]`
/// keeps each subsystem's keys in separate namespaces.
pub const INDEX_ROOT_ID: u64 = 0;
impl BitmapIndex {
/// Create a new, empty bitmap index for the given field.
pub fn new(field_name: impl Into<String>) -> Self;
/// The field name this index covers.
pub fn field_name(&self) -> &str;
/// Add an entity to the bitmap for the given field value.
///
/// For multi-value fields (tags), call once per value for the
/// same entity.
///
/// Returns `true` if the entity was newly added (was not already
/// in this value's bitmap).
pub fn insert(&self, entity_id: u32, value: impl Into<String>) -> bool;
/// Remove an entity from the bitmap for the given field value.
///
/// Returns `true` if the entity was present and removed.
pub fn delete(&self, entity_id: u32, value: &str) -> bool;
/// Remove an entity from ALL value bitmaps.
///
/// Used when an entity is deleted entirely. Scans all values
/// and removes the entity ID from each bitmap.
///
/// Returns the number of bitmaps the entity was removed from.
pub fn delete_entity(&self, entity_id: u32) -> usize;
/// Look up the bitmap for an exact field value.
///
/// Returns `None` if no entities have this value.
///
/// The returned bitmap is a clone (cheap for roaring bitmaps
/// due to copy-on-write internals). Callers can intersect,
/// union, or iterate over it freely.
pub fn get(&self, value: &str) -> Option<RoaringBitmap>;
/// Look up bitmaps for multiple values and return their union.
///
/// Implements OR-within-dimension: `category IN [jazz, blues]`
/// returns the union of the jazz and blues bitmaps.
pub fn get_union(&self, values: &[&str]) -> RoaringBitmap;
/// Number of entities matching a specific field value.
///
/// Used by the adaptive query planner for selectivity estimation.
pub fn cardinality(&self, value: &str) -> u64;
/// Total number of distinct entity IDs indexed across all values.
///
/// Used as the denominator for selectivity:
/// `selectivity = cardinality(value) / total_count()`.
///
/// Note: this is the cardinality of the union of all value
/// bitmaps, NOT the sum of individual cardinalities (which
/// would double-count entities in multi-value fields).
pub fn total_count(&self) -> u64;
/// Enumerate all indexed field values.
pub fn values(&self) -> Vec<String>;
/// Number of distinct field values in this index.
pub fn distinct_values(&self) -> usize;
/// Check if the index is empty (no entities indexed).
pub fn is_empty(&self) -> bool;
/// Serialize all bitmaps to storage engine key-value pairs.
///
/// Returns a `Vec<(Vec<u8>, Vec<u8>)>` of `(key, value)` pairs
/// suitable for `StorageEngine::write_batch()`.
///
/// Key format: `encode_key(EntityId(INDEX_ROOT_ID), Tag::Idx, suffix)`
/// where suffix = `b"BMP:" + field_name + b":" + field_value`.
pub fn serialize_to_kv_pairs(&self) -> Result<Vec<(Vec<u8>, Vec<u8>)>, IndexError>;
/// Deserialize bitmaps from storage engine key-value pairs.
///
/// Scans all keys with the `BMP:{field_name}:` prefix and
/// deserializes each value as a `RoaringBitmap`.
pub fn load_from_kv_pairs(
field_name: impl Into<String>,
pairs: impl Iterator<Item = (Vec<u8>, Vec<u8>)>,
) -> Result<Self, IndexError>;
}
Internal Design
RoaringBitmap entity ID representation:
RoaringBitmap operates on u32 values. EntityId is u64. For M2 (10K entities), all entity IDs fit in u32. For production at 10M+ entities, we need a strategy. Options:
- Truncate to u32: Works if entity IDs are assigned sequentially from 0. This is the simplest approach and matches how most databases assign internal row IDs.
- Split high/low bits: Use the high 32 bits as a partition key and the low 32 bits in the bitmap. Requires multiple bitmaps per value.
- Use
RoaringTreemap: Theroaringcrate providesRoaringTreemapwhich operates onu64. Slightly slower for small sets but handles the full ID space.
Decision for M2: Use u32 with an explicit conversion from EntityId. The insert and delete APIs accept u32 directly. The caller (entity write path, m2p5) is responsible for the EntityId::as_u64() as u32 conversion. A debug assertion verifies no truncation: debug_assert!(entity_id.as_u64() <= u64::from(u32::MAX)). At M7+, if entity IDs exceed u32, switch to RoaringTreemap.
HashMap vs BTreeMap for value storage:
HashMap<String, RoaringBitmap> is used because:
- Field value lookups are exact-match (hash is O(1)).
- No ordering needed for bitmap index values.
BTreeMapwould be used if we needed prefix or range queries on field values, which we do not.
Persistence key encoding:
Key: encode_key(EntityId(0), Tag::Idx, b"BMP:category:jazz")
Value: RoaringBitmap::serialize_into() bytes
The INDEX_ROOT_ID = EntityId(0) anchors all index keys to the same entity prefix, making them scannable via scan_prefix(entity_tag_prefix(EntityId(0), Tag::Idx)). The suffix BMP:{field_name}:{field_value} distinguishes bitmap keys from future index types (e.g., RNG: for range indexes in Task 02).
Bitmap serialization:
The roaring crate provides serialize_into(&mut writer) and deserialize_from(reader) using the standard Roaring bitmap serialization format (compatible with CRoaring, Roaring Java, etc.). At 10K items with ~50 distinct category values, each bitmap is ~1-4 KB serialized. Total index persistence: ~100 KB.
Error Handling
insert()anddelete()are infallible -- they operate on in-memory data only.serialize_to_kv_pairs()can fail if bitmap serialization fails (theoretically impossible for valid bitmaps, but the API isResultfor correctness).load_from_kv_pairs()can fail if the stored bytes are corrupted. ReturnsIndexError::Serializationwith context.
Test Strategy
Property Tests
use proptest::prelude::*;
use roaring::RoaringBitmap;
// Insert-query roundtrip: every inserted entity appears in get().
proptest! {
#[test]
fn insert_query_roundtrip(
entries in prop::collection::vec(
(0u32..100_000, "[a-z]{1,5}"),
1..500,
),
) {
let index = BitmapIndex::new("test_field");
for &(id, ref value) in &entries {
index.insert(id, value.clone());
}
for &(id, ref value) in &entries {
let bitmap = index.get(value).expect("value should exist");
prop_assert!(bitmap.contains(id), "entity {id} not found in bitmap for '{value}'");
}
}
}
// Delete removes correctly: after delete, entity is absent.
proptest! {
#[test]
fn delete_removes_entity(
ids in prop::collection::vec(0u32..10_000, 1..100),
value in "[a-z]{1,5}",
) {
let index = BitmapIndex::new("test_field");
for &id in &ids {
index.insert(id, value.clone());
}
// Delete first half
let half = ids.len() / 2;
for &id in &ids[..half] {
index.delete(id, &value);
}
// Verify first half absent, second half present
let bitmap = index.get(&value).unwrap_or_default();
for &id in &ids[..half] {
prop_assert!(!bitmap.contains(id), "deleted entity {id} still in bitmap");
}
for &id in &ids[half..] {
prop_assert!(bitmap.contains(id), "surviving entity {id} missing from bitmap");
}
}
}
// Cardinality matches bitmap length.
proptest! {
#[test]
fn cardinality_matches_bitmap(
ids in prop::collection::hash_set(0u32..100_000, 1..1000),
value in "[a-z]{1,3}",
) {
let index = BitmapIndex::new("test_field");
for &id in &ids {
index.insert(id, value.clone());
}
prop_assert_eq!(
index.cardinality(&value),
ids.len() as u64,
);
}
}
// Total count equals distinct entity IDs across all values.
proptest! {
#[test]
fn total_count_no_double_counting(
entries in prop::collection::vec(
(0u32..1_000, "[a-z]{1,3}"),
1..200,
),
) {
let index = BitmapIndex::new("tags");
let mut all_ids = RoaringBitmap::new();
for &(id, ref value) in &entries {
index.insert(id, value.clone());
all_ids.insert(id);
}
prop_assert_eq!(index.total_count(), all_ids.len());
}
}
// get_union returns the union of individual bitmaps.
proptest! {
#[test]
fn get_union_is_correct(
a_ids in prop::collection::vec(0u32..1000, 1..50),
b_ids in prop::collection::vec(0u32..1000, 1..50),
) {
let index = BitmapIndex::new("test_field");
for &id in &a_ids {
index.insert(id, "a".to_string());
}
for &id in &b_ids {
index.insert(id, "b".to_string());
}
let union = index.get_union(&["a", "b"]);
let manual_union = {
let mut bm = index.get("a").unwrap_or_default();
bm |= &index.get("b").unwrap_or_default();
bm
};
prop_assert_eq!(union, manual_union);
}
}
// Serialize-deserialize roundtrip preserves all bitmaps.
proptest! {
#[test]
fn serialize_deserialize_roundtrip(
entries in prop::collection::vec(
(0u32..10_000, "[a-z]{1,3}"),
1..100,
),
) {
let index = BitmapIndex::new("test_field");
for &(id, ref value) in &entries {
index.insert(id, value.clone());
}
let kv_pairs = index.serialize_to_kv_pairs().unwrap();
let restored = BitmapIndex::load_from_kv_pairs(
"test_field",
kv_pairs.into_iter(),
).unwrap();
// Verify all values and bitmaps match
for value in index.values() {
let orig = index.get(&value).unwrap();
let rest = restored.get(&value).unwrap();
prop_assert_eq!(orig, rest, "mismatch for value '{value}'");
}
prop_assert_eq!(index.total_count(), restored.total_count());
}
}
Unit Tests
#[test]
fn new_index_is_empty() {
let index = BitmapIndex::new("category");
assert!(index.is_empty());
assert_eq!(index.total_count(), 0);
assert_eq!(index.distinct_values(), 0);
assert!(index.get("jazz").is_none());
assert_eq!(index.cardinality("jazz"), 0);
}
#[test]
fn insert_single_entity() {
let index = BitmapIndex::new("category");
assert!(index.insert(1, "jazz"));
assert!(!index.is_empty());
assert_eq!(index.cardinality("jazz"), 1);
assert_eq!(index.total_count(), 1);
let bitmap = index.get("jazz").unwrap();
assert!(bitmap.contains(1));
}
#[test]
fn insert_same_entity_same_value_is_idempotent() {
let index = BitmapIndex::new("category");
assert!(index.insert(1, "jazz")); // first insert: true (newly added)
assert!(!index.insert(1, "jazz")); // second insert: false (already present)
assert_eq!(index.cardinality("jazz"), 1);
}
#[test]
fn multi_value_field_tags() {
let index = BitmapIndex::new("tags");
index.insert(1, "jazz");
index.insert(1, "piano");
index.insert(1, "classical");
// Entity 1 appears in all three tag bitmaps
assert!(index.get("jazz").unwrap().contains(1));
assert!(index.get("piano").unwrap().contains(1));
assert!(index.get("classical").unwrap().contains(1));
// Total count is 1, not 3 (one distinct entity)
assert_eq!(index.total_count(), 1);
assert_eq!(index.distinct_values(), 3);
}
#[test]
fn delete_entity_from_all_values() {
let index = BitmapIndex::new("tags");
index.insert(1, "jazz");
index.insert(1, "piano");
index.insert(2, "jazz");
let removed_count = index.delete_entity(1);
assert_eq!(removed_count, 2); // removed from "jazz" and "piano"
// Entity 1 gone from both bitmaps
assert!(!index.get("jazz").unwrap().contains(1));
assert!(index.get("piano").is_none()); // only had entity 1
// Entity 2 still in "jazz"
assert!(index.get("jazz").unwrap().contains(2));
assert_eq!(index.total_count(), 1);
}
#[test]
fn get_union_multiple_values() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
index.insert(2, "blues");
index.insert(3, "jazz");
let union = index.get_union(&["jazz", "blues"]);
assert_eq!(union.len(), 3);
assert!(union.contains(1));
assert!(union.contains(2));
assert!(union.contains(3));
}
#[test]
fn get_union_with_missing_value() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
let union = index.get_union(&["jazz", "nonexistent"]);
assert_eq!(union.len(), 1);
assert!(union.contains(1));
}
#[test]
fn get_union_empty_values() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
let union = index.get_union(&[]);
assert!(union.is_empty());
}
#[test]
fn values_enumerates_all() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
index.insert(2, "blues");
index.insert(3, "rock");
let mut values = index.values();
values.sort();
assert_eq!(values, vec!["blues", "jazz", "rock"]);
}
#[test]
fn delete_nonexistent_returns_false() {
let index = BitmapIndex::new("category");
assert!(!index.delete(99, "jazz"));
}
#[test]
fn field_name_accessor() {
let index = BitmapIndex::new("category");
assert_eq!(index.field_name(), "category");
}
#[test]
fn empty_value_bitmaps_are_cleaned_up() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
index.delete(1, "jazz");
// After removing the only entity, the value bitmap should be
// removed or empty. get() returns None for empty/removed.
assert!(index.get("jazz").is_none() || index.get("jazz").unwrap().is_empty());
assert_eq!(index.total_count(), 0);
}
Acceptance Criteria
BitmapIndexstores oneRoaringBitmapper distinct field value, mapping field values to sets ofu32entity IDsinsert(entity_id, field_value)adds the entity to the bitmap for that value; returns true if newly addeddelete(entity_id, field_value)removes the entity from the bitmap; returns true if was presentdelete_entity(entity_id)removes the entity from ALL value bitmaps; returns count of bitmaps modifiedget(field_value)returns the bitmap for exact-match lookup, orNoneif no entities have that valueget_union(values)returns the union bitmap across multiple values (OR-within-dimension)cardinality(field_value)returns the count of entities matching that valuetotal_count()returns the count of distinct entity IDs across all values (no double-counting for multi-value fields)- Multi-value fields work: one entity can appear in multiple value bitmaps (tested with tags)
serialize_to_kv_pairs()andload_from_kv_pairs()roundtrip preserves all bitmaps exactly (property tested)- Key encoding uses
encode_key(EntityId(0), Tag::Idx, b"BMP:{field_name}:{field_value}")for persistence BitmapIndexisSend + Sync- No
unsafecode cargo clippy -- -D warningspasses- All property tests and unit tests pass
Research References
- docs/research/ann_for_tidaldb.md -- "resolve filter predicates to roaring bitmaps", "Weaviate produces a RoaringBitmap allow-list", "Pinecone intersects metadata bitmaps per field with IVF cluster assignments"
- docs/specs/07-vector-retrieval.md -- Section 3 (selectivity estimation:
cardinality(bitmap[field][value]) / total_entities)
Spec References
- docs/specs/08-query-engine.md -- Section 7.1 (bitmap-based architecture:
category_bitmap["jazz"],format_bitmap["video"], bitmap intersection for compound filters), Section 7.4 (short-circuit evaluation: "if any bitmap has cardinality 0, return empty Results immediately") - docs/specs/09-ranking-scoring.md -- Section 3.2 (metadata-indexed scan: "filters on keyword fields are resolved to roaring bitmaps and intersected before any signal reads")
Implementation Notes
- Add
roaring = "0.10"to[dependencies]intidal/Cargo.toml. Theroaringcrate is pure Rust, nounsafe, well-maintained (~55M crates.io downloads), and implements the standard Roaring bitmap format. RoaringBitmapusesu32. This limits entity IDs to ~4 billion, which is sufficient through M6 (100K items). If M7+ needs u64 entity IDs, switch toRoaringTreemap(same crate, same API, u64 keys). Add adebug_assert!(entity_id <= u32::MAX as u64)at the conversion boundary.- Empty bitmaps (all entities deleted for a value) should be removed from the
HashMapto avoid accumulating dead entries. Checkbitmap.is_empty()afterdelete()and remove the entry. - The
RwLockis fromstd::sync::RwLock, notparking_lot. At M2 scale (10K entities, low contention),stdRwLock is sufficient. If profiling shows contention at M7, switch toparking_lot::RwLockwhich is faster under high contention. - Do NOT implement bitmap compression tuning (run optimization) in this task. The
roaringcrate handles compression automatically. - Do NOT implement the correlation cache for co-occurring filter pairs (Spec 07 Section 3: "maintain joint statistics for common filter combinations"). This is deferred to M5 or later.