- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test coverage for every public API surface - m1p5: TidalDb public API — write_item, signal, read_decay_score, read_windowed_count, read_velocity; StorageBox enum routing memory vs fjall; WalSender/WalHandleWriter bridge; WAL replay on open - Periodic checkpoint: 30s background thread for persistent+schema mode; FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful shutdown via Arc<AtomicBool> + join before final checkpoint - ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing) - Milestone 2 planning scaffolding added under docs/planning/milestone-2/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
559 lines
22 KiB
Markdown
559 lines
22 KiB
Markdown
# Task 01: Roaring Bitmap Indexes
|
|
|
|
## Context
|
|
|
|
**Milestone:** 2 -- Ranked Retrieval
|
|
**Phase:** m2p2 -- Metadata Indexes and Filter Engine
|
|
**Depends On:** None (uses types from m1p1 but no m2p2 tasks)
|
|
**Blocks:** Task 03 (Composable Filter Engine)
|
|
**Complexity:** M
|
|
|
|
## Objective
|
|
|
|
Deliver `BitmapIndex`, the in-memory index structure that maps categorical metadata field values to `RoaringBitmap` sets of matching entity IDs. This is the data structure that makes `FILTER category:jazz` resolve in microseconds: look up `category_bitmap["jazz"]`, get a `RoaringBitmap` of all jazz item IDs, intersect with other filter bitmaps. No entity scan. No metadata loads.
|
|
|
|
The bitmap index supports exact-match lookups (`category:jazz`), multi-value fields (`tags:classical` where one entity has multiple tags), and provides cardinality counts for the adaptive query planner's selectivity estimation (Spec 07 Section 3, Section 9). Every production system that solves filtered ANN -- Qdrant, Weaviate, Pinecone -- uses roaring bitmaps as the pre-filter primitive. This is table stakes.
|
|
|
|
## Requirements
|
|
|
|
- `BitmapIndex` struct: maps `(field_name, field_value)` pairs to `RoaringBitmap` of matching entity IDs
|
|
- `insert(entity_id, field_value)` -- adds entity to the bitmap for that value
|
|
- `delete(entity_id, field_value)` -- removes entity from the bitmap for that value
|
|
- `get(field_value) -> Option<&RoaringBitmap>` -- returns bitmap for exact-match filter
|
|
- `cardinality(field_value) -> u64` -- returns number of entities matching that value
|
|
- `total_count() -> u64` -- returns total number of distinct entity IDs indexed across all values
|
|
- `values() -> impl Iterator<Item = &str>` -- enumerate all indexed field values
|
|
- `field_name()` -- returns the field this index covers
|
|
- Multi-value field support: one entity can appear in multiple value bitmaps (e.g., tags)
|
|
- `delete_entity(entity_id)` -- removes entity from ALL value bitmaps (for entity deletion)
|
|
- Persistence: serialize/deserialize each value bitmap to/from the storage engine via `Tag::Idx`
|
|
- On startup: load from storage engine, rebuild in-memory state
|
|
- Pure Rust: use the `roaring` crate (crates.io), not `croaring` (C bindings)
|
|
- `Send + Sync` (for concurrent read access during queries)
|
|
|
|
## Technical Design
|
|
|
|
### Module Structure
|
|
|
|
```
|
|
tidal/src/storage/
|
|
indexes/
|
|
mod.rs -- IndexError, pub use re-exports
|
|
bitmap.rs -- BitmapIndex (this task)
|
|
```
|
|
|
|
### Public API
|
|
|
|
```rust
|
|
// === storage/indexes/bitmap.rs ===
|
|
|
|
use roaring::RoaringBitmap;
|
|
use std::collections::HashMap;
|
|
use std::sync::RwLock;
|
|
|
|
/// Error type for index operations.
|
|
#[derive(Debug, thiserror::Error)]
|
|
pub enum IndexError {
|
|
#[error("storage error: {0}")]
|
|
Storage(String),
|
|
#[error("serialization error: {0}")]
|
|
Serialization(String),
|
|
}
|
|
|
|
/// A roaring bitmap index for a single categorical metadata field.
|
|
///
|
|
/// Maps field values (strings) to `RoaringBitmap` sets of entity IDs
|
|
/// that have that value. Supports exact-match lookups, multi-value
|
|
/// fields (tags), and cardinality queries for selectivity estimation.
|
|
///
|
|
/// # Concurrency
|
|
///
|
|
/// Reads and writes are protected by an `RwLock`. Reads (get, cardinality,
|
|
/// total_count) take the read lock. Writes (insert, delete, delete_entity)
|
|
/// take the write lock. This is acceptable because:
|
|
/// - Writes happen on the entity write path (not the hot query path).
|
|
/// - Reads happen during filter evaluation (concurrent with other reads).
|
|
/// - At M2 scale (10K entities), contention is negligible.
|
|
///
|
|
/// At M7 scale (1M+ entities), if write contention becomes measurable,
|
|
/// switch to a sharded index (one BitmapIndex per shard of the field
|
|
/// value space).
|
|
///
|
|
/// # Persistence
|
|
///
|
|
/// Each `(field_name, field_value)` bitmap is stored as a key in the
|
|
/// storage engine using `Tag::Idx`:
|
|
///
|
|
/// ```text
|
|
/// Key: [INDEX_ROOT_ID: 8 bytes BE][0x00][Tag::Idx][b"BMP:"][field_name][b":"][field_value]
|
|
/// Value: RoaringBitmap serialized bytes
|
|
/// ```
|
|
///
|
|
/// `INDEX_ROOT_ID` is `EntityId(0)` -- a reserved entity ID used as
|
|
/// the root for all index keys.
|
|
pub struct BitmapIndex {
|
|
field_name: String,
|
|
values: RwLock<HashMap<String, RoaringBitmap>>,
|
|
}
|
|
|
|
/// The reserved entity ID used as the root for index storage keys.
|
|
///
|
|
/// EntityId 0 is reserved for system-level keys across all subsystems.
|
|
/// The signal system also uses EntityId(0) with `Tag::Sig` for checkpoint
|
|
/// metadata (see `signals/checkpoint.rs`). The `Tag::Idx` discriminant
|
|
/// ensures no key collision — the full key format `[entity_id][NUL][tag][suffix]`
|
|
/// keeps each subsystem's keys in separate namespaces.
|
|
pub const INDEX_ROOT_ID: u64 = 0;
|
|
|
|
impl BitmapIndex {
|
|
/// Create a new, empty bitmap index for the given field.
|
|
pub fn new(field_name: impl Into<String>) -> Self;
|
|
|
|
/// The field name this index covers.
|
|
pub fn field_name(&self) -> &str;
|
|
|
|
/// Add an entity to the bitmap for the given field value.
|
|
///
|
|
/// For multi-value fields (tags), call once per value for the
|
|
/// same entity.
|
|
///
|
|
/// Returns `true` if the entity was newly added (was not already
|
|
/// in this value's bitmap).
|
|
pub fn insert(&self, entity_id: u32, value: impl Into<String>) -> bool;
|
|
|
|
/// Remove an entity from the bitmap for the given field value.
|
|
///
|
|
/// Returns `true` if the entity was present and removed.
|
|
pub fn delete(&self, entity_id: u32, value: &str) -> bool;
|
|
|
|
/// Remove an entity from ALL value bitmaps.
|
|
///
|
|
/// Used when an entity is deleted entirely. Scans all values
|
|
/// and removes the entity ID from each bitmap.
|
|
///
|
|
/// Returns the number of bitmaps the entity was removed from.
|
|
pub fn delete_entity(&self, entity_id: u32) -> usize;
|
|
|
|
/// Look up the bitmap for an exact field value.
|
|
///
|
|
/// Returns `None` if no entities have this value.
|
|
///
|
|
/// The returned bitmap is a clone (cheap for roaring bitmaps
|
|
/// due to copy-on-write internals). Callers can intersect,
|
|
/// union, or iterate over it freely.
|
|
pub fn get(&self, value: &str) -> Option<RoaringBitmap>;
|
|
|
|
/// Look up bitmaps for multiple values and return their union.
|
|
///
|
|
/// Implements OR-within-dimension: `category IN [jazz, blues]`
|
|
/// returns the union of the jazz and blues bitmaps.
|
|
pub fn get_union(&self, values: &[&str]) -> RoaringBitmap;
|
|
|
|
/// Number of entities matching a specific field value.
|
|
///
|
|
/// Used by the adaptive query planner for selectivity estimation.
|
|
pub fn cardinality(&self, value: &str) -> u64;
|
|
|
|
/// Total number of distinct entity IDs indexed across all values.
|
|
///
|
|
/// Used as the denominator for selectivity:
|
|
/// `selectivity = cardinality(value) / total_count()`.
|
|
///
|
|
/// Note: this is the cardinality of the union of all value
|
|
/// bitmaps, NOT the sum of individual cardinalities (which
|
|
/// would double-count entities in multi-value fields).
|
|
pub fn total_count(&self) -> u64;
|
|
|
|
/// Enumerate all indexed field values.
|
|
pub fn values(&self) -> Vec<String>;
|
|
|
|
/// Number of distinct field values in this index.
|
|
pub fn distinct_values(&self) -> usize;
|
|
|
|
/// Check if the index is empty (no entities indexed).
|
|
pub fn is_empty(&self) -> bool;
|
|
|
|
/// Serialize all bitmaps to storage engine key-value pairs.
|
|
///
|
|
/// Returns a `Vec<(Vec<u8>, Vec<u8>)>` of `(key, value)` pairs
|
|
/// suitable for `StorageEngine::write_batch()`.
|
|
///
|
|
/// Key format: `encode_key(EntityId(INDEX_ROOT_ID), Tag::Idx, suffix)`
|
|
/// where suffix = `b"BMP:" + field_name + b":" + field_value`.
|
|
pub fn serialize_to_kv_pairs(&self) -> Result<Vec<(Vec<u8>, Vec<u8>)>, IndexError>;
|
|
|
|
/// Deserialize bitmaps from storage engine key-value pairs.
|
|
///
|
|
/// Scans all keys with the `BMP:{field_name}:` prefix and
|
|
/// deserializes each value as a `RoaringBitmap`.
|
|
pub fn load_from_kv_pairs(
|
|
field_name: impl Into<String>,
|
|
pairs: impl Iterator<Item = (Vec<u8>, Vec<u8>)>,
|
|
) -> Result<Self, IndexError>;
|
|
}
|
|
```
|
|
|
|
### Internal Design
|
|
|
|
**`RoaringBitmap` entity ID representation:**
|
|
|
|
`RoaringBitmap` operates on `u32` values. `EntityId` is `u64`. For M2 (10K entities), all entity IDs fit in `u32`. For production at 10M+ entities, we need a strategy. Options:
|
|
|
|
1. **Truncate to u32**: Works if entity IDs are assigned sequentially from 0. This is the simplest approach and matches how most databases assign internal row IDs.
|
|
2. **Split high/low bits**: Use the high 32 bits as a partition key and the low 32 bits in the bitmap. Requires multiple bitmaps per value.
|
|
3. **Use `RoaringTreemap`**: The `roaring` crate provides `RoaringTreemap` which operates on `u64`. Slightly slower for small sets but handles the full ID space.
|
|
|
|
Decision for M2: Use `u32` with an explicit conversion from `EntityId`. The `insert` and `delete` APIs accept `u32` directly. The caller (entity write path, m2p5) is responsible for the `EntityId::as_u64() as u32` conversion. A debug assertion verifies no truncation: `debug_assert!(entity_id.as_u64() <= u64::from(u32::MAX))`. At M7+, if entity IDs exceed u32, switch to `RoaringTreemap`.
|
|
|
|
**HashMap vs BTreeMap for value storage:**
|
|
|
|
`HashMap<String, RoaringBitmap>` is used because:
|
|
- Field value lookups are exact-match (hash is O(1)).
|
|
- No ordering needed for bitmap index values.
|
|
- `BTreeMap` would be used if we needed prefix or range queries on field values, which we do not.
|
|
|
|
**Persistence key encoding:**
|
|
|
|
```
|
|
Key: encode_key(EntityId(0), Tag::Idx, b"BMP:category:jazz")
|
|
Value: RoaringBitmap::serialize_into() bytes
|
|
```
|
|
|
|
The `INDEX_ROOT_ID = EntityId(0)` anchors all index keys to the same entity prefix, making them scannable via `scan_prefix(entity_tag_prefix(EntityId(0), Tag::Idx))`. The suffix `BMP:{field_name}:{field_value}` distinguishes bitmap keys from future index types (e.g., `RNG:` for range indexes in Task 02).
|
|
|
|
**Bitmap serialization:**
|
|
|
|
The `roaring` crate provides `serialize_into(&mut writer)` and `deserialize_from(reader)` using the standard Roaring bitmap serialization format (compatible with CRoaring, Roaring Java, etc.). At 10K items with ~50 distinct category values, each bitmap is ~1-4 KB serialized. Total index persistence: ~100 KB.
|
|
|
|
### Error Handling
|
|
|
|
- `insert()` and `delete()` are infallible -- they operate on in-memory data only.
|
|
- `serialize_to_kv_pairs()` can fail if bitmap serialization fails (theoretically impossible for valid bitmaps, but the API is `Result` for correctness).
|
|
- `load_from_kv_pairs()` can fail if the stored bytes are corrupted. Returns `IndexError::Serialization` with context.
|
|
|
|
## Test Strategy
|
|
|
|
### Property Tests
|
|
|
|
```rust
|
|
use proptest::prelude::*;
|
|
use roaring::RoaringBitmap;
|
|
|
|
// Insert-query roundtrip: every inserted entity appears in get().
|
|
proptest! {
|
|
#[test]
|
|
fn insert_query_roundtrip(
|
|
entries in prop::collection::vec(
|
|
(0u32..100_000, "[a-z]{1,5}"),
|
|
1..500,
|
|
),
|
|
) {
|
|
let index = BitmapIndex::new("test_field");
|
|
for &(id, ref value) in &entries {
|
|
index.insert(id, value.clone());
|
|
}
|
|
|
|
for &(id, ref value) in &entries {
|
|
let bitmap = index.get(value).expect("value should exist");
|
|
prop_assert!(bitmap.contains(id), "entity {id} not found in bitmap for '{value}'");
|
|
}
|
|
}
|
|
}
|
|
|
|
// Delete removes correctly: after delete, entity is absent.
|
|
proptest! {
|
|
#[test]
|
|
fn delete_removes_entity(
|
|
ids in prop::collection::vec(0u32..10_000, 1..100),
|
|
value in "[a-z]{1,5}",
|
|
) {
|
|
let index = BitmapIndex::new("test_field");
|
|
for &id in &ids {
|
|
index.insert(id, value.clone());
|
|
}
|
|
|
|
// Delete first half
|
|
let half = ids.len() / 2;
|
|
for &id in &ids[..half] {
|
|
index.delete(id, &value);
|
|
}
|
|
|
|
// Verify first half absent, second half present
|
|
let bitmap = index.get(&value).unwrap_or_default();
|
|
for &id in &ids[..half] {
|
|
prop_assert!(!bitmap.contains(id), "deleted entity {id} still in bitmap");
|
|
}
|
|
for &id in &ids[half..] {
|
|
prop_assert!(bitmap.contains(id), "surviving entity {id} missing from bitmap");
|
|
}
|
|
}
|
|
}
|
|
|
|
// Cardinality matches bitmap length.
|
|
proptest! {
|
|
#[test]
|
|
fn cardinality_matches_bitmap(
|
|
ids in prop::collection::hash_set(0u32..100_000, 1..1000),
|
|
value in "[a-z]{1,3}",
|
|
) {
|
|
let index = BitmapIndex::new("test_field");
|
|
for &id in &ids {
|
|
index.insert(id, value.clone());
|
|
}
|
|
prop_assert_eq!(
|
|
index.cardinality(&value),
|
|
ids.len() as u64,
|
|
);
|
|
}
|
|
}
|
|
|
|
// Total count equals distinct entity IDs across all values.
|
|
proptest! {
|
|
#[test]
|
|
fn total_count_no_double_counting(
|
|
entries in prop::collection::vec(
|
|
(0u32..1_000, "[a-z]{1,3}"),
|
|
1..200,
|
|
),
|
|
) {
|
|
let index = BitmapIndex::new("tags");
|
|
let mut all_ids = RoaringBitmap::new();
|
|
for &(id, ref value) in &entries {
|
|
index.insert(id, value.clone());
|
|
all_ids.insert(id);
|
|
}
|
|
prop_assert_eq!(index.total_count(), all_ids.len());
|
|
}
|
|
}
|
|
|
|
// get_union returns the union of individual bitmaps.
|
|
proptest! {
|
|
#[test]
|
|
fn get_union_is_correct(
|
|
a_ids in prop::collection::vec(0u32..1000, 1..50),
|
|
b_ids in prop::collection::vec(0u32..1000, 1..50),
|
|
) {
|
|
let index = BitmapIndex::new("test_field");
|
|
for &id in &a_ids {
|
|
index.insert(id, "a".to_string());
|
|
}
|
|
for &id in &b_ids {
|
|
index.insert(id, "b".to_string());
|
|
}
|
|
|
|
let union = index.get_union(&["a", "b"]);
|
|
let manual_union = {
|
|
let mut bm = index.get("a").unwrap_or_default();
|
|
bm |= &index.get("b").unwrap_or_default();
|
|
bm
|
|
};
|
|
prop_assert_eq!(union, manual_union);
|
|
}
|
|
}
|
|
|
|
// Serialize-deserialize roundtrip preserves all bitmaps.
|
|
proptest! {
|
|
#[test]
|
|
fn serialize_deserialize_roundtrip(
|
|
entries in prop::collection::vec(
|
|
(0u32..10_000, "[a-z]{1,3}"),
|
|
1..100,
|
|
),
|
|
) {
|
|
let index = BitmapIndex::new("test_field");
|
|
for &(id, ref value) in &entries {
|
|
index.insert(id, value.clone());
|
|
}
|
|
|
|
let kv_pairs = index.serialize_to_kv_pairs().unwrap();
|
|
let restored = BitmapIndex::load_from_kv_pairs(
|
|
"test_field",
|
|
kv_pairs.into_iter(),
|
|
).unwrap();
|
|
|
|
// Verify all values and bitmaps match
|
|
for value in index.values() {
|
|
let orig = index.get(&value).unwrap();
|
|
let rest = restored.get(&value).unwrap();
|
|
prop_assert_eq!(orig, rest, "mismatch for value '{value}'");
|
|
}
|
|
prop_assert_eq!(index.total_count(), restored.total_count());
|
|
}
|
|
}
|
|
```
|
|
|
|
### Unit Tests
|
|
|
|
```rust
|
|
#[test]
|
|
fn new_index_is_empty() {
|
|
let index = BitmapIndex::new("category");
|
|
assert!(index.is_empty());
|
|
assert_eq!(index.total_count(), 0);
|
|
assert_eq!(index.distinct_values(), 0);
|
|
assert!(index.get("jazz").is_none());
|
|
assert_eq!(index.cardinality("jazz"), 0);
|
|
}
|
|
|
|
#[test]
|
|
fn insert_single_entity() {
|
|
let index = BitmapIndex::new("category");
|
|
assert!(index.insert(1, "jazz"));
|
|
assert!(!index.is_empty());
|
|
assert_eq!(index.cardinality("jazz"), 1);
|
|
assert_eq!(index.total_count(), 1);
|
|
let bitmap = index.get("jazz").unwrap();
|
|
assert!(bitmap.contains(1));
|
|
}
|
|
|
|
#[test]
|
|
fn insert_same_entity_same_value_is_idempotent() {
|
|
let index = BitmapIndex::new("category");
|
|
assert!(index.insert(1, "jazz")); // first insert: true (newly added)
|
|
assert!(!index.insert(1, "jazz")); // second insert: false (already present)
|
|
assert_eq!(index.cardinality("jazz"), 1);
|
|
}
|
|
|
|
#[test]
|
|
fn multi_value_field_tags() {
|
|
let index = BitmapIndex::new("tags");
|
|
index.insert(1, "jazz");
|
|
index.insert(1, "piano");
|
|
index.insert(1, "classical");
|
|
|
|
// Entity 1 appears in all three tag bitmaps
|
|
assert!(index.get("jazz").unwrap().contains(1));
|
|
assert!(index.get("piano").unwrap().contains(1));
|
|
assert!(index.get("classical").unwrap().contains(1));
|
|
|
|
// Total count is 1, not 3 (one distinct entity)
|
|
assert_eq!(index.total_count(), 1);
|
|
assert_eq!(index.distinct_values(), 3);
|
|
}
|
|
|
|
#[test]
|
|
fn delete_entity_from_all_values() {
|
|
let index = BitmapIndex::new("tags");
|
|
index.insert(1, "jazz");
|
|
index.insert(1, "piano");
|
|
index.insert(2, "jazz");
|
|
|
|
let removed_count = index.delete_entity(1);
|
|
assert_eq!(removed_count, 2); // removed from "jazz" and "piano"
|
|
|
|
// Entity 1 gone from both bitmaps
|
|
assert!(!index.get("jazz").unwrap().contains(1));
|
|
assert!(index.get("piano").is_none()); // only had entity 1
|
|
|
|
// Entity 2 still in "jazz"
|
|
assert!(index.get("jazz").unwrap().contains(2));
|
|
assert_eq!(index.total_count(), 1);
|
|
}
|
|
|
|
#[test]
|
|
fn get_union_multiple_values() {
|
|
let index = BitmapIndex::new("category");
|
|
index.insert(1, "jazz");
|
|
index.insert(2, "blues");
|
|
index.insert(3, "jazz");
|
|
|
|
let union = index.get_union(&["jazz", "blues"]);
|
|
assert_eq!(union.len(), 3);
|
|
assert!(union.contains(1));
|
|
assert!(union.contains(2));
|
|
assert!(union.contains(3));
|
|
}
|
|
|
|
#[test]
|
|
fn get_union_with_missing_value() {
|
|
let index = BitmapIndex::new("category");
|
|
index.insert(1, "jazz");
|
|
|
|
let union = index.get_union(&["jazz", "nonexistent"]);
|
|
assert_eq!(union.len(), 1);
|
|
assert!(union.contains(1));
|
|
}
|
|
|
|
#[test]
|
|
fn get_union_empty_values() {
|
|
let index = BitmapIndex::new("category");
|
|
index.insert(1, "jazz");
|
|
|
|
let union = index.get_union(&[]);
|
|
assert!(union.is_empty());
|
|
}
|
|
|
|
#[test]
|
|
fn values_enumerates_all() {
|
|
let index = BitmapIndex::new("category");
|
|
index.insert(1, "jazz");
|
|
index.insert(2, "blues");
|
|
index.insert(3, "rock");
|
|
|
|
let mut values = index.values();
|
|
values.sort();
|
|
assert_eq!(values, vec!["blues", "jazz", "rock"]);
|
|
}
|
|
|
|
#[test]
|
|
fn delete_nonexistent_returns_false() {
|
|
let index = BitmapIndex::new("category");
|
|
assert!(!index.delete(99, "jazz"));
|
|
}
|
|
|
|
#[test]
|
|
fn field_name_accessor() {
|
|
let index = BitmapIndex::new("category");
|
|
assert_eq!(index.field_name(), "category");
|
|
}
|
|
|
|
#[test]
|
|
fn empty_value_bitmaps_are_cleaned_up() {
|
|
let index = BitmapIndex::new("category");
|
|
index.insert(1, "jazz");
|
|
index.delete(1, "jazz");
|
|
|
|
// After removing the only entity, the value bitmap should be
|
|
// removed or empty. get() returns None for empty/removed.
|
|
assert!(index.get("jazz").is_none() || index.get("jazz").unwrap().is_empty());
|
|
assert_eq!(index.total_count(), 0);
|
|
}
|
|
```
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `BitmapIndex` stores one `RoaringBitmap` per distinct field value, mapping field values to sets of `u32` entity IDs
|
|
- [ ] `insert(entity_id, field_value)` adds the entity to the bitmap for that value; returns true if newly added
|
|
- [ ] `delete(entity_id, field_value)` removes the entity from the bitmap; returns true if was present
|
|
- [ ] `delete_entity(entity_id)` removes the entity from ALL value bitmaps; returns count of bitmaps modified
|
|
- [ ] `get(field_value)` returns the bitmap for exact-match lookup, or `None` if no entities have that value
|
|
- [ ] `get_union(values)` returns the union bitmap across multiple values (OR-within-dimension)
|
|
- [ ] `cardinality(field_value)` returns the count of entities matching that value
|
|
- [ ] `total_count()` returns the count of distinct entity IDs across all values (no double-counting for multi-value fields)
|
|
- [ ] Multi-value fields work: one entity can appear in multiple value bitmaps (tested with tags)
|
|
- [ ] `serialize_to_kv_pairs()` and `load_from_kv_pairs()` roundtrip preserves all bitmaps exactly (property tested)
|
|
- [ ] Key encoding uses `encode_key(EntityId(0), Tag::Idx, b"BMP:{field_name}:{field_value}")` for persistence
|
|
- [ ] `BitmapIndex` is `Send + Sync`
|
|
- [ ] No `unsafe` code
|
|
- [ ] `cargo clippy -- -D warnings` passes
|
|
- [ ] All property tests and unit tests pass
|
|
|
|
## Research References
|
|
|
|
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- "resolve filter predicates to roaring bitmaps", "Weaviate produces a RoaringBitmap allow-list", "Pinecone intersects metadata bitmaps per field with IVF cluster assignments"
|
|
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 3 (selectivity estimation: `cardinality(bitmap[field][value]) / total_entities`)
|
|
|
|
## Spec References
|
|
|
|
- [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- Section 7.1 (bitmap-based architecture: `category_bitmap["jazz"]`, `format_bitmap["video"]`, bitmap intersection for compound filters), Section 7.4 (short-circuit evaluation: "if any bitmap has cardinality 0, return empty Results immediately")
|
|
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 3.2 (metadata-indexed scan: "filters on keyword fields are resolved to roaring bitmaps and intersected before any signal reads")
|
|
|
|
## Implementation Notes
|
|
|
|
- Add `roaring = "0.10"` to `[dependencies]` in `tidal/Cargo.toml`. The `roaring` crate is pure Rust, no `unsafe`, well-maintained (~55M crates.io downloads), and implements the standard Roaring bitmap format.
|
|
- `RoaringBitmap` uses `u32`. This limits entity IDs to ~4 billion, which is sufficient through M6 (100K items). If M7+ needs u64 entity IDs, switch to `RoaringTreemap` (same crate, same API, u64 keys). Add a `debug_assert!(entity_id <= u32::MAX as u64)` at the conversion boundary.
|
|
- Empty bitmaps (all entities deleted for a value) should be removed from the `HashMap` to avoid accumulating dead entries. Check `bitmap.is_empty()` after `delete()` and remove the entry.
|
|
- The `RwLock` is from `std::sync::RwLock`, not `parking_lot`. At M2 scale (10K entities, low contention), `std` RwLock is sufficient. If profiling shows contention at M7, switch to `parking_lot::RwLock` which is faster under high contention.
|
|
- Do NOT implement bitmap compression tuning (run optimization) in this task. The `roaring` crate handles compression automatically.
|
|
- Do NOT implement the correlation cache for co-occurring filter pairs (Spec 07 Section 3: "maintain joint statistics for common filter combinations"). This is deferred to M5 or later.
|