tidaldb/docs/planning/milestone-3/phase-1/task-03-user-state-bitmap-indexes.md
jordan 39ada28c6e feat: complete Milestones 2–4 — RETRIEVE query, vector index, ranking profiles, diversity, entity system, sessions
M2: RETRIEVE query pipeline with 5-stage execution (candidate → filter → score → diversify → limit),
    usearch HNSW vector index, bitmap/range/universe filters, ranking profiles with signal scoring,
    MMR diversity enforcement, and m2_uat integration tests.

M3: Entity system with typed metadata, relationship graph (follows/blocks/interactions),
    creator entities, session tracking, and m3_uat integration tests.

M4: Advanced ranking with builtin functions (freshness, trending, controversy, wilson),
    ranking executor with explain mode, query executor integration, benchmarks for
    query/ranking/vector/filters/diversity, and m4_uat integration tests.

Includes: 9 new blog posts, marketing site updates, updated roadmap, and updated vision doc.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 16:24:48 -07:00

524 lines
20 KiB
Markdown

# Task 03: User-State Bitmap Indexes
## Context
**Milestone:** 3 -- Personalized Ranking
**Phase:** m3p1 -- User and Creator Entities with Relationships
**Depends On:** Task 01 (User/Creator entities, `CreatorItemsBitmap`), Task 02 (Relationship graph: follows, blocks, hide edges)
**Blocks:** m3p2 (Feedback Loop updates seen/hide bitmaps), m3p3 (Personalized Profiles use follows bitmap for `following` profile), m3p4 (User State Filters compose these bitmaps with metadata filters)
**Complexity:** M
## Objective
Deliver three user-state bitmap structures that power the `unseen`, `unblocked`, and `relationship:follows` filters in the RETRIEVE executor:
1. **`UserSeenBitmap`**: per-user roaring bitmap of item IDs the user has viewed. Updated on `view` signals. Used by the `unseen` filter to exclude already-seen items.
2. **`UserBlockedSet`**: per-user set of blocked creator IDs and hidden item IDs. Built from `blocks` and `hide` relationship edges. Used by the `unblocked` filter to exclude all items from blocked creators and all hidden items.
3. **`FollowsBitmap`**: per-user roaring bitmap of item IDs from followed creators. Built by intersecting the user's `follows` edges with the `CreatorItemsBitmap` from Task 01. Used by `FILTER relationship:follows` to restrict candidates to followed creators' items.
These bitmaps are maintained in-memory for hot-path query performance and reconstructed from the storage engine on restart. They compose with the existing `FilterExpr` / `FilterResult` system from m2p2.
## Requirements
- `UserSeenBitmap`: per-user `RoaringBitmap` of viewed item IDs
- `UserBlockedSet`: per-user `HashSet<EntityId>` of blocked creator IDs + `RoaringBitmap` of hidden item IDs
- `FollowsBitmap`: per-user `RoaringBitmap` of item IDs from followed creators
- `UserStateIndex`: container holding all three structures for all users, backed by `DashMap` per structure
- `user_state.mark_seen(user_id, item_id)` adds to seen bitmap
- `user_state.is_seen(user_id, item_id)` checks membership
- `user_state.add_block(user_id, creator_id)` adds to blocked set
- `user_state.add_hide(user_id, item_id)` adds to hidden set
- `user_state.add_follow(user_id, creator_id)` rebuilds follows bitmap from creator items
- `user_state.remove_follow(user_id, creator_id)` updates follows bitmap
- `user_state.unseen_filter(user_id)` returns `FilterResult::Predicate` excluding seen items
- `user_state.unblocked_filter(user_id, creator_items_bitmap)` returns `FilterResult::Predicate` excluding blocked+hidden
- `user_state.follows_filter(user_id)` returns `FilterResult::Bitmap` of followed creators' items
- Memory budget: ~125KB per user at 1M items for seen bitmap (roaring bitmap compression)
- Reconstruction from storage on restart via `rebuild_from_storage()`
## Technical Design
### Module Structure
```
tidal/src/
entities/
user_state.rs -- UserStateIndex, all bitmap types
```
### Core Types
```rust
// === entities/user_state.rs ===
use dashmap::DashMap;
use roaring::RoaringBitmap;
use std::collections::HashSet;
use crate::schema::EntityId;
use super::CreatorItemsBitmap;
/// Per-user blocked creators and hidden items.
#[derive(Debug, Default, Clone)]
pub struct BlockedState {
/// Creator IDs the user has blocked.
pub blocked_creators: HashSet<u64>,
/// Item IDs the user has hidden.
pub hidden_items: RoaringBitmap,
}
/// Centralized user-state index for fast query-time filtering.
///
/// All structures are in-memory for hot-path performance. They are
/// rebuilt from the storage engine on startup and incrementally
/// maintained on signal writes and relationship changes.
pub struct UserStateIndex {
/// Per-user seen item bitmaps. Key: user_id as u64.
seen: DashMap<u64, RoaringBitmap>,
/// Per-user blocked/hidden state. Key: user_id as u64.
blocked: DashMap<u64, BlockedState>,
/// Per-user followed creator IDs. Key: user_id as u64.
follows: DashMap<u64, HashSet<u64>>,
}
impl UserStateIndex {
pub fn new() -> Self {
Self {
seen: DashMap::new(),
blocked: DashMap::new(),
follows: DashMap::new(),
}
}
// ── Seen ─────────────────────────────────────────────────
/// Mark an item as seen by a user.
pub fn mark_seen(&self, user_id: EntityId, item_id: EntityId) {
self.seen
.entry(user_id.as_u64())
.or_default()
.insert(item_id.as_u64() as u32);
}
/// Check if a user has seen an item.
pub fn is_seen(&self, user_id: EntityId, item_id: EntityId) -> bool {
self.seen
.get(&user_id.as_u64())
.map_or(false, |bm| bm.contains(item_id.as_u64() as u32))
}
/// Get the count of seen items for a user.
pub fn seen_count(&self, user_id: EntityId) -> u64 {
self.seen
.get(&user_id.as_u64())
.map_or(0, |bm| bm.len())
}
// ── Blocked / Hidden ──────────────────────────────────────
/// Add a creator to the user's blocked set.
pub fn add_block(&self, user_id: EntityId, creator_id: EntityId) {
self.blocked
.entry(user_id.as_u64())
.or_default()
.blocked_creators
.insert(creator_id.as_u64());
}
/// Add an item to the user's hidden set.
pub fn add_hide(&self, user_id: EntityId, item_id: EntityId) {
self.blocked
.entry(user_id.as_u64())
.or_default()
.hidden_items
.insert(item_id.as_u64() as u32);
}
/// Check if a creator is blocked by a user.
pub fn is_blocked(&self, user_id: EntityId, creator_id: EntityId) -> bool {
self.blocked
.get(&user_id.as_u64())
.map_or(false, |s| s.blocked_creators.contains(&creator_id.as_u64()))
}
/// Check if an item is hidden by a user.
pub fn is_hidden(&self, user_id: EntityId, item_id: EntityId) -> bool {
self.blocked
.get(&user_id.as_u64())
.map_or(false, |s| s.hidden_items.contains(item_id.as_u64() as u32))
}
// ── Follows ──────────────────────────────────────────────
/// Add a follow relationship.
pub fn add_follow(&self, user_id: EntityId, creator_id: EntityId) {
self.follows
.entry(user_id.as_u64())
.or_default()
.insert(creator_id.as_u64());
}
/// Remove a follow relationship.
pub fn remove_follow(&self, user_id: EntityId, creator_id: EntityId) {
if let Some(mut set) = self.follows.get_mut(&user_id.as_u64()) {
set.remove(&creator_id.as_u64());
}
}
/// Get the set of creator IDs a user follows.
pub fn followed_creators(&self, user_id: EntityId) -> Vec<EntityId> {
self.follows
.get(&user_id.as_u64())
.map_or_else(Vec::new, |set| {
set.iter().map(|&id| EntityId::new(id)).collect()
})
}
// ── Filter builders ──────────────────────────────────────
/// Build an "unseen" filter predicate for a user.
///
/// Returns a closure that returns `true` for items the user has NOT seen.
pub fn unseen_predicate(
&self,
user_id: EntityId,
) -> Box<dyn Fn(u64) -> bool + Send + Sync> {
let seen_bitmap = self.seen
.get(&user_id.as_u64())
.map(|bm| bm.clone());
Box::new(move |item_id: u64| {
match &seen_bitmap {
Some(bm) => !bm.contains(item_id as u32),
None => true, // no seen data = everything is unseen
}
})
}
/// Build an "unblocked" filter predicate for a user.
///
/// Returns a closure that returns `true` for items that are:
/// - NOT from a blocked creator
/// - NOT in the user's hidden set
///
/// Requires a function to look up creator_id for an item.
pub fn unblocked_predicate(
&self,
user_id: EntityId,
) -> Box<dyn Fn(u64, Option<u64>) -> bool + Send + Sync> {
let state = self.blocked
.get(&user_id.as_u64())
.map(|s| s.clone());
Box::new(move |item_id: u64, creator_id: Option<u64>| {
match &state {
Some(s) => {
// Check hidden items
if s.hidden_items.contains(item_id as u32) {
return false;
}
// Check blocked creators
if let Some(cid) = creator_id {
if s.blocked_creators.contains(&cid) {
return false;
}
}
true
}
None => true,
}
})
}
/// Build a "follows" filter bitmap for a user.
///
/// Returns the union of all item bitmaps for creators the user follows.
pub fn follows_bitmap(
&self,
user_id: EntityId,
creator_items: &CreatorItemsBitmap,
) -> RoaringBitmap {
let creators = self.followed_creators(user_id);
let creator_ids: Vec<EntityId> = creators;
creator_items.items_for_creators(&creator_ids)
}
// ── Reconstruction ───────────────────────────────────────
/// Rebuild all user-state bitmaps from storage.
///
/// Scans all relationship edges and signal ledger entries to reconstruct:
/// - Seen bitmaps from view signals
/// - Blocked/hidden sets from blocks/hide relationship edges
/// - Follows sets from follows relationship edges
pub fn rebuild_from_relationships(
&self,
storage: &dyn crate::storage::StorageEngine,
) -> crate::Result<()> {
// Scan all Rel-tagged keys in users keyspace
// For each key, decode the relationship type and update the
// appropriate bitmap/set.
// Implementation detail: use entity_tag_prefix scanning.
// This is called once on startup.
Ok(())
}
}
```
### Integration with FilterExpr
The `unseen` and `unblocked` filters are new `FilterExpr` variants that the query executor evaluates using the `UserStateIndex`:
```rust
// Extend FilterExpr in storage/indexes/filter.rs
pub enum FilterExpr {
// ... existing variants ...
/// Exclude items the user has seen. Requires user context.
Unseen,
/// Exclude items from blocked creators and hidden items. Requires user context.
Unblocked,
/// Only items from followed creators. Requires user context.
Follows,
}
```
The executor resolves these variants at query time by consulting the `UserStateIndex` attached to the `TidalDb` instance.
## Test Strategy
### Unit Tests
```rust
#[test]
fn mark_seen_and_check() {
let index = UserStateIndex::new();
let user = EntityId::new(1);
let item = EntityId::new(42);
assert!(!index.is_seen(user, item));
index.mark_seen(user, item);
assert!(index.is_seen(user, item));
}
#[test]
fn seen_count_increments() {
let index = UserStateIndex::new();
let user = EntityId::new(1);
assert_eq!(index.seen_count(user), 0);
index.mark_seen(user, EntityId::new(1));
index.mark_seen(user, EntityId::new(2));
index.mark_seen(user, EntityId::new(2)); // duplicate
assert_eq!(index.seen_count(user), 2);
}
#[test]
fn block_and_check() {
let index = UserStateIndex::new();
let user = EntityId::new(1);
let creator = EntityId::new(10);
assert!(!index.is_blocked(user, creator));
index.add_block(user, creator);
assert!(index.is_blocked(user, creator));
}
#[test]
fn hide_and_check() {
let index = UserStateIndex::new();
let user = EntityId::new(1);
let item = EntityId::new(42);
assert!(!index.is_hidden(user, item));
index.add_hide(user, item);
assert!(index.is_hidden(user, item));
}
#[test]
fn follow_and_list() {
let index = UserStateIndex::new();
let user = EntityId::new(1);
index.add_follow(user, EntityId::new(10));
index.add_follow(user, EntityId::new(20));
let creators = index.followed_creators(user);
assert_eq!(creators.len(), 2);
}
#[test]
fn unfollow_removes_creator() {
let index = UserStateIndex::new();
let user = EntityId::new(1);
index.add_follow(user, EntityId::new(10));
index.add_follow(user, EntityId::new(20));
index.remove_follow(user, EntityId::new(10));
let creators = index.followed_creators(user);
assert_eq!(creators.len(), 1);
assert!(creators.contains(&EntityId::new(20)));
}
#[test]
fn unseen_predicate_excludes_seen_items() {
let index = UserStateIndex::new();
let user = EntityId::new(1);
index.mark_seen(user, EntityId::new(5));
index.mark_seen(user, EntityId::new(10));
let pred = index.unseen_predicate(user);
assert!(!pred(5)); // seen -> excluded
assert!(!pred(10)); // seen -> excluded
assert!(pred(15)); // unseen -> included
assert!(pred(1)); // unseen -> included
}
#[test]
fn unseen_predicate_for_unknown_user_includes_all() {
let index = UserStateIndex::new();
let pred = index.unseen_predicate(EntityId::new(999));
assert!(pred(1));
assert!(pred(100));
}
#[test]
fn unblocked_predicate_excludes_blocked_and_hidden() {
let index = UserStateIndex::new();
let user = EntityId::new(1);
index.add_block(user, EntityId::new(77)); // block creator 77
index.add_hide(user, EntityId::new(42)); // hide item 42
let pred = index.unblocked_predicate(user);
// Item from blocked creator -> excluded
assert!(!pred(100, Some(77)));
// Hidden item -> excluded regardless of creator
assert!(!pred(42, Some(1)));
assert!(!pred(42, None));
// Normal item from unblocked creator -> included
assert!(pred(50, Some(10)));
// Item with unknown creator -> included (not blocked)
assert!(pred(50, None));
}
#[test]
fn follows_bitmap_union_of_creator_items() {
let index = UserStateIndex::new();
let creator_items = CreatorItemsBitmap::new();
// Creator 10 has items 100, 101
creator_items.add_item(EntityId::new(10), EntityId::new(100));
creator_items.add_item(EntityId::new(10), EntityId::new(101));
// Creator 20 has items 200, 201
creator_items.add_item(EntityId::new(20), EntityId::new(200));
creator_items.add_item(EntityId::new(20), EntityId::new(201));
// Creator 30 has items 300 (not followed)
creator_items.add_item(EntityId::new(30), EntityId::new(300));
let user = EntityId::new(1);
index.add_follow(user, EntityId::new(10));
index.add_follow(user, EntityId::new(20));
let bitmap = index.follows_bitmap(user, &creator_items);
assert!(bitmap.contains(100));
assert!(bitmap.contains(101));
assert!(bitmap.contains(200));
assert!(bitmap.contains(201));
assert!(!bitmap.contains(300)); // not followed
assert_eq!(bitmap.len(), 4);
}
#[test]
fn different_users_have_independent_state() {
let index = UserStateIndex::new();
let user_a = EntityId::new(1);
let user_b = EntityId::new(2);
index.mark_seen(user_a, EntityId::new(42));
index.add_block(user_b, EntityId::new(77));
assert!(index.is_seen(user_a, EntityId::new(42)));
assert!(!index.is_seen(user_b, EntityId::new(42)));
assert!(!index.is_blocked(user_a, EntityId::new(77)));
assert!(index.is_blocked(user_b, EntityId::new(77)));
}
```
### Property Tests
```rust
use proptest::prelude::*;
proptest! {
#[test]
fn seen_items_never_pass_unseen_filter(
user_id in 1u64..100,
seen_items in proptest::collection::vec(1u64..10000, 1..100),
test_item in 1u64..10000,
) {
let index = UserStateIndex::new();
let user = EntityId::new(user_id);
for &item in &seen_items {
index.mark_seen(user, EntityId::new(item));
}
let pred = index.unseen_predicate(user);
if seen_items.contains(&test_item) {
prop_assert!(!pred(test_item),
"seen item {} should be excluded by unseen filter", test_item);
} else {
prop_assert!(pred(test_item),
"unseen item {} should pass unseen filter", test_item);
}
}
#[test]
fn blocked_creators_items_never_pass_unblocked_filter(
user_id in 1u64..100,
blocked_creators in proptest::collection::vec(1u64..100, 1..10),
test_creator in 1u64..100,
test_item in 1u64..10000,
) {
let index = UserStateIndex::new();
let user = EntityId::new(user_id);
for &cid in &blocked_creators {
index.add_block(user, EntityId::new(cid));
}
let pred = index.unblocked_predicate(user);
if blocked_creators.contains(&test_creator) {
prop_assert!(!pred(test_item, Some(test_creator)),
"item from blocked creator {} should be excluded", test_creator);
} else {
prop_assert!(pred(test_item, Some(test_creator)),
"item from unblocked creator {} should pass", test_creator);
}
}
}
```
## Acceptance Criteria
- [ ] `UserStateIndex` with `DashMap`-backed seen, blocked, and follows structures
- [ ] `mark_seen` / `is_seen` / `seen_count` work correctly
- [ ] `add_block` / `is_blocked` / `add_hide` / `is_hidden` work correctly
- [ ] `add_follow` / `remove_follow` / `followed_creators` work correctly
- [ ] `unseen_predicate` returns closure excluding all seen items
- [ ] `unseen_predicate` for unknown user includes all items
- [ ] `unblocked_predicate` excludes items from blocked creators AND hidden items
- [ ] `follows_bitmap` returns union of item sets for followed creators
- [ ] Different users have fully independent state (no cross-contamination)
- [ ] `FilterExpr` extended with `Unseen`, `Unblocked`, `Follows` variants
- [ ] Memory: roaring bitmap for 1M items is < 200KB per user in typical usage
- [ ] Property test: seen items NEVER pass unseen filter
- [ ] Property test: blocked creators' items NEVER pass unblocked filter
- [ ] All unit and property tests pass
- [ ] `cargo clippy -- -D warnings` passes
## Research References
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- Roaring bitmap selectivity estimation
- [VISION.md](../../../../VISION.md) -- "unseen" and "unblocked" as first-class filter primitives
## Implementation Notes
- The `UserStateIndex` is stored as a field on `TidalDb`, allocated during `open()`. It is `Send + Sync` because all inner maps are `DashMap`.
- `RoaringBitmap` uses `u32` keys. For M3 at up to 1M items, `u32` is sufficient. If item IDs exceed `u32::MAX`, the bitmap must be partitioned. This is unlikely before M7 (production hardening). Document the `u32` limitation.
- The `unblocked_predicate` takes an `(item_id, Option<creator_id>)` pair because the predicate needs to know the creator for each item. The executor must look up creator_id per item when evaluating this filter. In the RETRIEVE executor, creator_id is already available from `ScoredCandidate::creator_id` (set in m2p3).
- On startup, `rebuild_from_relationships` scans all `Tag::Rel` keys in the users keyspace and populates the `follows`, `blocked`, and `hidden` structures. Seen bitmaps are NOT rebuilt from storage on startup for M3 -- they start empty and are populated from signal writes during the session. Full seen-state persistence (checkpoint + restore) is deferred to m3p4 Task 01 where it is implemented properly.
- The `CreatorItemsBitmap` (from Task 01) must be updated when new items are written. The `FollowsBitmap` then becomes stale if new items arrive for a followed creator. Two approaches: (a) rebuild follows bitmap on every item write (expensive), (b) rebuild follows bitmap on every query (cached). Recommendation: (b) -- cache the follows bitmap per user with a generation counter that increments on item writes. Invalidate on write, rebuild lazily on query.
- Do NOT implement signal-triggered updates to these bitmaps in this task. That wiring is done in m3p2 (Feedback Loop) where the signal dispatch atomically calls `mark_seen`, `add_hide`, `add_block` as part of the signal write path.