feat: complete M1 signal engine — m0p3 samples/docs, m1p5 TidalDb API, examples, and periodic checkpoint

- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples
  (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test
  coverage for every public API surface
- m1p5: TidalDb public API — write_item, signal, read_decay_score,
  read_windowed_count, read_velocity; StorageBox enum routing memory vs
  fjall; WalSender/WalHandleWriter bridge; WAL replay on open
- Periodic checkpoint: 30s background thread for persistent+schema mode;
  FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful
  shutdown via Arc<AtomicBool> + join before final checkpoint
- ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing)
- Milestone 2 planning scaffolding added under docs/planning/milestone-2/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
jordan 2026-02-20 22:45:10 -07:00
parent 4f076c927d
commit 6fdaa1584b
45 changed files with 16216 additions and 169 deletions

88
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,88 @@
# Contributing to tidalDB
## Quick Start
```bash
# Clone the repo
git clone https://github.com/orchard9/tidaldb && cd tidaldb
# Confirm the engine compiles and all tests pass
cargo test --manifest-path tidal/Cargo.toml
# Confirm doc tests and examples compile and run
cargo test --doc --manifest-path tidal/Cargo.toml
cargo test --examples --manifest-path tidal/Cargo.toml
```
## Run Samples Checklist
Before opening a PR that touches public API or examples, verify all samples still work:
```bash
# Doc tests (default features)
cargo test --doc --manifest-path tidal/Cargo.toml
# Doc tests with optional features
cargo test --doc --manifest-path tidal/Cargo.toml --features test-utils,metrics
# All four examples compile and run
cargo run --example quickstart --manifest-path tidal/Cargo.toml
cargo run --example cli_embedding --manifest-path tidal/Cargo.toml
cargo run --example axum_embedding --manifest-path tidal/Cargo.toml # Ctrl+C to stop
cargo run --example actix_embedding --manifest-path tidal/Cargo.toml # Ctrl+C to stop
```
Expected output for `quickstart`:
```
build: dev
uptime: 0.000s
health: ok
tidalDB opened, verified, and closed. M0 complete.
```
## Full Quality Gate
The pre-commit hook enforces these automatically on staged Rust files:
```bash
cargo fmt --manifest-path tidal/Cargo.toml -- --check
cargo clippy --manifest-path tidal/Cargo.toml -- -D warnings
cargo test --manifest-path tidal/Cargo.toml --lib
```
Run the complete gate manually:
```bash
cargo fmt --manifest-path tidal/Cargo.toml
cargo clippy --manifest-path tidal/Cargo.toml -- -D warnings
cargo test --manifest-path tidal/Cargo.toml
cargo bench --manifest-path tidal/Cargo.toml --no-run # ensure benches compile
```
## Project Layout
```
tidal/ Rust database engine
src/
db/ TidalDb handle, builder, config, metrics
schema/ Types, validation, error types
signals/ Signal ledger, decay, windowed counters, checkpoint
storage/ StorageEngine trait, fjall backend, key encoding
wal/ Write-ahead log, group commit, crash recovery
benches/ Criterion benchmarks
examples/ Embedding guides (quickstart, axum, actix, cli)
tests/ Integration and property tests
site/ Marketing site (Next.js)
docs/ Research and planning documents
```
## Coding Standards
See [CODING_GUIDELINES.md](CODING_GUIDELINES.md) for the full engineering standards.
Key rules:
- `Result<T, LumenError>` everywhere — no panics on recoverable failures
- `#![forbid(unsafe_code)]` — relaxed only at explicit FFI boundaries with `// SAFETY:` comment
- Property tests for invariants, criterion benchmarks for performance claims
- `cargo clippy -D warnings` must pass with zero warnings

1448
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@ -2,26 +2,45 @@
## Vision Statement ## Vision Statement
When tidalDB is complete, an engineering team building any content platform -- a media library, a social feed, a marketplace, a discovery surface -- can embed a single Rust database and replace the Elasticsearch + Redis + Kafka + feature store + vector database + ranking service stack. One process, one query interface, one operational model. The query `RETRIEVE items FOR USER @user_id USING PROFILE for_you FILTER unseen, unblocked DIVERSITY max_per_creator:2 LIMIT 50` executes in under 50ms, reflects signals written 100ms ago, enforces diversity without application logic, handles cold-start items without application intervention, and returns results a user would describe as "it knows what I want." When tidalDB is complete, an engineering team building any content platform -- a media library, a social feed, a marketplace, a discovery surface, or an agentic UX -- can embed a single Rust database and replace the Elasticsearch + Redis + Kafka + feature store + vector database + ranking service stack. One process, one query interface, one operational model. The query `RETRIEVE items FOR USER @user_id USING PROFILE for_you FILTER unseen, unblocked DIVERSITY max_per_creator:2 LIMIT 50` executes in under 50ms, reflects signals written 100ms ago, enforces diversity without application logic, handles cold-start items without application intervention, and returns results a user would describe as "it knows what I want."
The same runtime doubles as the personalization **memory substrate for agents**: user → agent → tidalDB. Agents ground themselves by reading live session context, write structured signals (preferences, critiques, tool usage) with decay budgets, and immediately query those updates on the next turn. The embeddable runtime is step zero; the exact same WAL + subject-prefix key architecture grows into a multi-region, eventually-consistent fabric so agent memory travels with the user across devices and datacenters without losing correctness.
## Thesis ## Thesis
A single embeddable database can replace the 6-system content ranking stack by treating signals, ranking profiles, and diversity constraints as database primitives rather than application logic. A single embeddable database can replace the 6-system content ranking stack by treating signals, ranking profiles, session policy, and diversity constraints as database primitives rather than application logic. Every agent or product surface gets an always-fresh memory lane without standing up Vespa-scale search clusters or bespoke feature stores.
## Differentiation vs Vespa and search platforms
1. **Agent-owned memory lanes.** Signals, session context, and reward metadata are schema-level objects. Agents can create scoped sessions, write feedback with decay guarantees, and read it back with zero glue code. Vespa is optimized for serving queries; it assumes you run feature updates elsewhere.
2. **Embeddable-first ergonomics.** `cargo add tidaldb` gives you the full signal + ranking stack with WAL durability and diagnostics in-process. Vespa demands a cluster, config servers, and feed pipelines before you can prototype.
3. **Temporal math on the write path.** Decay, windowing, velocity, and diversity guards are computed atomically when signals arrive. There is no notion of "update documents later" or external CRON math.
4. **Session- and policy-aware query language.** `RETRIEVE ... FOR USER ... FOR SESSION ... USING PROFILE ...` encodes permissions, diversity and cohort constraints; agent policies live in schema, not middleware.
5. **Roadmapped scale path.** The same WAL segments, subject-prefix keys, and checkpoint formats we ship for the embeddable runtime become the replication log and deterministic conflict-resolution substrate for the distributed fabric (see M8). Vespa already starts distributed; tidalDB grows there without sacrificing the zero-config DX.
--- ---
## Milestone Summary ## Milestone Summary
| # | Name | Proves | Enables | | # | Name | Proves | Enables |
|---|------|--------|---------| | --- | --------------------- | ---------------------------------------------------------------------------- | ----------------------------------------------------------------- |
| M0 | Embeddable Runtime | tidalDB can run in-process with zero-config defaults and tooling | Cuts proof-of-concept friction, enables internal dogfooding | | M0 | Embeddable Runtime | tidalDB can run in-process with zero-config defaults and tooling | Cuts proof-of-concept friction, enables internal dogfooding |
| M1 | Signal Engine | Signals are a database primitive with O(1) decay, not application math | UC-03 (partial), UC-06 (partial), UC-14 (partial) | | M1 | Signal Engine | Signals are a database primitive with O(1) decay, not application math | UC-03 (partial), UC-06 (partial), UC-14 (partial) |
| M2 | Ranked Retrieval | A single query retrieves, scores, and ranks content using live signals | UC-03, UC-04, UC-06, UC-08, UC-13, UC-14 | | M2 | Ranked Retrieval | A single query retrieves, scores, and ranks content using live signals | UC-03, UC-04, UC-06, UC-08, UC-13, UC-14 |
| M3 | Personalized Ranking | User context shapes retrieval and ranking -- the "For You" query works | UC-01, UC-05, UC-07, UC-09 (partial) | | M3 | Personalized Ranking | User context shapes retrieval and ranking -- the "For You" query works | UC-01, UC-05, UC-07, UC-09 (partial) |
| M4 | Agent Memory | Agents can create sessions, write signals, and enforce policy inside tidalDB | Agent-mediated personalization, RLHF loops, conversational memory | | M4 | Agent Memory | Agents can create sessions, write signals, and enforce policy inside tidalDB | Agent-mediated personalization, RLHF loops, conversational memory |
| M5 | Hybrid Search | Text + semantic + signal-ranked search in one query | UC-02, UC-10, UC-11 | | M5 | Hybrid Search | Text + semantic + signal-ranked search in one query | UC-02, UC-10, UC-11 |
| M6 | Full Surface Coverage | Every use case, every sort mode, every filter, every feedback loop | UC-01 through UC-14 complete | | M6 | Full Surface Coverage | Every use case, every sort mode, every filter, every feedback loop | UC-01 through UC-14 complete |
| M7 | Production Hardening | Crash safety, graceful degradation, operational readiness | All UCs at production quality | | M7 | Production Hardening | Crash safety, graceful degradation, operational readiness | All UCs at production quality |
| M8 | Distributed Fabric | Multi-region, multi-tenant replication keeps agent-memory semantics intact | Hosted tidalDB, cloud/edge deployments, shared agent substrate |
### Embeddable → Distributed Path
1. **M0M2 (Embed & prove primitives):** Establish the deterministic builder, WAL, key encoding, and checkpoint semantics that make on-device instances safe to embed. Research refs: `docs/research/tidaldb_wal.md`, `docs/research/tidaldb_signal_ledger.md`.
2. **M3M4 (Session + agent policy):** Layer user/creator entities, sessions, and policy enforcement so agents can write/read scoped memory lanes without glue. This also defines the logical replication unit: entity + session keyspaces.
3. **M5M6 (Surface completeness):** Ship hybrid search and every retrieval mode so a single tidalDB node can back any personalization surface or agent prompt grounding workload.
4. **M7 (Operational envelope):** Hardening (crash fencing, throttling, observability) creates the guarantees the fabric will rely on when shipping WAL segments across machines.
5. **M8 (Distributed Fabric):** Introduce shard-aware keyspaces, WAL shipping + deterministic reconciliation, and multi-region eventual-consistency policies so embeddable instances graduate to hosted, global deployments without rewriting application code.
### Product Milestone Summary (New) ### Product Milestone Summary (New)
@ -30,39 +49,40 @@ The roadmap now has two tracks:
- **Engine Track (M0-M7):** proves tidalDB capabilities. - **Engine Track (M0-M7):** proves tidalDB capabilities.
- **Product Track (P0-P4):** proves end-user value for the beachhead product. - **Product Track (P0-P4):** proves end-user value for the beachhead product.
| # | Name | Proves | Depends On | | # | Name | Proves | Depends On |
|---|------|--------|------------| | --- | ----------------------------------------- | ----------------------------------------------------------------------------------------------- | ---------------------------------- |
| P0 | Beachhead Validation | Knowledge workers and consumers care about a personal briefing feed enough to use it repeatedly | M0 (embedding/runtime), partial M1 | | P0 | Beachhead Validation | Knowledge workers and consumers care about a personal briefing feed enough to use it repeatedly | M0 (embedding/runtime), partial M1 |
| P1 | Concierge Alpha | Daily "Today Brief" with explicit feedback controls creates Day-2 retention in a small cohort | M1 complete, partial M2 | | P1 | Concierge Alpha | Daily "Today Brief" with explicit feedback controls creates Day-2 retention in a small cohort | M1 complete, partial M2 |
| PG1 | Personalization Core Done (Blocking Gate) | Personalization loop is correct, immediate, and measurably better than baseline | P1 + M1/M2/M3 core slices | | PG1 | Personalization Core Done (Blocking Gate) | Personalization loop is correct, immediate, and measurably better than baseline | P1 + M1/M2/M3 core slices |
| P2 | Productized Beta | Self-serve onboarding + real-time adaptation + explanation UX works without manual curation | M2 complete, partial M3 | | P2 | Productized Beta | Self-serve onboarding + real-time adaptation + explanation UX works without manual curation | M2 complete, partial M3 |
| P3 | Public Launch | The product is reliable, useful, and trusted at real user volume | M3 + M5 core, M6 partial | | P3 | Public Launch | The product is reliable, useful, and trusted at real user volume | M3 + M5 core, M6 partial |
| P4 | Scale + Revenue Fit | Sustainable retention and monetization without quality collapse | M6 + M7 | | P4 | Scale + Revenue Fit | Sustainable retention and monetization without quality collapse | M6 + M7 |
--- ---
## Current Status ## Current Status
| Phase | Status | Tests | | Phase | Status | Tests |
|-------|--------|-------| | ------------------------------------------------ | ----------- | ---------------------------------------------------------------------- |
| **m0p1: Embeddable Runtime Skeleton** | COMPLETE | 329 passing (293 unit + 36 integration + 3 doc) | | **m0p1: Embeddable Runtime Skeleton** | COMPLETE | 329 passing (293 unit + 36 integration + 3 doc) |
| **m0p2: Tooling & Diagnostics** | COMPLETE | 349 passing (+7 metrics unit + 7 metrics integration + 9 tidalctl CLI) | | **m0p2: Tooling & Diagnostics** | COMPLETE | 349 passing (+7 metrics unit + 7 metrics integration + 9 tidalctl CLI) |
| m0p3: Samples & Docs | NOT STARTED | -- | | **m0p3: Samples & Docs** | COMPLETE | 11 doc tests (14 with features); 4 examples compile and run |
| **m1p1: Core Type System and Schema** | COMPLETE | 77 passing | | **m1p1: Core Type System and Schema** | COMPLETE | 77 passing |
| **m1p2: Write-Ahead Log** | COMPLETE | passing (unit + integration) | | **m1p2: Write-Ahead Log** | COMPLETE | passing (unit + integration) |
| **m1p3: Storage Engine Trait and fjall Backend** | COMPLETE | 140 passing (128 unit + 12 integration) | | **m1p3: Storage Engine Trait and fjall Backend** | COMPLETE | 140 passing (128 unit + 12 integration) |
| m1p4: Signal Ledger | NOT STARTED | -- | | **m1p4: Signal Ledger** | COMPLETE | 300 passing |
| m1p5: Entity CRUD and Signal Write API | NOT STARTED | -- | | **m1p5: Entity CRUD and Signal Write API** | COMPLETE | 305 passing (300 unit + 5 integration) |
| P0: Beachhead Validation | NOT STARTED | -- | | P0: Beachhead Validation | NOT STARTED | -- |
| P1: Concierge Alpha | NOT STARTED | -- | | P1: Concierge Alpha | NOT STARTED | -- |
| PG1: Personalization Core Done gate | NOT STARTED | -- | | PG1: Personalization Core Done gate | NOT STARTED | -- |
| P2: Productized Beta | NOT STARTED | -- | | P2: Productized Beta | NOT STARTED | -- |
| P3: Public Launch | NOT STARTED | -- | | P3: Public Launch | NOT STARTED | -- |
| P4: Scale + Revenue Fit | NOT STARTED | -- | | P4: Scale + Revenue Fit | NOT STARTED | -- |
**Current phase:** m0p2 (Tooling & Diagnostics) or m1p4 (Signal Ledger) — m0p1 unblocks m0p2; m1p2 and m1p3 unblock m1p4. **Current phase:** Milestone 1 COMPLETE. All phases (m1p1m1p5) are done. Next: P0 Beachhead Validation.
**Lessons learned:** **Lessons learned:**
- m1p3 keyspaces are organized per `EntityKind` ("items", "users", "creators"), not by data category. The `Tag` enum in key encoding provides the data-category namespace within each entity-kind keyspace. - m1p3 keyspaces are organized per `EntityKind` ("items", "users", "creators"), not by data category. The `Tag` enum in key encoding provides the data-category namespace within each entity-kind keyspace.
- The `LumenError` name is a legacy artifact from a predecessor project. Will be renamed when convenient but does not block progress. - The `LumenError` name is a legacy artifact from a predecessor project. Will be renamed when convenient but does not block progress.
- MSRV was bumped to 1.91 for fjall 3 compatibility. - MSRV was bumped to 1.91 for fjall 3 compatibility.
@ -82,6 +102,7 @@ Dedicated roadmap: `docs/planning/PRODUCT_ROADMAP.md`.
Validate that a personal briefing feed solves a painful daily job for users and drives repeat use. Validate that a personal briefing feed solves a painful daily job for users and drives repeat use.
**Acceptance Criteria** **Acceptance Criteria**
- [ ] Recruit 20-50 target users (knowledge workers + high-intent consumers). - [ ] Recruit 20-50 target users (knowledge workers + high-intent consumers).
- [ ] Run daily briefing prototype (can include manual source QA). - [ ] Run daily briefing prototype (can include manual source QA).
- [ ] At least one meaningful feedback action per session for the median user (`more`, `less`, `hide`, `mute`, `save`). - [ ] At least one meaningful feedback action per session for the median user (`more`, `less`, `hide`, `mute`, `save`).
@ -95,6 +116,7 @@ Validate that a personal briefing feed solves a painful daily job for users and
Deliver a reliable daily `Today Brief` experience with immediate visible adaptation after user feedback. Deliver a reliable daily `Today Brief` experience with immediate visible adaptation after user feedback.
**Acceptance Criteria** **Acceptance Criteria**
- [ ] App surface: ranked brief, reason labels, source links, save/feedback controls. - [ ] App surface: ranked brief, reason labels, source links, save/feedback controls.
- [ ] Feedback loop: next refresh reflects `less/hide/mute` actions immediately. - [ ] Feedback loop: next refresh reflects `less/hide/mute` actions immediately.
- [ ] Time-budget mode (`5/10/20` min) is available and used. - [ ] Time-budget mode (`5/10/20` min) is available and used.
@ -108,6 +130,7 @@ Deliver a reliable daily `Today Brief` experience with immediate visible adaptat
Turn the alpha into a self-serve product with stable onboarding, trust UX, and measurable quality. Turn the alpha into a self-serve product with stable onboarding, trust UX, and measurable quality.
**Acceptance Criteria** **Acceptance Criteria**
- [ ] Self-serve onboarding completed in under 3 minutes. - [ ] Self-serve onboarding completed in under 3 minutes.
- [ ] "Why this" explanations are present and understandable on every briefing card. - [ ] "Why this" explanations are present and understandable on every briefing card.
- [ ] Cohort layer available ("trending for people like you"). - [ ] Cohort layer available ("trending for people like you").
@ -122,6 +145,7 @@ Turn the alpha into a self-serve product with stable onboarding, trust UX, and m
Launch publicly with reliability, quality, and trust guardrails suitable for broad use. Launch publicly with reliability, quality, and trust guardrails suitable for broad use.
**Acceptance Criteria** **Acceptance Criteria**
- [ ] Reliability and latency SLOs defined and met for briefing generation. - [ ] Reliability and latency SLOs defined and met for briefing generation.
- [ ] Quality floor enforced (freshness, source quality, duplicate suppression). - [ ] Quality floor enforced (freshness, source quality, duplicate suppression).
- [ ] Notification cadence controls prevent spam. - [ ] Notification cadence controls prevent spam.
@ -134,6 +158,7 @@ Launch publicly with reliability, quality, and trust guardrails suitable for bro
Prove the product can grow and monetize while preserving user trust and briefing quality. Prove the product can grow and monetize while preserving user trust and briefing quality.
**Acceptance Criteria** **Acceptance Criteria**
- [ ] Monetization model validated (subscription, team plan, or equivalent). - [ ] Monetization model validated (subscription, team plan, or equivalent).
- [ ] Revenue metrics tracked alongside quality metrics (no quality-revenue trade-off regressions). - [ ] Revenue metrics tracked alongside quality metrics (no quality-revenue trade-off regressions).
- [ ] Retention and engagement remain stable as volume increases. - [ ] Retention and engagement remain stable as volume increases.
@ -146,6 +171,7 @@ Prove the product can grow and monetize while preserving user trust and briefing
Before product breadth expansion, the core personalization loop must be provably correct and immediately responsive. Before product breadth expansion, the core personalization loop must be provably correct and immediately responsive.
**Acceptance Criteria** **Acceptance Criteria**
- [ ] Hard negatives (`hide/mute/block`) never leak after write, restart, or replay. - [ ] Hard negatives (`hide/mute/block`) never leak after write, restart, or replay.
- [ ] Explicit feedback (`more/less/skip/save`) changes next-refresh ranking within target latency. - [ ] Explicit feedback (`more/less/skip/save`) changes next-refresh ranking within target latency.
- [ ] User personalization state rebuilds deterministically from checkpoint + WAL replay. - [ ] User personalization state rebuilds deterministically from checkpoint + WAL replay.
@ -185,6 +211,7 @@ Before we prove any ranking math, developers must be able to embed tidalDB insid
- Quickstart example + doctest run under CI (`cargo test --doc --examples`). - Quickstart example + doctest run under CI (`cargo test --doc --examples`).
- Axum/Actix embedding examples include graceful shutdown + metrics wiring. - Axum/Actix embedding examples include graceful shutdown + metrics wiring.
- CONTRIBUTING updated with “run samples” checklist. - CONTRIBUTING updated with “run samples” checklist.
### UAT Scenario ### UAT Scenario
``` ```
@ -258,6 +285,7 @@ Then:
**Delivers:** The foundational type system -- entity IDs, signal type definitions, decay rate declarations, window specifications, and the error types that every subsequent module depends on. The schema module that validates and stores signal/entity definitions. **Delivers:** The foundational type system -- entity IDs, signal type definitions, decay rate declarations, window specifications, and the error types that every subsequent module depends on. The schema module that validates and stores signal/entity definitions.
**Acceptance Criteria:** **Acceptance Criteria:**
- [x] `EntityId` is a u64 newtype with `Display`, `Hash`, `Eq`, `Ord`, `to_be_bytes()` (big-endian, preserves numeric ordering) - [x] `EntityId` is a u64 newtype with `Display`, `Hash`, `Eq`, `Ord`, `to_be_bytes()` (big-endian, preserves numeric ordering)
- [x] `EntityKind` enum: `Item`, `User`, `Creator` - [x] `EntityKind` enum: `Item`, `User`, `Creator`
- [x] `SignalTypeDef` captures: name, target `EntityKind`, `DecayModel` (exponential with pre-computed lambda / linear / permanent), `WindowSet`, velocity enabled flag - [x] `SignalTypeDef` captures: name, target `EntityKind`, `DecayModel` (exponential with pre-computed lambda / linear / permanent), `WindowSet`, velocity enabled flag
@ -279,6 +307,7 @@ Then:
**Delivers:** A durable, append-only log for signal events. Every signal write is fsync'd before acknowledgment. Group commit amortizes fsync cost. Content-addressed events via BLAKE3 for deduplication. The WAL is the source of truth -- all other state is derived. **Delivers:** A durable, append-only log for signal events. Every signal write is fsync'd before acknowledgment. Group commit amortizes fsync cost. Content-addressed events via BLAKE3 for deduplication. The WAL is the source of truth -- all other state is derived.
**Acceptance Criteria:** **Acceptance Criteria:**
- [x] WAL entries are length-prefixed with BLAKE3 checksums - [x] WAL entries are length-prefixed with BLAKE3 checksums
- [x] Group commit batches up to 100 events or 10ms, whichever comes first - [x] Group commit batches up to 100 events or 10ms, whichever comes first
- [x] Duplicate events (same BLAKE3 hash) are silently deduplicated - [x] Duplicate events (same BLAKE3 hash) are silently deduplicated
@ -296,6 +325,7 @@ Then:
**Delivers:** The `StorageEngine` trait abstraction and two implementations: `FjallBackend` (fjall 3 LSM-tree) for production and `InMemoryBackend` (BTreeMap + RwLock) for deterministic testing. Key encoding follows the subject-prefix pattern with a `Tag` discriminant. `FjallStorage` coordinates three keyspaces per entity kind. `FjallAtomicBatch` provides cross-keyspace atomic writes. **Delivers:** The `StorageEngine` trait abstraction and two implementations: `FjallBackend` (fjall 3 LSM-tree) for production and `InMemoryBackend` (BTreeMap + RwLock) for deterministic testing. Key encoding follows the subject-prefix pattern with a `Tag` discriminant. `FjallStorage` coordinates three keyspaces per entity kind. `FjallAtomicBatch` provides cross-keyspace atomic writes.
**Acceptance Criteria:** **Acceptance Criteria:**
- [x] `StorageEngine` trait with `get`, `put`, `delete`, `scan_prefix`, `write_batch`, `flush` operations - [x] `StorageEngine` trait with `get`, `put`, `delete`, `scan_prefix`, `write_batch`, `flush` operations
- [x] Key encoding: `[entity_id: 8 bytes BE][0x00][Tag: 1 byte][suffix...]` with `Tag` enum (`Evt`=0x01, `Sig`=0x02, `Meta`=0x03, `Rel`=0x04, `Mv`=0x05, `Idx`=0x06) - [x] Key encoding: `[entity_id: 8 bytes BE][0x00][Tag: 1 byte][suffix...]` with `Tag` enum (`Evt`=0x01, `Sig`=0x02, `Meta`=0x03, `Rel`=0x04, `Mv`=0x05, `Idx`=0x06)
- [x] `encode_key`, `parse_key` roundtrip correctly for all tag variants and arbitrary suffixes - [x] `encode_key`, `parse_key` roundtrip correctly for all tag variants and arbitrary suffixes
@ -323,16 +353,17 @@ Then:
**Delivers:** The in-memory per-entity signal state with running decay scores (O(1) update, O(1) read) and bucketed windowed counters. Signal writes update the running scores atomically. Signal reads return decay-correct values without scanning raw events. State is checkpointed to storage for crash recovery. **Delivers:** The in-memory per-entity signal state with running decay scores (O(1) update, O(1) read) and bucketed windowed counters. Signal writes update the running scores atomically. Signal reads return decay-correct values without scanning raw events. State is checkpointed to storage for crash recovery.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] `EntitySignalState` is `#[repr(C, align(64))]` -- one L1 cache line per hot-path struct
- [ ] Running decay formula: `S(t) = S(t_prev) * exp(-lambda * dt) + weight` -- mathematically exact, verified against analytical brute-force computation to 6 decimal places across 10,000 random event sequences (property test) - [x] `EntitySignalState` is `#[repr(C, align(64))]` -- one L1 cache line per hot-path struct
- [ ] Out-of-order events handled correctly: when `t_event < last_update`, weight is pre-decayed: `score += weight * exp(-lambda * (last_update - t_event))` - [x] Running decay formula: `S(t) = S(t_prev) * exp(-lambda * dt) + weight` -- mathematically exact, verified against analytical brute-force computation to 6 decimal places across 10,000 random event sequences (property test)
- [ ] Windowed counts use per-minute bucketed counters (BucketedCounter) supporting 1h/24h/7d windows - [x] Out-of-order events handled correctly: when `t_event < last_update`, weight is pre-decayed: `score += weight * exp(-lambda * (last_update - t_event))`
- [ ] Velocity = windowed_count / window_duration_seconds - [x] Windowed counts use per-minute bucketed counters (BucketedCounter) supporting 1h/24h/7d windows
- [ ] Signal write latency < 100 microseconds including WAL write (amortized), benchmarked with criterion - [x] Velocity = windowed_count / window_duration_seconds
- [ ] Decay score read latency < 100ns per entity per lambda, benchmarked with criterion - [x] Signal write latency < 100 microseconds including WAL write (amortized), benchmarked with criterion
- [ ] 200-entity scoring pass < 5 microseconds, benchmarked with criterion - [x] Decay score read latency < 100ns per entity per lambda, benchmarked with criterion
- [ ] State checkpointed to storage every 30 seconds; crash recovery reconstructs from checkpoint + WAL replay - [x] 200-entity scoring pass < 5 microseconds, benchmarked with criterion
- [ ] DashMap or sharded map for concurrent entity state access; signal counters use AtomicU64 with Relaxed ordering - [x] State checkpointed to storage every 30 seconds; crash recovery reconstructs from checkpoint + WAL replay
- [x] DashMap or sharded map for concurrent entity state access; signal counters use AtomicU64 with Relaxed ordering
**Depends On:** Phase 2, Phase 3 **Depends On:** Phase 2, Phase 3
**Complexity:** XL **Complexity:** XL
@ -343,15 +374,16 @@ Then:
**Delivers:** The public API surface for Milestone 1. `TidalDB::open()`, `TidalDB::shutdown()`, entity write/read, signal write/read. This is the interface the UAT scenario tests against. Includes the `signal()` method that atomically writes to WAL, updates in-memory state, and returns immediately. **Delivers:** The public API surface for Milestone 1. `TidalDB::open()`, `TidalDB::shutdown()`, entity write/read, signal write/read. This is the interface the UAT scenario tests against. Includes the `signal()` method that atomically writes to WAL, updates in-memory state, and returns immediately.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] `TidalDB::open(config)` opens storage, restores in-memory state from checkpoint + WAL replay, returns `Result<TidalDB>`
- [ ] `TidalDB::shutdown()` checkpoints all in-memory state, syncs WAL, closes storage cleanly - [x] `TidalDB::open(config)` opens storage, restores in-memory state from checkpoint + WAL replay, returns `Result<TidalDB>`
- [ ] `db.write_item(id, metadata)` stores entity metadata - [x] `TidalDB::shutdown()` checkpoints all in-memory state, syncs WAL, closes storage cleanly
- [ ] `db.signal(signal_type, entity_id, weight, timestamp)` atomically: appends to WAL, updates decay scores, updates windowed counters - [x] `db.write_item(id, metadata)` stores entity metadata
- [ ] `db.read_decay_score(entity_id, signal_type, lambda_index)` returns current decayed score - [x] `db.signal(signal_type, entity_id, weight, timestamp)` atomically: appends to WAL, updates decay scores, updates windowed counters
- [ ] `db.read_windowed_count(entity_id, signal_type, window)` returns count within window - [x] `db.read_decay_score(entity_id, signal_type, lambda_index)` returns current decayed score
- [ ] `db.read_velocity(entity_id, signal_type, window)` returns count / window_duration - [x] `db.read_windowed_count(entity_id, signal_type, window)` returns count within window
- [ ] Full UAT scenario passes as an integration test - [x] `db.read_velocity(entity_id, signal_type, window)` returns count / window_duration
- [ ] `TidalDB` is `Send + Sync` -- safe to share across threads behind `Arc` - [x] Full UAT scenario passes as an integration test
- [x] `TidalDB` is `Send + Sync` -- safe to share across threads behind `Arc`
**Depends On:** Phase 4 **Depends On:** Phase 4
**Complexity:** M **Complexity:** M
@ -484,6 +516,7 @@ Then:
**Delivers:** USearch wrapped behind a trait, with mmap persistence, f16 quantization, and the adaptive filtered search planner. Items can be inserted with embeddings and retrieved by ANN similarity. **Delivers:** USearch wrapped behind a trait, with mmap persistence, f16 quantization, and the adaptive filtered search planner. Items can be inserted with embeddings and retrieved by ANN similarity.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] `VectorIndex` trait with `insert(key, vector)`, `remove(key)`, `search(query, k)`, `filtered_search(query, k, predicate)`, `save()`, `load()`, `view()` - [ ] `VectorIndex` trait with `insert(key, vector)`, `remove(key)`, `search(query, k)`, `filtered_search(query, k, predicate)`, `save()`, `load()`, `view()`
- [ ] USearch backend implements the trait with f16 quantization (default), mmap persistence - [ ] USearch backend implements the trait with f16 quantization (default), mmap persistence
- [ ] Vectors normalized at insertion time (L2 distance equivalent to cosine for unit vectors) - [ ] Vectors normalized at insertion time (L2 distance equivalent to cosine for unit vectors)
@ -502,6 +535,7 @@ Then:
**Delivers:** Roaring bitmap indexes for categorical metadata, B-tree indexes for range attributes, and a composable filter engine that evaluates arbitrary filter combinations. The filter engine produces either a bitmap (for pre-filtering ANN) or a predicate closure (for in-graph filtering). **Delivers:** Roaring bitmap indexes for categorical metadata, B-tree indexes for range attributes, and a composable filter engine that evaluates arbitrary filter combinations. The filter engine produces either a bitmap (for pre-filtering ANN) or a predicate closure (for in-graph filtering).
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] Roaring bitmap per high-cardinality metadata value: category, format, creator_id - [ ] Roaring bitmap per high-cardinality metadata value: category, format, creator_id
- [ ] B-tree index for range attributes: created_at, duration - [ ] B-tree index for range attributes: created_at, duration
- [ ] Filter expressions are composable: AND across dimensions, OR within a dimension - [ ] Filter expressions are composable: AND across dimensions, OR within a dimension
@ -520,15 +554,16 @@ Then:
**Delivers:** Named ranking profiles declared as data (not compiled code), parsed, validated, stored, and executed by the database. Profiles reference signal scores, windowed aggregates, velocity, metadata fields, and define quality gates. Profiles are versioned and swappable at query time. **Delivers:** Named ranking profiles declared as data (not compiled code), parsed, validated, stored, and executed by the database. Profiles reference signal scores, windowed aggregates, velocity, metadata fields, and define quality gates. Profiles are versioned and swappable at query time.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] Profile declaration syntax supports: primary signal, secondary signals with weights, BOOST, GATE (minimum threshold), PENALIZE, EXCLUDE - [ ] Profile declaration syntax supports: primary signal, secondary signals with weights, BOOST, GATE (minimum threshold), PENALIZE, EXCLUDE
- [ ] Profiles stored in schema, versioned, retrievable by name - [ ] Profiles stored in schema, versioned, retrievable by name
- [ ] Profile execution: given a candidate set and a profile, produce a scored and sorted result list - [ ] Profile execution: given a candidate set and a profile, produce a scored and sorted result list
- [ ] Built-in profiles implemented: `trending`, `hot`, `new`, `top_week`, `top_month`, `top_all_time`, `hidden_gems`, `controversial`, `most_viewed`, `most_liked`, `shuffle` - [ ] Built-in profiles implemented: `trending`, `hot`, `new`, `top_week`, `top_month`, `top_all_time`, `hidden_gems`, `controversial`, `most_viewed`, `most_liked`, `shuffle`
- [ ] `hot` formula: `score / (age_hours + 2)^gravity` with configurable gravity - [ ] `hot` formula: `log10(max(|positive - negative|, 1)) / (age_hours + 2)^gravity` with configurable gravity
- [ ] `controversial` formula: `max(positive_signals * negative_signals)` - [ ] `controversial` formula: `(positive * negative) / (positive + negative)^2`
- [ ] `hidden_gems` formula: `quality_score * (1 / log(1 + view_count))` - [ ] `hidden_gems` formula: `quality_score * (1 / log10(view_count + 10))` -- the `+10` prevents division by zero for items with zero views
- [ ] Profile change does not require recompile -- profiles are runtime data - [ ] Profile change does not require recompile -- profiles are runtime data
- [ ] 200-candidate scoring pass with a profile < 10 microseconds (benchmarked) - [ ] 200-candidate scoring pass with decay-only profile < 10 microseconds, with velocity-based profile (trending) < 100 microseconds (both Criterion benchmarked)
**Depends On:** m1p4 (signal ledger) **Depends On:** m1p4 (signal ledger)
**Complexity:** L **Complexity:** L
@ -539,6 +574,7 @@ Then:
**Delivers:** Post-scoring diversity pass that reorders results to satisfy constraints (max_per_creator, format_mix) without reducing result count. Implemented as a greedy selection pass over the scored candidate list. **Delivers:** Post-scoring diversity pass that reorders results to satisfy constraints (max_per_creator, format_mix) without reducing result count. Implemented as a greedy selection pass over the scored candidate list.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] `max_per_creator:N` enforced: no more than N items from any single creator in the result set - [ ] `max_per_creator:N` enforced: no more than N items from any single creator in the result set
- [ ] `format_mix:true` enforced: no more than 60% of results from any single format - [ ] `format_mix:true` enforced: no more than 60% of results from any single format
- [ ] Diversity pass does not reduce result count -- it selects the next-best candidate that satisfies constraints - [ ] Diversity pass does not reduce result count -- it selects the next-best candidate that satisfies constraints
@ -555,6 +591,7 @@ Then:
**Delivers:** The query parser for the RETRIEVE operation and the executor that orchestrates candidate retrieval, filtering, scoring, diversity, and result assembly. This is the "one query" entry point. For M2, the RETRIEVE query does not require `FOR USER` (no personalization yet) -- it operates on the full item corpus with filters and profiles. **Delivers:** The query parser for the RETRIEVE operation and the executor that orchestrates candidate retrieval, filtering, scoring, diversity, and result assembly. This is the "one query" entry point. For M2, the RETRIEVE query does not require `FOR USER` (no personalization yet) -- it operates on the full item corpus with filters and profiles.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] Parser handles: `RETRIEVE items`, `USING PROFILE <name>`, `FILTER <conditions>`, `DIVERSITY <constraints>`, `LIMIT <n>`, `EXCLUDE [ids]` - [ ] Parser handles: `RETRIEVE items`, `USING PROFILE <name>`, `FILTER <conditions>`, `DIVERSITY <constraints>`, `LIMIT <n>`, `EXCLUDE [ids]`
- [ ] Parser produces a typed AST; parse errors include position and helpful message - [ ] Parser produces a typed AST; parse errors include position and helpful message
- [ ] Executor pipeline: candidate retrieval (ANN or full scan based on profile) -> filter -> score -> diversity -> limit -> return - [ ] Executor pipeline: candidate retrieval (ANN or full scan based on profile) -> filter -> score -> diversity -> limit -> return
@ -687,6 +724,7 @@ Then:
**Delivers:** User and creator entity types with preference vectors and a relationship graph. Relationship edges are weighted, directional, and queryable. Follows, blocks, interaction weights are first-class. **Delivers:** User and creator entity types with preference vectors and a relationship graph. Relationship edges are weighted, directional, and queryable. Follows, blocks, interaction weights are first-class.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] User entities store: user_id, preference embedding (mutable, updated on signals), metadata - [ ] User entities store: user_id, preference embedding (mutable, updated on signals), metadata
- [ ] Creator entities store: creator_id, catalog embedding (aggregated from items), metadata - [ ] Creator entities store: creator_id, catalog embedding (aggregated from items), metadata
- [ ] Relationship edges: `(from_entity, to_entity, type, weight, timestamp)` with types: follows, blocks, interaction_weight, hide, mute - [ ] Relationship edges: `(from_entity, to_entity, type, weight, timestamp)` with types: follows, blocks, interaction_weight, hide, mute
@ -703,11 +741,12 @@ Then:
**Delivers:** When a signal event is written (like, skip, hide, completion), the database atomically updates the item's signal ledger, the user-to-item relationship, the user-to-creator interaction weight, and the user's preference vector. One write, multiple state updates, no application logic. **Delivers:** When a signal event is written (like, skip, hide, completion), the database atomically updates the item's signal ledger, the user-to-item relationship, the user-to-creator interaction weight, and the user's preference vector. One write, multiple state updates, no application logic.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] `db.signal("like", item_id, user_id, weight, timestamp)` atomically: - [ ] `db.signal("like", item_id, user_id, weight, timestamp)` atomically:
1. Appends event to WAL 1. Appends event to WAL
2. Updates item signal ledger (decay scores, windowed counts) 2. Updates item signal ledger (decay scores, windowed counts)
3. Increments user->creator interaction_weight 3. Increments user->creator interaction_weight
4. Shifts user preference vector toward item embedding (configurable learning rate) 4. Shifts user preference vector toward item embedding (configurable learning rate)
- [ ] `db.signal("skip", ...)` atomically: updates item skip count, decays user->creator weight, shifts preference vector away from item embedding - [ ] `db.signal("skip", ...)` atomically: updates item skip count, decays user->creator weight, shifts preference vector away from item embedding
- [ ] `db.signal("hide", ...)` sets permanent hard-negative on user->item relationship; item excluded from all future queries for this user - [ ] `db.signal("hide", ...)` sets permanent hard-negative on user->item relationship; item excluded from all future queries for this user
- [ ] `db.signal("block", user, creator)` sets permanent block; all items by creator excluded from all queries for this user - [ ] `db.signal("block", user, creator)` sets permanent block; all items by creator excluded from all queries for this user
@ -723,10 +762,11 @@ Then:
**Delivers:** Ranking profiles that incorporate user context: preference match (embedding similarity between user and item), user-creator interaction weight, social proof (engagement from user's follows), and user-specific exclusions. The `for_you`, `following`, `related`, and `notification` profiles. **Delivers:** Ranking profiles that incorporate user context: preference match (embedding similarity between user and item), user-creator interaction weight, social proof (engagement from user's follows), and user-specific exclusions. The `for_you`, `following`, `related`, and `notification` profiles.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] `for_you` profile: ANN retrieval using user preference vector, scoring = preference_match * engagement_velocity * recency_decay * social_proof, gates on completion_rate, penalizes skip count, 10% exploration budget
- [ ] `for_you` profile: ANN retrieval using user preference vector, scoring = preference*match * engagement*velocity * recency_decay \* social_proof, gates on completion_rate, penalizes skip count, 10% exploration budget
- [ ] `following` profile: candidate set restricted to followed creators' items, sorted by created_at DESC, tiebreaker on completion_rate - [ ] `following` profile: candidate set restricted to followed creators' items, sorted by created_at DESC, tiebreaker on completion_rate
- [ ] `related` profile: ANN retrieval using source item's embedding, collaborative filtering boost (items co-engaged with source), personalization re-rank by user preference - [ ] `related` profile: ANN retrieval using source item's embedding, collaborative filtering boost (items co-engaged with source), personalization re-rank by user preference
- [ ] `notification` profile: candidates from followed creators' recent items, scored by relationship_strength * item_quality - [ ] `notification` profile: candidates from followed creators' recent items, scored by relationship_strength \* item_quality
- [ ] Exploration budget: 10% of for_you results are from creators the user does not follow, to prevent filter bubbles - [ ] Exploration budget: 10% of for_you results are from creators the user does not follow, to prevent filter bubbles
- [ ] Cold start: new users with no signal history get results ranked by population-level signals (trending, quality) - [ ] Cold start: new users with no signal history get results ranked by population-level signals (trending, quality)
- [ ] Cold start: new items with no signals get an exploration window (appear in a small % of for_you feeds) - [ ] Cold start: new items with no signals get an exploration window (appear in a small % of for_you feeds)
@ -740,6 +780,7 @@ Then:
**Delivers:** Filters that depend on user state: unseen, in_progress, saved, liked, in_collection. These require per-user bitmaps or sets maintained by the signal system. **Delivers:** Filters that depend on user state: unseen, in_progress, saved, liked, in_collection. These require per-user bitmaps or sets maintained by the signal system.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] `unseen` filter: excludes items the user has viewed (maintained as roaring bitmap per user, updated on view signal) - [ ] `unseen` filter: excludes items the user has viewed (maintained as roaring bitmap per user, updated on view signal)
- [ ] `unblocked` filter: excludes items from blocked creators and hidden items - [ ] `unblocked` filter: excludes items from blocked creators and hidden items
- [ ] `saved` filter: returns only items the user has saved - [ ] `saved` filter: returns only items the user has saved
@ -897,6 +938,7 @@ Then:
**Delivers:** Tantivy embedded as a derived index for full-text search. DB-primary consistency pattern: entity store is source of truth, Tantivy is a materialized view updated via outbox. BM25 scoring exposed via custom Collector and Weight/Scorer seek pattern. **Delivers:** Tantivy embedded as a derived index for full-text search. DB-primary consistency pattern: entity store is source of truth, Tantivy is a materialized view updated via outbox. BM25 scoring exposed via custom Collector and Weight/Scorer seek pattern.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] Tantivy index created from schema text field definitions (title, description, tags) - [ ] Tantivy index created from schema text field definitions (title, description, tags)
- [ ] Background indexer reads entity store outbox and feeds Tantivy writer - [ ] Background indexer reads entity store outbox and feeds Tantivy writer
- [ ] Tantivy commit stores last-processed sequence number in payload for crash recovery - [ ] Tantivy commit stores last-processed sequence number in payload for crash recovery
@ -917,6 +959,7 @@ Then:
**Delivers:** Reciprocal Rank Fusion combining BM25 ranked lists with ANN ranked lists into a single scored result set. The starting point is RRF with k=60; the architecture supports upgrading to tuned linear combination when relevance labels exist. **Delivers:** Reciprocal Rank Fusion combining BM25 ranked lists with ANN ranked lists into a single scored result set. The starting point is RRF with k=60; the architecture supports upgrading to tuned linear combination when relevance labels exist.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] `RRF(d) = 1/(60 + rank_bm25(d)) + 1/(60 + rank_ann(d))` implemented - [ ] `RRF(d) = 1/(60 + rank_bm25(d)) + 1/(60 + rank_ann(d))` implemented
- [ ] Documents appearing in only one list contribute only their single-list term - [ ] Documents appearing in only one list contribute only their single-list term
- [ ] RRF results are re-rankable by personalization (user preference overlay) - [ ] RRF results are re-rankable by personalization (user preference overlay)
@ -934,6 +977,7 @@ Then:
**Delivers:** The SEARCH query parser and executor that orchestrates text retrieval, semantic retrieval, fusion, personalization, filtering, diversity, and result assembly. **Delivers:** The SEARCH query parser and executor that orchestrates text retrieval, semantic retrieval, fusion, personalization, filtering, diversity, and result assembly.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] Parser handles: `SEARCH items/creators`, `QUERY "text"`, `VECTOR [embedding]`, `FOR USER`, `USING PROFILE`, `FILTER`, `DIVERSITY`, `LIMIT` - [ ] Parser handles: `SEARCH items/creators`, `QUERY "text"`, `VECTOR [embedding]`, `FOR USER`, `USING PROFILE`, `FILTER`, `DIVERSITY`, `LIMIT`
- [ ] Query text parsing: exact phrase (`"...""`), boolean operators (AND/OR/NOT/-), field-scoped (`title:...`), wildcard (`term*`) - [ ] Query text parsing: exact phrase (`"...""`), boolean operators (AND/OR/NOT/-), field-scoped (`title:...`), wildcard (`term*`)
- [ ] Executor pipeline: text retrieval -> ANN retrieval -> fusion -> personalization -> filter -> diversity -> return - [ ] Executor pipeline: text retrieval -> ANN retrieval -> fusion -> personalization -> filter -> diversity -> return
@ -952,6 +996,7 @@ Then:
**Delivers:** Search over creator entities by name, topic, and attributes. "Creators like X" via creator embedding similarity. Enables UC-10. **Delivers:** Search over creator entities by name, topic, and attributes. "Creators like X" via creator embedding similarity. Enables UC-10.
**Acceptance Criteria:** **Acceptance Criteria:**
- [ ] Creator entities indexed in Tantivy (name, handle, bio, topics) - [ ] Creator entities indexed in Tantivy (name, handle, bio, topics)
- [ ] Creator embeddings searchable via ANN (aggregated from catalog) - [ ] Creator embeddings searchable via ANN (aggregated from catalog)
- [ ] `SEARCH creators QUERY "jazz" LIMIT 10` returns creators matching topic - [ ] `SEARCH creators QUERY "jazz" LIMIT 10` returns creators matching topic
@ -1158,32 +1203,120 @@ tidalDB operates correctly at 1M items under sustained concurrent read/write loa
--- ---
## Milestone 8: Distributed Fabric -- "Agent memory everywhere"
### Milestone Thesis
The exact same signal semantics, session policies, and WAL format power a multi-tenant, multi-region deployment. Instances shard deterministically by `EntityKind` + `EntityId`, ship WAL segments to peers, reconcile deterministically, and expose an eventually consistent API that still honors agent memory guarantees (no hidden items leaking, no double-counted decay). Hosted tidalDB can now back global agent workloads without rewriting application code.
### UAT Scenario
```
Given:
- Three regions (us-east, eu-west, ap-south) with 5 shards each
- Global write throughput: 25K signal events/sec, evenly distributed
- Fat-client agents pinned to local region but free to roam
- 1-hour network partition between eu-west and ap-south during sustained load
When:
1. Write signals for a user in us-east, then read in eu-west after < 2s
2. Crash an entire shard primary; observe automatic promotion and replay
3. Execute global query (`RETRIEVE ... COHORT locale:EU`) while ap-south is partitioned
4. Heal the partition; verify deterministic reconciliation (no duplicate counts, hides remain hidden)
5. Move a tenant (agent workspace) to a new region by changing routing config only
Then:
- Cross-region replication lag < 2s p99
- No signal loss or duplication after failover/partition
- Hard negatives (hide/mute/block) never leak, even while eventual state converges
- Per-tenant resource isolation enforced (quotas, WAL namespaces)
- Control plane surfaces reconciliation lag, shard health, and tenant placement
```
### Phases
#### Phase 1: Partitioned Keyspaces and WAL Shipping
**Delivers:** Deterministic shard IDs derived from subject-prefix keys, WAL segment shipping with per-segment checksums, follower apply loops using the same checkpoint format as single-node. Cross-shard atomicity defined at the "entity group" boundary (Item, User, Creator each map to a shard). Lag metrics (`replication_seconds_behind`) exported.
**Acceptance Criteria:**
- [ ] `ShardId = hash(entity_id) mod N` (configurable per `EntityKind`) stored alongside keys; shard map hot-swappable via epoch config.
- [ ] WAL segments have globally unique IDs (`region_id:shard_id:seqno`); followers detect gaps and request retransmit.
- [ ] Followers reapply segments idempotently using the same `EntitySignalState` checkpoint format from M1.
- [ ] Lag SLO: < 2s p99 at 25K writes/sec across 5 shards.
- [ ] CLI: `tidalctl shard status` shows leader, lag, checkpoint age.
**Depends On:** M7 (hardened WAL/Signal ledger)
**Complexity:** XL
**Research Reference:** `docs/research/tidaldb_wal.md`, `docs/research/tidaldb_signal_ledger.md`
#### Phase 2: Conflict Resolution and Session Semantics
**Delivers:** Deterministic reconciliation for eventually-consistent writes: CRDT-style counters for windowed aggregates, last-writer-wins timestamps for session state, and per-session sequence numbers so agents can reason about acknowledgements. Adds write-idempotency keys to the WAL and exposes a reconciliation audit log.
**Acceptance Criteria:**
- [ ] Windowed counters replicated as bounded PN-counters (positive/negative components) with tombstones for expired buckets.
- [ ] Decay scores replay identically because WAL order is preserved per shard; cross-shard dependencies (user->creator) carry causal metadata.
- [ ] Session updates carry `(session_id, seqno)`; duplicates dropped, gaps surfaced via API.
- [ ] `reconcile --since <ts>` tool emits merged vs diverged entries for auditing.
- [ ] Hides/blocks modeled as LWW registers with vector-clock tie-breakers (region priority list).
**Depends On:** Phase 1
**Complexity:** XL
**Research Reference:** `thoughts.md` Part V.5-6 (quarantine-first, group commit), `docs/research/tidaldb_signal_ledger.md`
#### Phase 3: Control Plane, Multi-Tenancy, and Routing
**Delivers:** Tenant-aware namespaces (per-tenant WAL directories and key prefixes), routing layer that maps tenants + entity IDs to shard endpoints, and policy templates (data residency, read-after-write budgets). Adds hosted-ready observability (lag dashboards, per-tenant quotas) and blue/green deploy tooling for the fabric.
**Acceptance Criteria:**
- [ ] Tenant config: `{tenant_id, shard_set, residency=[regions], rpo, rto}` stored in control-plane keyspace.
- [ ] Router SDK chooses nearest healthy region that satisfies residency and read-after-write target; falls back with documented staleness budget.
- [ ] Throttling per tenant (signals/sec, query concurrency) with circuit-breaker events surfaced via metrics + CLI.
- [ ] Rolling upgrade playbook: add shard, rebalance, observe zero dropped writes.
- [ ] Hosted docs: describe how embeddable apps graduate to hosted fabric without rewrites (same query + signal APIs).
**Depends On:** Phase 2
**Complexity:** L
### Done When
tidalDB instances can be deployed as a hosted, multi-region fabric with deterministic replication and reconciliation. Agents anywhere in the world can write signals and rely on hides/mutes/policies holding globally. Operators get tooling for shard health, tenant placement, rolling upgrades, and lag visibility. Embeddable users flip a config switch to opt into the fabric; query and signal APIs remain unchanged.
---
## Use Case Coverage Progression ## Use Case Coverage Progression
| UC | Description | M1 | M2 | M3 | M4 | M5 | M6 | M7 | | UC | Description | M1 | M2 | M3 | M4 | M5 | M6 | M7 |
|----|-------------|----|----|----|----|----|----|----| | ----- | ----------------- | ------- | -------- | -------- | ------- | -------- | -------- | ---- |
| UC-01 | For You Feed | - | - | **Full** | Full | Full | Full | Full | | UC-01 | For You Feed | - | - | **Full** | Full | Full | Full | Full |
| UC-02 | Search | - | - | - | - | **Core** | **Full** | Full | | UC-02 | Search | - | - | - | - | **Core** | **Full** | Full |
| UC-03 | Trending/Rising | Signals | **Full** | Full | Full | Full | Full | Full | | UC-03 | Trending/Rising | Signals | **Full** | Full | Full | Full | Full | Full |
| UC-04 | Following Feed | - | Partial | **Full** | Full | Full | Full | Full | | UC-04 | Following Feed | - | Partial | **Full** | Full | Full | Full | Full |
| UC-05 | Related/Up Next | - | - | **Core** | Core | Core | **Full** | Full | | UC-05 | Related/Up Next | - | - | **Core** | Core | Core | **Full** | Full |
| UC-06 | Browse/Category | Signals | **Core** | Core | Core | Core | **Full** | Full | | UC-06 | Browse/Category | Signals | **Core** | Core | Core | Core | **Full** | Full |
| UC-07 | Notifications | - | - | **Core** | Core | Core | **Full** | Full | | UC-07 | Notifications | - | - | **Core** | Core | Core | **Full** | Full |
| UC-08 | Creator Profile | - | **Core** | Core | Core | Core | **Full** | Full | | UC-08 | Creator Profile | - | **Core** | Core | Core | Core | **Full** | Full |
| UC-09 | User Library | - | - | Partial | Partial | Partial | **Full** | Full | | UC-09 | User Library | - | - | Partial | Partial | Partial | **Full** | Full |
| UC-10 | People Search | - | - | - | - | **Core** | **Full** | Full | | UC-10 | People Search | - | - | - | - | **Core** | **Full** | Full |
| UC-11 | Visual/Semantic | - | - | - | - | Partial | **Full** | Full | | UC-11 | Visual/Semantic | - | - | - | - | Partial | **Full** | Full |
| UC-12 | Live Content | - | - | - | - | - | **Full** | Full | | UC-12 | Live Content | - | - | - | - | - | **Full** | Full |
| UC-13 | Hidden Gems | - | **Full** | Full | Full | Full | Full | Full | | UC-13 | Hidden Gems | - | **Full** | Full | Full | Full | Full | Full |
| UC-14 | Controversial/Hot | Signals | **Full** | Full | Full | Full | Full | Full | | UC-14 | Controversial/Hot | Signals | **Full** | Full | Full | Full | Full | Full |
Legend: Legend:
- `-` = Not addressed - `-` = Not addressed
- `Signals` = Signal primitives exist but no query surface - `Signals` = Signal primitives exist but no query surface
- `Partial` = Some functionality, not all modes - `Partial` = Some functionality, not all modes
- `Core` = Primary query path works, some modes/filters missing - `Core` = Primary query path works, some modes/filters missing
- **Full** = All modes, filters, and feedback loops per USE_CASES.md specification - **Full** = All modes, filters, and feedback loops per USE_CASES.md specification
M8 focuses on deployment topology and therefore leaves UC coverage unchanged; it ensures the existing feature surface works globally and under multi-tenant load.
--- ---
## Dependency DAG ## Dependency DAG
@ -1232,6 +1365,7 @@ m1p1 (Types/Schema) ✓
``` ```
**Parallelization opportunities:** **Parallelization opportunities:**
- m1p2 (WAL) and m1p3 (Storage) are parallel after m1p1 (both now complete: m1p3 was completed first, m1p2 followed) - m1p2 (WAL) and m1p3 (Storage) are parallel after m1p1 (both now complete: m1p3 was completed first, m1p2 followed)
- m2p1 (USearch) and m2p2 (Filters) can be built in parallel after m1p3 - m2p1 (USearch) and m2p2 (Filters) can be built in parallel after m1p3
- m3p1 (Entities) and m4p1 (Tantivy) can start in parallel with later M2 phases - m3p1 (Entities) and m4p1 (Tantivy) can start in parallel with later M2 phases
@ -1244,19 +1378,19 @@ m1p1 (Types/Schema) ✓
These decisions are made. They are not revisited unless benchmarks prove them wrong. These decisions are made. They are not revisited unless benchmarks prove them wrong.
| Decision | Chosen | Alternative | Rationale | | Decision | Chosen | Alternative | Rationale |
|----------|--------|-------------|-----------| | -------------------- | ---------------------------------------------- | ------------------------ | ------------------------------------------------------------------------------------- |
| Storage engine | fjall (pure Rust) | RocksDB | Pure Rust, `#![forbid(unsafe_code)]`, fast compile, trait-abstracted for swap | | Storage engine | fjall (pure Rust) | RocksDB | Pure Rust, `#![forbid(unsafe_code)]`, fast compile, trait-abstracted for swap |
| Vector index | USearch (C++ FFI) | hnsw_rs | 10-100x QPS, predicate callbacks, mmap, f16 quantization | | Vector index | USearch (C++ FFI) | hnsw_rs | 10-100x QPS, predicate callbacks, mmap, f16 quantization |
| Text search | Tantivy (embedded) | Custom BM25 | 40K lines of battle-tested code; Collector/Scorer API provides exact hooks needed | | Text search | Tantivy (embedded) | Custom BM25 | 40K lines of battle-tested code; Collector/Scorer API provides exact hooks needed |
| Decay formula | Running S(t)=S(prev)*exp(-lambda*dt)+w | Raw event scan | O(1) vs O(N), proven exact, 20-60x faster at 50+ events/entity | | Decay formula | Running S(t)=S(prev)*exp(-lambda*dt)+w | Raw event scan | O(1) vs O(N), proven exact, 20-60x faster at 50+ events/entity |
| Windowed aggregation | Bucketed counters (Scotty pattern) | SWAG two-stacks | Simpler, serves multiple window sizes from one set of buckets | | Windowed aggregation | Bucketed counters (Scotty pattern) | SWAG two-stacks | Simpler, serves multiple window sizes from one set of buckets |
| Hybrid fusion | RRF (k=60) | Tuned linear combination | Zero-config, robust; linear combo is the upgrade path with relevance labels | | Hybrid fusion | RRF (k=60) | Tuned linear combination | Zero-config, robust; linear combo is the upgrade path with relevance labels |
| Consistency model | DB-primary, Tantivy as derived index | Two-phase commit | Simpler, deterministic recovery, source of truth is always the entity store | | Consistency model | DB-primary, Tantivy as derived index | Two-phase commit | Simpler, deterministic recovery, source of truth is always the entity store |
| WAL checksums | BLAKE3 | CRC32C | Content-addressing enables deduplication; BLAKE3 is fast enough | | WAL checksums | BLAKE3 | CRC32C | Content-addressing enables deduplication; BLAKE3 is fast enough |
| Key encoding | Subject-prefix `[entity_id][0x00][TAG:suffix]` | Separate key namespaces | Co-locates entity data, natural shard boundary, single prefix scan | | Key encoding | Subject-prefix `[entity_id][0x00][TAG:suffix]` | Separate key namespaces | Co-locates entity data, natural shard boundary, single prefix scan |
| Embedding format | f16 quantization (default) | float32 | Half memory, < 1% recall loss at 1536D | | Embedding format | f16 quantization (default) | float32 | Half memory, < 1% recall loss at 1536D |
| Query language | Custom (RETRIEVE/SEARCH/SIGNAL) | SQL | Domain semantics cannot be expressed in SQL without losing optimization opportunities | | Query language | Custom (RETRIEVE/SEARCH/SIGNAL) | SQL | Domain semantics cannot be expressed in SQL without losing optimization opportunities |
--- ---
@ -1265,8 +1399,8 @@ These decisions are made. They are not revisited unless benchmarks prove them wr
These are explicitly out of scope for the foreseeable future: These are explicitly out of scope for the foreseeable future:
1. **Embedding generation** -- tidalDB retrieves and ranks over vectors. It does not generate them. Bring your own model. 1. **Embedding generation** -- tidalDB retrieves and ranks over vectors. It does not generate them. Bring your own model.
2. **Horizontal distribution** -- Single-node first. Scale vertically. Distribution is a separate product. 2. **Generic horizontal distribution** -- M8 delivers the tidalDB-specific distributed fabric (WAL shipping, shard routing, eventual consistency). We are still not building a general-purpose distributed SQL store or OLTP replica mesh.
3. **ACID transactions across entities** -- Signal writes are atomic within an entity's state. Cross-entity transactions are not needed for the ranking problem. 3. **ACID transactions across entities** -- Signal writes are atomic within an entity's state. Cross-entity transactions are not needed for the ranking problem.
4. **SQL compatibility** -- The custom query language exists because SQL cannot express ranking semantics. No SQL layer. 4. **SQL compatibility** -- The custom query language exists because SQL cannot express ranking semantics. No SQL layer.
5. **Multi-tenancy** -- One tidalDB instance serves one application. Tenant isolation is the application's concern. 5. **Per-request hard multitenancy inside a single shard** -- M8 introduces tenant-aware namespaces and quotas for hosted deployments, but strong regulatory isolation (HIPAA, PCI) still requires separate deployments per tenant.
6. **Content moderation, authentication, payments, CDN** -- tidalDB solves one problem: ranking. Everything else is someone else's job. 6. **Content moderation, authentication, payments, CDN** -- tidalDB solves one problem: ranking. Everything else is someone else's job.

View File

@ -0,0 +1,64 @@
# Milestone 0 · Embeddable Runtime (Overview)
Milestone 0 proves that tidalDB can run entirely in-process with zero-configuration defaults, deterministic file layout, and tooling support. Engineers can `cargo add tidaldb`, open an ephemeral database, and run doctested samples before touching the harder ranking work.
## Objective
Ship the runtime shell that every later milestone depends on: builder/config API, sandboxed paths, WAL/metrics plumbing, CLI inspection, and living documentation.
## Deliverables
- `TidalDb::builder()` with `ephemeral()` and `single_process()` shortcuts, eager path validation, deterministic shutdown hooks, and temp-directory helpers used by doctests/integration tests.
- Tooling + diagnostics: `tidalctl` CLI (`status`, `paths`) and optional `/metrics` + `/healthz` endpoints that read the same files as the embedded process.
- Samples + docs: quick start code, doctests, Axum/Actix embedding examples, CONTRIBUTING checklist updates.
## Phase Breakdown
1. **Phase 1 — Embeddable Runtime Skeleton**
- Config + builder API with temp dirs, sandboxed layout, shutdown hooks.
- Deterministic path helper reused everywhere.
2. **Phase 2 — Tooling & Diagnostics**
- `tidalctl` surface plus metrics exporters sharing the same path helpers.
- Health probes expose WAL sequence, config hash, uptime.
3. **Phase 3 — Samples & Docs**
- Quick-start doctest stays green in CI.
- Embedding examples for Axum/Actix/CLI + CONTRIBUTING updates.
Each phase has task files under `docs/planning/milestone-0/phase-*`.
## Acceptance Criteria
- Builder defaults require zero manual config; paths validated up-front.
- Shutdown drains background workers and surfaces errors.
- Temp-directory helper cleans up automatically unless `preserve()` is called.
- `tidalctl status --path <dir>` prints WAL seq, storage layout, config snapshot.
- Metrics endpoint (optional) exposes uptime, WAL queue depth, build hash.
- Docs + samples compile via `cargo test --doc`; doctests fail CI if stale.
## Dependencies & Unblocks
- **Depends on:** None — foundational.
- **Unblocks:** Every milestone that needs storage paths, WAL writers, doctests, or CLI inspection (M1 onward).
## UAT Snapshot
```
let db = TidalDb::builder().ephemeral().with_temp_dir().open().unwrap();
db.health_check();
tidalctl status --path <dir>;
cargo test --doc;
```
All commands succeed without hand configuration.
## Research / Reference Notes
- `docs/research/tidaldb_signal_ledger.md` — running-score math informs future phases; builder must not block that work.
- `docs/research/tidaldb_wal.md` — dictates WAL layout and diagnostics exposed by tooling.
## Risks & Mitigations
- **Risk:** Temp directories linger on CI failures.
**Mitigation:** `TempTidalHome::drop` best-effort cleanup + `preserve()` flag for debugging.
- **Risk:** Tooling diverges from runtime paths.
**Mitigation:** `Paths` helper shared between runtime, CLI, and docs.

View File

@ -0,0 +1,101 @@
# Milestone 2, Phase 1: Vector Index Integration (USearch)
## Phase Deliverable
The `VectorIndex` trait and two implementations: `BruteForceIndex` (pure Rust, exact search) and `UsearchIndex` (USearch C++ FFI, HNSW approximate search). Items can be inserted with embeddings and retrieved by approximate nearest neighbor similarity. Vectors are L2-normalized at insertion time so L2 distance is equivalent to cosine similarity. An adaptive query planner routes filtered ANN queries to the optimal strategy based on estimated selectivity: brute-force for very selective filters (< 1%), widened `ef_search` for the danger zone (1-20%), and standard in-graph predicate filtering for broad filters (> 20%). The USearch backend uses f16 quantization by default, mmap persistence via `view()` for instant restart, and full `save()`/`load()` for checkpoint coordination. A `BruteForceIndex` exists for correctness verification, small datasets, and the pre-filter brute-force strategy.
## Acceptance Criteria
- [ ] `VectorIndex` trait with `insert(key, vector)`, `delete(key)`, `search(query, k, ef_search)`, `filtered_search(query, k, ef_search, predicate)`, `save()`, `load()`, `view()`, `reserve()`, `len()`, `len_live()`, `is_empty()`, `tombstone_ratio()`
- [ ] `VectorSearchResult { id: VectorId, distance: f32 }` and `VectorIndexConfig { dimensions, metric, quantization, connectivity, ef_construction, ef_search }` types
- [ ] `DistanceMetric` enum: `L2`, `InnerProduct`
- [ ] `QuantizationLevel` enum: `F32`, `F16`, `Int8`
- [ ] `VectorError` enum: `DimensionMismatch`, `CapacityExceeded`, `NotFound`, `Io`, `CorruptedIndex`, `Backend`, `ZeroNormVector`
- [ ] `BruteForceIndex` implements `VectorIndex` with exact linear-scan search, `RwLock<HashMap<VectorId, Vec<f32>>>` storage
- [ ] `MockVectorIndex` returns predetermined results for unit tests and records call history
- [ ] USearch backend implements the trait with f16 quantization (default), M=16, ef_construction=200, ef_search=200
- [ ] USearch `filtered_search` passes predicate closure to USearch's predicate callback API
- [ ] USearch `reserve()` for capacity management (2x over-provision)
- [ ] USearch `save()`, `load()`, `view()` delegating to USearch persistence methods
- [ ] `#![forbid(unsafe_code)]` relaxed only in `storage/vector/usearch.rs` with `// SAFETY:` comments on every unsafe block
- [ ] `l2_normalize(v: &[f32]) -> Result<Vec<f32>, VectorError>` normalizes to unit length; fails on zero-norm vectors
- [ ] `EmbeddingSlotRegistry` maps `(EntityKind, slot_name)` to `EmbeddingSlotState { index, dimensions, quantization, params }`
- [ ] Insert path: validate dims, normalize, store f32 in entity store (`META` key with `EMB:slot_name` suffix), insert quantized into HNSW
- [ ] Update path: tombstone in HNSW, insert new vector
- [ ] Delete path: tombstone only
- [ ] Adaptive query planner: selectivity < 1% triggers pre-filter + brute-force; 1-20% uses `filtered_search` with widened `ef_search` (2-3x); > 20% uses standard `filtered_search`; 100% (no filter) uses unfiltered `search()`
- Note: ROADMAP.md acceptance criteria say selectivity < 2% -> brute-force. These docs use the refined spec thresholds from Spec 07 Section 9 (< 1% brute-force, 1-20% widened HNSW, > 20% in-graph filter). The roadmap threshold of 2% is superseded by the spec.
- [ ] ANN retrieval at 10K vectors returns top-100 with recall@100 > 0.95 (measured against `BruteForceIndex`)
- [ ] ANN retrieval latency < 10ms at 10K vectors (Criterion benchmark)
- [ ] Persistence: `save()` on checkpoint, `view()` on restart for immediate read serving
- [ ] Criterion benchmarks: unfiltered search, filtered search at 20% and 5% selectivity, brute-force search, recall@100
## Dependencies
- **Requires:** m1p1 (types: `EntityId`, `EntityKind`), m1p3 (storage: `StorageEngine` trait, key encoding with `Tag::Meta` for embedding persistence), m1p5 (entity write API for storing embeddings in entity store)
- **Blocks:** m2p2 (metadata indexes use the `VectorIndex` for filter bitmap selectivity estimation), m2p5 (RETRIEVE executor needs ANN search for candidate generation)
## Research References
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- USearch evaluation (127K QPS at f32, 167K at int8), filtered search callback architecture, ACORN-1 two-hop expansion, f16 as optimal default, mmap persistence strategy, memory budget analysis (31.5 GB at 10M x 1536d x f16), `reserve()` capacity planning
- [thoughts.md](../../../../thoughts.md) -- Part V.9 (hybrid storage: "vector index and text index are derived state, always rebuildable from the entity store")
## Spec References
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 2 (HNSW internals: M=16, ef_construction=200, ef_search=200), Section 3 (filtered ANN: three strategies with selectivity thresholds), Section 4 (quantization: f16 default, < 1% recall loss), Section 5 (multiple embedding spaces, slot registry), Section 6 (embedding lifecycle: insert, update, delete paths), Section 7 (persistence: save/load/view, delta journal), Section 9 (adaptive query planner: decision tree, threshold reference, runtime statistics), Section 11 (VectorIndex trait, full API), Section 12 (performance targets), Section 13 (invariants: 10 correctness guarantees)
- [docs/specs/00-architecture-overview.md](../../../specs/00-architecture-overview.md) -- Module map showing `storage/vector/`
## Task Index
| # | Task | Delivers | Depends On | Complexity |
|---|------|----------|------------|------------|
| 01 | VectorIndex Trait + BruteForceIndex | `VectorIndex` trait, all types, `BruteForceIndex`, `MockVectorIndex`, property tests | None | M |
| 02 | USearch Backend | `UsearchIndex` wrapping USearch via Rust crate, f16 quantization, mmap persistence, `#[allow(unsafe_code)]` | Task 01 | L |
| 03 | Embedding Lifecycle + Slot Registry | `l2_normalize`, `EmbeddingSlotRegistry`, insert/update/delete paths, entity store integration | Task 01 | M |
| 04 | Adaptive Query Planner + Benchmarks | `AdaptiveQueryPlanner`, `SelectivityEstimator`, `AnnQueryStats`, Criterion benchmarks | Task 01, Task 02 | M |
## Task Dependency DAG
```
Task 01: VectorIndex Trait + BruteForceIndex
| \
| \
v v
Task 02: USearch Backend Task 03: Embedding Lifecycle + Slot Registry
|
+----> Task 04: Adaptive Query Planner + Benchmarks
| (also depends on Task 01)
```
Task 01 is the foundation -- it defines the trait all other tasks implement or consume. Tasks 02 and 03 are parallelizable after Task 01. Task 04 requires both Task 01 (for trait types) and Task 02 (for USearch backend to benchmark against).
## File Layout
```
tidal/src/
storage/
vector/
mod.rs -- VectorIndex trait, VectorError, VectorSearchResult, VectorIndexConfig,
DistanceMetric, QuantizationLevel, VectorId, types re-exports (Task 01)
brute.rs -- BruteForceIndex, MockVectorIndex (Task 01)
usearch.rs -- UsearchIndex, #[allow(unsafe_code)] (Task 02)
lifecycle.rs -- l2_normalize, embedding insert/update/delete path (Task 03)
registry.rs -- EmbeddingSlotRegistry, EmbeddingSlotState (Task 03)
planner.rs -- AdaptiveQueryPlanner, SelectivityEstimator, AnnQueryStats (Task 04)
mod.rs -- add `pub mod vector;`
tidal/benches/
vector.rs -- Criterion benchmarks (Task 04)
tidal/Cargo.toml -- add `usearch` and `rayon` dependencies
```
## Open Questions
1. **`usearch` crate version**: The `usearch` crate (crates.io) wraps USearch via CXX. Verify the latest stable version supports `filtered_search` with a predicate callback. If not, the alternative is `hnsw_rs` which is pure Rust but lacks quantization and deletion support. The research doc recommends USearch but notes hnsw_rs as a fallback.
2. **Capacity planning**: USearch `reserve()` must be called before first insertion. tidalDB should over-provision by 2x the schema-defined entity limit. What happens if the index fills up? Need to benchmark whether a full rebuild is needed or if `reserve()` can be called again with higher capacity.
3. **Concurrency model**: USearch claims concurrent reads + writes. Verify that `filtered_search` and `insert` can truly run concurrently without a mutex wrapper. If not, add `RwLock<UsearchIndex>` and document the contention implication. ScyllaDB validates concurrent operation at 1B vectors but tidalDB's access patterns may differ.
4. **Delta journal vs full save**: The spec recommends a delta journal for incremental persistence (Spec 07, Section 7). For M2 at 10K items (not 10M), a full `save()` on every checkpoint is fast enough (10K x 1536d x f16 = ~30 MB, writes in < 100ms). Defer delta journal implementation to M7 unless benchmarks show otherwise.
5. **Selectivity estimation without m2p2**: Task 04 (planner) depends on selectivity estimates from metadata bitmap indexes (m2p2). For the m2p1 phase, the planner can use a fixed threshold with a placeholder `SelectivityEstimator` that returns 1.0 (always use in-graph filter). Wire up the real estimator when m2p2 is implemented.

View File

@ -0,0 +1,826 @@
# Task 01: VectorIndex Trait + BruteForceIndex
## Context
**Milestone:** 2 -- Ranked Retrieval
**Phase:** m2p1 -- Vector Index Integration (USearch)
**Depends On:** None (uses types from m1p1 but no m2p1 tasks)
**Blocks:** Task 02 (USearch Backend), Task 03 (Embedding Lifecycle), Task 04 (Adaptive Query Planner)
**Complexity:** M
## Objective
Deliver the `VectorIndex` trait -- the public interface for all ANN operations in tidalDB -- along with the full type system for vector search (`VectorId`, `VectorSearchResult`, `VectorIndexConfig`, `DistanceMetric`, `QuantizationLevel`, `VectorError`) and two pure-Rust implementations: `BruteForceIndex` (exact linear-scan search) and `MockVectorIndex` (predetermined results for unit tests).
The `VectorIndex` trait is the abstraction boundary. No module outside `storage/vector/` will ever know whether USearch, hnsw_rs, or brute-force is behind it. This is the same pattern as `StorageEngine` in m1p3: define the trait first, implement brute-force for correctness, then add the production backend in the next task.
`BruteForceIndex` is not a throwaway. It serves three permanent roles:
1. **Correctness oracle** -- recall measurements compare HNSW results against `BruteForceIndex` exact results.
2. **Small datasets** -- when the index has fewer than ~10,000 vectors, brute-force is faster than HNSW because there is no graph construction overhead.
3. **Pre-filter fallback** -- the adaptive query planner (Task 04) uses `BruteForceIndex`-style linear scan over bitmap-filtered candidate sets when selectivity < 1%.
No unsafe code in this task. Pure Rust throughout.
## Requirements
- `VectorIndex` trait: `insert`, `search`, `filtered_search`, `delete`, `reserve`, `save`, `load`, `view`, `len`, `len_live`, `is_empty`, `tombstone_ratio`
- All trait methods match the signatures in Spec 07, Section 11
- `VectorIndex: Send + Sync` bound
- `VectorId = u64` type alias
- `VectorSearchResult { id: VectorId, distance: f32 }` with `Debug`, `Clone`
- `VectorIndexConfig` with all HNSW parameters
- `DistanceMetric` enum: `L2`, `InnerProduct`
- `QuantizationLevel` enum: `F32`, `F16`, `Int8`
- `VectorError` enum with `Display`, `Debug`, `From<std::io::Error>`
- `BruteForceIndex`: `RwLock<HashMap<VectorId, Vec<f32>>>` for storage, linear scan for search
- `BruteForceIndex::search` returns results sorted by ascending L2 squared distance
- `BruteForceIndex::filtered_search` applies predicate during linear scan, returns only matching results
- `BruteForceIndex::delete` removes the vector from the HashMap (true delete, not tombstone)
- `BruteForceIndex::save`/`load`/`view` use a simple binary format for test persistence
- `MockVectorIndex`: predetermined results, call recording for test assertions
- No `unsafe` code
## Technical Design
### Module Structure
```
tidal/src/storage/vector/
mod.rs -- VectorIndex trait, all types, re-exports
brute.rs -- BruteForceIndex, MockVectorIndex
```
### Public API
```rust
// === storage/vector/mod.rs ===
use std::path::Path;
/// A unique identifier for an entity in the vector index.
/// Corresponds to the u64 representation of the application-provided entity ID.
pub type VectorId = u64;
/// A scored search result from the vector index.
#[derive(Debug, Clone)]
pub struct VectorSearchResult {
/// Entity ID in the vector index.
pub id: VectorId,
/// L2 squared distance from query vector. Lower = more similar.
/// For L2-normalized vectors, range is [0.0, 4.0] where 0.0 = identical.
pub distance: f32,
}
/// Configuration for vector index construction.
#[derive(Debug, Clone)]
pub struct VectorIndexConfig {
/// Number of dimensions per vector.
pub dimensions: usize,
/// Distance metric.
pub metric: DistanceMetric,
/// Quantization level for stored vectors.
pub quantization: QuantizationLevel,
/// Maximum connections per node per layer (M parameter). Default: 16.
pub connectivity: usize,
/// Beam width during index construction. Default: 200.
pub ef_construction: usize,
/// Default beam width during search (overridable per query). Default: 200.
pub ef_search: usize,
}
impl Default for VectorIndexConfig {
fn default() -> Self {
Self {
dimensions: 1536,
metric: DistanceMetric::L2,
quantization: QuantizationLevel::F16,
connectivity: 16,
ef_construction: 200,
ef_search: 200,
}
}
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum DistanceMetric {
/// L2 squared distance. Default for cosine over normalized vectors.
L2,
/// Inner product. For MIPS workloads (with XBOX transformation).
InnerProduct,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum QuantizationLevel {
/// Full precision (4 bytes per dimension).
F32,
/// Half precision (2 bytes per dimension). Default.
F16,
/// Scalar quantization (1 byte per dimension).
Int8,
}
/// Errors from vector index operations.
#[derive(Debug)]
pub enum VectorError {
/// Vector dimensions do not match index configuration.
DimensionMismatch { expected: usize, got: usize },
/// Index is at capacity and cannot accept more vectors.
CapacityExceeded { capacity: usize },
/// Vector ID not found in the index.
NotFound { id: VectorId },
/// I/O error during persistence.
Io(std::io::Error),
/// Index file is corrupted or incompatible.
CorruptedIndex(String),
/// USearch or backend-specific error.
Backend(String),
/// Vector has zero L2 norm and cannot be normalized.
ZeroNormVector,
}
// Note: `ZeroNormVector` is not in Spec 07 Section 11 but is required by `l2_normalize()` in Task 03. Spec 07 should be updated to include it.
impl std::fmt::Display for VectorError { /* variant-specific messages */ }
impl std::error::Error for VectorError {}
impl From<std::io::Error> for VectorError { /* wraps as VectorError::Io */ }
/// The vector index trait. All ANN operations go through this interface.
///
/// Implementations must be `Send + Sync` for concurrent search + insert.
///
/// # Contract
///
/// - Vectors passed to `insert()` must already be L2-normalized. The trait
/// does not normalize -- the caller (embedding lifecycle, Task 03) is
/// responsible for normalization before insertion.
/// - `search()` and `filtered_search()` return results sorted by ascending
/// distance (most similar first).
/// - `delete()` marks a vector as tombstoned. Tombstoned vectors are excluded
/// from search results but may remain in the index structure.
pub trait VectorIndex: Send + Sync {
/// Insert a vector into the index.
///
/// If a vector with this ID already exists, it is replaced (delete + insert).
///
/// # Errors
///
/// - `VectorError::CapacityExceeded` if the index is full.
/// - `VectorError::DimensionMismatch` if `embedding.len() != config.dimensions`.
fn insert(&self, id: VectorId, embedding: &[f32]) -> Result<(), VectorError>;
/// Search for the K nearest neighbors to the query vector.
///
/// Results are ordered by ascending distance (most similar first).
///
/// # Arguments
///
/// * `query` -- The query vector. Must be L2-normalized.
/// * `k` -- Number of results to return.
/// * `ef_search` -- Beam width override. If 0, uses the index default.
fn search(
&self,
query: &[f32],
k: usize,
ef_search: usize,
) -> Result<Vec<VectorSearchResult>, VectorError>;
/// Search for the K nearest neighbors that satisfy a filter predicate.
///
/// The predicate is evaluated during traversal. Nodes failing the predicate
/// are used for navigation but excluded from results (in-graph filtering).
///
/// # Arguments
///
/// * `query` -- The query vector. Must be L2-normalized.
/// * `k` -- Number of results to return.
/// * `ef_search` -- Beam width override. If 0, uses the index default.
/// * `filter` -- Predicate per candidate node. Return `true` to include.
fn filtered_search(
&self,
query: &[f32],
k: usize,
ef_search: usize,
filter: &dyn Fn(VectorId) -> bool,
) -> Result<Vec<VectorSearchResult>, VectorError>;
/// Remove a vector from the index (lazy tombstone).
///
/// # Errors
///
/// - `VectorError::NotFound` if the ID is not in the index.
fn delete(&self, id: VectorId) -> Result<(), VectorError>;
/// Reserve capacity for at least `additional` more vectors.
fn reserve(&self, additional: usize) -> Result<(), VectorError>;
/// Persist the index to disk.
fn save(&self, path: &Path) -> Result<(), VectorError>;
/// Load an index from disk into writable memory.
fn load(path: &Path, config: &VectorIndexConfig) -> Result<Self, VectorError>
where
Self: Sized;
/// Memory-map an index from disk for read-only access.
// config required by USearch to initialize the mmap'd index with correct parameters
fn view(path: &Path, config: &VectorIndexConfig) -> Result<Self, VectorError>
where
Self: Sized;
/// Number of vectors in the index (including tombstoned).
fn len(&self) -> usize;
/// Number of live (non-tombstoned) vectors.
fn len_live(&self) -> usize;
/// Whether the index is empty.
fn is_empty(&self) -> bool {
self.len_live() == 0
}
/// Ratio of tombstoned vectors to total vectors.
fn tombstone_ratio(&self) -> f64 {
if self.len() == 0 {
0.0
} else {
(self.len() - self.len_live()) as f64 / self.len() as f64
}
}
}
```
### BruteForceIndex
```rust
// === storage/vector/brute.rs ===
use std::collections::HashMap;
use std::sync::RwLock;
use std::path::Path;
use std::io::{Read, Write, BufReader, BufWriter};
use std::fs::File;
use super::{VectorIndex, VectorId, VectorSearchResult, VectorIndexConfig, VectorError};
/// Exact nearest-neighbor search via linear scan.
///
/// Used for:
/// 1. Correctness verification (recall measurement against HNSW).
/// 2. Small datasets (< 10,000 vectors where brute-force is faster).
/// 3. Pre-filter fallback (adaptive query planner uses brute-force for
/// very selective filters where the filtered set is small).
pub struct BruteForceIndex {
vectors: RwLock<HashMap<VectorId, Vec<f32>>>,
config: VectorIndexConfig,
}
impl BruteForceIndex {
pub fn new(config: VectorIndexConfig) -> Self;
/// Number of vectors (HashMap length).
fn vector_count(&self) -> usize;
}
```
**Search implementation:**
- Acquire read lock on `vectors`
- Compute L2 squared distance between query and every stored vector
- Collect `(VectorId, f32)` pairs into a Vec
- Sort by ascending distance
- Take first `k` results
- Return as `Vec<VectorSearchResult>`
**L2 squared distance function:**
```rust
/// Compute L2 squared distance between two vectors of equal length.
///
/// For L2-normalized vectors, this is equivalent to `2 - 2 * cos(a, b)`.
/// Returns sum of squared differences.
pub(crate) fn l2_distance_sq(a: &[f32], b: &[f32]) -> f32 {
debug_assert_eq!(a.len(), b.len());
a.iter()
.zip(b.iter())
.map(|(x, y)| {
let d = x - y;
d * d
})
.sum()
}
```
**Persistence (save/load/view):**
`BruteForceIndex` uses a simple binary format for test persistence:
```
Header:
[magic: 4 bytes "BFVI"]
[version: 1 byte (0x01)]
[dimensions: 4 bytes LE]
[count: 8 bytes LE]
Per vector:
[id: 8 bytes LE]
[vector: dimensions * 4 bytes, f32 LE]
```
`view()` loads the same file as `load()` (brute-force has no mmap mode -- it is always in-memory). This is acceptable because `BruteForceIndex` is not the production backend.
**Filtered search:** Same as `search()` but skips vectors where `filter(id) == false` before adding to the distance computation. This means brute-force filtered search only computes distances for vectors passing the filter, which is why it is fast for very selective filters.
### MockVectorIndex
```rust
/// Configurable mock for unit tests.
///
/// Returns predetermined results from search calls and records all method
/// invocations for verification.
pub struct MockVectorIndex {
search_results: RwLock<Vec<Vec<VectorSearchResult>>>,
call_log: RwLock<Vec<VectorIndexCall>>,
config: VectorIndexConfig,
inserted_count: RwLock<usize>,
}
#[derive(Debug, Clone)]
pub enum VectorIndexCall {
Insert { id: VectorId },
Delete { id: VectorId },
Search { k: usize, ef_search: usize },
FilteredSearch { k: usize, ef_search: usize },
Reserve { additional: usize },
Save,
Load,
View,
}
impl MockVectorIndex {
/// Create a mock with predetermined search results.
///
/// Each call to `search()` or `filtered_search()` pops the first element
/// from `search_results`. If empty, returns an empty Vec.
pub fn new(config: VectorIndexConfig, search_results: Vec<Vec<VectorSearchResult>>) -> Self;
/// Get the recorded call log.
pub fn calls(&self) -> Vec<VectorIndexCall>;
/// Clear the call log.
pub fn clear_calls(&self);
}
```
### Error Handling
- `insert()` with wrong dimensions: returns `VectorError::DimensionMismatch { expected, got }`.
- `search()` with wrong query dimensions: returns `VectorError::DimensionMismatch`.
- `delete()` for unknown ID: returns `VectorError::NotFound { id }`.
- `save()`/`load()` I/O failures: returns `VectorError::Io(e)`.
- `load()` with corrupt file: returns `VectorError::CorruptedIndex(msg)`.
## Test Strategy
### Property Tests
```rust
use proptest::prelude::*;
// Insert + search roundtrip: every inserted vector is retrievable.
proptest! {
#[test]
fn insert_search_roundtrip(
dim in 2usize..64,
n_vectors in 1usize..200,
k in 1usize..50,
) {
let k = k.min(n_vectors);
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = BruteForceIndex::new(config);
// Insert random unit vectors
let mut rng = proptest::test_runner::TestRng::deterministic_rng(
proptest::test_runner::RngAlgorithm::ChaCha
);
for id in 0..n_vectors as u64 {
let v: Vec<f32> = (0..dim).map(|_| rng.gen::<f32>() - 0.5).collect();
let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
let unit: Vec<f32> = v.iter().map(|x| x / norm).collect();
index.insert(id, &unit).unwrap();
}
// Search for each inserted vector: it should be the top-1 result
for id in 0..n_vectors as u64 {
// Note: test must be in the module (or use pub(crate) vectors field) to access this private field.
let v = index.vectors.read().unwrap()[&id].clone();
let results = index.search(&v, 1, 0).unwrap();
prop_assert!(!results.is_empty());
prop_assert_eq!(results[0].id, id);
prop_assert!(results[0].distance < 1e-6, "self-search should return distance ~0");
}
}
}
// Delete excludes tombstoned IDs from search results.
proptest! {
#[test]
fn delete_excludes_from_results(
dim in 2usize..32,
n_vectors in 5usize..100,
) {
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = BruteForceIndex::new(config);
// Insert vectors
let vectors: Vec<Vec<f32>> = (0..n_vectors).map(|_| {
let v: Vec<f32> = (0..dim).map(|i| ((i * 7 + 13) % 100) as f32 / 100.0 - 0.5).collect();
let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
v.iter().map(|x| x / norm).collect()
}).collect();
for (id, v) in vectors.iter().enumerate() {
index.insert(id as u64, v).unwrap();
}
// Delete the first vector
index.delete(0).unwrap();
// Search should not return deleted ID
let query = &vectors[0];
let results = index.search(query, n_vectors, 0).unwrap();
prop_assert!(results.iter().all(|r| r.id != 0),
"deleted vector should not appear in results");
prop_assert_eq!(results.len(), n_vectors - 1);
}
}
// filtered_search honors all predicates.
proptest! {
#[test]
fn filtered_search_honors_predicate(
dim in 2usize..32,
n_vectors in 10usize..100,
k in 1usize..20,
) {
let k = k.min(n_vectors / 2);
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = BruteForceIndex::new(config);
for id in 0..n_vectors as u64 {
let v: Vec<f32> = (0..dim).map(|i| ((id as usize * 3 + i * 7) % 100) as f32 / 100.0).collect();
let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
let unit: Vec<f32> = v.iter().map(|x| x / norm).collect();
index.insert(id, &unit).unwrap();
}
// Filter: only even IDs
let predicate = |id: VectorId| id % 2 == 0;
let query: Vec<f32> = (0..dim).map(|i| (i as f32) / dim as f32).collect();
let norm: f32 = query.iter().map(|x| x * x).sum::<f32>().sqrt();
let unit_query: Vec<f32> = query.iter().map(|x| x / norm).collect();
let results = index.filtered_search(&unit_query, k, 0, &predicate).unwrap();
for r in &results {
prop_assert!(r.id % 2 == 0,
"filtered_search returned odd ID {}", r.id);
}
}
}
// Search results are sorted by ascending distance.
proptest! {
#[test]
fn results_sorted_by_distance(
dim in 2usize..32,
n_vectors in 5usize..100,
k in 2usize..50,
) {
let k = k.min(n_vectors);
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = BruteForceIndex::new(config);
for id in 0..n_vectors as u64 {
let v: Vec<f32> = (0..dim).map(|i| ((id as usize + i) % 100) as f32 / 100.0).collect();
let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
let unit: Vec<f32> = v.iter().map(|x| x / norm).collect();
index.insert(id, &unit).unwrap();
}
let query: Vec<f32> = vec![1.0 / (dim as f32).sqrt(); dim];
let results = index.search(&query, k, 0).unwrap();
for w in results.windows(2) {
prop_assert!(w[0].distance <= w[1].distance,
"results not sorted: {} > {}", w[0].distance, w[1].distance);
}
}
}
```
### Unit Tests
```rust
#[test]
fn brute_force_new_is_empty() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
assert_eq!(index.len(), 0);
assert_eq!(index.len_live(), 0);
assert!(index.is_empty());
assert!((index.tombstone_ratio() - 0.0).abs() < f64::EPSILON);
}
#[test]
fn brute_force_insert_and_len() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
index.insert(2, &[0.0, 1.0, 0.0]).unwrap();
assert_eq!(index.len(), 2);
assert_eq!(index.len_live(), 2);
assert!(!index.is_empty());
}
#[test]
fn brute_force_dimension_mismatch() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
let result = index.insert(1, &[1.0, 0.0]); // 2 dims instead of 3
assert!(matches!(result, Err(VectorError::DimensionMismatch { expected: 3, got: 2 })));
}
#[test]
fn brute_force_search_dimension_mismatch() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
let result = index.search(&[1.0, 0.0], 1, 0); // 2 dims query
assert!(matches!(result, Err(VectorError::DimensionMismatch { .. })));
}
#[test]
fn brute_force_self_search_distance_zero() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
let v = [1.0, 0.0, 0.0];
index.insert(42, &v).unwrap();
let results = index.search(&v, 1, 0).unwrap();
assert_eq!(results.len(), 1);
assert_eq!(results[0].id, 42);
assert!(results[0].distance < 1e-6);
}
#[test]
fn brute_force_search_empty_index() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
let results = index.search(&[1.0, 0.0, 0.0], 10, 0).unwrap();
assert!(results.is_empty());
}
#[test]
fn brute_force_search_k_larger_than_index() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
index.insert(2, &[0.0, 1.0, 0.0]).unwrap();
let results = index.search(&[1.0, 0.0, 0.0], 100, 0).unwrap();
assert_eq!(results.len(), 2); // returns all available, not error
}
#[test]
fn brute_force_orthogonal_vectors_distance() {
// For unit vectors a, b: ||a - b||^2 = 2 - 2*cos(a,b)
// Orthogonal unit vectors: cos = 0, so distance = 2.0
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
let results = index.search(&[0.0, 1.0, 0.0], 1, 0).unwrap();
assert!((results[0].distance - 2.0).abs() < 1e-5,
"orthogonal unit vectors should have L2^2 distance of 2.0, got {}", results[0].distance);
}
#[test]
fn brute_force_identical_vectors_distance() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
let v = [0.577_350_3, 0.577_350_3, 0.577_350_3]; // unit vector
index.insert(1, &v).unwrap();
let results = index.search(&v, 1, 0).unwrap();
assert!(results[0].distance < 1e-6);
}
#[test]
fn brute_force_delete_and_search() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
index.insert(2, &[0.0, 1.0, 0.0]).unwrap();
index.insert(3, &[0.0, 0.0, 1.0]).unwrap();
index.delete(2).unwrap();
assert_eq!(index.len(), 2); // BruteForce does true delete
assert_eq!(index.len_live(), 2);
let results = index.search(&[0.0, 1.0, 0.0], 10, 0).unwrap();
assert!(results.iter().all(|r| r.id != 2));
}
#[test]
fn brute_force_delete_not_found() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
let result = index.delete(999);
assert!(matches!(result, Err(VectorError::NotFound { id: 999 })));
}
#[test]
fn brute_force_insert_replaces_existing() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
index.insert(1, &[0.0, 1.0, 0.0]).unwrap(); // replace
assert_eq!(index.len(), 1); // still 1 vector
let results = index.search(&[0.0, 1.0, 0.0], 1, 0).unwrap();
assert_eq!(results[0].id, 1);
assert!(results[0].distance < 1e-6, "should match the replacement vector");
}
#[test]
fn brute_force_filtered_search_excludes_non_matching() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
for id in 0..10u64 {
let v = [1.0, 0.0, 0.0]; // all same direction
index.insert(id, &v).unwrap();
}
// Only include even IDs
let results = index.filtered_search(&[1.0, 0.0, 0.0], 10, 0, &|id| id % 2 == 0).unwrap();
assert_eq!(results.len(), 5);
assert!(results.iter().all(|r| r.id % 2 == 0));
}
#[test]
fn brute_force_filtered_search_empty_result() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
// Predicate that matches nothing
let results = index.filtered_search(&[1.0, 0.0, 0.0], 10, 0, &|_| false).unwrap();
assert!(results.is_empty());
}
#[test]
fn brute_force_save_load_roundtrip() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config.clone());
index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
index.insert(2, &[0.0, 1.0, 0.0]).unwrap();
let dir = tempfile::tempdir().unwrap();
let path = dir.path().join("test.bfvi");
index.save(&path).unwrap();
let loaded = BruteForceIndex::load(&path, &config).unwrap();
assert_eq!(loaded.len(), 2);
// Search should produce identical results
let results_orig = index.search(&[1.0, 0.0, 0.0], 2, 0).unwrap();
let results_loaded = loaded.search(&[1.0, 0.0, 0.0], 2, 0).unwrap();
assert_eq!(results_orig.len(), results_loaded.len());
for (a, b) in results_orig.iter().zip(results_loaded.iter()) {
assert_eq!(a.id, b.id);
assert!((a.distance - b.distance).abs() < 1e-6);
}
}
#[test]
fn brute_force_reserve_is_noop() {
// BruteForce uses HashMap, which resizes automatically.
// reserve() is a noop but must not error.
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
assert!(index.reserve(1_000_000).is_ok());
}
#[test]
fn l2_distance_sq_correctness() {
let a = [1.0, 0.0, 0.0];
let b = [0.0, 1.0, 0.0];
let dist = l2_distance_sq(&a, &b);
assert!((dist - 2.0).abs() < 1e-6);
let c = [1.0, 0.0, 0.0];
assert!(l2_distance_sq(&a, &c) < 1e-6);
}
#[test]
fn mock_vector_index_returns_predetermined() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let results = vec![
vec![VectorSearchResult { id: 42, distance: 0.1 }],
vec![VectorSearchResult { id: 99, distance: 0.5 }],
];
let mock = MockVectorIndex::new(config, results);
let r1 = mock.search(&[1.0, 0.0, 0.0], 1, 0).unwrap();
assert_eq!(r1[0].id, 42);
let r2 = mock.search(&[0.0, 1.0, 0.0], 1, 0).unwrap();
assert_eq!(r2[0].id, 99);
// Third call: no more results, returns empty
let r3 = mock.search(&[0.0, 0.0, 1.0], 1, 0).unwrap();
assert!(r3.is_empty());
}
#[test]
fn mock_vector_index_records_calls() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let mock = MockVectorIndex::new(config, vec![]);
mock.insert(1, &[1.0, 0.0, 0.0]).unwrap();
mock.delete(1).unwrap();
mock.search(&[1.0, 0.0, 0.0], 10, 200).unwrap();
mock.filtered_search(&[1.0, 0.0, 0.0], 5, 0, &|_| true).unwrap();
let calls = mock.calls();
assert_eq!(calls.len(), 4);
assert!(matches!(calls[0], VectorIndexCall::Insert { id: 1 }));
assert!(matches!(calls[1], VectorIndexCall::Delete { id: 1 }));
assert!(matches!(calls[2], VectorIndexCall::Search { k: 10, ef_search: 200 }));
assert!(matches!(calls[3], VectorIndexCall::FilteredSearch { k: 5, ef_search: 0 }));
}
#[test]
fn vector_index_is_send_and_sync() {
fn assert_send_sync<T: Send + Sync>() {}
assert_send_sync::<BruteForceIndex>();
assert_send_sync::<MockVectorIndex>();
}
#[test]
fn vector_index_config_defaults() {
let config = VectorIndexConfig::default();
assert_eq!(config.dimensions, 1536);
assert_eq!(config.metric, DistanceMetric::L2);
assert_eq!(config.quantization, QuantizationLevel::F16);
assert_eq!(config.connectivity, 16);
assert_eq!(config.ef_construction, 200);
assert_eq!(config.ef_search, 200);
}
```
## Acceptance Criteria
- [ ] `VectorIndex` trait with all methods from Spec 07, Section 11
- [ ] `VectorIndex: Send + Sync` bound
- [ ] `VectorId = u64` type alias
- [ ] `VectorSearchResult`, `VectorIndexConfig`, `DistanceMetric`, `QuantizationLevel`, `VectorError` types with correct derives
- [ ] `VectorIndexConfig::default()` returns dimensions=1536, L2, F16, M=16, ef_construction=200, ef_search=200
- [ ] `VectorError` implements `Display`, `Error`, `From<std::io::Error>`
- [ ] `l2_distance_sq()` computes correct L2 squared distance
- [ ] `BruteForceIndex::search()` returns exact nearest neighbors sorted by ascending distance
- [ ] `BruteForceIndex::filtered_search()` returns only results where `filter(id) == true`
- [ ] `BruteForceIndex::insert()` validates dimensions and rejects mismatches
- [ ] `BruteForceIndex::insert()` replaces existing vectors with the same ID
- [ ] `BruteForceIndex::delete()` removes vectors; they never appear in search results
- [ ] `BruteForceIndex::delete()` returns `NotFound` for unknown IDs
- [ ] `BruteForceIndex::save()` and `load()` roundtrip produces identical search results
- [ ] `MockVectorIndex` returns predetermined results and records call history
- [ ] All property tests pass: insert+search roundtrip, delete exclusion, filtered_search predicate honor, result ordering
- [ ] `BruteForceIndex` and `MockVectorIndex` are `Send + Sync`
- [ ] No `unsafe` code
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All property tests and unit tests pass
## Research References
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- Section "Implementation recommendation: wrap USearch, build the planner": "A `BruteForceIndex` exists for correctness verification and small-dataset deployments", brute-force breakeven point (~2,000-5,000 vectors)
## Spec References
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 11 (VectorIndex trait: full API signatures, VectorError variants, BruteForceIndex implementation sketch, MockVectorIndex), Section 12 (performance targets), Section 13 (invariants 1-3: insert retrievability, delete exclusion, filtered_search predicate compliance)
## Implementation Notes
- Add `pub mod vector;` to `tidal/src/storage/mod.rs`. The vector module is a submodule of storage because vector indexes are a storage concern (persistence, key encoding, entity store integration).
- `BruteForceIndex` uses true deletion (HashMap::remove), not lazy tombstoning. This means `len()` and `len_live()` always return the same value. The `tombstone_ratio()` default implementation handles this correctly (returns 0.0). USearch (Task 02) uses lazy tombstoning, where `len() > len_live()`.
- The `ef_search` parameter is ignored by `BruteForceIndex` (exact search has no beam width). It is accepted for trait compliance but unused.
- `view()` for `BruteForceIndex` delegates to `load()` since there is no mmap mode. This is documented on the method.
- `reserve()` for `BruteForceIndex` is a no-op since HashMap resizes automatically. This is documented on the method.
- Do NOT add the `usearch` crate dependency in this task. That is Task 02.
- Do NOT implement `l2_normalize()` in this task. That is Task 03 (Embedding Lifecycle).
- Do NOT implement the adaptive query planner in this task. That is Task 04.
- The `l2_distance_sq()` function is `pub(crate)` -- it is used by `BruteForceIndex` and by Task 04's planner for brute-force fallback. It is not a public API.

View File

@ -0,0 +1,607 @@
# Task 02: USearch Backend
## Context
**Milestone:** 2 -- Ranked Retrieval
**Phase:** m2p1 -- Vector Index Integration (USearch)
**Depends On:** Task 01 (VectorIndex trait, types, `l2_distance_sq`)
**Blocks:** Task 04 (Adaptive Query Planner -- needs USearch for benchmarking)
**Complexity:** L
## Objective
Deliver `UsearchIndex`, the production HNSW implementation wrapping the `usearch` Rust crate (Apache-2.0, C++ FFI via `cxx`). This is the performance-critical vector index that tidalDB uses for approximate nearest neighbor search at scale. At 10M vectors of dimension 1536, USearch achieves ~127K QPS at f32 and ~167K QPS at int8, with recall@100 > 95% -- numbers validated by ScyllaDB, ClickHouse, and DuckDB in production.
This is the only module in tidalDB where `#![forbid(unsafe_code)]` is relaxed. The `usearch` crate uses CXX for C++ FFI, which requires `unsafe` at the binding boundary. Every `unsafe` block must have a `// SAFETY:` comment explaining why the invariants hold. The `#[allow(unsafe_code)]` attribute is scoped to this single file (`storage/vector/usearch.rs`).
The USearch backend implements the full `VectorIndex` trait: `insert`, `search`, `filtered_search`, `delete`, `reserve`, `save`, `load`, `view`. It uses f16 quantization by default, M=16, ef_construction=200, ef_search=200 -- parameters validated by the research doc as optimal for 1536-dimensional embeddings at tidalDB's target scale.
## Requirements
- `UsearchIndex` wraps `usearch::Index` from the `usearch` crate
- Implements `VectorIndex` trait from Task 01
- Default config: f16 quantization (`usearch::ScalarKind::F16`), M=16, ef_construction=200, ef_search=200, metric=L2sq
- `insert()` delegates to `usearch::Index::add(key, vector)`
- `search()` delegates to `usearch::Index::search(query, k)`
- `filtered_search()` delegates to `usearch::Index::filtered_search(query, k, predicate)`
- `delete()` delegates to `usearch::Index::remove(key)` (lazy tombstone)
- `reserve()` delegates to `usearch::Index::reserve(capacity)`
- `save()`, `load()`, `view()` delegate to USearch persistence methods
- `len()` and `len_live()` use USearch's `size()` and capacity reporting
- `#[allow(unsafe_code)]` scoped to `usearch.rs` only, with `// SAFETY:` on every unsafe block
- Integration test: insert 1000 random vectors, search for 10 query vectors, compare recall against `BruteForceIndex`
- `UsearchIndex` is `Send + Sync`
## Technical Design
### Module Structure
```
tidal/src/storage/vector/
usearch.rs -- UsearchIndex, #[allow(unsafe_code)]
```
### Cargo.toml Addition
```toml
[dependencies]
usearch = "2" # or latest stable version supporting filtered_search
```
Note: The exact version must be verified at implementation time. The `usearch` crate must support `filtered_search` with a predicate callback. If the latest published version does not support this API, the implementation must either:
1. Use a version that does (check crate changelog).
2. Fall back to `hnsw_rs` (pure Rust, `Filterable` trait) -- see Open Question 1 in OVERVIEW.md.
### Lint Configuration
**Unsafe code:** The `usearch` crate (v2.x) provides a safe Rust API at the `Index` level -- CXX bridge handles the FFI internally. At implementation time, verify that all `Index` methods (`add`, `search`, `filtered_search`, `remove`, `save`, `load`, `view`) have safe signatures. If confirmed safe, **do NOT add `#[allow(unsafe_code)]`** and keep crate-level `forbid(unsafe_code)`. Only add `#[allow(unsafe_code)]` if specific call sites require it, with `// SAFETY:` comments. The current expectation is that no unsafe blocks are needed in `usearch.rs`.
### Public API
```rust
// === storage/vector/usearch.rs ===
//! USearch HNSW backend for approximate nearest neighbor search.
//!
//! This module wraps the `usearch` crate (Apache-2.0, C++ FFI via CXX)
//! behind the `VectorIndex` trait. It is the ONLY module in tidalDB that
//! uses `unsafe` code, and only at the C++ FFI boundary.
//!
//! # Safety
//!
//! All unsafe blocks delegate to `usearch::Index` methods which perform
//! C++ interop via CXX. The safety invariants are:
//! - Vectors passed to USearch have the correct dimensionality (checked
//! before the FFI call).
//! - The `usearch::Index` handle is valid for the lifetime of `UsearchIndex`.
//! - `reserve()` has been called with sufficient capacity before insertion.
#![allow(unsafe_code)]
use std::path::Path;
use super::{VectorIndex, VectorId, VectorSearchResult, VectorIndexConfig, VectorError,
DistanceMetric, QuantizationLevel};
/// Production HNSW index backed by USearch.
///
/// Uses f16 quantization by default, M=16, ef_construction=200, ef_search=200.
/// Supports concurrent reads and writes (validated by ScyllaDB at 1B vectors).
///
/// # Persistence
///
/// - `save(path)`: Full serialization to disk. Coordinated with WAL checkpoint.
/// - `load(path)`: Full deserialization into writable RAM.
/// - `view(path)`: Zero-copy mmap for read-only serving (instant restart).
pub struct UsearchIndex {
inner: usearch::Index,
config: VectorIndexConfig,
}
impl UsearchIndex {
/// Create a new empty index with the given configuration.
///
/// # Errors
///
/// Returns `VectorError::Backend` if USearch fails to initialize.
pub fn new(config: VectorIndexConfig) -> Result<Self, VectorError>;
}
```
### Internal Design
**Index construction:**
```rust
impl UsearchIndex {
pub fn new(config: VectorIndexConfig) -> Result<Self, VectorError> {
let metric = match config.metric {
DistanceMetric::L2 => usearch::MetricKind::L2sq,
DistanceMetric::InnerProduct => usearch::MetricKind::IP,
};
let quantization = match config.quantization {
QuantizationLevel::F32 => usearch::ScalarKind::F32,
QuantizationLevel::F16 => usearch::ScalarKind::F16,
QuantizationLevel::Int8 => usearch::ScalarKind::I8,
};
let options = usearch::IndexOptions {
dimensions: config.dimensions,
metric,
quantization,
connectivity: config.connectivity,
expansion_add: config.ef_construction,
expansion_search: config.ef_search,
..Default::default()
};
// SAFETY: usearch::new_index performs C++ allocation via CXX.
// The returned Index handle is valid until dropped.
let inner = usearch::new_index(&options)
.map_err(|e| VectorError::Backend(format!("USearch init failed: {e}")))?;
Ok(Self { inner, config })
}
}
```
**Insert implementation:**
```rust
fn insert(&self, id: VectorId, embedding: &[f32]) -> Result<(), VectorError> {
if embedding.len() != self.config.dimensions {
return Err(VectorError::DimensionMismatch {
expected: self.config.dimensions,
got: embedding.len(),
});
}
// SAFETY: embedding slice has correct length (checked above).
// USearch::add performs C++ FFI to insert the vector into the HNSW graph.
// The key (u64) and vector data are copied into USearch's internal storage.
self.inner.add(id, embedding)
.map_err(|e| VectorError::Backend(format!("USearch insert failed: {e}")))?;
Ok(())
}
```
**Search implementation:**
```rust
fn search(
&self,
query: &[f32],
k: usize,
ef_search: usize,
) -> Result<Vec<VectorSearchResult>, VectorError> {
if query.len() != self.config.dimensions {
return Err(VectorError::DimensionMismatch {
expected: self.config.dimensions,
got: query.len(),
});
}
// SAFETY: query slice has correct length (checked above).
// USearch::search performs HNSW traversal via C++ FFI.
// Results are copied back into Rust-owned memory.
let results = self.inner.search(query, k)
.map_err(|e| VectorError::Backend(format!("USearch search failed: {e}")))?;
Ok(results.keys.iter().zip(results.distances.iter())
.map(|(&id, &dist)| VectorSearchResult { id, distance: dist })
.collect())
}
```
**Filtered search implementation:**
```rust
fn filtered_search(
&self,
query: &[f32],
k: usize,
ef_search: usize,
filter: &dyn Fn(VectorId) -> bool,
) -> Result<Vec<VectorSearchResult>, VectorError> {
if query.len() != self.config.dimensions {
return Err(VectorError::DimensionMismatch {
expected: self.config.dimensions,
got: query.len(),
});
}
// SAFETY: query slice has correct length (checked above).
// The predicate closure is called from C++ during HNSW traversal.
// CXX marshals the u64 key to Rust and back. The closure captures
// only the filter reference which outlives the search call.
let results = self.inner.filtered_search(query, k, |key| filter(key))
.map_err(|e| VectorError::Backend(format!("USearch filtered_search failed: {e}")))?;
Ok(results.keys.iter().zip(results.distances.iter())
.map(|(&id, &dist)| VectorSearchResult { id, distance: dist })
.collect())
}
```
**Note on `filtered_search` args:** USearch's `filtered_search` takes (query, count, filter) -- there is no `ef_search` parameter. To use a different `ef_search` for this query, call `self.inner.change_expansion_search(ef)` BEFORE `filtered_search`. See the ef_search override note below.
**ef_search override:** Calling `change_expansion_search(ef)` before a search changes a global index parameter. Under concurrent searches this is NOT safe. For M2 (single-threaded query path or low concurrency), wrap the `(change_expansion_search, search)` pair in a `Mutex` guard. For M7 and high concurrency, investigate USearch's thread-safe ef_search API or fix ef_search at construction time. Document this in the Open Questions.
**Delete implementation:**
```rust
fn delete(&self, id: VectorId) -> Result<(), VectorError> {
// SAFETY: USearch::remove performs lazy tombstoning via C++ FFI.
// The node remains in the graph for navigation but is excluded from results.
self.inner.remove(id)
.map_err(|e| VectorError::Backend(format!("USearch delete failed: {e}")))?;
Ok(())
}
```
**Persistence implementation:**
```rust
fn save(&self, path: &Path) -> Result<(), VectorError> {
let path_str = path.to_str()
.ok_or_else(|| VectorError::Io(std::io::Error::new(
std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?;
// SAFETY: USearch::save serializes the entire index to disk via C++ I/O.
self.inner.save(path_str)
.map_err(|e| VectorError::Backend(format!("USearch save failed: {e}")))?;
Ok(())
}
fn load(path: &Path, config: &VectorIndexConfig) -> Result<Self, VectorError> {
let index = Self::new(config.clone())?;
let path_str = path.to_str()
.ok_or_else(|| VectorError::Io(std::io::Error::new(
std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?;
// SAFETY: USearch::load deserializes from disk into writable RAM via C++ I/O.
index.inner.load(path_str)
.map_err(|e| VectorError::Backend(format!("USearch load failed: {e}")))?;
Ok(index)
}
fn view(path: &Path, config: &VectorIndexConfig) -> Result<Self, VectorError> {
// view() now receives config, matching the updated VectorIndex trait
// signature from Task 01 (Fix 2a). Create an index with the config
// options, then call USearch's view() to mmap the file.
let index = Self::new(config.clone())?;
let path_str = path.to_str()
.ok_or_else(|| VectorError::Io(std::io::Error::new(
std::io::ErrorKind::InvalidInput, "non-UTF-8 path")))?;
// SAFETY: USearch::view memory-maps the file for read-only access via C++ I/O.
index.inner.view(path_str)
.map_err(|e| VectorError::Backend(format!("USearch view failed: {e}")))?;
Ok(index)
}
```
**`len` and `len_live` implementation:**
```rust
fn len(&self) -> usize {
self.inner.size()
}
fn len_live(&self) -> usize {
// USearch tracks live vs tombstoned internally.
// If the crate exposes this, use it. Otherwise, len() is the best estimate.
// Investigate at implementation time.
self.inner.size() // may need adjustment
}
```
### Error Handling
- All USearch errors are mapped to `VectorError::Backend(String)` with the original error message.
- Dimension checks happen before any FFI call to provide clear Rust-side errors.
- I/O errors from persistence are mapped to `VectorError::Io` when possible, `VectorError::Backend` otherwise.
- If `reserve()` is not called before insertion and USearch fails, the error is `VectorError::Backend` with a message suggesting `reserve()`.
## Test Strategy
### Integration Tests
```rust
// === tests/vector_usearch.rs (integration test) ===
use tidaldb::storage::vector::*;
use rand::Rng;
/// Generate a random unit vector of the given dimension.
fn random_unit_vector(dim: usize, rng: &mut impl Rng) -> Vec<f32> {
let v: Vec<f32> = (0..dim).map(|_| rng.gen::<f32>() - 0.5).collect();
let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
v.iter().map(|x| x / norm).collect()
}
#[test]
fn usearch_insert_and_search_1000_vectors() {
let dim = 128; // smaller dim for test speed
let config = VectorIndexConfig {
dimensions: dim,
metric: DistanceMetric::L2,
quantization: QuantizationLevel::F16,
connectivity: 16,
ef_construction: 200,
ef_search: 200,
};
let usearch_index = UsearchIndex::new(config.clone()).unwrap();
usearch_index.reserve(2000).unwrap();
let brute_index = BruteForceIndex::new(config.clone());
let mut rng = rand::thread_rng();
let vectors: Vec<(u64, Vec<f32>)> = (0..1000)
.map(|id| (id, random_unit_vector(dim, &mut rng)))
.collect();
// Insert into both indexes
for (id, v) in &vectors {
usearch_index.insert(*id, v).unwrap();
brute_index.insert(*id, v).unwrap();
}
// Search with 10 random queries, measure recall
let mut total_recall = 0.0;
let k = 100;
let n_queries = 10;
for _ in 0..n_queries {
let query = random_unit_vector(dim, &mut rng);
let exact_results = brute_index.search(&query, k, 0).unwrap();
let approx_results = usearch_index.search(&query, k, 0).unwrap();
let exact_ids: std::collections::HashSet<u64> =
exact_results.iter().map(|r| r.id).collect();
let approx_ids: std::collections::HashSet<u64> =
approx_results.iter().map(|r| r.id).collect();
let overlap = exact_ids.intersection(&approx_ids).count();
let recall = overlap as f64 / k as f64;
total_recall += recall;
}
let mean_recall = total_recall / n_queries as f64;
assert!(mean_recall > 0.90,
"recall@{k} should be > 0.90, got {mean_recall:.3}");
}
#[test]
fn usearch_filtered_search_excludes_non_matching() {
let dim = 64;
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config).unwrap();
index.reserve(200).unwrap();
let mut rng = rand::thread_rng();
for id in 0..100u64 {
let v = random_unit_vector(dim, &mut rng);
index.insert(id, &v).unwrap();
}
// Only include even IDs
let query = random_unit_vector(dim, &mut rng);
let results = index.filtered_search(&query, 50, 0, &|id| id % 2 == 0).unwrap();
for r in &results {
assert!(r.id % 2 == 0, "filtered_search returned odd ID {}", r.id);
}
}
#[test]
fn usearch_delete_excludes_from_results() {
let dim = 64;
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config).unwrap();
index.reserve(200).unwrap();
let mut rng = rand::thread_rng();
let vectors: Vec<(u64, Vec<f32>)> = (0..50)
.map(|id| (id, random_unit_vector(dim, &mut rng)))
.collect();
for (id, v) in &vectors {
index.insert(*id, v).unwrap();
}
// Delete ID 0
index.delete(0).unwrap();
// Search for the deleted vector -- it should not appear
let results = index.search(&vectors[0].1, 50, 0).unwrap();
assert!(results.iter().all(|r| r.id != 0),
"deleted vector should not appear in results");
}
#[test]
fn usearch_save_load_roundtrip() {
let dim = 64;
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config.clone()).unwrap();
index.reserve(200).unwrap();
let mut rng = rand::thread_rng();
for id in 0..100u64 {
let v = random_unit_vector(dim, &mut rng);
index.insert(id, &v).unwrap();
}
let dir = tempfile::tempdir().unwrap();
let path = dir.path().join("test.usearch");
// Save
index.save(&path).unwrap();
// Load
let loaded = UsearchIndex::load(&path, &config).unwrap();
assert_eq!(loaded.len(), 100);
// Search on loaded index should produce similar results
let query = random_unit_vector(dim, &mut rng);
let results_orig = index.search(&query, 10, 0).unwrap();
let results_loaded = loaded.search(&query, 10, 0).unwrap();
// Top-1 should match (high probability for exact same index)
assert_eq!(results_orig[0].id, results_loaded[0].id);
}
#[test]
fn usearch_view_readonly() {
let dim = 64;
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config.clone()).unwrap();
index.reserve(100).unwrap();
let mut rng = rand::thread_rng();
for id in 0..50u64 {
let v = random_unit_vector(dim, &mut rng);
index.insert(id, &v).unwrap();
}
let dir = tempfile::tempdir().unwrap();
let path = dir.path().join("test.usearch");
index.save(&path).unwrap();
// View (mmap read-only)
let viewed = UsearchIndex::view(&path, &config).unwrap();
assert_eq!(viewed.len(), 50);
// Search should work on view'd index
let query = random_unit_vector(dim, &mut rng);
let results = viewed.search(&query, 10, 0).unwrap();
assert!(!results.is_empty());
}
#[test]
fn usearch_dimension_mismatch() {
let config = VectorIndexConfig {
dimensions: 64,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config).unwrap();
index.reserve(10).unwrap();
// Wrong dimension on insert
let result = index.insert(1, &[1.0; 32]); // 32 dims instead of 64
assert!(matches!(result, Err(VectorError::DimensionMismatch { expected: 64, got: 32 })));
// Wrong dimension on search
index.insert(1, &[0.0; 64]).unwrap();
let result = index.search(&[1.0; 32], 1, 0);
assert!(matches!(result, Err(VectorError::DimensionMismatch { .. })));
}
#[test]
fn usearch_is_send_and_sync() {
fn assert_send_sync<T: Send + Sync>() {}
assert_send_sync::<UsearchIndex>();
}
#[test]
fn usearch_recall_at_10k() {
// Larger recall test at 10K vectors, matching the phase acceptance criteria.
// Uses smaller dimensions (128) for test speed.
let dim = 128;
let n = 10_000;
let k = 100;
let config = VectorIndexConfig {
dimensions: dim,
metric: DistanceMetric::L2,
quantization: QuantizationLevel::F16,
connectivity: 16,
ef_construction: 200,
ef_search: 200,
};
let usearch_index = UsearchIndex::new(config.clone()).unwrap();
usearch_index.reserve(n * 2).unwrap();
let brute_index = BruteForceIndex::new(config);
let mut rng = rand::thread_rng();
for id in 0..n as u64 {
let v = random_unit_vector(dim, &mut rng);
usearch_index.insert(id, &v).unwrap();
brute_index.insert(id, &v).unwrap();
}
// 10 queries, compute mean recall@100
let mut total_recall = 0.0;
for _ in 0..10 {
let query = random_unit_vector(dim, &mut rng);
let exact = brute_index.search(&query, k, 0).unwrap();
let approx = usearch_index.search(&query, k, 0).unwrap();
let exact_ids: std::collections::HashSet<u64> = exact.iter().map(|r| r.id).collect();
let approx_ids: std::collections::HashSet<u64> = approx.iter().map(|r| r.id).collect();
let recall = exact_ids.intersection(&approx_ids).count() as f64 / k as f64;
total_recall += recall;
}
let mean_recall = total_recall / 10.0;
assert!(mean_recall > 0.95,
"recall@{k} at {n} vectors should be > 0.95, got {mean_recall:.3}");
}
```
## Acceptance Criteria
- [ ] `UsearchIndex` wraps `usearch::Index` from the `usearch` crate
- [ ] `UsearchIndex` implements `VectorIndex` trait (all methods)
- [ ] Default config: f16 quantization, M=16, ef_construction=200, ef_search=200, L2sq metric
- [ ] `insert()` validates dimensions before FFI call
- [ ] `search()` returns results sorted by ascending L2 distance
- [ ] `filtered_search()` passes predicate closure to USearch's callback API; all returned results satisfy the predicate
- [ ] `delete()` tombstones the vector; it is excluded from subsequent search results
- [ ] `reserve()` pre-allocates capacity in USearch
- [ ] `save()` persists the full index to disk
- [ ] `load()` restores a writable index from disk; search produces identical results
- [ ] `view()` memory-maps the index for read-only search
- [ ] `#[allow(unsafe_code)]` scoped to `usearch.rs` only
- [ ] Every `unsafe` block has a `// SAFETY:` comment
- [ ] Integration test: 1000 vectors, 10 queries, recall@100 > 0.90
- [ ] Integration test: 10K vectors, recall@100 > 0.95 (matching phase acceptance criteria)
- [ ] Integration test: filtered_search returns only predicate-matching results
- [ ] Integration test: save/load roundtrip preserves search results
- [ ] `UsearchIndex` is `Send + Sync`
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All integration tests pass
## Research References
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- USearch evaluation: 127K QPS at f32, 167K QPS at int8, ScyllaDB validates concurrent operation at 1B vectors, f16 as optimal default (half memory, < 1% recall loss), `filtered_search(query, k, |key| predicate(key))` implements in-graph filtering, `view()` for zero-copy mmap serving
## Spec References
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 2 (HNSW internals: M=16, ef_construction=200, ef_search=200, L2 distance over normalized vectors), Section 3 (filtered ANN: USearch predicate callback, in-graph filtering preserves graph navigation), Section 4 (quantization: f16 default, ScalarKind mapping), Section 7 (persistence: save/load/view lifecycle, checkpoint coordination), Section 11 (UsearchIndex implementation sketch), Section 12 (performance targets: < 10ms ANN at 10K, recall@100 > 95%)
## Implementation Notes
- Add `usearch = "2"` (or the latest stable version with `filtered_search` support) to `tidal/Cargo.toml` `[dependencies]`.
- Change `[lints.rust] unsafe_code` from `"forbid"` to `"deny"` in `Cargo.toml`. Add a comment: `# deny (not forbid) to allow #[allow(unsafe_code)] in usearch FFI module`.
- Add `rand = "0.9"` to `[dev-dependencies]` for random vector generation in tests.
- The `usearch` crate depends on `cxx` for C++ interop. This adds a C++ compiler requirement to the build. Document this in a top-level build note.
- If USearch does not expose a way to distinguish live vs tombstoned vectors, `len_live()` should track deletions via an internal `AtomicUsize` counter decremented on each `delete()` call.
- The `view()` method signature in the `VectorIndex` trait now takes `(path, config)` per the updated trait definition in Task 01. USearch requires knowing the index dimensions/metric to initialize the mmap'd index, so the config parameter is passed through to USearch construction before calling `view()`.
- Do NOT implement per-query `ef_search` override in this task if the USearch crate does not support it cleanly. Accept the parameter, log a debug warning if it differs from the default, and use the index-level default. Per-query override can be added when the adaptive query planner (Task 04) needs it.
- Do NOT wrap `UsearchIndex` in `RwLock` unless testing reveals that concurrent `insert` + `search` causes data races. USearch claims thread safety for concurrent reads and writes. Verify in the integration test by running searches and inserts from multiple threads.

View File

@ -0,0 +1,820 @@
# Task 03: Embedding Lifecycle + Slot Registry
## Context
**Milestone:** 2 -- Ranked Retrieval
**Phase:** m2p1 -- Vector Index Integration (USearch)
**Depends On:** Task 01 (VectorIndex trait, VectorError, VectorIndexConfig, QuantizationLevel)
**Blocks:** m2p5 (RETRIEVE executor -- needs embedding insert path for write_item with embeddings)
**Complexity:** M
## Objective
Deliver the embedding lifecycle operations (`l2_normalize`, insert, update, delete) and the `EmbeddingSlotRegistry` that maps named embedding slots to their HNSW indexes. This is the layer between the entity write API (`write_item()` with an embedding) and the raw `VectorIndex` trait.
When an application writes an item with an embedding, the lifecycle layer:
1. Validates that the dimensions match the slot definition.
2. L2-normalizes the vector to unit length (so L2 distance = cosine similarity).
3. Stores the full-precision (f32) normalized vector in the entity store as the source of truth.
4. Inserts the vector into the HNSW index (which quantizes to f16/int8 internally).
The `EmbeddingSlotRegistry` is the central authority for embedding slot configuration. It maps `(EntityKind, slot_name)` to `EmbeddingSlotState` which contains the HNSW index, dimensions, quantization level, and HNSW parameters. The registry is constructed from the schema at `TidalDB::open()` time.
Embeddings in the entity store use the key format `encode_key(entity_id, Tag::Meta, b"EMB:slot_name")`. This co-locates embedding data with entity metadata under the same entity prefix, enabling efficient prefix scans for entity-level operations.
## Requirements
- `l2_normalize(v: &[f32]) -> Result<Vec<f32>, VectorError>` normalizes to unit length
- `l2_normalize` fails with `VectorError::ZeroNormVector` on zero-norm input
- `l2_normalize` verifies the result: `|1.0 - ||result||| < 1e-5`
- `EmbeddingSlotRegistry` maps `(EntityKind, String)` to `EmbeddingSlotState`
- `EmbeddingSlotState` holds: `Box<dyn VectorIndex>`, `dimensions`, `quantization`, `source`, `params`
- `EmbeddingSource` enum: `External` (provided by application), `DatabaseManaged` (computed by tidalDB)
- Insert path: validate dims, normalize, store in entity store, insert into HNSW
- Update path: validate dims, normalize, update entity store, tombstone old in HNSW, insert new
- Delete path: tombstone in HNSW, optionally remove from entity store
- Entity store key format: `encode_key(entity_id, Tag::Meta, b"EMB:slot_name")`
- Entity store value format: `[dimensions: 4 bytes LE][vector: dimensions * 4 bytes, f32 LE]`
- No `unsafe` code
## Technical Design
### Module Structure
```
tidal/src/storage/vector/
lifecycle.rs -- l2_normalize, EmbeddingOps (insert/update/delete helpers)
registry.rs -- EmbeddingSlotRegistry, EmbeddingSlotState, EmbeddingSource, HnswParams
```
### Public API
```rust
// === storage/vector/lifecycle.rs ===
use super::{VectorError, VectorId, VectorIndex};
use crate::schema::EntityId;
use crate::storage::{StorageEngine, Tag, encode_key};
/// L2-normalize a vector to unit length.
///
/// Computes `v[i] = v[i] / ||v||` where `||v|| = sqrt(sum(v[i]^2))`.
///
/// For L2-normalized vectors, L2 distance is equivalent to cosine distance:
/// `||a - b||^2 = 2 - 2 * cos(a, b)`.
///
/// # Errors
///
/// Returns `VectorError::ZeroNormVector` if the vector has zero norm (all zeros).
/// A zero vector has no direction and cannot participate in cosine similarity.
///
/// # Post-conditions
///
/// The returned vector has L2 norm within `1e-5` of 1.0.
pub fn l2_normalize(v: &[f32]) -> Result<Vec<f32>, VectorError> {
let norm_sq: f32 = v.iter().map(|x| x * x).sum();
if norm_sq < f32::EPSILON {
return Err(VectorError::ZeroNormVector);
}
let norm = norm_sq.sqrt();
let result: Vec<f32> = v.iter().map(|x| x / norm).collect();
// Post-condition: verify normalization
debug_assert!({
let result_norm: f32 = result.iter().map(|x| x * x).sum::<f32>().sqrt();
(1.0 - result_norm).abs() < 1e-5
});
Ok(result)
}
/// Build the entity store key for an embedding slot.
///
/// Format: `encode_key(entity_id, Tag::Meta, b"EMB:slot_name")`
pub fn embedding_store_key(entity_id: EntityId, slot_name: &str) -> Vec<u8> {
let suffix = format!("EMB:{slot_name}");
encode_key(entity_id, Tag::Meta, suffix.as_bytes())
}
/// Serialize an embedding vector for entity store storage.
///
/// Format: `[dimensions: 4 bytes LE][vector: dimensions * 4 bytes, f32 LE]`
pub fn serialize_embedding(v: &[f32]) -> Vec<u8> {
let mut buf = Vec::with_capacity(4 + v.len() * 4);
buf.extend_from_slice(&(v.len() as u32).to_le_bytes());
for &x in v {
buf.extend_from_slice(&x.to_le_bytes());
}
buf
}
/// Deserialize an embedding vector from entity store storage.
///
/// Returns the f32 vector or an error if the data is corrupt.
pub fn deserialize_embedding(bytes: &[u8]) -> Result<Vec<f32>, VectorError> {
if bytes.len() < 4 {
return Err(VectorError::CorruptedIndex(
"embedding data too short for dimension header".into()));
}
let dim = u32::from_le_bytes([bytes[0], bytes[1], bytes[2], bytes[3]]) as usize;
let expected_len = 4 + dim * 4;
if bytes.len() != expected_len {
return Err(VectorError::CorruptedIndex(
format!("embedding data length {} != expected {expected_len}", bytes.len())));
}
let mut v = Vec::with_capacity(dim);
for i in 0..dim {
let offset = 4 + i * 4;
let x = f32::from_le_bytes([
bytes[offset], bytes[offset + 1], bytes[offset + 2], bytes[offset + 3],
]);
v.push(x);
}
Ok(v)
}
/// Insert an embedding for an entity.
///
/// 1. Validates dimensions match the expected `dimensions`.
/// 2. L2-normalizes the vector.
/// 3. Stores the normalized f32 vector in the entity store.
/// 4. Inserts the normalized vector into the HNSW index.
///
/// The entity store is the source of truth. The HNSW index is derived state.
pub fn insert_embedding(
entity_id: EntityId,
slot_name: &str,
raw_vector: &[f32],
expected_dimensions: usize,
index: &dyn VectorIndex,
storage: &dyn StorageEngine,
) -> Result<(), VectorError> {
// Validate dimensions
if raw_vector.len() != expected_dimensions {
return Err(VectorError::DimensionMismatch {
expected: expected_dimensions,
got: raw_vector.len(),
});
}
// Normalize
let normalized = l2_normalize(raw_vector)?;
// Store in entity store (source of truth)
let key = embedding_store_key(entity_id, slot_name);
let value = serialize_embedding(&normalized);
storage.put(&key, &value)
.map_err(|e| VectorError::Backend(format!("entity store write failed: {e}")))?;
// Insert into HNSW index
index.insert(entity_id.as_u64(), &normalized)?;
Ok(())
}
/// Update an embedding for an entity.
///
/// 1. Validates dimensions.
/// 2. L2-normalizes the new vector.
/// 3. Updates the entity store.
/// 4. Tombstones the old vector in HNSW.
/// 5. Inserts the new vector into HNSW.
///
/// Note: Between steps 4 and 5, the entity is absent from ANN results.
/// This window is microseconds and is acceptable per Spec 07, Section 6.
pub fn update_embedding(
entity_id: EntityId,
slot_name: &str,
raw_vector: &[f32],
expected_dimensions: usize,
index: &dyn VectorIndex,
storage: &dyn StorageEngine,
) -> Result<(), VectorError> {
if raw_vector.len() != expected_dimensions {
return Err(VectorError::DimensionMismatch {
expected: expected_dimensions,
got: raw_vector.len(),
});
}
let normalized = l2_normalize(raw_vector)?;
// Update entity store
let key = embedding_store_key(entity_id, slot_name);
let value = serialize_embedding(&normalized);
storage.put(&key, &value)
.map_err(|e| VectorError::Backend(format!("entity store write failed: {e}")))?;
// Tombstone old in HNSW, insert new
// delete() may return NotFound if the entity was never indexed (first embedding).
// That is fine -- ignore NotFound on the delete step.
let _ = index.delete(entity_id.as_u64());
index.insert(entity_id.as_u64(), &normalized)?;
Ok(())
}
/// Delete an embedding for an entity.
///
/// 1. Tombstones the vector in HNSW.
/// 2. Optionally removes the embedding from the entity store.
///
/// For archive (soft delete): tombstone HNSW only, keep entity store data.
/// For hard delete: tombstone HNSW and remove entity store key.
pub fn delete_embedding(
entity_id: EntityId,
slot_name: &str,
index: &dyn VectorIndex,
storage: &dyn StorageEngine,
hard_delete: bool,
) -> Result<(), VectorError> {
// Tombstone in HNSW
index.delete(entity_id.as_u64())?;
// Optionally remove from entity store
if hard_delete {
let key = embedding_store_key(entity_id, slot_name);
storage.delete(&key)
.map_err(|e| VectorError::Backend(format!("entity store delete failed: {e}")))?;
}
Ok(())
}
```
### EmbeddingSlotRegistry
```rust
// === storage/vector/registry.rs ===
use std::collections::HashMap;
use crate::schema::EntityKind;
use super::{VectorIndex, VectorIndexConfig, QuantizationLevel};
/// HNSW parameters for an embedding slot.
#[derive(Debug, Clone)]
pub struct HnswParams {
pub connectivity: usize,
pub ef_construction: usize,
pub ef_search: usize,
}
impl Default for HnswParams {
fn default() -> Self {
Self {
connectivity: 16,
ef_construction: 200,
ef_search: 200,
}
}
}
/// Source of an embedding slot's vectors.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum EmbeddingSource {
/// Provided by the application via `write_item()` or `write_user()`.
External,
/// Computed and maintained by tidalDB (e.g., user preference vector,
/// creator catalog embedding).
DatabaseManaged,
}
/// State for a single embedding slot.
pub struct EmbeddingSlotState {
/// The HNSW index for this slot.
pub index: Box<dyn VectorIndex>,
/// Number of dimensions for this slot.
pub dimensions: usize,
/// Quantization level used in the HNSW index.
pub quantization: QuantizationLevel,
/// Whether this embedding is externally provided or database-managed.
pub source: EmbeddingSource,
/// HNSW graph parameters.
pub params: HnswParams,
}
/// Registry of all embedding slots across all entity types.
///
/// Constructed from the schema at `TidalDB::open()` time. Each entity type
/// can define up to 4 embedding slots (per Entity Model Specification).
///
/// # Example
///
/// ```text
/// Item "content" -> 1536d, f16, External, M=16
/// Item "visual" -> 512d, f16, External, M=16
/// User "preference" -> 1536d, f16, DatabaseManaged, M=16
/// ```
pub struct EmbeddingSlotRegistry {
slots: HashMap<(EntityKind, String), EmbeddingSlotState>,
}
impl EmbeddingSlotRegistry {
/// Create an empty registry.
pub fn new() -> Self {
Self { slots: HashMap::new() }
}
/// Register an embedding slot.
///
/// # Errors
///
/// Returns an error if a slot with the same `(entity_kind, slot_name)` already exists.
pub fn register(
&mut self,
entity_kind: EntityKind,
slot_name: String,
state: EmbeddingSlotState,
) -> Result<(), VectorError>;
/// Look up an embedding slot by entity kind and slot name.
///
/// Returns `None` if the slot is not registered.
pub fn get(&self, entity_kind: EntityKind, slot_name: &str) -> Option<&EmbeddingSlotState>;
/// Look up an embedding slot mutably.
pub fn get_mut(
&mut self,
entity_kind: EntityKind,
slot_name: &str,
) -> Option<&mut EmbeddingSlotState>;
/// List all slot names for a given entity kind.
pub fn slots_for(&self, entity_kind: EntityKind) -> Vec<&str>;
/// Total number of registered slots across all entity kinds.
pub fn slot_count(&self) -> usize {
self.slots.len()
}
/// Save all indexes to disk under the given directory.
///
/// File naming: `{data_dir}/vector/{entity_kind}_{slot_name}.usearch`
pub fn save_all(&self, data_dir: &std::path::Path) -> Result<(), VectorError>;
/// Load all indexes from disk.
///
/// Uses `view()` for immediate read serving, then optionally `load()` for
/// writable access in the background.
pub fn load_all(&mut self, data_dir: &std::path::Path) -> Result<(), VectorError>;
}
```
### Error Handling
- `l2_normalize()` with zero vector: returns `VectorError::ZeroNormVector`.
- Dimension mismatch on insert/update: returns `VectorError::DimensionMismatch`.
- Entity store I/O failure: returns `VectorError::Backend` wrapping the storage error.
- Corrupt embedding data on deserialize: returns `VectorError::CorruptedIndex`.
- Duplicate slot registration: returns `VectorError::Backend("slot already registered: ...")`.
- Slot not found in registry: returns `None` (not an error -- callers check before use).
## Test Strategy
### Property Tests
```rust
use proptest::prelude::*;
// l2_normalize produces unit vectors.
proptest! {
#[test]
fn normalize_produces_unit_vector(
v in prop::collection::vec(-100.0f32..100.0, 2..256),
) {
// Skip zero vectors (they fail normalization, which is correct)
let norm_sq: f32 = v.iter().map(|x| x * x).sum();
prop_assume!(norm_sq > f32::EPSILON);
let normalized = l2_normalize(&v).unwrap();
let result_norm: f32 = normalized.iter().map(|x| x * x).sum::<f32>().sqrt();
prop_assert!(
(1.0 - result_norm).abs() < 1e-5,
"norm was {result_norm}, expected ~1.0"
);
}
}
// l2_normalize is idempotent: normalizing a unit vector returns the same vector.
proptest! {
#[test]
fn normalize_idempotent(
v in prop::collection::vec(-100.0f32..100.0, 2..256),
) {
let norm_sq: f32 = v.iter().map(|x| x * x).sum();
prop_assume!(norm_sq > f32::EPSILON);
let first = l2_normalize(&v).unwrap();
let second = l2_normalize(&first).unwrap();
for (a, b) in first.iter().zip(second.iter()) {
prop_assert!((a - b).abs() < 1e-5,
"idempotent check failed: {a} vs {b}");
}
}
}
// l2_normalize preserves direction (cosine similarity with original = 1.0).
proptest! {
#[test]
fn normalize_preserves_direction(
v in prop::collection::vec(1.0f32..100.0, 2..256),
) {
let normalized = l2_normalize(&v).unwrap();
// Cosine similarity between v and normalized(v) should be ~1.0
let dot: f32 = v.iter().zip(normalized.iter()).map(|(a, b)| a * b).sum();
let norm_v: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
let cosine = dot / norm_v; // normalized already has norm 1
prop_assert!(
(1.0 - cosine).abs() < 1e-4,
"cosine similarity with original was {cosine}, expected ~1.0"
);
}
}
// Embedding serialize/deserialize roundtrip.
proptest! {
#[test]
fn embedding_serde_roundtrip(
v in prop::collection::vec(-1.0f32..1.0, 1..512),
) {
let bytes = serialize_embedding(&v);
let restored = deserialize_embedding(&bytes).unwrap();
prop_assert_eq!(v.len(), restored.len());
for (a, b) in v.iter().zip(restored.iter()) {
prop_assert!((a - b).abs() < 1e-7,
"serde mismatch: {a} vs {b}");
}
}
}
// Insert + search roundtrip via BruteForceIndex.
proptest! {
#[test]
fn insert_embedding_searchable(
dim in 2usize..64,
n in 1usize..50,
) {
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = BruteForceIndex::new(config);
let storage = InMemoryBackend::new();
for id in 0..n as u64 {
let raw: Vec<f32> = (0..dim).map(|i| ((id as usize + i) % 100) as f32 / 100.0 + 0.01).collect();
insert_embedding(
EntityId::new(id + 1),
"content",
&raw,
dim,
&index,
&storage,
).unwrap();
}
// Verify all are searchable
prop_assert_eq!(index.len(), n);
// Verify entity store has the normalized vectors
for id in 0..n as u64 {
let key = embedding_store_key(EntityId::new(id + 1), "content");
let bytes = storage.get(&key).unwrap();
prop_assert!(bytes.is_some(), "entity store should have embedding for id {id}");
let stored = deserialize_embedding(&bytes.unwrap()).unwrap();
let norm: f32 = stored.iter().map(|x| x * x).sum::<f32>().sqrt();
prop_assert!((1.0 - norm).abs() < 1e-5, "stored embedding should be normalized");
}
}
}
```
### Unit Tests
```rust
#[test]
fn l2_normalize_unit_vector() {
let v = vec![1.0, 0.0, 0.0];
let normalized = l2_normalize(&v).unwrap();
assert!((normalized[0] - 1.0).abs() < 1e-6);
assert!(normalized[1].abs() < 1e-6);
assert!(normalized[2].abs() < 1e-6);
}
#[test]
fn l2_normalize_non_unit_vector() {
let v = vec![3.0, 4.0]; // norm = 5
let normalized = l2_normalize(&v).unwrap();
assert!((normalized[0] - 0.6).abs() < 1e-5);
assert!((normalized[1] - 0.8).abs() < 1e-5);
let norm: f32 = normalized.iter().map(|x| x * x).sum::<f32>().sqrt();
assert!((1.0 - norm).abs() < 1e-5);
}
#[test]
fn l2_normalize_zero_vector_fails() {
let v = vec![0.0, 0.0, 0.0];
let result = l2_normalize(&v);
assert!(matches!(result, Err(VectorError::ZeroNormVector)));
}
#[test]
fn l2_normalize_near_zero_vector_fails() {
let v = vec![1e-40, 0.0, 0.0]; // norm^2 < f32::EPSILON
let result = l2_normalize(&v);
assert!(matches!(result, Err(VectorError::ZeroNormVector)));
}
#[test]
fn serialize_deserialize_embedding() {
let v = vec![1.0, 2.0, 3.0];
let bytes = serialize_embedding(&v);
assert_eq!(bytes.len(), 4 + 3 * 4); // 4 dim header + 12 data
let restored = deserialize_embedding(&bytes).unwrap();
assert_eq!(v, restored);
}
#[test]
fn deserialize_embedding_truncated() {
let result = deserialize_embedding(&[0x03, 0x00, 0x00]); // too short for header
assert!(matches!(result, Err(VectorError::CorruptedIndex(_))));
}
#[test]
fn deserialize_embedding_wrong_length() {
let mut bytes = serialize_embedding(&[1.0, 2.0]);
bytes.pop(); // truncate one byte
let result = deserialize_embedding(&bytes);
assert!(matches!(result, Err(VectorError::CorruptedIndex(_))));
}
#[test]
fn embedding_store_key_format() {
let key = embedding_store_key(EntityId::new(42), "content");
let (eid, tag, suffix) = parse_key(&key).unwrap();
assert_eq!(eid, EntityId::new(42));
assert_eq!(tag, Tag::Meta);
assert_eq!(suffix, b"EMB:content");
}
#[test]
fn embedding_store_key_different_slots() {
let key_content = embedding_store_key(EntityId::new(1), "content");
let key_visual = embedding_store_key(EntityId::new(1), "visual");
assert_ne!(key_content, key_visual);
}
#[test]
fn insert_embedding_validates_dimensions() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
let storage = InMemoryBackend::new();
let result = insert_embedding(
EntityId::new(1), "content", &[1.0, 2.0], 3, &index, &storage,
);
assert!(matches!(result, Err(VectorError::DimensionMismatch { expected: 3, got: 2 })));
}
#[test]
fn insert_embedding_stores_normalized_vector() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
let storage = InMemoryBackend::new();
insert_embedding(
EntityId::new(1), "content", &[3.0, 4.0, 0.0], 3, &index, &storage,
).unwrap();
// Read from entity store
let key = embedding_store_key(EntityId::new(1), "content");
let bytes = storage.get(&key).unwrap().unwrap();
let stored = deserialize_embedding(&bytes).unwrap();
// Should be normalized (norm = 5, so [0.6, 0.8, 0.0])
assert!((stored[0] - 0.6).abs() < 1e-5);
assert!((stored[1] - 0.8).abs() < 1e-5);
assert!(stored[2].abs() < 1e-5);
}
#[test]
fn insert_embedding_zero_vector_fails() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
let storage = InMemoryBackend::new();
let result = insert_embedding(
EntityId::new(1), "content", &[0.0, 0.0, 0.0], 3, &index, &storage,
);
assert!(matches!(result, Err(VectorError::ZeroNormVector)));
}
#[test]
fn update_embedding_replaces_vector() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
let storage = InMemoryBackend::new();
// Insert original
insert_embedding(
EntityId::new(1), "content", &[1.0, 0.0, 0.0], 3, &index, &storage,
).unwrap();
// Update
update_embedding(
EntityId::new(1), "content", &[0.0, 1.0, 0.0], 3, &index, &storage,
).unwrap();
// Search should find the updated vector
let results = index.search(&[0.0, 1.0, 0.0], 1, 0).unwrap();
assert_eq!(results[0].id, 1);
assert!(results[0].distance < 1e-5, "should match updated vector");
}
#[test]
fn delete_embedding_removes_from_index() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
let storage = InMemoryBackend::new();
insert_embedding(
EntityId::new(1), "content", &[1.0, 0.0, 0.0], 3, &index, &storage,
).unwrap();
delete_embedding(EntityId::new(1), "content", &index, &storage, false).unwrap();
// Should not appear in search results
let results = index.search(&[1.0, 0.0, 0.0], 10, 0).unwrap();
assert!(results.is_empty());
// Soft delete: entity store still has the embedding
let key = embedding_store_key(EntityId::new(1), "content");
assert!(storage.get(&key).unwrap().is_some());
}
#[test]
fn delete_embedding_hard_removes_from_store() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
let storage = InMemoryBackend::new();
insert_embedding(
EntityId::new(1), "content", &[1.0, 0.0, 0.0], 3, &index, &storage,
).unwrap();
delete_embedding(EntityId::new(1), "content", &index, &storage, true).unwrap();
// Entity store should not have the embedding
let key = embedding_store_key(EntityId::new(1), "content");
assert!(storage.get(&key).unwrap().is_none());
}
#[test]
fn registry_register_and_lookup() {
let mut registry = EmbeddingSlotRegistry::new();
let config = VectorIndexConfig { dimensions: 1536, ..VectorIndexConfig::default() };
let state = EmbeddingSlotState {
index: Box::new(BruteForceIndex::new(config)),
dimensions: 1536,
quantization: QuantizationLevel::F16,
source: EmbeddingSource::External,
params: HnswParams::default(),
};
registry.register(EntityKind::Item, "content".into(), state).unwrap();
let slot = registry.get(EntityKind::Item, "content");
assert!(slot.is_some());
assert_eq!(slot.unwrap().dimensions, 1536);
assert_eq!(slot.unwrap().source, EmbeddingSource::External);
}
#[test]
fn registry_duplicate_slot_fails() {
let mut registry = EmbeddingSlotRegistry::new();
let config = VectorIndexConfig { dimensions: 1536, ..VectorIndexConfig::default() };
let state1 = EmbeddingSlotState {
index: Box::new(BruteForceIndex::new(config.clone())),
dimensions: 1536,
quantization: QuantizationLevel::F16,
source: EmbeddingSource::External,
params: HnswParams::default(),
};
let state2 = EmbeddingSlotState {
index: Box::new(BruteForceIndex::new(config)),
dimensions: 1536,
quantization: QuantizationLevel::F16,
source: EmbeddingSource::External,
params: HnswParams::default(),
};
registry.register(EntityKind::Item, "content".into(), state1).unwrap();
let result = registry.register(EntityKind::Item, "content".into(), state2);
assert!(result.is_err());
}
#[test]
fn registry_different_entity_kinds_same_name() {
let mut registry = EmbeddingSlotRegistry::new();
let config = VectorIndexConfig { dimensions: 1536, ..VectorIndexConfig::default() };
let state_item = EmbeddingSlotState {
index: Box::new(BruteForceIndex::new(config.clone())),
dimensions: 1536,
quantization: QuantizationLevel::F16,
source: EmbeddingSource::External,
params: HnswParams::default(),
};
let state_user = EmbeddingSlotState {
index: Box::new(BruteForceIndex::new(config)),
dimensions: 1536,
quantization: QuantizationLevel::F16,
source: EmbeddingSource::DatabaseManaged,
params: HnswParams::default(),
};
registry.register(EntityKind::Item, "content".into(), state_item).unwrap();
registry.register(EntityKind::User, "content".into(), state_user).unwrap();
let item_slot = registry.get(EntityKind::Item, "content").unwrap();
let user_slot = registry.get(EntityKind::User, "content").unwrap();
assert_eq!(item_slot.source, EmbeddingSource::External);
assert_eq!(user_slot.source, EmbeddingSource::DatabaseManaged);
}
#[test]
fn registry_slots_for_entity_kind() {
let mut registry = EmbeddingSlotRegistry::new();
let config = VectorIndexConfig { dimensions: 128, ..VectorIndexConfig::default() };
for name in &["content", "visual", "audio"] {
let state = EmbeddingSlotState {
index: Box::new(BruteForceIndex::new(config.clone())),
dimensions: 128,
quantization: QuantizationLevel::F16,
source: EmbeddingSource::External,
params: HnswParams::default(),
};
registry.register(EntityKind::Item, (*name).to_string(), state).unwrap();
}
let slots = registry.slots_for(EntityKind::Item);
assert_eq!(slots.len(), 3);
assert!(slots.contains(&"content"));
assert!(slots.contains(&"visual"));
assert!(slots.contains(&"audio"));
// No user slots
let user_slots = registry.slots_for(EntityKind::User);
assert!(user_slots.is_empty());
}
#[test]
fn registry_nonexistent_slot_returns_none() {
let registry = EmbeddingSlotRegistry::new();
assert!(registry.get(EntityKind::Item, "content").is_none());
}
```
## Acceptance Criteria
- [ ] `l2_normalize()` normalizes vectors to unit length within `1e-5` tolerance
- [ ] `l2_normalize()` fails with `VectorError::ZeroNormVector` on zero-norm input
- [ ] `l2_normalize()` is idempotent (normalizing a unit vector returns the same vector)
- [ ] `serialize_embedding()` / `deserialize_embedding()` roundtrip produces identical vectors
- [ ] `embedding_store_key()` produces correct key: `[entity_id][NUL][Tag::Meta][EMB:slot_name]`
- [ ] `insert_embedding()` validates dimensions, normalizes, stores in entity store, inserts into HNSW
- [ ] `update_embedding()` tombstones old vector, inserts new, updates entity store
- [ ] `delete_embedding()` with `hard_delete=false` tombstones HNSW only, preserves entity store
- [ ] `delete_embedding()` with `hard_delete=true` removes from both HNSW and entity store
- [ ] Entity store always contains the full-precision normalized f32 vector (source of truth)
- [ ] `EmbeddingSlotRegistry::register()` stores slot state, rejects duplicates
- [ ] `EmbeddingSlotRegistry::get()` returns the correct slot by `(EntityKind, name)`
- [ ] `EmbeddingSlotRegistry::slots_for()` lists all slots for an entity kind
- [ ] Different entity kinds can have same-named slots without collision
- [ ] All property tests pass: normalize produces unit vectors, normalize is idempotent, serde roundtrip, insert+search roundtrip
- [ ] No `unsafe` code
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All unit and property tests pass
## Research References
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- "Normalize vectors at insertion time and use L2 distance (equivalent to cosine for unit vectors, and more SIMD-friendly)", capacity planning with 2x over-provision
## Spec References
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 1 (design principle: "Embeddings are L2-normalized at insertion. Cosine similarity is computed as L2 distance over unit vectors"), Section 5 (multiple embedding spaces: EmbeddingSlotRegistry, slot configuration per entity type, up to 4 slots), Section 6 (embedding lifecycle: insert path steps 1-6, update path, delete path, batch operations, normalization edge case), Section 11 (BruteForceIndex as correctness verifier)
- [docs/specs/02-entity-model.md](../../../specs/) -- Embedding slot constraints (up to 4 per entity type), embedding source (External vs DatabaseManaged)
## Implementation Notes
- `l2_normalize` uses `f32::EPSILON` (~1.19e-7) as the zero-norm threshold. This catches both exact zero vectors and vectors with components so small that normalization would overflow or produce denormalized results.
- The entity store key uses `Tag::Meta` (not a new tag) because embeddings are entity metadata. The `EMB:` prefix in the suffix distinguishes embedding keys from other metadata keys. This keeps the key encoding scheme from m1p3 intact without adding new Tag variants.
- `EmbeddingSlotRegistry` is NOT `Send + Sync` by default because `Box<dyn VectorIndex>` behind a `HashMap` requires external synchronization. In production, the registry is owned by `TidalDB` which provides appropriate access control. The registry is constructed once at startup and then used for reads only (except for index persistence operations).
- Do NOT implement batch insert via rayon in this task. Batch insert is an optimization for initial data load that can be added when the RETRIEVE executor (m2p5) needs it. The sequential insert path is correct and sufficient for M2 acceptance criteria.
- Do NOT implement the delta journal for incremental persistence. Full `save()` is fast enough at 10K vectors. Delta journal is deferred to M7 per Open Question 4 in OVERVIEW.md.
- The `save_all()` / `load_all()` methods on the registry coordinate persistence across all embedding slots. The directory structure follows Spec 07, Section 7: `{data_dir}/vector/{entity_kind}_{slot_name}.usearch`.

View File

@ -0,0 +1,815 @@
# Task 04: Adaptive Query Planner + Benchmarks
## Context
**Milestone:** 2 -- Ranked Retrieval
**Phase:** m2p1 -- Vector Index Integration (USearch)
**Depends On:** Task 01 (VectorIndex trait, BruteForceIndex, types), Task 02 (UsearchIndex for benchmarking)
**Blocks:** m2p5 (RETRIEVE executor calls the planner to select ANN strategy)
**Complexity:** M
## Objective
Deliver the `AdaptiveQueryPlanner` that evaluates filter selectivity before each ANN query and routes to the optimal strategy. The planner eliminates the single most common failure mode in filtered vector search: using HNSW in-graph filtering on extremely selective predicates (< 1% matching) where recall collapses, or using brute-force on broad predicates (> 20% matching) where linear scan is too slow.
The planner implements the decision tree from Spec 07, Section 9:
- **No filter (100%):** Standard HNSW `search()` -- fastest path, highest recall.
- **Broad filter (> 20%):** In-graph predicate filter via `filtered_search()` -- predicate evaluated during graph traversal, non-matching nodes used for navigation.
- **Danger zone (1-20%):** `filtered_search()` with widened `ef_search` (2-3x normal) -- ACORN-1 approximation to maintain recall under moderate selectivity.
**ef_search concurrency caveat:** Changing `ef_search` per query via `index.change_expansion_search(ef)` mutates global USearch state. For concurrent queries using different strategies, this requires a mutex around the `(change_expansion_search, search)` sequence. For M2, accept this limitation and document it. The `AdaptiveQueryPlanner` should take a `Mutex<Box<dyn VectorIndex>>` or wrap per-query `ef_search` changes in a lock. Alternatively, set `ef_search` conservatively high at construction time (e.g., 400) and skip per-query override for M2. Defer true per-query ef_search to M7 after benchmarking.
- **Very selective (< 1%):** Pre-filter to bitmap, then brute-force L2 scan over the small matched set -- exact results, fast on small sets.
This task also delivers the Criterion benchmarks for the entire vector subsystem, establishing the baseline performance measurements that all future milestones track.
For M2, the `SelectivityEstimator` is a placeholder that accepts an externally provided selectivity value. The real estimator (reading metadata bitmap cardinalities) is wired up when m2p2 (Metadata Indexes and Filter Engine) is implemented. This decoupling allows the planner to be tested and benchmarked independently.
## Requirements
- `AnnStrategy` enum: `Unfiltered`, `InGraphFilter`, `WidenedFilter`, `PreFilterBruteForce`
- `AdaptiveQueryPlanner` selects strategy based on estimated selectivity
- Selectivity thresholds: < 1% brute-force, 1-20% widened filter, > 20% standard filter, 100% unfiltered
- `ef_search` widening: 2x for 5-20% selectivity, 3x for 1-5% selectivity
- `SelectivityEstimator` trait with a placeholder implementation returning caller-provided values
- `AnnQueryStats` struct for per-query observability: estimated selectivity, actual selectivity, strategy, latency, results count
- `PlannerConfig` for threshold tuning: `in_graph_min_selectivity`, `brute_force_max_selectivity`, `ef_search_multiplier_moderate`, `ef_search_multiplier_low`
- Criterion benchmarks: unfiltered search, filtered search at 20% and 5% selectivity, brute-force search, recall@100
- No `unsafe` code
## Technical Design
### Module Structure
```
tidal/src/storage/vector/
planner.rs -- AdaptiveQueryPlanner, SelectivityEstimator, AnnQueryStats, PlannerConfig, AnnStrategy
tidal/benches/
vector.rs -- Criterion benchmarks
```
### Public API
```rust
// === storage/vector/planner.rs ===
use std::time::{Duration, Instant};
use super::{VectorIndex, VectorId, VectorSearchResult, VectorError, VectorIndexConfig};
/// The ANN strategy selected by the query planner.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum AnnStrategy {
/// No filter active. Standard HNSW search.
Unfiltered,
/// Filter selectivity > 20%. Standard in-graph predicate filter.
InGraphFilter,
/// Filter selectivity 1-20%. In-graph filter with widened ef_search.
WidenedFilter {
/// The widened ef_search value (2-3x normal).
ef_search: usize,
},
/// Filter selectivity < 1%. Pre-filter to candidate set, then brute-force.
PreFilterBruteForce,
}
/// Configuration for the adaptive query planner's selectivity thresholds.
///
/// These thresholds determine which ANN strategy is selected based on
/// estimated filter selectivity. They can be tuned based on runtime
/// statistics from `AnnQueryStats`.
#[derive(Debug, Clone)]
pub struct PlannerConfig {
/// Minimum selectivity for standard in-graph filtering.
/// Below this, use widened filter or brute-force.
/// Default: 0.20 (20%). Range: [0.05, 0.50].
pub in_graph_min_selectivity: f64,
/// Maximum selectivity for pre-filter + brute-force.
/// Above this, use widened filter instead.
/// Default: 0.01 (1%). Range: [0.001, 0.05].
pub brute_force_max_selectivity: f64,
/// ef_search multiplier for moderate selectivity (5-20%).
/// Default: 2.0.
pub ef_search_multiplier_moderate: f64,
/// ef_search multiplier for low selectivity (1-5%).
/// Default: 3.0.
pub ef_search_multiplier_low: f64,
/// Default ef_search when no override is specified.
/// Default: 200.
pub default_ef_search: usize,
}
impl Default for PlannerConfig {
fn default() -> Self {
Self {
in_graph_min_selectivity: 0.20,
brute_force_max_selectivity: 0.01,
ef_search_multiplier_moderate: 2.0,
ef_search_multiplier_low: 3.0,
default_ef_search: 200,
}
}
}
/// Trait for selectivity estimation.
///
/// The real implementation (m2p2) reads metadata bitmap cardinalities.
/// For m2p1, a placeholder implementation returns caller-provided values.
pub trait SelectivityEstimator: Send + Sync {
/// Estimate the fraction of items matching the given filter.
///
/// Returns a value in [0.0, 1.0]:
/// - 1.0 means no filter (all items match).
/// - 0.01 means ~1% of items match.
/// - 0.0 means nothing matches (empty result guaranteed).
fn estimate_selectivity(&self, filter: &dyn Fn(VectorId) -> bool) -> f64;
}
/// Placeholder estimator that always returns a fixed selectivity.
///
/// Used for m2p1 testing before metadata indexes (m2p2) exist.
/// Callers set the selectivity directly.
pub struct FixedSelectivityEstimator {
selectivity: f64,
}
impl FixedSelectivityEstimator {
pub fn new(selectivity: f64) -> Self {
Self { selectivity: selectivity.clamp(0.0, 1.0) }
}
/// Update the fixed selectivity value.
pub fn set_selectivity(&mut self, selectivity: f64) {
self.selectivity = selectivity.clamp(0.0, 1.0);
}
}
impl SelectivityEstimator for FixedSelectivityEstimator {
fn estimate_selectivity(&self, _filter: &dyn Fn(VectorId) -> bool) -> f64 {
self.selectivity
}
}
/// Statistics collected per ANN query for planner observability.
#[derive(Debug, Clone)]
pub struct AnnQueryStats {
/// Estimated selectivity before execution.
pub estimated_selectivity: f64,
/// Strategy selected by the planner.
pub strategy: AnnStrategy,
/// Number of results returned.
pub results_returned: usize,
/// Requested K.
pub requested_k: usize,
/// Wall clock time for the ANN query.
pub latency: Duration,
}
/// The adaptive query planner for filtered ANN search.
///
/// Evaluates filter selectivity and selects the optimal ANN strategy
/// for each query. Logs the plan at DEBUG level for observability.
///
/// # Strategy Selection
///
/// ```text
/// selectivity = 100% (no filter) -> Unfiltered
/// selectivity > 20% -> InGraphFilter (standard ef_search)
/// selectivity 5-20% -> WidenedFilter (2x ef_search)
/// selectivity 1-5% -> WidenedFilter (3x ef_search)
/// selectivity < 1% -> PreFilterBruteForce
/// ```
pub struct AdaptiveQueryPlanner {
config: PlannerConfig,
}
impl AdaptiveQueryPlanner {
pub fn new(config: PlannerConfig) -> Self {
Self { config }
}
pub fn with_defaults() -> Self {
Self::new(PlannerConfig::default())
}
/// Select the ANN strategy for a query based on estimated selectivity.
///
/// If `selectivity` is 1.0, returns `Unfiltered`.
/// Otherwise, applies the threshold decision tree.
pub fn select_strategy(&self, selectivity: f64) -> AnnStrategy {
if (selectivity - 1.0).abs() < f64::EPSILON || selectivity > 1.0 {
return AnnStrategy::Unfiltered;
}
if selectivity >= self.config.in_graph_min_selectivity {
return AnnStrategy::InGraphFilter;
}
if selectivity >= self.config.brute_force_max_selectivity {
let multiplier = if selectivity >= 0.05 {
self.config.ef_search_multiplier_moderate
} else {
self.config.ef_search_multiplier_low
};
let ef = (self.config.default_ef_search as f64 * multiplier) as usize;
return AnnStrategy::WidenedFilter { ef_search: ef };
}
AnnStrategy::PreFilterBruteForce
}
/// Execute an ANN query using the selected strategy.
///
/// This is the top-level entry point called by the RETRIEVE executor.
/// It estimates selectivity, selects a strategy, executes the search,
/// and returns results with query statistics.
///
/// # Arguments
///
/// * `index` -- The HNSW index to search.
/// * `query` -- The query vector (L2-normalized).
/// * `k` -- Number of results to return.
/// * `filter` -- Optional filter predicate. If `None`, unfiltered search.
/// * `selectivity` -- Estimated selectivity (provided by estimator or caller).
/// * `brute_force_index` -- Optional brute-force index for pre-filter fallback.
/// If `None` and strategy is `PreFilterBruteForce`, falls back to `WidenedFilter`.
pub fn execute(
&self,
index: &dyn VectorIndex,
query: &[f32],
k: usize,
filter: Option<&dyn Fn(VectorId) -> bool>,
selectivity: f64,
brute_force_index: Option<&dyn VectorIndex>,
) -> Result<(Vec<VectorSearchResult>, AnnQueryStats), VectorError> {
let strategy = match &filter {
None => AnnStrategy::Unfiltered,
Some(_) => self.select_strategy(selectivity),
};
let start = Instant::now();
let results = match (&strategy, filter) {
(AnnStrategy::Unfiltered, _) => {
index.search(query, k, self.config.default_ef_search)?
}
(AnnStrategy::InGraphFilter, Some(f)) => {
index.filtered_search(query, k, self.config.default_ef_search, f)?
}
(AnnStrategy::WidenedFilter { ef_search }, Some(f)) => {
index.filtered_search(query, k, *ef_search, f)?
}
(AnnStrategy::PreFilterBruteForce, Some(f)) => {
match brute_force_index {
Some(bf) => bf.filtered_search(query, k, 0, f)?,
None => {
// Fallback: use widened filter if no brute-force index
let ef = (self.config.default_ef_search as f64
* self.config.ef_search_multiplier_low) as usize;
index.filtered_search(query, k, ef, f)?
}
}
}
_ => {
// Filter is None but strategy is not Unfiltered -- should not happen.
// Defensive: run unfiltered.
index.search(query, k, self.config.default_ef_search)?
}
};
let latency = start.elapsed();
let stats = AnnQueryStats {
estimated_selectivity: selectivity,
strategy,
results_returned: results.len(),
requested_k: k,
latency,
};
Ok((results, stats))
}
/// Get the current planner configuration.
pub fn config(&self) -> &PlannerConfig {
&self.config
}
}
```
### Error Handling
- All errors propagate from the underlying `VectorIndex` methods.
- If `PreFilterBruteForce` is selected but no brute-force index is available, the planner falls back to `WidenedFilter` with `ef_search_multiplier_low`. This is logged at WARN level.
- If a filtered search returns fewer than `k` results (recall underflow), this is captured in `AnnQueryStats::results_returned < requested_k`. The planner does not automatically retry with a different strategy -- the caller (RETRIEVE executor) decides whether to retry.
### Criterion Benchmarks
```rust
// === tidal/benches/vector.rs ===
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId};
use tidaldb::storage::vector::*;
use rand::Rng;
fn random_unit_vector(dim: usize, rng: &mut impl Rng) -> Vec<f32> {
let v: Vec<f32> = (0..dim).map(|_| rng.gen::<f32>() - 0.5).collect();
let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
v.iter().map(|x| x / norm).collect()
}
/// Benchmark: unfiltered ANN search at 10K vectors.
fn bench_ann_search_unfiltered(c: &mut Criterion) {
let dim = 128; // Use 128d for CI-friendly benchmarks. 1536d for nightly.
let n = 10_000;
let k = 100;
let config = VectorIndexConfig {
dimensions: dim,
quantization: QuantizationLevel::F16,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config).unwrap();
index.reserve(n * 2).unwrap();
let mut rng = rand::thread_rng();
for id in 0..n as u64 {
let v = random_unit_vector(dim, &mut rng);
index.insert(id, &v).unwrap();
}
let query = random_unit_vector(dim, &mut rng);
c.bench_function("ann_search_unfiltered_10k", |b| {
b.iter(|| {
index.search(&query, k, 200).unwrap()
})
});
}
/// Benchmark: filtered ANN search at 20% selectivity.
fn bench_ann_search_filtered_20pct(c: &mut Criterion) {
let dim = 128;
let n = 10_000;
let k = 100;
let config = VectorIndexConfig {
dimensions: dim,
quantization: QuantizationLevel::F16,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config).unwrap();
index.reserve(n * 2).unwrap();
let mut rng = rand::thread_rng();
for id in 0..n as u64 {
let v = random_unit_vector(dim, &mut rng);
index.insert(id, &v).unwrap();
}
let query = random_unit_vector(dim, &mut rng);
// 20% selectivity: IDs divisible by 5
let predicate = |id: VectorId| id % 5 == 0;
c.bench_function("ann_search_filtered_20pct_10k", |b| {
b.iter(|| {
index.filtered_search(&query, k, 200, &predicate).unwrap()
})
});
}
/// Benchmark: filtered ANN search at 5% selectivity (danger zone, widened ef).
fn bench_ann_search_filtered_5pct(c: &mut Criterion) {
let dim = 128;
let n = 10_000;
let k = 100;
let config = VectorIndexConfig {
dimensions: dim,
quantization: QuantizationLevel::F16,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config).unwrap();
index.reserve(n * 2).unwrap();
let mut rng = rand::thread_rng();
for id in 0..n as u64 {
let v = random_unit_vector(dim, &mut rng);
index.insert(id, &v).unwrap();
}
let query = random_unit_vector(dim, &mut rng);
// 5% selectivity: IDs divisible by 20
let predicate = |id: VectorId| id % 20 == 0;
c.bench_function("ann_search_filtered_5pct_10k", |b| {
b.iter(|| {
index.filtered_search(&query, k, 400, &predicate).unwrap()
})
});
}
/// Benchmark: brute-force search over filtered candidate set.
fn bench_ann_search_brute_force(c: &mut Criterion) {
let dim = 128;
let n = 10_000;
let k = 100;
let config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let index = BruteForceIndex::new(config);
let mut rng = rand::thread_rng();
for id in 0..n as u64 {
let v = random_unit_vector(dim, &mut rng);
index.insert(id, &v).unwrap();
}
let query = random_unit_vector(dim, &mut rng);
// 0.5% selectivity: ~50 candidates from 10K
let predicate = |id: VectorId| id % 200 == 0;
c.bench_function("ann_brute_force_0_5pct_10k", |b| {
b.iter(|| {
index.filtered_search(&query, k, 0, &predicate).unwrap()
})
});
}
/// Benchmark: measure recall@100 (not a latency benchmark -- measures quality).
fn bench_ann_recall_at_100(c: &mut Criterion) {
let dim = 128;
let n = 10_000;
let k = 100;
let usearch_config = VectorIndexConfig {
dimensions: dim,
quantization: QuantizationLevel::F16,
..VectorIndexConfig::default()
};
let brute_config = VectorIndexConfig {
dimensions: dim,
..VectorIndexConfig::default()
};
let usearch_index = UsearchIndex::new(usearch_config).unwrap();
usearch_index.reserve(n * 2).unwrap();
let brute_index = BruteForceIndex::new(brute_config);
let mut rng = rand::thread_rng();
for id in 0..n as u64 {
let v = random_unit_vector(dim, &mut rng);
usearch_index.insert(id, &v).unwrap();
brute_index.insert(id, &v).unwrap();
}
// Generate 10 queries
let queries: Vec<Vec<f32>> = (0..10)
.map(|_| random_unit_vector(dim, &mut rng))
.collect();
c.bench_function("ann_recall_at_100_10k", |b| {
b.iter(|| {
let mut total_recall = 0.0;
for query in &queries {
let exact = brute_index.search(query, k, 0).unwrap();
let approx = usearch_index.search(query, k, 0).unwrap();
let exact_ids: std::collections::HashSet<u64> =
exact.iter().map(|r| r.id).collect();
let approx_ids: std::collections::HashSet<u64> =
approx.iter().map(|r| r.id).collect();
total_recall += exact_ids.intersection(&approx_ids).count() as f64 / k as f64;
}
total_recall / queries.len() as f64
})
});
}
/// Benchmark: insert one f16 vector into a 10K-vector index, pre-reserved capacity.
fn bench_ann_insert_single(c: &mut Criterion) {
let dim = 128;
let n = 10_000;
let config = VectorIndexConfig {
dimensions: dim,
quantization: QuantizationLevel::F16,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config).unwrap();
index.reserve(n * 2).unwrap();
let mut rng = rand::thread_rng();
for id in 0..n as u64 {
let v = random_unit_vector(dim, &mut rng);
index.insert(id, &v).unwrap();
}
let mut next_id = n as u64;
c.bench_function("ann_insert_single_10k", |b| {
b.iter(|| {
let v = random_unit_vector(dim, &mut rng);
index.insert(next_id, &v).unwrap();
next_id += 1;
})
});
}
/// Benchmark: tombstone-delete one vector from a 10K-vector index.
fn bench_ann_delete_single(c: &mut Criterion) {
let dim = 128;
let n = 10_000;
let config = VectorIndexConfig {
dimensions: dim,
quantization: QuantizationLevel::F16,
..VectorIndexConfig::default()
};
let index = UsearchIndex::new(config).unwrap();
index.reserve(n * 2).unwrap();
let mut rng = rand::thread_rng();
for id in 0..n as u64 {
let v = random_unit_vector(dim, &mut rng);
index.insert(id, &v).unwrap();
}
let mut delete_id = 0u64;
c.bench_function("ann_delete_single_10k", |b| {
b.iter(|| {
// Delete and re-insert to keep the bench iterable
let _ = index.delete(delete_id);
let v = random_unit_vector(dim, &mut rng);
index.insert(delete_id, &v).unwrap();
delete_id = (delete_id + 1) % n as u64;
})
});
}
criterion_group!(
benches,
bench_ann_search_unfiltered,
bench_ann_search_filtered_20pct,
bench_ann_search_filtered_5pct,
bench_ann_search_brute_force,
bench_ann_recall_at_100,
bench_ann_insert_single,
bench_ann_delete_single,
);
criterion_main!(benches);
```
## Test Strategy
### Unit Tests
```rust
#[test]
fn strategy_unfiltered_at_100pct() {
let planner = AdaptiveQueryPlanner::with_defaults();
assert_eq!(planner.select_strategy(1.0), AnnStrategy::Unfiltered);
}
#[test]
fn strategy_in_graph_above_20pct() {
let planner = AdaptiveQueryPlanner::with_defaults();
assert_eq!(planner.select_strategy(0.50), AnnStrategy::InGraphFilter);
assert_eq!(planner.select_strategy(0.25), AnnStrategy::InGraphFilter);
assert_eq!(planner.select_strategy(0.20), AnnStrategy::InGraphFilter);
}
#[test]
fn strategy_widened_moderate_5_to_20pct() {
let planner = AdaptiveQueryPlanner::with_defaults();
let strategy = planner.select_strategy(0.10);
match strategy {
AnnStrategy::WidenedFilter { ef_search } => {
// 10% is in the moderate range (5-20%), so 2x multiplier
assert_eq!(ef_search, 400, "ef_search should be 2x default (200*2=400)");
}
_ => panic!("expected WidenedFilter, got {strategy:?}"),
}
}
#[test]
fn strategy_widened_low_1_to_5pct() {
let planner = AdaptiveQueryPlanner::with_defaults();
let strategy = planner.select_strategy(0.03);
match strategy {
AnnStrategy::WidenedFilter { ef_search } => {
// 3% is in the low range (1-5%), so 3x multiplier
assert_eq!(ef_search, 600, "ef_search should be 3x default (200*3=600)");
}
_ => panic!("expected WidenedFilter, got {strategy:?}"),
}
}
#[test]
fn strategy_brute_force_below_1pct() {
let planner = AdaptiveQueryPlanner::with_defaults();
assert_eq!(planner.select_strategy(0.005), AnnStrategy::PreFilterBruteForce);
assert_eq!(planner.select_strategy(0.001), AnnStrategy::PreFilterBruteForce);
assert_eq!(planner.select_strategy(0.0), AnnStrategy::PreFilterBruteForce);
}
#[test]
fn strategy_boundary_at_20pct() {
let planner = AdaptiveQueryPlanner::with_defaults();
// Exactly at 20%: in-graph filter
assert_eq!(planner.select_strategy(0.20), AnnStrategy::InGraphFilter);
// Just below 20%: widened filter
let strategy = planner.select_strategy(0.19);
assert!(matches!(strategy, AnnStrategy::WidenedFilter { .. }));
}
#[test]
fn strategy_boundary_at_1pct() {
let planner = AdaptiveQueryPlanner::with_defaults();
// Exactly at 1%: widened filter
let strategy = planner.select_strategy(0.01);
assert!(matches!(strategy, AnnStrategy::WidenedFilter { .. }));
// Just below 1%: brute-force
assert_eq!(planner.select_strategy(0.009), AnnStrategy::PreFilterBruteForce);
}
#[test]
fn custom_thresholds() {
let config = PlannerConfig {
in_graph_min_selectivity: 0.30,
brute_force_max_selectivity: 0.02,
ef_search_multiplier_moderate: 2.5,
ef_search_multiplier_low: 4.0,
default_ef_search: 100,
};
let planner = AdaptiveQueryPlanner::new(config);
// 25%: below 30% threshold, widened filter
let strategy = planner.select_strategy(0.25);
assert!(matches!(strategy, AnnStrategy::WidenedFilter { .. }));
// 35%: above 30% threshold, in-graph
assert_eq!(planner.select_strategy(0.35), AnnStrategy::InGraphFilter);
// 1.5%: below 2% threshold, brute-force
assert_eq!(planner.select_strategy(0.015), AnnStrategy::PreFilterBruteForce);
}
#[test]
fn execute_unfiltered() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
index.insert(2, &[0.0, 1.0, 0.0]).unwrap();
let planner = AdaptiveQueryPlanner::with_defaults();
let (results, stats) = planner.execute(
&index, &[1.0, 0.0, 0.0], 2, None, 1.0, None,
).unwrap();
assert_eq!(results.len(), 2);
assert_eq!(stats.strategy, AnnStrategy::Unfiltered);
assert_eq!(stats.results_returned, 2);
assert_eq!(stats.requested_k, 2);
}
#[test]
fn execute_filtered_in_graph() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
for id in 0..10u64 {
index.insert(id, &[1.0, 0.0, 0.0]).unwrap();
}
let planner = AdaptiveQueryPlanner::with_defaults();
let filter = |id: VectorId| id % 2 == 0;
let (results, stats) = planner.execute(
&index, &[1.0, 0.0, 0.0], 5, Some(&filter), 0.50, None,
).unwrap();
assert!(results.iter().all(|r| r.id % 2 == 0));
assert_eq!(stats.strategy, AnnStrategy::InGraphFilter);
}
#[test]
fn execute_brute_force_fallback_without_brute_index() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
for id in 0..10u64 {
index.insert(id, &[1.0, 0.0, 0.0]).unwrap();
}
let planner = AdaptiveQueryPlanner::with_defaults();
let filter = |id: VectorId| id == 0;
// Selectivity 0.005 triggers PreFilterBruteForce, but no brute index provided
// Should fall back to WidenedFilter
let (results, stats) = planner.execute(
&index, &[1.0, 0.0, 0.0], 1, Some(&filter), 0.005, None,
).unwrap();
// Should still return results (fallback works)
assert!(!results.is_empty());
}
#[test]
fn execute_brute_force_with_brute_index() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let hnsw_index = BruteForceIndex::new(config.clone());
let brute_index = BruteForceIndex::new(config);
for id in 0..100u64 {
let v = [1.0, 0.0, 0.0];
hnsw_index.insert(id, &v).unwrap();
brute_index.insert(id, &v).unwrap();
}
let planner = AdaptiveQueryPlanner::with_defaults();
let filter = |id: VectorId| id < 1; // 1% selectivity
let (results, stats) = planner.execute(
&hnsw_index, &[1.0, 0.0, 0.0], 1, Some(&filter), 0.005, Some(&brute_index),
).unwrap();
assert_eq!(stats.strategy, AnnStrategy::PreFilterBruteForce);
assert!(results.iter().all(|r| r.id < 1));
}
#[test]
fn ann_query_stats_captures_latency() {
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
let index = BruteForceIndex::new(config);
index.insert(1, &[1.0, 0.0, 0.0]).unwrap();
let planner = AdaptiveQueryPlanner::with_defaults();
let (_, stats) = planner.execute(
&index, &[1.0, 0.0, 0.0], 1, None, 1.0, None,
).unwrap();
// Latency should be non-zero
assert!(stats.latency.as_nanos() > 0, "latency should be > 0");
}
#[test]
fn fixed_selectivity_estimator() {
let estimator = FixedSelectivityEstimator::new(0.15);
assert!((estimator.estimate_selectivity(&|_| true) - 0.15).abs() < f64::EPSILON);
let mut estimator = FixedSelectivityEstimator::new(2.0); // clamped to 1.0
assert!((estimator.estimate_selectivity(&|_| true) - 1.0).abs() < f64::EPSILON);
estimator.set_selectivity(-0.5); // clamped to 0.0
assert!((estimator.estimate_selectivity(&|_| true) - 0.0).abs() < f64::EPSILON);
}
#[test]
fn planner_config_defaults() {
let config = PlannerConfig::default();
assert!((config.in_graph_min_selectivity - 0.20).abs() < f64::EPSILON);
assert!((config.brute_force_max_selectivity - 0.01).abs() < f64::EPSILON);
assert!((config.ef_search_multiplier_moderate - 2.0).abs() < f64::EPSILON);
assert!((config.ef_search_multiplier_low - 3.0).abs() < f64::EPSILON);
assert_eq!(config.default_ef_search, 200);
}
```
## Acceptance Criteria
- [ ] `AnnStrategy` enum with 4 variants: `Unfiltered`, `InGraphFilter`, `WidenedFilter { ef_search }`, `PreFilterBruteForce`
- [ ] `AdaptiveQueryPlanner::select_strategy()` correctly routes: < 1% to brute-force, 1-5% to widened(3x), 5-20% to widened(2x), > 20% to in-graph, 100% to unfiltered
- [ ] `AdaptiveQueryPlanner::execute()` dispatches to the correct `VectorIndex` method based on selected strategy
- [ ] `execute()` falls back to `WidenedFilter` when `PreFilterBruteForce` is selected but no brute-force index is available
- [ ] `AnnQueryStats` captures: estimated_selectivity, strategy, results_returned, requested_k, latency
- [ ] `PlannerConfig` allows threshold tuning with correct defaults
- [ ] `FixedSelectivityEstimator` returns caller-provided selectivity, clamped to [0.0, 1.0]
- [ ] `SelectivityEstimator` trait defined for future m2p2 integration
- [ ] Criterion benchmarks implemented: `bench_ann_search_unfiltered`, `bench_ann_search_filtered_20pct`, `bench_ann_search_filtered_5pct`, `bench_ann_search_brute_force`, `bench_ann_recall_at_100`
- [ ] All benchmarks compile and produce results (performance targets are tracked, not gated)
- [ ] ANN retrieval latency < 10ms at 10K vectors (benchmark report)
- [ ] ANN recall@100 > 0.95 at 10K vectors (benchmark report)
- [ ] No `unsafe` code in `planner.rs`
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All unit tests pass
## Research References
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- "The critical insight across all systems: at extreme selectivity (<1-2%), everyone falls back to pre-filter + brute-force", ACORN-1 two-hop expansion, adaptive query planner architecture, selectivity estimation via metadata indexes, brute-force breakeven at `ef_search * 10` nodes
## Spec References
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 3 (three filtered ANN strategies: in-graph, pre-filter brute-force, ACORN-1 widened), Section 9 (adaptive query planner: decision tree, threshold reference table, runtime statistics `AnnQueryStats`, threshold adjustment bounds, query plan logging), Section 12 (performance targets: < 10ms unfiltered, < 15ms filtered > 20%, < 25ms filtered 1-20%, < 10ms brute-force < 1%; recall targets: > 97% unfiltered, > 95% filtered > 20%, > 90% filtered 1-20%, 100% brute-force; benchmark definitions)
## Implementation Notes
- Add `[[bench]] name = "vector" harness = false` to `tidal/Cargo.toml`.
- Add `rand = "0.9"` to `[dev-dependencies]` if not already present (shared with Task 02 tests).
- The benchmarks use 128-dimensional vectors for CI speed. Add a separate `#[cfg(feature = "nightly-bench")]` set at 1536 dimensions for nightly performance regression tracking.
- The `execute()` method takes `Option<&dyn Fn(VectorId) -> bool>` for the filter, not a `&dyn Fn`. When `None`, the planner always selects `Unfiltered` regardless of the selectivity parameter. This is a convenience for callers that do not have a filter.
- The `brute_force_index` parameter in `execute()` is `Option<&dyn VectorIndex>`. In practice, the RETRIEVE executor holds both the HNSW index and a reference to the entity store embeddings that can be loaded for brute-force scan. For M2, the `BruteForceIndex` is pre-populated alongside the HNSW index for small datasets. At scale, brute-force operates by loading embeddings from the entity store on demand.
- Do NOT implement threshold self-tuning based on `AnnQueryStats` in this task. The stats are collected for observability; automatic threshold adjustment is an M7 optimization. The thresholds are fixed per `PlannerConfig`.
- Do NOT implement the ACORN-1 two-hop expansion as a separate strategy. The `WidenedFilter` with increased `ef_search` achieves a similar effect. True ACORN-1 requires modifying the HNSW traversal algorithm inside USearch, which is not exposed via the current API. Deferred to M7 if widened `ef_search` proves insufficient.

View File

@ -0,0 +1,88 @@
# Milestone 2, Phase 2: Metadata Indexes and Filter Engine
## Phase Deliverable
Roaring bitmap indexes for categorical metadata fields (category, format, creator_id, tags) and B-tree range indexes for numeric/timestamp fields (created_at, duration). A composable filter engine that evaluates arbitrary filter combinations and produces either a `RoaringBitmap` (for pre-filtering ANN) or a `Fn(EntityId) -> bool` predicate closure (for in-graph filtering). Filter selectivity estimates for the adaptive query planner from m2p1.
This is the indexing layer that makes `FILTER category:jazz, format:video, duration_min:5m, created_within:7d` execute in microseconds instead of milliseconds. Without it, every metadata filter requires a full entity scan. With it, the query planner can estimate selectivity before choosing an ANN strategy (Spec 07 Section 9), and the RETRIEVE executor can intersect pre-computed bitmaps instead of loading entity metadata per candidate (Spec 08 Section 7).
## Acceptance Criteria
- [ ] Roaring bitmap per high-cardinality metadata value: category, format, creator_id, tags (multi-value)
- [ ] B-tree index for range attributes: created_at (nanosecond timestamps), duration (seconds)
- [ ] Filter expressions are composable: AND across dimensions, OR within a dimension, NOT for negation
- [ ] `filter.selectivity()` estimates the fraction of items matching (for query planner)
- [ ] `filter.to_bitmap()` returns a `RoaringBitmap` for pre-filtering
- [ ] `filter.to_predicate()` returns a `Fn(EntityId) -> bool` for in-graph filtering
- [ ] Filters tested: `category:jazz`, `format:video`, `duration_min:5m`, `created_within:7d`, and arbitrary AND/OR/NOT combinations
- [ ] Filter evaluation < 1 microsecond per candidate (benchmarked via bitmap containment check)
- [ ] Index insert and delete operations are correct (property tested)
- [ ] Selectivity estimates are in [0.0, 1.0] for all inputs (property tested)
## Dependencies
- **Requires:** m1p1 (types: `EntityId`, `EntityKind`, `Timestamp`), m1p3 (storage: `StorageEngine` trait, key encoding with `Tag::Idx` for index persistence), m1p5 (entity write API -- bitmap indexes are updated when entities are written)
- **Blocks:** m2p1 Task 04 (adaptive query planner's `SelectivityEstimator` uses m2p2's bitmap cardinalities), m2p5 (RETRIEVE executor applies filters to candidate sets)
## Research References
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- Selectivity estimation via bitmap cardinality, pre-filter brute-force strategy for selective filters (<1%), danger zone (1-20%) requiring widened ef_search, bitmap intersection as the standard pre-filtering primitive across Qdrant/Weaviate/Pinecone
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 3 (filtered ANN: three strategies with selectivity thresholds, selectivity estimation from bitmap cardinalities), Section 9 (adaptive query planner: decision tree using selectivity estimates)
- [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- Section 7 (filter evaluation: bitmap-based architecture, filter push-down, filter types, short-circuit evaluation, user-state filter implementation)
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 3.2 (scan strategy: metadata-indexed scan resolves filters to roaring bitmaps before signal reads), Section 4 Stage 3 (filter evaluation using pre-computed roaring bitmaps for keyword fields and range scans for numeric fields)
## Spec References
- [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- Section 7.1 (bitmap-based architecture: `category_bitmap["jazz"] intersect format_bitmap["video"]`), Section 7.2 (filter push-down into ANN predicate callback or pre-filter set), Section 7.3 (`Filter` enum: `Eq`, `Any`, `Range`, `Min`, `Max`, `Preset`, `CreatedWithin`, `CreatedAfter`, `CreatedBefore`), Section 7.4 (short-circuit evaluation: sort by ascending cardinality, abort on empty intersection)
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 3 (selectivity estimation: keyword equality uses `cardinality(bitmap[field][value]) / total_entities`; compound AND uses independence assumption `product(individual)`; compound OR uses `1 - product(1 - s_i)`)
## Task Index
| # | Task | Delivers | Depends On | Complexity |
|---|------|----------|------------|------------|
| 01 | Roaring Bitmap Indexes | `BitmapIndex` struct, insert/delete/get/cardinality, persistence via `Tag::Idx`, multi-value field support (tags) | None | M |
| 02 | B-tree Range Indexes | `RangeIndex<V>` struct, insert/delete/range query returning `RoaringBitmap`, selectivity estimation for ranges | None | S |
| 03 | Composable Filter Engine | `FilterExpr` AST, `FilterEvaluator`, `FilterResult` (bitmap or predicate), selectivity estimation, Criterion benchmarks | Task 01, Task 02 | M |
## Task Dependency DAG
```
Task 01: Roaring Bitmap Indexes Task 02: B-tree Range Indexes
| |
+-----------------------------------+
|
v
Task 03: Composable Filter Engine
```
Tasks 01 and 02 are fully parallelizable -- they share no types or state beyond `EntityId`. Task 03 composes them into the filter evaluation pipeline.
## File Layout
```
tidal/src/
storage/
indexes/
mod.rs -- pub use re-exports, IndexError type
bitmap.rs -- BitmapIndex (Task 01)
range.rs -- RangeIndex<V> (Task 02)
filter.rs -- FilterExpr, FilterEvaluator, FilterResult (Task 03)
mod.rs -- add `pub mod indexes;`
tidal/benches/
filters.rs -- Criterion benchmarks (Task 03)
tidal/Cargo.toml -- add `roaring` dependency
```
## Open Questions
1. **`roaring` vs `croaring`**: The `roaring` crate (pure Rust, simple API) vs `croaring` (C bindings, faster bulk operations). For M2 with 10K items, `roaring` is sufficient and keeps the `#![forbid(unsafe_code)]` crate-level lint intact. Use `roaring`. If M7 benchmarks at 1M+ items show roaring is a bottleneck, switch to `croaring`.
2. **Index update on entity write**: When `db.write_item(id, metadata)` is called, the bitmap and range indexes must be updated atomically with the storage engine write. Define the update order: storage engine first (source of truth), then update in-memory indexes. If the process crashes between storage write and in-memory update, the indexes are rebuilt from the storage engine on restart. The m2p2 phase defines the index data structures and their in-memory operations; the wiring into the entity write path is done in m2p5 (RETRIEVE executor) or a dedicated integration task.
3. **Multi-value fields (tags)**: Tags are a multi-value field -- one entity can have multiple tags. The bitmap index must support this: `insert(entity_id, "jazz")` and `insert(entity_id, "piano")` for the same entity. The entity appears in the bitmap for EACH tag value. When deleting an entity, it must be removed from ALL tag bitmaps. The `BitmapIndex` API uses `insert(entity_id, field_value)` and `delete(entity_id, field_value)`, so multi-value is handled by calling insert once per value.
4. **Index warming on startup**: At startup, load all bitmap indexes from storage engine before accepting queries. Time to warm 10K items with 10 category values = ~10ms (acceptable). At 1M items this may take ~1s. At 10M items this becomes a concern -- defer index pre-warming optimization to M7.
5. **Persistence granularity**: Each `(field_name, field_value)` pair's bitmap is stored as a single key in the storage engine: `encode_key(EntityId(0), Tag::Idx, b"BMP:{field_name}:{field_value}")`. For M2 with 10K items and ~50 distinct metadata values, this means ~50 keys. At 10M items with 10K distinct values, this means ~10K keys -- still manageable. Serialization uses `RoaringBitmap::serialize_into()` / `RoaringBitmap::deserialize_from()`.
6. **Separate index keyspace vs entity keyspace**: Index bitmaps are global (not per-entity) -- they map field values to sets of entity IDs. The subject-prefix key encoding (`[entity_id][NUL][tag][suffix]`) is entity-centric. Index keys need a different encoding since they are field-value-centric, not entity-centric. Solution: use a reserved entity ID (e.g., `EntityId(0)` or `EntityId(u64::MAX)`) as the "index root" with `Tag::Idx`, or use a dedicated prefix outside the entity keyspace. Decision: use `EntityId(0)` as the index root -- it is never a valid entity ID in practice, and keeps the key encoding uniform.

View File

@ -0,0 +1,558 @@
# Task 01: Roaring Bitmap Indexes
## Context
**Milestone:** 2 -- Ranked Retrieval
**Phase:** m2p2 -- Metadata Indexes and Filter Engine
**Depends On:** None (uses types from m1p1 but no m2p2 tasks)
**Blocks:** Task 03 (Composable Filter Engine)
**Complexity:** M
## Objective
Deliver `BitmapIndex`, the in-memory index structure that maps categorical metadata field values to `RoaringBitmap` sets of matching entity IDs. This is the data structure that makes `FILTER category:jazz` resolve in microseconds: look up `category_bitmap["jazz"]`, get a `RoaringBitmap` of all jazz item IDs, intersect with other filter bitmaps. No entity scan. No metadata loads.
The bitmap index supports exact-match lookups (`category:jazz`), multi-value fields (`tags:classical` where one entity has multiple tags), and provides cardinality counts for the adaptive query planner's selectivity estimation (Spec 07 Section 3, Section 9). Every production system that solves filtered ANN -- Qdrant, Weaviate, Pinecone -- uses roaring bitmaps as the pre-filter primitive. This is table stakes.
## Requirements
- `BitmapIndex` struct: maps `(field_name, field_value)` pairs to `RoaringBitmap` of matching entity IDs
- `insert(entity_id, field_value)` -- adds entity to the bitmap for that value
- `delete(entity_id, field_value)` -- removes entity from the bitmap for that value
- `get(field_value) -> Option<&RoaringBitmap>` -- returns bitmap for exact-match filter
- `cardinality(field_value) -> u64` -- returns number of entities matching that value
- `total_count() -> u64` -- returns total number of distinct entity IDs indexed across all values
- `values() -> impl Iterator<Item = &str>` -- enumerate all indexed field values
- `field_name()` -- returns the field this index covers
- Multi-value field support: one entity can appear in multiple value bitmaps (e.g., tags)
- `delete_entity(entity_id)` -- removes entity from ALL value bitmaps (for entity deletion)
- Persistence: serialize/deserialize each value bitmap to/from the storage engine via `Tag::Idx`
- On startup: load from storage engine, rebuild in-memory state
- Pure Rust: use the `roaring` crate (crates.io), not `croaring` (C bindings)
- `Send + Sync` (for concurrent read access during queries)
## Technical Design
### Module Structure
```
tidal/src/storage/
indexes/
mod.rs -- IndexError, pub use re-exports
bitmap.rs -- BitmapIndex (this task)
```
### Public API
```rust
// === storage/indexes/bitmap.rs ===
use roaring::RoaringBitmap;
use std::collections::HashMap;
use std::sync::RwLock;
/// Error type for index operations.
#[derive(Debug, thiserror::Error)]
pub enum IndexError {
#[error("storage error: {0}")]
Storage(String),
#[error("serialization error: {0}")]
Serialization(String),
}
/// A roaring bitmap index for a single categorical metadata field.
///
/// Maps field values (strings) to `RoaringBitmap` sets of entity IDs
/// that have that value. Supports exact-match lookups, multi-value
/// fields (tags), and cardinality queries for selectivity estimation.
///
/// # Concurrency
///
/// Reads and writes are protected by an `RwLock`. Reads (get, cardinality,
/// total_count) take the read lock. Writes (insert, delete, delete_entity)
/// take the write lock. This is acceptable because:
/// - Writes happen on the entity write path (not the hot query path).
/// - Reads happen during filter evaluation (concurrent with other reads).
/// - At M2 scale (10K entities), contention is negligible.
///
/// At M7 scale (1M+ entities), if write contention becomes measurable,
/// switch to a sharded index (one BitmapIndex per shard of the field
/// value space).
///
/// # Persistence
///
/// Each `(field_name, field_value)` bitmap is stored as a key in the
/// storage engine using `Tag::Idx`:
///
/// ```text
/// Key: [INDEX_ROOT_ID: 8 bytes BE][0x00][Tag::Idx][b"BMP:"][field_name][b":"][field_value]
/// Value: RoaringBitmap serialized bytes
/// ```
///
/// `INDEX_ROOT_ID` is `EntityId(0)` -- a reserved entity ID used as
/// the root for all index keys.
pub struct BitmapIndex {
field_name: String,
values: RwLock<HashMap<String, RoaringBitmap>>,
}
/// The reserved entity ID used as the root for index storage keys.
///
/// EntityId 0 is reserved for system-level keys across all subsystems.
/// The signal system also uses EntityId(0) with `Tag::Sig` for checkpoint
/// metadata (see `signals/checkpoint.rs`). The `Tag::Idx` discriminant
/// ensures no key collision — the full key format `[entity_id][NUL][tag][suffix]`
/// keeps each subsystem's keys in separate namespaces.
pub const INDEX_ROOT_ID: u64 = 0;
impl BitmapIndex {
/// Create a new, empty bitmap index for the given field.
pub fn new(field_name: impl Into<String>) -> Self;
/// The field name this index covers.
pub fn field_name(&self) -> &str;
/// Add an entity to the bitmap for the given field value.
///
/// For multi-value fields (tags), call once per value for the
/// same entity.
///
/// Returns `true` if the entity was newly added (was not already
/// in this value's bitmap).
pub fn insert(&self, entity_id: u32, value: impl Into<String>) -> bool;
/// Remove an entity from the bitmap for the given field value.
///
/// Returns `true` if the entity was present and removed.
pub fn delete(&self, entity_id: u32, value: &str) -> bool;
/// Remove an entity from ALL value bitmaps.
///
/// Used when an entity is deleted entirely. Scans all values
/// and removes the entity ID from each bitmap.
///
/// Returns the number of bitmaps the entity was removed from.
pub fn delete_entity(&self, entity_id: u32) -> usize;
/// Look up the bitmap for an exact field value.
///
/// Returns `None` if no entities have this value.
///
/// The returned bitmap is a clone (cheap for roaring bitmaps
/// due to copy-on-write internals). Callers can intersect,
/// union, or iterate over it freely.
pub fn get(&self, value: &str) -> Option<RoaringBitmap>;
/// Look up bitmaps for multiple values and return their union.
///
/// Implements OR-within-dimension: `category IN [jazz, blues]`
/// returns the union of the jazz and blues bitmaps.
pub fn get_union(&self, values: &[&str]) -> RoaringBitmap;
/// Number of entities matching a specific field value.
///
/// Used by the adaptive query planner for selectivity estimation.
pub fn cardinality(&self, value: &str) -> u64;
/// Total number of distinct entity IDs indexed across all values.
///
/// Used as the denominator for selectivity:
/// `selectivity = cardinality(value) / total_count()`.
///
/// Note: this is the cardinality of the union of all value
/// bitmaps, NOT the sum of individual cardinalities (which
/// would double-count entities in multi-value fields).
pub fn total_count(&self) -> u64;
/// Enumerate all indexed field values.
pub fn values(&self) -> Vec<String>;
/// Number of distinct field values in this index.
pub fn distinct_values(&self) -> usize;
/// Check if the index is empty (no entities indexed).
pub fn is_empty(&self) -> bool;
/// Serialize all bitmaps to storage engine key-value pairs.
///
/// Returns a `Vec<(Vec<u8>, Vec<u8>)>` of `(key, value)` pairs
/// suitable for `StorageEngine::write_batch()`.
///
/// Key format: `encode_key(EntityId(INDEX_ROOT_ID), Tag::Idx, suffix)`
/// where suffix = `b"BMP:" + field_name + b":" + field_value`.
pub fn serialize_to_kv_pairs(&self) -> Result<Vec<(Vec<u8>, Vec<u8>)>, IndexError>;
/// Deserialize bitmaps from storage engine key-value pairs.
///
/// Scans all keys with the `BMP:{field_name}:` prefix and
/// deserializes each value as a `RoaringBitmap`.
pub fn load_from_kv_pairs(
field_name: impl Into<String>,
pairs: impl Iterator<Item = (Vec<u8>, Vec<u8>)>,
) -> Result<Self, IndexError>;
}
```
### Internal Design
**`RoaringBitmap` entity ID representation:**
`RoaringBitmap` operates on `u32` values. `EntityId` is `u64`. For M2 (10K entities), all entity IDs fit in `u32`. For production at 10M+ entities, we need a strategy. Options:
1. **Truncate to u32**: Works if entity IDs are assigned sequentially from 0. This is the simplest approach and matches how most databases assign internal row IDs.
2. **Split high/low bits**: Use the high 32 bits as a partition key and the low 32 bits in the bitmap. Requires multiple bitmaps per value.
3. **Use `RoaringTreemap`**: The `roaring` crate provides `RoaringTreemap` which operates on `u64`. Slightly slower for small sets but handles the full ID space.
Decision for M2: Use `u32` with an explicit conversion from `EntityId`. The `insert` and `delete` APIs accept `u32` directly. The caller (entity write path, m2p5) is responsible for the `EntityId::as_u64() as u32` conversion. A debug assertion verifies no truncation: `debug_assert!(entity_id.as_u64() <= u64::from(u32::MAX))`. At M7+, if entity IDs exceed u32, switch to `RoaringTreemap`.
**HashMap vs BTreeMap for value storage:**
`HashMap<String, RoaringBitmap>` is used because:
- Field value lookups are exact-match (hash is O(1)).
- No ordering needed for bitmap index values.
- `BTreeMap` would be used if we needed prefix or range queries on field values, which we do not.
**Persistence key encoding:**
```
Key: encode_key(EntityId(0), Tag::Idx, b"BMP:category:jazz")
Value: RoaringBitmap::serialize_into() bytes
```
The `INDEX_ROOT_ID = EntityId(0)` anchors all index keys to the same entity prefix, making them scannable via `scan_prefix(entity_tag_prefix(EntityId(0), Tag::Idx))`. The suffix `BMP:{field_name}:{field_value}` distinguishes bitmap keys from future index types (e.g., `RNG:` for range indexes in Task 02).
**Bitmap serialization:**
The `roaring` crate provides `serialize_into(&mut writer)` and `deserialize_from(reader)` using the standard Roaring bitmap serialization format (compatible with CRoaring, Roaring Java, etc.). At 10K items with ~50 distinct category values, each bitmap is ~1-4 KB serialized. Total index persistence: ~100 KB.
### Error Handling
- `insert()` and `delete()` are infallible -- they operate on in-memory data only.
- `serialize_to_kv_pairs()` can fail if bitmap serialization fails (theoretically impossible for valid bitmaps, but the API is `Result` for correctness).
- `load_from_kv_pairs()` can fail if the stored bytes are corrupted. Returns `IndexError::Serialization` with context.
## Test Strategy
### Property Tests
```rust
use proptest::prelude::*;
use roaring::RoaringBitmap;
// Insert-query roundtrip: every inserted entity appears in get().
proptest! {
#[test]
fn insert_query_roundtrip(
entries in prop::collection::vec(
(0u32..100_000, "[a-z]{1,5}"),
1..500,
),
) {
let index = BitmapIndex::new("test_field");
for &(id, ref value) in &entries {
index.insert(id, value.clone());
}
for &(id, ref value) in &entries {
let bitmap = index.get(value).expect("value should exist");
prop_assert!(bitmap.contains(id), "entity {id} not found in bitmap for '{value}'");
}
}
}
// Delete removes correctly: after delete, entity is absent.
proptest! {
#[test]
fn delete_removes_entity(
ids in prop::collection::vec(0u32..10_000, 1..100),
value in "[a-z]{1,5}",
) {
let index = BitmapIndex::new("test_field");
for &id in &ids {
index.insert(id, value.clone());
}
// Delete first half
let half = ids.len() / 2;
for &id in &ids[..half] {
index.delete(id, &value);
}
// Verify first half absent, second half present
let bitmap = index.get(&value).unwrap_or_default();
for &id in &ids[..half] {
prop_assert!(!bitmap.contains(id), "deleted entity {id} still in bitmap");
}
for &id in &ids[half..] {
prop_assert!(bitmap.contains(id), "surviving entity {id} missing from bitmap");
}
}
}
// Cardinality matches bitmap length.
proptest! {
#[test]
fn cardinality_matches_bitmap(
ids in prop::collection::hash_set(0u32..100_000, 1..1000),
value in "[a-z]{1,3}",
) {
let index = BitmapIndex::new("test_field");
for &id in &ids {
index.insert(id, value.clone());
}
prop_assert_eq!(
index.cardinality(&value),
ids.len() as u64,
);
}
}
// Total count equals distinct entity IDs across all values.
proptest! {
#[test]
fn total_count_no_double_counting(
entries in prop::collection::vec(
(0u32..1_000, "[a-z]{1,3}"),
1..200,
),
) {
let index = BitmapIndex::new("tags");
let mut all_ids = RoaringBitmap::new();
for &(id, ref value) in &entries {
index.insert(id, value.clone());
all_ids.insert(id);
}
prop_assert_eq!(index.total_count(), all_ids.len());
}
}
// get_union returns the union of individual bitmaps.
proptest! {
#[test]
fn get_union_is_correct(
a_ids in prop::collection::vec(0u32..1000, 1..50),
b_ids in prop::collection::vec(0u32..1000, 1..50),
) {
let index = BitmapIndex::new("test_field");
for &id in &a_ids {
index.insert(id, "a".to_string());
}
for &id in &b_ids {
index.insert(id, "b".to_string());
}
let union = index.get_union(&["a", "b"]);
let manual_union = {
let mut bm = index.get("a").unwrap_or_default();
bm |= &index.get("b").unwrap_or_default();
bm
};
prop_assert_eq!(union, manual_union);
}
}
// Serialize-deserialize roundtrip preserves all bitmaps.
proptest! {
#[test]
fn serialize_deserialize_roundtrip(
entries in prop::collection::vec(
(0u32..10_000, "[a-z]{1,3}"),
1..100,
),
) {
let index = BitmapIndex::new("test_field");
for &(id, ref value) in &entries {
index.insert(id, value.clone());
}
let kv_pairs = index.serialize_to_kv_pairs().unwrap();
let restored = BitmapIndex::load_from_kv_pairs(
"test_field",
kv_pairs.into_iter(),
).unwrap();
// Verify all values and bitmaps match
for value in index.values() {
let orig = index.get(&value).unwrap();
let rest = restored.get(&value).unwrap();
prop_assert_eq!(orig, rest, "mismatch for value '{value}'");
}
prop_assert_eq!(index.total_count(), restored.total_count());
}
}
```
### Unit Tests
```rust
#[test]
fn new_index_is_empty() {
let index = BitmapIndex::new("category");
assert!(index.is_empty());
assert_eq!(index.total_count(), 0);
assert_eq!(index.distinct_values(), 0);
assert!(index.get("jazz").is_none());
assert_eq!(index.cardinality("jazz"), 0);
}
#[test]
fn insert_single_entity() {
let index = BitmapIndex::new("category");
assert!(index.insert(1, "jazz"));
assert!(!index.is_empty());
assert_eq!(index.cardinality("jazz"), 1);
assert_eq!(index.total_count(), 1);
let bitmap = index.get("jazz").unwrap();
assert!(bitmap.contains(1));
}
#[test]
fn insert_same_entity_same_value_is_idempotent() {
let index = BitmapIndex::new("category");
assert!(index.insert(1, "jazz")); // first insert: true (newly added)
assert!(!index.insert(1, "jazz")); // second insert: false (already present)
assert_eq!(index.cardinality("jazz"), 1);
}
#[test]
fn multi_value_field_tags() {
let index = BitmapIndex::new("tags");
index.insert(1, "jazz");
index.insert(1, "piano");
index.insert(1, "classical");
// Entity 1 appears in all three tag bitmaps
assert!(index.get("jazz").unwrap().contains(1));
assert!(index.get("piano").unwrap().contains(1));
assert!(index.get("classical").unwrap().contains(1));
// Total count is 1, not 3 (one distinct entity)
assert_eq!(index.total_count(), 1);
assert_eq!(index.distinct_values(), 3);
}
#[test]
fn delete_entity_from_all_values() {
let index = BitmapIndex::new("tags");
index.insert(1, "jazz");
index.insert(1, "piano");
index.insert(2, "jazz");
let removed_count = index.delete_entity(1);
assert_eq!(removed_count, 2); // removed from "jazz" and "piano"
// Entity 1 gone from both bitmaps
assert!(!index.get("jazz").unwrap().contains(1));
assert!(index.get("piano").is_none()); // only had entity 1
// Entity 2 still in "jazz"
assert!(index.get("jazz").unwrap().contains(2));
assert_eq!(index.total_count(), 1);
}
#[test]
fn get_union_multiple_values() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
index.insert(2, "blues");
index.insert(3, "jazz");
let union = index.get_union(&["jazz", "blues"]);
assert_eq!(union.len(), 3);
assert!(union.contains(1));
assert!(union.contains(2));
assert!(union.contains(3));
}
#[test]
fn get_union_with_missing_value() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
let union = index.get_union(&["jazz", "nonexistent"]);
assert_eq!(union.len(), 1);
assert!(union.contains(1));
}
#[test]
fn get_union_empty_values() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
let union = index.get_union(&[]);
assert!(union.is_empty());
}
#[test]
fn values_enumerates_all() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
index.insert(2, "blues");
index.insert(3, "rock");
let mut values = index.values();
values.sort();
assert_eq!(values, vec!["blues", "jazz", "rock"]);
}
#[test]
fn delete_nonexistent_returns_false() {
let index = BitmapIndex::new("category");
assert!(!index.delete(99, "jazz"));
}
#[test]
fn field_name_accessor() {
let index = BitmapIndex::new("category");
assert_eq!(index.field_name(), "category");
}
#[test]
fn empty_value_bitmaps_are_cleaned_up() {
let index = BitmapIndex::new("category");
index.insert(1, "jazz");
index.delete(1, "jazz");
// After removing the only entity, the value bitmap should be
// removed or empty. get() returns None for empty/removed.
assert!(index.get("jazz").is_none() || index.get("jazz").unwrap().is_empty());
assert_eq!(index.total_count(), 0);
}
```
## Acceptance Criteria
- [ ] `BitmapIndex` stores one `RoaringBitmap` per distinct field value, mapping field values to sets of `u32` entity IDs
- [ ] `insert(entity_id, field_value)` adds the entity to the bitmap for that value; returns true if newly added
- [ ] `delete(entity_id, field_value)` removes the entity from the bitmap; returns true if was present
- [ ] `delete_entity(entity_id)` removes the entity from ALL value bitmaps; returns count of bitmaps modified
- [ ] `get(field_value)` returns the bitmap for exact-match lookup, or `None` if no entities have that value
- [ ] `get_union(values)` returns the union bitmap across multiple values (OR-within-dimension)
- [ ] `cardinality(field_value)` returns the count of entities matching that value
- [ ] `total_count()` returns the count of distinct entity IDs across all values (no double-counting for multi-value fields)
- [ ] Multi-value fields work: one entity can appear in multiple value bitmaps (tested with tags)
- [ ] `serialize_to_kv_pairs()` and `load_from_kv_pairs()` roundtrip preserves all bitmaps exactly (property tested)
- [ ] Key encoding uses `encode_key(EntityId(0), Tag::Idx, b"BMP:{field_name}:{field_value}")` for persistence
- [ ] `BitmapIndex` is `Send + Sync`
- [ ] No `unsafe` code
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All property tests and unit tests pass
## Research References
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- "resolve filter predicates to roaring bitmaps", "Weaviate produces a RoaringBitmap allow-list", "Pinecone intersects metadata bitmaps per field with IVF cluster assignments"
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 3 (selectivity estimation: `cardinality(bitmap[field][value]) / total_entities`)
## Spec References
- [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- Section 7.1 (bitmap-based architecture: `category_bitmap["jazz"]`, `format_bitmap["video"]`, bitmap intersection for compound filters), Section 7.4 (short-circuit evaluation: "if any bitmap has cardinality 0, return empty Results immediately")
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 3.2 (metadata-indexed scan: "filters on keyword fields are resolved to roaring bitmaps and intersected before any signal reads")
## Implementation Notes
- Add `roaring = "0.10"` to `[dependencies]` in `tidal/Cargo.toml`. The `roaring` crate is pure Rust, no `unsafe`, well-maintained (~55M crates.io downloads), and implements the standard Roaring bitmap format.
- `RoaringBitmap` uses `u32`. This limits entity IDs to ~4 billion, which is sufficient through M6 (100K items). If M7+ needs u64 entity IDs, switch to `RoaringTreemap` (same crate, same API, u64 keys). Add a `debug_assert!(entity_id <= u32::MAX as u64)` at the conversion boundary.
- Empty bitmaps (all entities deleted for a value) should be removed from the `HashMap` to avoid accumulating dead entries. Check `bitmap.is_empty()` after `delete()` and remove the entry.
- The `RwLock` is from `std::sync::RwLock`, not `parking_lot`. At M2 scale (10K entities, low contention), `std` RwLock is sufficient. If profiling shows contention at M7, switch to `parking_lot::RwLock` which is faster under high contention.
- Do NOT implement bitmap compression tuning (run optimization) in this task. The `roaring` crate handles compression automatically.
- Do NOT implement the correlation cache for co-occurring filter pairs (Spec 07 Section 3: "maintain joint statistics for common filter combinations"). This is deferred to M5 or later.

View File

@ -0,0 +1,551 @@
# Task 02: B-tree Range Indexes
## Context
**Milestone:** 2 -- Ranked Retrieval
**Phase:** m2p2 -- Metadata Indexes and Filter Engine
**Depends On:** None (uses types from m1p1 but no m2p2 tasks)
**Blocks:** Task 03 (Composable Filter Engine)
**Complexity:** S
## Objective
Deliver `RangeIndex<V>`, a sorted in-memory index for range queries over numeric and timestamp fields. The query `FILTER duration_min:5m, created_within:7d` resolves to range lookups that return `RoaringBitmap` sets: "all entities with duration >= 300 seconds" and "all entities with created_at >= (now - 7 days)". These bitmaps are intersected with categorical bitmaps from Task 01 and fed into the filter engine (Task 03).
Range indexes are the complement to bitmap indexes. Bitmap indexes handle exact-match categorical predicates (`category:jazz`). Range indexes handle ordered numeric predicates (`duration >= 300`, `created_at >= 2026-02-13T00:00:00`). Together they cover the full filter predicate space defined in Spec 08 Section 7.3: `Eq`, `Any`, `Range`, `Min`, `Max`, `CreatedWithin`, `CreatedAfter`, `CreatedBefore`.
## Requirements
- `RangeIndex<V: Ord + Clone>` struct backed by `BTreeMap<V, RoaringBitmap>`
- Key is the attribute value, value is the set of entity IDs with that exact attribute value
- `insert(entity_id, value: V)` -- adds entity to the bitmap for that value
- `delete(entity_id, value: V)` -- removes entity from the bitmap for that value
- `range(lo: Bound<V>, hi: Bound<V>) -> RoaringBitmap` -- union of all bitmaps with keys in [lo, hi]
- `gt(threshold: V) -> RoaringBitmap` -- union of all bitmaps with keys > threshold
- `gte(threshold: V) -> RoaringBitmap` -- union of all bitmaps with keys >= threshold
- `lt(threshold: V) -> RoaringBitmap` -- union of all bitmaps with keys < threshold
- `lte(threshold: V) -> RoaringBitmap` -- union of all bitmaps with keys <= threshold
- `selectivity(lo: Bound<V>, hi: Bound<V>, total: u64) -> f64` -- estimated fraction of entities in range
- `total_count() -> u64` -- total distinct entity IDs indexed
- Concrete instantiations: `RangeIndex<u64>` for timestamps (nanoseconds), `RangeIndex<u32>` for duration (seconds)
- Persistence: serialize/deserialize each value bitmap to/from storage engine via `Tag::Idx`
- `Send + Sync`
## Technical Design
### Module Structure
```
tidal/src/storage/
indexes/
range.rs -- RangeIndex<V> (this task)
```
### Public API
```rust
// === storage/indexes/range.rs ===
use roaring::RoaringBitmap;
use std::collections::BTreeMap;
use std::ops::Bound;
use std::sync::RwLock;
use super::IndexError;
/// A B-tree backed range index for a single ordered numeric field.
///
/// Maps attribute values to `RoaringBitmap` sets of entity IDs.
/// The B-tree ordering enables efficient range queries: "all entities
/// with duration >= 300" unions the bitmaps for keys 300, 301, ...
///
/// # Design
///
/// Unlike the `BitmapIndex` (which uses a `HashMap` for exact-match),
/// this index uses a `BTreeMap` to exploit key ordering for range
/// scans. The `range()` method iterates from `lo` to `hi` in the
/// tree and unions the bitmaps. At 10K entities with ~100 distinct
/// duration values, this is ~100 bitmap unions -- well under 1ms.
///
/// # Concurrency
///
/// Same model as `BitmapIndex`: `RwLock` for read/write separation.
pub struct RangeIndex<V: Ord + Clone> {
field_name: String,
tree: RwLock<BTreeMap<V, RoaringBitmap>>,
}
impl<V: Ord + Clone> RangeIndex<V> {
/// Create a new, empty range index for the given field.
pub fn new(field_name: impl Into<String>) -> Self;
/// The field name this index covers.
pub fn field_name(&self) -> &str;
/// Add an entity with the given attribute value.
///
/// If the entity already exists at a DIFFERENT value, the caller
/// must call `delete(entity_id, old_value)` first. The range index
/// does not track previous values per entity.
pub fn insert(&self, entity_id: u32, value: V);
/// Remove an entity from the bitmap at the given value.
///
/// Returns `true` if the entity was present and removed.
pub fn delete(&self, entity_id: u32, value: &V) -> bool;
/// Range query: return the union of all bitmaps with keys in [lo, hi].
///
/// Uses `BTreeMap::range()` to iterate matching entries and
/// unions their bitmaps.
pub fn range(&self, lo: Bound<&V>, hi: Bound<&V>) -> RoaringBitmap;
/// Greater-than query: return entities with value > threshold.
pub fn gt(&self, threshold: &V) -> RoaringBitmap;
/// Greater-than-or-equal query.
pub fn gte(&self, threshold: &V) -> RoaringBitmap;
/// Less-than query: return entities with value < threshold.
pub fn lt(&self, threshold: &V) -> RoaringBitmap;
/// Less-than-or-equal query.
pub fn lte(&self, threshold: &V) -> RoaringBitmap;
/// Estimate the fraction of entities matching a range query.
///
/// Computed as: `sum(cardinality(bitmap) for key in range) / total`.
/// This is exact, not estimated, because we iterate the actual
/// bitmaps. At M2 scale (10K entities, ~100 distinct values),
/// this is cheap. At M7 scale, consider sampling.
///
/// Returns a value in [0.0, 1.0]. Returns 0.0 if `total` is 0.
pub fn selectivity(&self, lo: Bound<&V>, hi: Bound<&V>, total: u64) -> f64;
/// Total number of distinct entity IDs indexed.
pub fn total_count(&self) -> u64;
/// Number of distinct attribute values in the tree.
pub fn distinct_values(&self) -> usize;
/// Whether the index is empty.
pub fn is_empty(&self) -> bool;
}
```
### Persistence API
```rust
/// Persistence methods for `RangeIndex<u64>` (timestamps) and `RangeIndex<u32>` (durations).
///
/// These are implemented on the concrete types, not the generic, because
/// serialization requires knowing the byte width of V.
impl RangeIndex<u64> {
/// Serialize all bitmaps to storage engine key-value pairs.
///
/// Key format: `encode_key(EntityId(0), Tag::Idx, suffix)`
/// where suffix = `b"RNG:" + field_name + b":" + value_be_bytes`.
pub fn serialize_to_kv_pairs(&self) -> Result<Vec<(Vec<u8>, Vec<u8>)>, IndexError>;
/// Deserialize from storage engine key-value pairs.
pub fn load_from_kv_pairs(
field_name: impl Into<String>,
pairs: impl Iterator<Item = (Vec<u8>, Vec<u8>)>,
) -> Result<Self, IndexError>;
}
impl RangeIndex<u32> {
/// Serialize all bitmaps to storage engine key-value pairs.
pub fn serialize_to_kv_pairs(&self) -> Result<Vec<(Vec<u8>, Vec<u8>)>, IndexError>;
/// Deserialize from storage engine key-value pairs.
pub fn load_from_kv_pairs(
field_name: impl Into<String>,
pairs: impl Iterator<Item = (Vec<u8>, Vec<u8>)>,
) -> Result<Self, IndexError>;
}
```
### Internal Design
**BTreeMap iteration for range queries:**
```rust
fn range(&self, lo: Bound<&V>, hi: Bound<&V>) -> RoaringBitmap {
let tree = self.tree.read().expect("lock poisoned");
let mut result = RoaringBitmap::new();
for (_key, bitmap) in tree.range((lo, hi)) {
result |= bitmap;
}
result
}
```
This leverages `BTreeMap::range()` which returns an iterator over entries with keys in the specified bounds. The bounds use `std::ops::Bound` (`Included`, `Excluded`, `Unbounded`) for flexible range specification.
**Selectivity computation:**
```rust
fn selectivity(&self, lo: Bound<&V>, hi: Bound<&V>, total: u64) -> f64 {
if total == 0 {
return 0.0;
}
let range_bitmap = self.range(lo, hi);
range_bitmap.len() as f64 / total as f64
}
```
This is exact (not estimated) because we compute the actual union of bitmaps in the range. At M2 scale, this is fast enough. If M7 shows this is a bottleneck, approximate by sampling a fixed number of values in the range and extrapolating.
**Persistence key encoding for ranges:**
```
Key: encode_key(EntityId(0), Tag::Idx, b"RNG:created_at:\x00\x00\x01\x8E\x3A\xB0\xD0\x00")
^--- INDEX_ROOT_ID ^--- "RNG:" prefix ^--- value in BE bytes
```
Values are stored in big-endian byte order in the key suffix so that lexicographic key ordering matches numeric value ordering. This is important for the storage engine's prefix scan to return values in sorted order.
### Error Handling
- `insert()` and `delete()` are infallible (in-memory only).
- `range()` and selectivity methods are infallible.
- Persistence methods return `Result<_, IndexError>`.
## Test Strategy
### Property Tests
```rust
use proptest::prelude::*;
use std::ops::Bound;
// Range query returns exactly the entities with values in [lo, hi].
proptest! {
#[test]
fn range_query_correctness(
entries in prop::collection::vec(
(0u32..10_000, 0u32..1000), // (entity_id, value)
1..200,
),
lo in 0u32..500,
hi in 500u32..1000,
) {
let index: RangeIndex<u32> = RangeIndex::new("test_field");
for &(id, value) in &entries {
index.insert(id, value);
}
let result = index.range(
Bound::Included(&lo),
Bound::Included(&hi),
);
// Verify: result contains exactly the entities with lo <= value <= hi
for &(id, value) in &entries {
if value >= lo && value <= hi {
prop_assert!(result.contains(id),
"entity {id} with value {value} should be in range [{lo}, {hi}]");
}
}
// Note: entity IDs can appear multiple times in entries with different
// values, some inside range and some outside. The bitmap union ensures
// the entity is present if ANY of its values fall in range.
}
}
// Selectivity is in [0.0, 1.0].
proptest! {
#[test]
fn selectivity_in_unit_range(
entries in prop::collection::vec(
(0u32..10_000, 0u32..1000),
1..200,
),
lo in 0u32..500,
hi in 500u32..1000,
) {
let index: RangeIndex<u32> = RangeIndex::new("test_field");
for &(id, value) in &entries {
index.insert(id, value);
}
let total = index.total_count();
let sel = index.selectivity(
Bound::Included(&lo),
Bound::Included(&hi),
total,
);
prop_assert!(sel >= 0.0, "selectivity was {sel}");
prop_assert!(sel <= 1.0, "selectivity was {sel}");
}
}
// Insert-delete roundtrip: deleted entities do not appear in range queries.
proptest! {
#[test]
fn insert_delete_roundtrip(
entries in prop::collection::vec(
(0u32..1_000, 0u32..100),
1..100,
),
) {
let index: RangeIndex<u32> = RangeIndex::new("test_field");
for &(id, value) in &entries {
index.insert(id, value);
}
// Delete all entries
for &(id, value) in &entries {
index.delete(id, &value);
}
// Full range should be empty
let result = index.range(Bound::Unbounded, Bound::Unbounded);
prop_assert!(result.is_empty(), "expected empty after deleting all");
}
}
// Serialize-deserialize roundtrip (u32).
proptest! {
#[test]
fn serialize_roundtrip_u32(
entries in prop::collection::vec(
(0u32..10_000, 0u32..1000),
1..100,
),
) {
let index: RangeIndex<u32> = RangeIndex::new("duration");
for &(id, value) in &entries {
index.insert(id, value);
}
let kv_pairs = index.serialize_to_kv_pairs().unwrap();
let restored = RangeIndex::<u32>::load_from_kv_pairs(
"duration",
kv_pairs.into_iter(),
).unwrap();
// Full range query should match
let orig = index.range(Bound::Unbounded, Bound::Unbounded);
let rest = restored.range(Bound::Unbounded, Bound::Unbounded);
prop_assert_eq!(orig, rest);
}
}
```
### Unit Tests
```rust
#[test]
fn new_index_is_empty() {
let index: RangeIndex<u32> = RangeIndex::new("duration");
assert!(index.is_empty());
assert_eq!(index.total_count(), 0);
assert_eq!(index.distinct_values(), 0);
}
#[test]
fn insert_and_range_query() {
let index: RangeIndex<u32> = RangeIndex::new("duration");
index.insert(1, 60); // 1 minute
index.insert(2, 300); // 5 minutes
index.insert(3, 600); // 10 minutes
index.insert(4, 1800); // 30 minutes
// Range [300, 600] should return entities 2 and 3
let result = index.range(
Bound::Included(&300),
Bound::Included(&600),
);
assert_eq!(result.len(), 2);
assert!(result.contains(2));
assert!(result.contains(3));
assert!(!result.contains(1));
assert!(!result.contains(4));
}
#[test]
fn gte_query() {
let index: RangeIndex<u32> = RangeIndex::new("duration");
index.insert(1, 60);
index.insert(2, 300);
index.insert(3, 600);
let result = index.gte(&300);
assert_eq!(result.len(), 2);
assert!(result.contains(2));
assert!(result.contains(3));
}
#[test]
fn gt_query() {
let index: RangeIndex<u32> = RangeIndex::new("duration");
index.insert(1, 60);
index.insert(2, 300);
index.insert(3, 600);
let result = index.gt(&300);
assert_eq!(result.len(), 1);
assert!(result.contains(3));
}
#[test]
fn lt_and_lte_queries() {
let index: RangeIndex<u32> = RangeIndex::new("duration");
index.insert(1, 60);
index.insert(2, 300);
index.insert(3, 600);
let lt = index.lt(&300);
assert_eq!(lt.len(), 1);
assert!(lt.contains(1));
let lte = index.lte(&300);
assert_eq!(lte.len(), 2);
assert!(lte.contains(1));
assert!(lte.contains(2));
}
#[test]
fn unbounded_range_returns_all() {
let index: RangeIndex<u32> = RangeIndex::new("duration");
index.insert(1, 60);
index.insert(2, 300);
index.insert(3, 600);
let all = index.range(Bound::Unbounded, Bound::Unbounded);
assert_eq!(all.len(), 3);
}
#[test]
fn empty_range_returns_empty_bitmap() {
let index: RangeIndex<u32> = RangeIndex::new("duration");
index.insert(1, 60);
index.insert(2, 300);
// Range [400, 500] has no entities
let result = index.range(
Bound::Included(&400),
Bound::Included(&500),
);
assert!(result.is_empty());
}
#[test]
fn selectivity_full_range() {
let index: RangeIndex<u32> = RangeIndex::new("duration");
index.insert(1, 60);
index.insert(2, 300);
index.insert(3, 600);
let sel = index.selectivity(
Bound::Unbounded,
Bound::Unbounded,
3,
);
assert!((sel - 1.0).abs() < f64::EPSILON);
}
#[test]
fn selectivity_partial_range() {
let index: RangeIndex<u32> = RangeIndex::new("duration");
index.insert(1, 60);
index.insert(2, 300);
index.insert(3, 600);
index.insert(4, 1800);
let sel = index.selectivity(
Bound::Included(&300),
Bound::Included(&600),
4,
);
assert!((sel - 0.5).abs() < f64::EPSILON); // 2 of 4
}
#[test]
fn selectivity_zero_total() {
let index: RangeIndex<u32> = RangeIndex::new("duration");
let sel = index.selectivity(
Bound::Unbounded,
Bound::Unbounded,
0,
);
assert!((sel - 0.0).abs() < f64::EPSILON);
}
#[test]
fn delete_cleans_up_empty_entries() {
let index: RangeIndex<u32> = RangeIndex::new("duration");
index.insert(1, 300);
index.delete(1, &300);
assert_eq!(index.total_count(), 0);
assert_eq!(index.distinct_values(), 0);
}
#[test]
fn multiple_entities_same_value() {
let index: RangeIndex<u32> = RangeIndex::new("duration");
index.insert(1, 300);
index.insert(2, 300);
index.insert(3, 300);
let result = index.gte(&300);
assert_eq!(result.len(), 3);
assert_eq!(index.distinct_values(), 1); // one value: 300
}
#[test]
fn timestamp_index_u64() {
let index: RangeIndex<u64> = RangeIndex::new("created_at");
let now_ns: u64 = 1_708_000_000_000_000_000; // some timestamp
let one_day_ago = now_ns - 86_400_000_000_000; // 24h in nanos
let seven_days_ago = now_ns - 7 * 86_400_000_000_000;
index.insert(1, now_ns);
index.insert(2, one_day_ago);
index.insert(3, seven_days_ago);
index.insert(4, seven_days_ago - 1); // older than 7 days
// "created_within:7d" = created_at >= seven_days_ago
let recent = index.gte(&seven_days_ago);
assert_eq!(recent.len(), 3); // entities 1, 2, 3
assert!(!recent.contains(4));
}
```
## Acceptance Criteria
- [ ] `RangeIndex<V>` backed by `BTreeMap<V, RoaringBitmap>` with `RwLock` for concurrent access
- [ ] `insert(entity_id, value)` adds the entity to the bitmap for that exact value
- [ ] `delete(entity_id, value)` removes the entity from the bitmap; cleans up empty entries
- [ ] `range(lo, hi)` returns the union bitmap of all entries with keys in [lo, hi] using `BTreeMap::range()`
- [ ] `gt()`, `gte()`, `lt()`, `lte()` convenience methods implemented via `range()` with appropriate bounds
- [ ] `selectivity(lo, hi, total)` returns the fraction of entities in range; always in [0.0, 1.0]
- [ ] `total_count()` returns distinct entity IDs across all values (no double-counting)
- [ ] Concrete instantiations work: `RangeIndex<u64>` for timestamps, `RangeIndex<u32>` for durations
- [ ] Persistence: `serialize_to_kv_pairs()` / `load_from_kv_pairs()` roundtrip for `RangeIndex<u64>` and `RangeIndex<u32>` (property tested)
- [ ] Key encoding uses `encode_key(EntityId(0), Tag::Idx, b"RNG:{field_name}:{value_be_bytes}")` with BE ordering
- [ ] Range query returns exactly the entities whose values fall within the bounds (property tested)
- [ ] `RangeIndex<V>` is `Send + Sync`
- [ ] No `unsafe` code
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All property tests and unit tests pass
## Research References
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- Selectivity estimation from sorted index statistics (used for range predicates alongside bitmap cardinality for keyword predicates)
## Spec References
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 3 (selectivity estimation: "numeric range: estimate from sorted index statistics")
- [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- Section 7.3 (`Filter::Range`, `Filter::Min`, `Filter::Max`, `Filter::CreatedWithin`, `Filter::CreatedAfter`, `Filter::CreatedBefore`)
## Implementation Notes
- `BTreeMap::range()` accepts `(Bound<&K>, Bound<&K>)`. The `Bound` type from `std::ops` provides `Included`, `Excluded`, and `Unbounded`. This maps directly to filter predicates: `Min` = `Included(threshold)..Unbounded`, `Max` = `Unbounded..Included(threshold)`, `Range` = `Included(lo)..Included(hi)`.
- Empty bitmaps should be removed from the `BTreeMap` after `delete()` to avoid dead entries accumulating. Check `bitmap.is_empty()` after removal.
- `RangeIndex<V>` requires `V: Ord + Clone`. `u32` and `u64` both satisfy these bounds. If future needs add `f64` range indexes (e.g., for latitude/longitude), `f64` does NOT implement `Ord` (NaN). Use `ordered_float::OrderedFloat<f64>` or a newtype wrapper. This is deferred -- M2 only needs integer types.
- The `RangeIndex` does not track which entity has which value. If an entity's value changes (e.g., duration updated), the caller must `delete(id, old_value)` then `insert(id, new_value)`. The caller (entity write path) is responsible for this. The index is a pure mapping structure.
- For persistence, values are encoded in big-endian bytes in the key suffix so lexicographic key ordering in the storage engine matches numeric value ordering. This allows potential future optimization of scanning index keys from the storage engine in sorted order without loading all into memory.
- Do NOT implement approximate selectivity estimation (histogram-based, reservoir sampling) in this task. The exact computation via bitmap union is fast enough at M2 scale. If M7 benchmarks show the full union is too slow for selectivity estimation, add histograms as an optimization.

View File

@ -0,0 +1,821 @@
# Task 03: Composable Filter Engine
## Context
**Milestone:** 2 -- Ranked Retrieval
**Phase:** m2p2 -- Metadata Indexes and Filter Engine
**Depends On:** Task 01 (BitmapIndex), Task 02 (RangeIndex)
**Blocks:** m2p1 Task 04 (adaptive query planner's `SelectivityEstimator`), m2p5 (RETRIEVE executor)
**Complexity:** M
## Objective
Deliver `FilterExpr`, `FilterEvaluator`, and `FilterResult` -- the composable filter evaluation engine that sits between the query parser (m2p5) and the index structures (Tasks 01 and 02). The filter engine takes an arbitrary boolean combination of metadata predicates, evaluates them against bitmap and range indexes, and produces either a `RoaringBitmap` (for pre-filtering ANN candidates) or a predicate closure `Fn(u64) -> bool` (for in-graph filtering during HNSW traversal).
This is the component that makes `FILTER category:jazz, format:video, duration_min:5m, created_within:7d` work as a single composable expression. The query parser (m2p5) builds a `FilterExpr` tree. The executor passes it to `FilterEvaluator` which resolves each leaf against the appropriate index, composes the results via bitmap algebra (AND = intersection, OR = union, NOT = complement), and returns a `FilterResult` that can be used either as a bitmap or a predicate closure.
The filter engine also provides selectivity estimation for the adaptive query planner (m2p1 Task 04, Spec 07 Section 9). Before executing the ANN search, the planner calls `evaluator.selectivity(&expr)` to estimate what fraction of the corpus matches the filter. This determines the search strategy: brute-force for <1%, widened HNSW for 1-20%, in-graph predicate for >20%, unfiltered for 100%.
## Requirements
- `FilterExpr` enum: the AST of filter conditions (categorical equality, range predicates, boolean composition)
- `FilterEvaluator` struct: holds references to `BitmapIndex` and `RangeIndex` instances, evaluates filter expressions
- `FilterResult` enum: either `Bitmap(RoaringBitmap)` or `Predicate(Box<dyn Fn(u64) -> bool + Send + Sync>)`
- `evaluator.evaluate(expr) -> FilterResult` -- evaluates a filter expression against indexes
- `evaluator.selectivity(expr) -> f64` -- estimates the fraction of entities matching the filter
- `result.to_bitmap() -> RoaringBitmap` -- extracts or computes the bitmap representation
- `result.to_predicate() -> Box<dyn Fn(u64) -> bool + Send + Sync>` -- extracts or wraps as a predicate closure
- AND composition: bitmap intersection for two bitmap results
- OR composition: bitmap union for two bitmap results
- NOT composition: bitmap complement (requires the "universe" bitmap of all entity IDs)
- Short-circuit evaluation: sort AND operands by ascending selectivity, abort on empty (Spec 08 Section 7.4)
- Compound AND selectivity: product of individual selectivities (independence assumption)
- Compound OR selectivity: `1 - product(1 - s_i)` (inclusion-exclusion approximation)
- Criterion benchmarks for filter evaluation and predicate-per-candidate performance
## Technical Design
### Module Structure
```
tidal/src/storage/
indexes/
filter.rs -- FilterExpr, FilterEvaluator, FilterResult (this task)
tidal/benches/
filters.rs -- Criterion benchmarks (this task)
```
### Public API
```rust
// === storage/indexes/filter.rs ===
use roaring::RoaringBitmap;
use std::time::Duration;
use super::bitmap::BitmapIndex;
use super::range::RangeIndex;
use super::IndexError;
/// A filter expression AST node.
///
/// Built by the query parser (m2p5) from the `FILTER` clause
/// of a RETRIEVE or SEARCH query. Evaluated by `FilterEvaluator`
/// against bitmap and range indexes.
///
/// # Composition
///
/// Filters compose naturally:
/// - `And(vec![CategoryEq("jazz"), FormatEq("video")])` = jazz videos
/// - `Or(vec![CategoryEq("jazz"), CategoryEq("blues")])` = jazz or blues
/// - `And(vec![CategoryEq("jazz"), DurationMin(300)])` = jazz items >= 5 min
/// - `Not(Box::new(CreatorEq(77)))` = not by creator 77
///
/// The query parser enforces valid structure: AND across dimensions,
/// OR within a dimension. The evaluator handles any structure.
#[derive(Debug, Clone, PartialEq)]
pub enum FilterExpr {
/// Exact equality on a categorical field: `category:jazz`
CategoryEq(String),
/// Exact equality on format field: `format:video`
FormatEq(String),
/// Exact equality on creator: `creator:@id`
CreatorEq(u32),
/// Tag match (multi-value field): `tag:classical`
Tag(String),
/// Minimum duration in seconds: `duration_min:300` (5 minutes)
DurationMin(u32),
/// Maximum duration in seconds: `duration_max:1800` (30 minutes)
DurationMax(u32),
/// Created after a timestamp (nanos): `created_after:...`
CreatedAfter(u64),
/// Created before a timestamp (nanos): `created_before:...`
CreatedBefore(u64),
/// AND composition: all sub-expressions must match.
And(Vec<FilterExpr>),
/// OR composition: at least one sub-expression must match.
Or(Vec<FilterExpr>),
/// NOT composition: the sub-expression must NOT match.
/// Requires a universe bitmap for complement computation.
Not(Box<FilterExpr>),
}
/// The result of evaluating a filter expression.
///
/// Can be materialized as either a `RoaringBitmap` (for pre-filtering)
/// or a predicate closure (for in-graph filtering).
pub enum FilterResult {
/// A bitmap of matching entity IDs. Used for pre-filtering ANN
/// candidates or intersecting with other filter results.
Bitmap(RoaringBitmap),
/// A predicate closure for in-graph filtering when bitmap
/// materialization is deferred or unnecessary.
Predicate(Box<dyn Fn(u64) -> bool + Send + Sync>),
}
impl FilterResult {
/// Extract the bitmap. This is the primary representation.
pub fn into_bitmap(self) -> RoaringBitmap;
/// Convert to a predicate closure that checks bitmap containment.
///
/// The predicate captures the bitmap and returns `true` if the
/// entity ID is present. This is the closure passed to
/// `VectorIndex::filtered_search()` for in-graph filtering.
///
/// Performance: bitmap containment check is O(1) amortized,
/// approximately 10-50 nanoseconds per check.
pub fn into_predicate(self) -> Box<dyn Fn(u64) -> bool + Send + Sync>;
/// Number of entities matching the filter.
pub fn cardinality(&self) -> u64;
/// Whether no entities match.
pub fn is_empty(&self) -> bool;
}
/// Evaluates filter expressions against bitmap and range indexes.
///
/// The evaluator holds references to all indexes needed for filter
/// resolution. It is constructed once per query and evaluates the
/// filter expression tree.
///
/// # Selectivity Estimation
///
/// The `selectivity()` method estimates the fraction of entities
/// matching a filter WITHOUT materializing the full bitmap. This
/// is used by the adaptive query planner (m2p1) to choose the
/// ANN search strategy before executing the search.
///
/// For leaf predicates, selectivity uses `BitmapIndex::cardinality()`
/// or `RangeIndex::selectivity()`. For compound predicates:
/// - AND: product of individual selectivities (independence assumption)
/// - OR: `1 - product(1 - s_i)` (inclusion-exclusion approximation)
/// - NOT: `1 - selectivity(inner)`
pub struct FilterEvaluator<'a> {
category_index: &'a BitmapIndex,
format_index: &'a BitmapIndex,
creator_index: &'a BitmapIndex,
tag_index: &'a BitmapIndex,
duration_index: &'a RangeIndex<u32>,
created_at_index: &'a RangeIndex<u64>,
/// The universe bitmap: all entity IDs currently in the database.
/// Used for NOT (complement) operations.
universe: &'a RoaringBitmap,
}
impl<'a> FilterEvaluator<'a> {
/// Create a new evaluator with references to all indexes.
pub fn new(
category_index: &'a BitmapIndex,
format_index: &'a BitmapIndex,
creator_index: &'a BitmapIndex,
tag_index: &'a BitmapIndex,
duration_index: &'a RangeIndex<u32>,
created_at_index: &'a RangeIndex<u64>,
universe: &'a RoaringBitmap,
) -> Self;
/// Evaluate a filter expression and return the matching entity set.
///
/// Recursively evaluates the expression tree:
/// - Leaf nodes resolve to index lookups.
/// - AND nodes intersect child bitmaps.
/// - OR nodes union child bitmaps.
/// - NOT nodes compute the complement against the universe.
///
/// # Short-Circuit Optimization (Spec 08, Section 7.4)
///
/// For AND nodes, children are sorted by estimated selectivity
/// (ascending). After each child evaluation, if the running
/// intersection is empty, evaluation stops immediately.
pub fn evaluate(&self, expr: &FilterExpr) -> FilterResult;
/// Estimate the fraction of entities matching a filter expression.
///
/// This is a cheap computation that does NOT materialize bitmaps.
/// It uses index cardinalities and the independence assumption
/// for compound predicates.
///
/// Returns a value in [0.0, 1.0].
///
/// Used by the adaptive query planner to choose the ANN strategy:
/// - selectivity < 0.01 -> pre-filter + brute-force
/// - selectivity 0.01 - 0.20 -> widened HNSW (ACORN-1)
/// - selectivity > 0.20 -> in-graph predicate filter
/// - selectivity == 1.0 -> unfiltered search
pub fn selectivity(&self, expr: &FilterExpr) -> f64;
}
```
### Internal Design
**Leaf evaluation:**
Each `FilterExpr` leaf resolves to an index lookup:
| FilterExpr Variant | Index | Method |
|-------------------|-------|--------|
| `CategoryEq(v)` | `category_index` | `get(v) -> RoaringBitmap` |
| `FormatEq(v)` | `format_index` | `get(v) -> RoaringBitmap` |
| `CreatorEq(id)` | `creator_index` | `get(&id.to_string()) -> RoaringBitmap` |
| `Tag(v)` | `tag_index` | `get(v) -> RoaringBitmap` |
| `DurationMin(secs)` | `duration_index` | `gte(&secs) -> RoaringBitmap` |
| `DurationMax(secs)` | `duration_index` | `lte(&secs) -> RoaringBitmap` |
| `CreatedAfter(ns)` | `created_at_index` | `gt(&ns) -> RoaringBitmap` |
| `CreatedBefore(ns)` | `created_at_index` | `lt(&ns) -> RoaringBitmap` |
When a leaf lookup returns `None` (no entities match), an empty `RoaringBitmap` is used. This ensures AND short-circuit works: `empty & X = empty`.
**AND evaluation with short-circuit:**
```rust
fn evaluate_and(&self, children: &[FilterExpr]) -> RoaringBitmap {
if children.is_empty() {
return self.universe.clone();
}
// Sort children by estimated selectivity (ascending) for early termination
let mut ordered: Vec<_> = children.iter()
.map(|c| (self.selectivity(c), c))
.collect();
ordered.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap_or(std::cmp::Ordering::Equal));
let mut result = self.evaluate(&ordered[0].1).into_bitmap();
for &(_, child) in &ordered[1..] {
// Short-circuit: empty ∩ X = empty regardless of remaining children,
// including NOT sub-expressions. AND identity: ∅ & X = ∅.
if result.is_empty() {
return FilterResult::Bitmap(result);
}
let child_bitmap = self.evaluate(child).into_bitmap();
result &= &child_bitmap;
}
result
}
```
**NOT evaluation:**
```rust
fn evaluate_not(&self, inner: &FilterExpr) -> RoaringBitmap {
let inner_bitmap = self.evaluate(inner).into_bitmap();
let mut complement = self.universe.clone();
complement -= &inner_bitmap;
complement
}
```
The NOT operation requires the "universe" bitmap (all valid entity IDs). This is maintained by the entity store and passed to the `FilterEvaluator` at construction time. Without it, complement is undefined.
**Selectivity estimation without bitmap materialization:**
```rust
fn selectivity(&self, expr: &FilterExpr) -> f64 {
let total = self.universe.len() as f64;
if total == 0.0 {
return 0.0;
}
match expr {
FilterExpr::CategoryEq(v) =>
self.category_index.cardinality(v) as f64 / total,
FilterExpr::FormatEq(v) =>
self.format_index.cardinality(v) as f64 / total,
FilterExpr::CreatorEq(id) =>
self.creator_index.cardinality(&id.to_string()) as f64 / total,
FilterExpr::Tag(v) =>
self.tag_index.cardinality(v) as f64 / total,
FilterExpr::DurationMin(secs) =>
self.duration_index.selectivity(
Bound::Included(secs), Bound::Unbounded,
self.universe.len(),
),
FilterExpr::DurationMax(secs) =>
self.duration_index.selectivity(
Bound::Unbounded, Bound::Included(secs),
self.universe.len(),
),
FilterExpr::CreatedAfter(ns) =>
self.created_at_index.selectivity(
Bound::Excluded(ns), Bound::Unbounded,
self.universe.len(),
),
FilterExpr::CreatedBefore(ns) =>
self.created_at_index.selectivity(
Bound::Unbounded, Bound::Excluded(ns),
self.universe.len(),
),
FilterExpr::And(children) => {
// Independence assumption: P(A AND B) = P(A) * P(B)
children.iter()
.map(|c| self.selectivity(c))
.product::<f64>()
.clamp(0.0, 1.0)
}
FilterExpr::Or(children) => {
// Inclusion-exclusion approximation:
// P(A OR B) = 1 - (1 - P(A)) * (1 - P(B))
let complement_product: f64 = children.iter()
.map(|c| 1.0 - self.selectivity(c))
.product();
(1.0 - complement_product).clamp(0.0, 1.0)
}
FilterExpr::Not(inner) => {
(1.0 - self.selectivity(inner)).clamp(0.0, 1.0)
}
}
}
```
The independence assumption (AND selectivity = product of individual selectivities) is known to be inaccurate for correlated filters (e.g., `category:jazz AND format:audio`). Spec 07 Section 3 acknowledges this and proposes a correlation cache for frequently co-occurring filter pairs. This cache is deferred to M5+. For M2, the independence assumption is sufficient because the query planner's strategy selection has wide threshold bands (1%, 20%) that tolerate estimation error.
**`FilterResult::into_predicate()`:**
```rust
impl FilterResult {
pub fn into_predicate(self) -> Box<dyn Fn(u64) -> bool + Send + Sync> {
match self {
FilterResult::Bitmap(bitmap) => {
Box::new(move |id: u64| {
debug_assert!(id <= u64::from(u32::MAX), "EntityId out of u32 range");
bitmap.contains(id as u32)
})
}
FilterResult::Predicate(f) => f,
}
}
}
```
The predicate captures the `RoaringBitmap` by move. `RoaringBitmap` is `Send + Sync`, so the closure is `Send + Sync`. The bitmap containment check (`bitmap.contains(id)`) is O(1) amortized via the roaring bitmap's hierarchical container structure. Performance: ~10-50 nanoseconds per check, well under the 1 microsecond target.
### Error Handling
- `evaluate()` is infallible. Missing index values produce empty bitmaps.
- `selectivity()` is infallible. Returns 0.0 for empty universe or missing values.
- `into_bitmap()` and `into_predicate()` are infallible.
## Test Strategy
### Property Tests
```rust
use proptest::prelude::*;
use roaring::RoaringBitmap;
// AND of two filters equals bitmap intersection.
proptest! {
#[test]
fn and_equals_intersection(
a_ids in prop::collection::hash_set(0u32..10_000, 1..200),
b_ids in prop::collection::hash_set(0u32..10_000, 1..200),
) {
let cat_index = BitmapIndex::new("category");
let fmt_index = BitmapIndex::new("format");
for &id in &a_ids { cat_index.insert(id, "jazz"); }
for &id in &b_ids { fmt_index.insert(id, "video"); }
// Build universe
let mut universe = RoaringBitmap::new();
for &id in a_ids.iter().chain(b_ids.iter()) { universe.insert(id); }
let evaluator = build_evaluator(&cat_index, &fmt_index, &universe);
let result = evaluator.evaluate(&FilterExpr::And(vec![
FilterExpr::CategoryEq("jazz".into()),
FilterExpr::FormatEq("video".into()),
]));
let expected = {
let a: RoaringBitmap = a_ids.iter().copied().collect();
let b: RoaringBitmap = b_ids.iter().copied().collect();
a & b
};
prop_assert_eq!(result.into_bitmap(), expected);
}
}
// OR of two filters equals bitmap union.
proptest! {
#[test]
fn or_equals_union(
a_ids in prop::collection::hash_set(0u32..10_000, 1..200),
b_ids in prop::collection::hash_set(0u32..10_000, 1..200),
) {
let cat_index = BitmapIndex::new("category");
for &id in &a_ids { cat_index.insert(id, "jazz"); }
for &id in &b_ids { cat_index.insert(id, "blues"); }
let mut universe = RoaringBitmap::new();
for &id in a_ids.iter().chain(b_ids.iter()) { universe.insert(id); }
let evaluator = build_evaluator_simple(&cat_index, &universe);
let result = evaluator.evaluate(&FilterExpr::Or(vec![
FilterExpr::CategoryEq("jazz".into()),
FilterExpr::CategoryEq("blues".into()),
]));
let expected = {
let a: RoaringBitmap = a_ids.iter().copied().collect();
let b: RoaringBitmap = b_ids.iter().copied().collect();
a | b
};
prop_assert_eq!(result.into_bitmap(), expected);
}
}
// NOT produces the complement.
proptest! {
#[test]
fn not_is_complement(
all_ids in prop::collection::hash_set(0u32..10_000, 50..500),
match_ids in prop::collection::hash_set(0u32..10_000, 1..200),
) {
let cat_index = BitmapIndex::new("category");
for &id in &match_ids { cat_index.insert(id, "jazz"); }
let universe: RoaringBitmap = all_ids.iter().copied().collect();
let evaluator = build_evaluator_simple(&cat_index, &universe);
let result = evaluator.evaluate(&FilterExpr::Not(
Box::new(FilterExpr::CategoryEq("jazz".into()))
));
let result_bitmap = result.into_bitmap();
// The result should be universe \ jazz
let jazz_bitmap: RoaringBitmap = match_ids.iter()
.filter(|id| all_ids.contains(id))
.copied()
.collect();
let expected = &universe - &jazz_bitmap;
prop_assert_eq!(result_bitmap, expected);
}
}
// Selectivity is in [0.0, 1.0] for any expression.
proptest! {
#[test]
fn selectivity_in_unit_range(
n_items in 1u32..10_000,
n_jazz in 0u32..5_000,
n_video in 0u32..5_000,
) {
let cat_index = BitmapIndex::new("category");
let fmt_index = BitmapIndex::new("format");
let n_jazz = n_jazz.min(n_items);
let n_video = n_video.min(n_items);
for id in 0..n_jazz { cat_index.insert(id, "jazz"); }
for id in 0..n_video { fmt_index.insert(id, "video"); }
let universe: RoaringBitmap = (0..n_items).collect();
let evaluator = build_evaluator(&cat_index, &fmt_index, &universe);
// Test various expressions
let exprs = vec![
FilterExpr::CategoryEq("jazz".into()),
FilterExpr::FormatEq("video".into()),
FilterExpr::And(vec![
FilterExpr::CategoryEq("jazz".into()),
FilterExpr::FormatEq("video".into()),
]),
FilterExpr::Or(vec![
FilterExpr::CategoryEq("jazz".into()),
FilterExpr::FormatEq("video".into()),
]),
FilterExpr::Not(Box::new(FilterExpr::CategoryEq("jazz".into()))),
];
for expr in &exprs {
let sel = evaluator.selectivity(expr);
prop_assert!(sel >= 0.0 && sel <= 1.0,
"selectivity {sel} out of range for {expr:?}");
}
}
}
// Predicate closure agrees with bitmap containment.
proptest! {
#[test]
fn predicate_matches_bitmap(
ids in prop::collection::hash_set(0u32..10_000, 1..200),
test_ids in prop::collection::vec(0u32..10_000, 1..100),
) {
let cat_index = BitmapIndex::new("category");
for &id in &ids { cat_index.insert(id, "jazz"); }
let universe: RoaringBitmap = (0..10_000u32).collect();
let evaluator = build_evaluator_simple(&cat_index, &universe);
let result = evaluator.evaluate(&FilterExpr::CategoryEq("jazz".into()));
let bitmap = result.into_bitmap();
let bitmap_clone = bitmap.clone();
let predicate = FilterResult::Bitmap(bitmap).into_predicate();
for &test_id in &test_ids {
prop_assert_eq!(
predicate(test_id),
bitmap_clone.contains(test_id),
"predicate disagreed with bitmap for id {test_id}"
);
}
}
}
```
### Unit Tests
```rust
#[test]
fn evaluate_single_category() {
let (evaluator, _indexes) = setup_test_indexes();
let result = evaluator.evaluate(&FilterExpr::CategoryEq("jazz".into()));
let bitmap = result.into_bitmap();
assert!(!bitmap.is_empty());
// All IDs in the bitmap should be jazz items
}
#[test]
fn evaluate_and_two_filters() {
let (evaluator, _indexes) = setup_test_indexes();
let result = evaluator.evaluate(&FilterExpr::And(vec![
FilterExpr::CategoryEq("jazz".into()),
FilterExpr::FormatEq("video".into()),
]));
let bitmap = result.into_bitmap();
// Should be the intersection of jazz items and video items
}
#[test]
fn evaluate_or_two_categories() {
let (evaluator, _indexes) = setup_test_indexes();
let result = evaluator.evaluate(&FilterExpr::Or(vec![
FilterExpr::CategoryEq("jazz".into()),
FilterExpr::CategoryEq("blues".into()),
]));
let bitmap = result.into_bitmap();
// Should be the union of jazz and blues items
}
#[test]
fn evaluate_not_creator() {
let (evaluator, _indexes) = setup_test_indexes();
let result = evaluator.evaluate(&FilterExpr::Not(
Box::new(FilterExpr::CreatorEq(77))
));
let bitmap = result.into_bitmap();
// Should exclude all items by creator 77
assert!(!bitmap.contains(77)); // assuming creator 77's items use creator_id as entity_id for test simplicity
}
#[test]
fn evaluate_duration_min() {
let (evaluator, _indexes) = setup_test_indexes();
let result = evaluator.evaluate(&FilterExpr::DurationMin(300)); // >= 5 min
let bitmap = result.into_bitmap();
// Should include items with duration >= 300 seconds
}
#[test]
fn evaluate_created_after() {
let (evaluator, _indexes) = setup_test_indexes();
let seven_days_ago = now_ns() - 7 * 86_400_000_000_000;
let result = evaluator.evaluate(&FilterExpr::CreatedAfter(seven_days_ago));
let bitmap = result.into_bitmap();
// Should include items created within the last 7 days
}
#[test]
fn evaluate_complex_compound() {
// category:jazz AND format:video AND duration_min:5m AND created_within:7d
let (evaluator, _indexes) = setup_test_indexes();
let seven_days_ago = now_ns() - 7 * 86_400_000_000_000;
let result = evaluator.evaluate(&FilterExpr::And(vec![
FilterExpr::CategoryEq("jazz".into()),
FilterExpr::FormatEq("video".into()),
FilterExpr::DurationMin(300),
FilterExpr::CreatedAfter(seven_days_ago),
]));
let bitmap = result.into_bitmap();
// Result should be the intersection of all four conditions
}
#[test]
fn evaluate_nonexistent_category_short_circuits() {
let (evaluator, _indexes) = setup_test_indexes();
let result = evaluator.evaluate(&FilterExpr::And(vec![
FilterExpr::CategoryEq("nonexistent".into()),
FilterExpr::FormatEq("video".into()),
]));
assert!(result.is_empty());
}
#[test]
fn selectivity_single_category() {
let (evaluator, _indexes) = setup_test_indexes_with_known_counts();
// If 100 of 1000 items are jazz, selectivity = 0.1
let sel = evaluator.selectivity(&FilterExpr::CategoryEq("jazz".into()));
assert!((sel - 0.1).abs() < 0.01);
}
#[test]
fn selectivity_and_independence() {
let (evaluator, _indexes) = setup_test_indexes_with_known_counts();
// jazz=10%, video=20%, AND selectivity = 10% * 20% = 2%
let sel = evaluator.selectivity(&FilterExpr::And(vec![
FilterExpr::CategoryEq("jazz".into()),
FilterExpr::FormatEq("video".into()),
]));
assert!((sel - 0.02).abs() < 0.01);
}
#[test]
fn selectivity_or_inclusion_exclusion() {
let (evaluator, _indexes) = setup_test_indexes_with_known_counts();
// jazz=10%, blues=15%, OR selectivity = 1 - (0.9 * 0.85) = 0.235
let sel = evaluator.selectivity(&FilterExpr::Or(vec![
FilterExpr::CategoryEq("jazz".into()),
FilterExpr::CategoryEq("blues".into()),
]));
assert!((sel - 0.235).abs() < 0.02);
}
#[test]
fn selectivity_not() {
let (evaluator, _indexes) = setup_test_indexes_with_known_counts();
// jazz=10%, NOT jazz = 90%
let sel = evaluator.selectivity(&FilterExpr::Not(
Box::new(FilterExpr::CategoryEq("jazz".into()))
));
assert!((sel - 0.9).abs() < 0.01);
}
#[test]
fn selectivity_empty_universe() {
// With no entities, selectivity is always 0.0
let universe = RoaringBitmap::new();
let cat_index = BitmapIndex::new("category");
let evaluator = build_evaluator_with_empty(&cat_index, &universe);
let sel = evaluator.selectivity(&FilterExpr::CategoryEq("jazz".into()));
assert!((sel - 0.0).abs() < f64::EPSILON);
}
#[test]
fn into_predicate_checks_bitmap_containment() {
let mut bitmap = RoaringBitmap::new();
bitmap.insert(1);
bitmap.insert(42);
bitmap.insert(999);
let result = FilterResult::Bitmap(bitmap);
let predicate = result.into_predicate();
assert!(predicate(1));
assert!(predicate(42));
assert!(predicate(999));
assert!(!predicate(0));
assert!(!predicate(2));
assert!(!predicate(1000));
}
#[test]
fn empty_and_returns_universe() {
let (evaluator, _indexes) = setup_test_indexes();
let result = evaluator.evaluate(&FilterExpr::And(vec![]));
// AND of nothing = everything (identity element for intersection)
let bitmap = result.into_bitmap();
assert!(!bitmap.is_empty());
}
#[test]
fn empty_or_returns_empty() {
let (evaluator, _indexes) = setup_test_indexes();
let result = evaluator.evaluate(&FilterExpr::Or(vec![]));
// OR of nothing = nothing (identity element for union)
assert!(result.is_empty());
}
```
### Benchmarks
```rust
// === tidal/benches/filters.rs ===
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use roaring::RoaringBitmap;
fn bench_filter_single_category(c: &mut Criterion) {
// Setup: 10K items, 10 categories (~1K items per category)
let (evaluator, _indexes) = setup_benchmark_indexes(10_000);
let expr = FilterExpr::CategoryEq("jazz".into());
c.bench_function("filter_single_category", |b| {
b.iter(|| {
let result = evaluator.evaluate(black_box(&expr));
black_box(result.cardinality());
});
});
}
fn bench_filter_compound_and(c: &mut Criterion) {
// Setup: 10K items, category:jazz AND format:video AND duration_min:300
let (evaluator, _indexes) = setup_benchmark_indexes(10_000);
let expr = FilterExpr::And(vec![
FilterExpr::CategoryEq("jazz".into()),
FilterExpr::FormatEq("video".into()),
FilterExpr::DurationMin(300),
]);
c.bench_function("filter_compound_and_3", |b| {
b.iter(|| {
let result = evaluator.evaluate(black_box(&expr));
black_box(result.cardinality());
});
});
}
fn bench_filter_predicate_per_candidate(c: &mut Criterion) {
// Setup: build a filter result, then check 10K candidate IDs
let (evaluator, _indexes) = setup_benchmark_indexes(10_000);
let expr = FilterExpr::And(vec![
FilterExpr::CategoryEq("jazz".into()),
FilterExpr::FormatEq("video".into()),
]);
let result = evaluator.evaluate(&expr);
let predicate = result.into_predicate();
c.bench_function("filter_predicate_per_candidate_10k", |b| {
b.iter(|| {
for id in 0..10_000u32 {
black_box(predicate(black_box(id)));
}
});
});
}
fn bench_selectivity_estimation(c: &mut Criterion) {
// Setup: 10K items
let (evaluator, _indexes) = setup_benchmark_indexes(10_000);
let expr = FilterExpr::And(vec![
FilterExpr::CategoryEq("jazz".into()),
FilterExpr::FormatEq("video".into()),
FilterExpr::DurationMin(300),
]);
c.bench_function("selectivity_estimation_compound", |b| {
b.iter(|| {
black_box(evaluator.selectivity(black_box(&expr)));
});
});
}
criterion_group!(
benches,
bench_filter_single_category,
bench_filter_compound_and,
bench_filter_predicate_per_candidate,
bench_selectivity_estimation,
);
criterion_main!(benches);
```
## Acceptance Criteria
- [ ] `FilterExpr` enum covers: `CategoryEq`, `FormatEq`, `CreatorEq`, `Tag`, `DurationMin`, `DurationMax`, `CreatedAfter`, `CreatedBefore`, `And`, `Or`, `Not`
- [ ] `FilterEvaluator` resolves each leaf `FilterExpr` variant against the correct index (`BitmapIndex` or `RangeIndex`)
- [ ] `evaluate()` produces correct results for single-leaf, AND, OR, NOT, and nested compound expressions (property tested)
- [ ] AND evaluation uses bitmap intersection; empty AND (no children) returns the universe
- [ ] OR evaluation uses bitmap union; empty OR (no children) returns empty
- [ ] NOT evaluation uses bitmap complement against the universe
- [ ] AND evaluation sorts children by ascending selectivity and short-circuits on empty intersection (Spec 08 Section 7.4)
- [ ] `selectivity()` returns a value in [0.0, 1.0] for all inputs (property tested)
- [ ] AND selectivity uses product of individual selectivities (independence assumption)
- [ ] OR selectivity uses `1 - product(1 - s_i)` (inclusion-exclusion approximation)
- [ ] NOT selectivity = `1 - inner_selectivity`
- [ ] `FilterResult::into_bitmap()` extracts the bitmap
- [ ] `FilterResult::into_predicate()` returns a closure that checks bitmap containment; closure is `Send + Sync`
- [ ] Predicate per-candidate evaluation < 1 microsecond (benchmarked; target ~10-50 nanoseconds via roaring containment check)
- [ ] Criterion benchmarks pass: `bench_filter_single_category`, `bench_filter_compound_and`, `bench_filter_predicate_per_candidate`, `bench_selectivity_estimation`
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All property tests and unit tests pass
## Research References
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- "estimate selectivity via metadata indexes" -> select ANN strategy; "resolve filter predicates to roaring bitmaps" -> pre-filter for brute-force; predicate callback for in-graph filtering
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 3 (selectivity estimation: independence assumption for AND, correlation cache deferred; keyword equality from bitmap cardinality; numeric range from sorted index), Section 9 (adaptive query planner decision tree: threshold reference table)
## Spec References
- [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- Section 7.1 (bitmap-based filter architecture), Section 7.2 (filter push-down: bitmap passed to `VectorIndex::filtered_search()` as predicate callback), Section 7.3 (`Filter` enum with all variant types), Section 7.4 (short-circuit evaluation: sort by ascending cardinality, abort on empty intersection), Section 5.3 (Stage 2: filter evaluation reduces candidate set)
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 3.1 (filter interaction with ANN: "pre-filters are applied as predicate callbacks during HNSW traversal when selectivity is 2-100%"), Section 4 Stage 3 (filter evaluation: "AND-composed across dimensions; OR-composed within a dimension")
## Implementation Notes
- `FilterExpr` implements `Clone` and `PartialEq` for testing. It does NOT need to be `Serialize`/`Deserialize` -- it exists only as an in-memory AST between the parser and the evaluator.
- The `FilterEvaluator` borrows all indexes by reference (`&'a`). It is constructed per-query and does not outlive the query execution. This avoids cloning indexes.
- `FilterResult` only has a `Bitmap` variant in M2. A future `Predicate` variant (wrapping an arbitrary closure without materializing a bitmap) is deferred because at M2 scale (10K items), bitmap materialization is always cheap. At M7 scale, if filter evaluation on 10M items is too slow to materialize as a bitmap, add a `Predicate` variant that evaluates lazily.
- The `into_predicate()` method moves the bitmap into the closure. This means the `FilterResult` is consumed. If both the bitmap and the predicate are needed, clone the bitmap first.
- Add `roaring = "0.10"` to `[dependencies]` in `tidal/Cargo.toml` (may already be added by Task 01).
- Add `[[bench]] name = "filters" harness = false` to `tidal/Cargo.toml`.
- Do NOT implement user-state filters (`Unseen`, `NotBlocked`, `Relationship`, `SocialGraph`, `InCollection`) in this task. These require user entities (M3). The `FilterExpr` enum deliberately omits them. They are added in m3p4 (User State Filters).
- Do NOT implement the correlation cache for co-occurring filter pairs. The independence assumption is sufficient for M2's selectivity estimation.
- Do NOT implement filter push-down into the ANN search in this task. The filter engine produces the bitmap/predicate; the RETRIEVE executor (m2p5) is responsible for passing it to the vector index.

View File

@ -0,0 +1,113 @@
# Milestone 2, Phase 3: Ranking Profile Engine
## Phase Deliverable
Named ranking profiles declared as runtime data (not compiled code), stored in the schema, parsed, validated, and executed by the database. Profiles reference signal decay scores, windowed aggregates, velocity, and metadata fields. They define quality gates, boosts, penalties, and candidate generation strategies. Profiles are versioned and swappable at query time without recompile. The executor takes a profile and a candidate set and produces a scored, sorted result list in under 10 microseconds for 200 candidates.
This is the phase that turns signals from "primitives the application reads" into "primitives the database scores over." After this phase, a developer can name a profile and get ranked results -- the database does the math, not the application.
## Acceptance Criteria
- [ ] `RankingProfile` struct: name, version, candidate_strategy, scoring_rules (boosts, penalties, gates, excludes), sort mode override, diversity config, exploration budget
- [ ] `ScoringRule` enum: `Boost { signal, window, aggregation, weight }`, `Gate { condition, threshold }`, `Penalize { signal, window, weight }`, `Exclude { condition }`
- [ ] `Sort` enum with formula variants: `Hot { gravity }`, `Controversial`, `HiddenGems`, `New`, `Shuffle { seed }`, `TopWindow { window }`, `MostViewed`, `MostLiked`, `Rising`
- [ ] `CandidateStrategy` enum: `Ann`, `Scan`, `Hybrid`, `Relationship`, `CohortTrending` (M2 implements only `Scan`; others are type stubs used by profiles but not executed until their retrieval strategy is built)
- [ ] `ProfileRegistry` maps profile name to versioned `RankingProfile` instances, supports `get`, `get_versioned`, `register`, `list`
- [ ] Profile validation: duplicate names rejected, unknown signal references rejected (if schema available), gate threshold range [0.0, 1.0], weight normalization warning, version monotonicity (INV-PROF-1)
- [ ] Profiles serializable via serde for schema checkpoint/reload
- [ ] Built-in profiles registered at `SchemaBuilder::build()` time: `trending`, `hot`, `new`, `top_week`, `top_month`, `top_all_time`, `hidden_gems`, `controversial`, `most_viewed`, `most_liked`, `shuffle`
- [ ] Built-in profiles are standard `RankingProfile` instances -- not special-cased in the executor
- [ ] Built-in profiles with unavailable signals degrade gracefully (skip missing signals, not fatal error)
- [ ] `hot` formula: `log10(max(|positive - negative|, 1)) / (age_hours + 2)^gravity` with configurable gravity (default 1.8) -- Spec 09 Section 11.1
- [ ] `controversial` formula: `(positive * negative) / (positive + negative)^2` -- Spec 09 Section 11.4
- [ ] `hidden_gems` formula: `quality_score * (1 / log10(view_count + 10))` -- Spec 09 Section 11.5
- [ ] `ProfileExecutor::score()` takes `&[EntityId]` candidates and `&RankingProfile`, returns `Vec<ScoredCandidate>` sorted by score descending
- [ ] `ScoredCandidate` includes: entity_id, score, signal_snapshot (key signal values used for scoring transparency)
- [ ] Gate failure sets score to 0.0; candidates with score 0.0 are filtered out before returning
- [ ] Shuffle profile uses deterministic seeded RNG (stable per user_id + profile_name + page_cursor)
- [ ] Profile change does not require recompile -- profiles are runtime data
- [ ] 200-candidate scoring pass with decay-only profile < 10 microseconds, with velocity-based profile (trending) < 100 microseconds (both Criterion benchmarked)
- [ ] Deterministic scoring: same candidates + same profile + same signal state = identical results (INV-RANK-1)
- [ ] Normalized scores in [0.0, 1.0] after min-max normalization (INV-RANK-2)
## Dependencies
- **Requires:** m1p1 (types: `EntityId`, `EntityKind`, `SignalTypeDef`, `Window`, `WindowSet`, `DecayModel`, `Score`), m1p4 (SignalLedger: profiles read decay scores and windowed counts via `SignalLedger` API), m1p5 (entity read API: `TidalDb::read_decay_score`, `TidalDb::read_windowed_count`, `TidalDb::read_velocity`)
- **Blocks:** m2p4 (diversity enforcement takes scored lists from profile executor), m2p5 (RETRIEVE executor uses profiles to score candidates)
## Research References
- [docs/research/tidaldb_signal_ledger.md](../../../research/tidaldb_signal_ledger.md) -- Signal read latencies (~15ns decay score, ~200ns windowed count) that establish the per-candidate scoring budget
- [thoughts.md](../../../../thoughts.md) -- Part V.14 (cache-line alignment for hot-path structs), scoring pipeline architecture
## Spec References
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- THE authoritative spec for this phase:
- Section 2 (ProfileDef structure, versioning, inheritance, A/B testing)
- Section 3 (CandidateStrategy variants: ANN, Scan, Hybrid, Relationship, CohortTrending)
- Section 4 (Scoring pipeline: 9-stage fixed-order transformation)
- Section 5 (Boost types: signal, relationship, social proof, recency, cohort, preference match)
- Section 6 (Penalty types: signal-based negative scoring)
- Section 7 (Quality gates: minimum signal, ratio, count gates with exploration bypass)
- Section 8 (Score composition: composite formula, min-max normalization, percentile signal normalization)
- Section 11 (Built-in sort modes: Hot, Trending, Rising, Controversial, HiddenGems, Shuffle, Top windowed, simple field sorts)
- Section 13 (Profile presets: for_you, trending, search, following, related, browse, hidden_gems, notification, live, hot, rising, controversial)
- Section 15 (Performance targets: total scoring pipeline < 500us for 200 candidates, per-candidate scoring ~1.5us)
- Section 16 (Invariants INV-RANK-1 through INV-RANK-7, INV-PROF-1 through INV-PROF-3, property tests P1-P6)
## Task Index
| # | Task | Delivers | Depends On | Complexity |
|---|------|----------|------------|------------|
| 01 | Ranking Profile Type System | `RankingProfile`, `ScoringRule`, `Sort`, `CandidateStrategy`, `ProfileRegistry`, validation, serde | None | L |
| 02 | Built-in Profiles | All 11 built-in profile definitions as `RankingProfile` instances, signal dependency validation, graceful degradation for missing signals | Task 01 | M |
| 03 | Profile Executor + Benchmarks | `ProfileExecutor`, `ScoredCandidate`, `ShuffleExecutor`, sort formula implementations, min-max normalization, Criterion benchmarks | Task 01, Task 02 | L |
## Task Dependency DAG
```
Task 01: Ranking Profile Type System
|
v
Task 02: Built-in Profiles
|
+---> Task 03: Profile Executor + Benchmarks
| (also depends on Task 01)
```
Task 01 is the foundation -- it defines all types that Task 02 instantiates and Task 03 executes. Task 02 constructs the built-in profiles from Task 01 types. Task 03 requires both the types (Task 01) and the profiles (Task 02) to implement and benchmark execution.
## File Layout
```
tidal/src/
ranking/
mod.rs -- pub use re-exports: RankingProfile, ScoringRule, Sort, CandidateStrategy,
ProfileRegistry, ProfileExecutor, ScoredCandidate
profile.rs -- RankingProfile struct, ScoringRule, Sort, CandidateStrategy,
SignalAgg, Boost, Gate, Penalty, Exclude, validation (Task 01)
registry.rs -- ProfileRegistry, built-in profile construction, signal dependency
checking (Task 01 registry types, Task 02 built-in definitions)
executor.rs -- ProfileExecutor, ScoredCandidate, score() method,
sort formula implementations (Task 03)
shuffle.rs -- ShuffleExecutor, seeded RNG (Task 03)
lib.rs -- (unchanged, already declares pub mod ranking)
tidal/benches/
ranking.rs -- Criterion benchmarks (Task 03)
```
## Open Questions
1. **`SmallRng` vs `rand_xoshiro`**: The shuffle profile needs a stable-per-session RNG seeded from `(user_id, profile_name, page_cursor)`. `SmallRng` from the `rand` crate is fast and seedable. `rand_xoshiro::Xoshiro256StarStar` is available via `rand_xoshiro`. Decision: use `SmallRng` for M2 -- it is already in `rand`'s dependency tree, performs well, and is reproducible given the same seed. Add `rand_xoshiro` only if `SmallRng` proves non-deterministic across platforms.
2. **`signal_snapshot` in ScoredCandidate**: The spec says results should include key signal values used in scoring for debugging (Spec 09 Section 4, Stage 10). For M2, include all signals referenced in the profile's scoring rules (typically 2-5 signals). Cap at 10 signal values per candidate. The snapshot is a `Vec<(String, f64)>` (signal name, value) rather than a HashMap to keep allocation small and ordering deterministic.
3. **Gate vs Exclude vs Penalize semantics**: Gate zeros the score (candidate excluded from results). Exclude physically removes the candidate before scoring (it never enters the pipeline). Penalize multiplies by a factor < 1. The executor filters out score <= 0.0 candidates before returning. For M2, `Exclude` variants (`signal("hide")`, `relationship("blocked")`) are type stubs -- the actual exclusion logic requires user state from M3. The executor skips Exclude rules when no user context is available.
4. **Profile versioning**: Version is a `u32` monotonic counter per profile name. `ProfileRegistry` keeps all versions per name (INV-PROF-1 requires monotonic increase). `get()` returns latest version. `get_versioned(name, version)` returns a specific version. For M2, no version pruning. Pruning deferred to M5+ for A/B testing lifecycle management.
5. **Built-in profiles with unavailable signals**: At M2, the schema may not define all signals that a built-in profile references (e.g., `trending` requires `share` velocity, but the UAT schema might only have `view` and `like`). Built-in profiles must be resilient: if a referenced signal type is not in the schema, that boost/penalty is silently skipped (contributes 0.0). A `tracing::warn!` is emitted at registration time listing missing signals. The profile is still registered and usable -- it just scores with fewer signals.
6. **Sort mode vs boost/penalty pipeline**: When a profile has a `sort` override (e.g., `Sort::Hot`), the sort formula replaces stages 4-5 (boost and penalty application) of the scoring pipeline (Spec 09 Section 11.9). Gates, exclusions, normalization, and diversity still apply. The executor must check for a sort override before running the boost/penalty loop.
7. **Candidate strategy as type stub**: `CandidateStrategy` variants are defined as types in Task 01 but not executed in m2p3. The executor receives a pre-generated `&[EntityId]` candidate set. Candidate generation is the responsibility of the RETRIEVE executor (m2p5), which calls the appropriate retrieval strategy (ANN, scan, etc.) and passes the results to the profile executor for scoring. The `CandidateStrategy` on the profile is informational -- it tells the RETRIEVE executor how to generate candidates, but the profile executor itself does not generate candidates.

View File

@ -0,0 +1,986 @@
# Task 01: Ranking Profile Type System
## Context
**Milestone:** 2 -- Ranked Retrieval
**Phase:** m2p3 -- Ranking Profile Engine
**Depends On:** None (uses types from m1p1 but no m2p3 tasks)
**Blocks:** Task 02 (Built-in Profiles), Task 03 (Profile Executor + Benchmarks)
**Complexity:** L
## Objective
Deliver the data types that represent ranking profiles as runtime data, not compiled code. A `RankingProfile` is a named, versioned scoring function declared in schema and stored in a `ProfileRegistry`. It specifies how candidates are scored via `ScoringRule` variants (boosts, gates, penalties, excludes), how they are retrieved via `CandidateStrategy`, and how they are ordered via an optional `Sort` override. The entire profile is serializable, validatable, and swappable at query time without recompilation.
This task builds the type foundation for the entire ranking subsystem. Every subsequent task in m2p3 (built-in profiles, executor, benchmarks) depends on these types. The design must match Spec 09 Sections 2-8 and 11 exactly, while remaining pragmatic about what M2 actually executes vs what is declared as a type stub for future milestones.
## Requirements
- `RankingProfile` struct with all fields from Spec 09 Section 2.1 (`ProfileDef`)
- `ScoringRule` enum covering Boost, Gate, Penalty, Exclude from Spec 09 Sections 5-7
- `Sort` enum with all formula variants from Spec 09 Section 11
- `CandidateStrategy` enum with all variants from Spec 09 Section 3
- `SignalAgg` enum for signal aggregation modes from Spec 09 Section 5.1
- `ProfileRegistry` with name-to-versioned-profile mapping, registration, lookup
- Validation at registration time: duplicate version rejection, signal reference checking, gate threshold range, inheritance cycle detection
- Serde `Serialize`/`Deserialize` on all profile types for checkpoint persistence
- No `unsafe` code
## Technical Design
### Module Structure
```
tidal/src/ranking/
profile.rs -- RankingProfile, ScoringRule, Sort, CandidateStrategy, SignalAgg,
Boost, Gate, Penalty, Exclude, ProfileDecay, DiversitySpec, validation
registry.rs -- ProfileRegistry, ProfileError
mod.rs -- pub use re-exports
```
### Public API
```rust
// === ranking/profile.rs ===
use serde::{Deserialize, Serialize};
use crate::schema::{EntityKind, Window};
/// A named, versioned ranking profile declared as runtime data.
///
/// Profiles are schema-level declarations stored in the database.
/// A profile change never requires recompilation or redeployment.
/// The query executor resolves a profile by name, loads it from the
/// registry, and executes the scoring pipeline it defines.
///
/// # Spec Reference
///
/// Spec 09 Section 2.1 (ProfileDef structure).
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct RankingProfile {
/// Unique profile name. Lowercase alphanumeric plus underscores.
name: String,
/// Monotonically increasing version number.
version: u32,
/// How candidates are generated (Spec 09 Section 3).
/// Informational for the RETRIEVE executor -- the profile executor
/// itself receives pre-generated candidates.
candidate_strategy: CandidateStrategy,
/// Positive signal boosts added to candidate scores (Spec 09 Section 5).
boosts: Vec<Boost>,
/// Content age decay applied multiplicatively to all scores (Spec 09 Section 5.4).
decay: Option<ProfileDecay>,
/// Quality gates -- hard thresholds that exclude candidates (Spec 09 Section 7).
gates: Vec<Gate>,
/// Negative signal penalties subtracted from scores (Spec 09 Section 6).
penalties: Vec<Penalty>,
/// Hard exclusions -- items matching these are removed before scoring.
excludes: Vec<Exclude>,
/// Post-scoring diversity constraints (Spec 09 Section 9).
/// Stored on the profile but executed by the diversity engine (m2p4).
diversity: Option<DiversitySpec>,
/// Fraction of results reserved for exploration (Spec 09 Section 10).
/// Range: 0.0 to 0.5. Default: 0.0 (no exploration).
exploration: f64,
/// Optional sort mode override. When set, bypasses the boost/penalty
/// scoring pipeline and uses a formula-based sort (Spec 09 Section 11).
sort: Option<Sort>,
/// Whether this is a built-in profile (registered by the engine,
/// not the application). Built-in profiles can be overridden by
/// application profiles with the same name.
is_builtin: bool,
}
impl RankingProfile {
/// Construct a new ranking profile.
///
/// Validation is performed at `ProfileRegistry::register()` time,
/// not at construction. This allows building profiles incrementally.
pub fn new(name: impl Into<String>, version: u32) -> Self {
Self {
name: name.into(),
version,
candidate_strategy: CandidateStrategy::Scan {
entity: EntityKind::Item,
},
boosts: Vec::new(),
decay: None,
gates: Vec::new(),
penalties: Vec::new(),
excludes: Vec::new(),
diversity: None,
exploration: 0.0,
sort: None,
is_builtin: false,
}
}
// Builder methods (all return &mut Self for chaining):
pub fn with_candidate_strategy(&mut self, strategy: CandidateStrategy) -> &mut Self;
pub fn with_boost(&mut self, boost: Boost) -> &mut Self;
pub fn with_boosts(&mut self, boosts: Vec<Boost>) -> &mut Self;
pub fn with_decay(&mut self, decay: ProfileDecay) -> &mut Self;
pub fn with_gate(&mut self, gate: Gate) -> &mut Self;
pub fn with_gates(&mut self, gates: Vec<Gate>) -> &mut Self;
pub fn with_penalty(&mut self, penalty: Penalty) -> &mut Self;
pub fn with_penalties(&mut self, penalties: Vec<Penalty>) -> &mut Self;
pub fn with_exclude(&mut self, exclude: Exclude) -> &mut Self;
pub fn with_excludes(&mut self, excludes: Vec<Exclude>) -> &mut Self;
pub fn with_diversity(&mut self, diversity: DiversitySpec) -> &mut Self;
pub fn with_exploration(&mut self, budget: f64) -> &mut Self;
pub fn with_sort(&mut self, sort: Sort) -> &mut Self;
pub fn set_builtin(&mut self, builtin: bool) -> &mut Self;
// Getters:
pub fn name(&self) -> &str;
pub fn version(&self) -> u32;
pub fn candidate_strategy(&self) -> &CandidateStrategy;
pub fn boosts(&self) -> &[Boost];
pub fn decay(&self) -> Option<&ProfileDecay>;
pub fn gates(&self) -> &[Gate];
pub fn penalties(&self) -> &[Penalty];
pub fn excludes(&self) -> &[Exclude];
pub fn diversity(&self) -> Option<&DiversitySpec>;
pub fn exploration(&self) -> f64;
pub fn sort(&self) -> Option<&Sort>;
pub fn is_builtin(&self) -> bool;
/// Returns true if this profile uses a sort mode override
/// (bypasses boost/penalty pipeline).
pub fn has_sort_override(&self) -> bool;
/// Returns all signal names referenced by this profile's boosts,
/// penalties, and gates. Used for dependency validation and
/// signal snapshot construction.
pub fn referenced_signals(&self) -> Vec<&str>;
}
/// How candidates are generated. Spec 09 Section 3.
///
/// The profile executor does not execute candidate strategies --
/// the RETRIEVE executor (m2p5) does. These are informational:
/// they tell the RETRIEVE executor which retrieval path to use.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum CandidateStrategy {
/// Vector similarity search over embeddings. Spec 09 Section 3.1.
Ann {
query_source: VectorSource,
entity: EntityKind,
top_k: u32,
},
/// Full entity scan with signal-based ranking. Spec 09 Section 3.2.
Scan {
entity: EntityKind,
},
/// Text + vector hybrid fusion. Spec 09 Section 3.3.
Hybrid {
text_weight: f64,
vector_weight: f64,
fusion: FusionStrategy,
},
/// Graph traversal via relationship edges. Spec 09 Section 3.4.
Relationship {
edge: String,
},
/// Cohort-scoped trending. Spec 09 Section 3.5.
CohortTrending {
window: Window,
top_k: u32,
},
}
/// Source of the query vector for ANN candidate generation.
/// Spec 09 Section 3.1 VectorSource variants.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum VectorSource {
/// The querying user's preference vector.
UserPreference,
/// A specific item's embedding.
ItemEmbedding(String),
/// The query vector passed inline (for SEARCH).
QueryEmbedding,
/// A creator's catalog embedding.
CreatorEmbedding(String),
}
/// Fusion strategy for hybrid text + vector search.
/// Spec 09 Section 3.3.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum FusionStrategy {
/// Reciprocal Rank Fusion with configurable k.
Rrf { k: u32 },
/// Weighted linear combination (requires relevance labels).
Linear { alpha: f64 },
}
/// A positive signal boost. Spec 09 Section 5.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Boost {
/// The signal name (e.g., "view", "like", "share").
pub signal: String,
/// Time window for aggregation.
pub window: Window,
/// How to aggregate the signal within the window.
pub aggregation: SignalAgg,
/// Contribution weight. Typically 0.0 to 1.0.
pub weight: f64,
}
impl Boost {
pub fn new(signal: impl Into<String>, window: Window, aggregation: SignalAgg, weight: f64) -> Self {
Self {
signal: signal.into(),
window,
aggregation,
weight,
}
}
}
/// How a signal value is aggregated within a window.
/// Spec 09 Section 5.1 SignalAgg variants.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum SignalAgg {
/// Raw aggregate value (count or weighted sum) in the window.
Value,
/// Rate of change within the window (events per hour).
Velocity,
/// Signal value divided by view count (engagement ratio).
Ratio,
/// Running exponential decay score from hot tier.
DecayScore,
/// Short-window velocity / long-window velocity.
RelativeVelocity,
}
/// Content age decay applied multiplicatively to scores.
/// Spec 09 Section 5.4.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ProfileDecay {
/// Metadata field name containing the timestamp (typically "created_at").
pub field: String,
/// Half-life for exponential age decay.
pub half_life_secs: f64,
}
impl ProfileDecay {
pub fn new(field: impl Into<String>, half_life: std::time::Duration) -> Self {
Self {
field: field.into(),
half_life_secs: half_life.as_secs_f64(),
}
}
}
/// A quality gate -- hard threshold that excludes candidates.
/// Spec 09 Section 7.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum Gate {
/// Items where signal value in window < threshold are excluded.
/// Spec 09 Section 7.1.
MinSignal {
signal: String,
window: Window,
threshold: f64,
},
/// Items where a named ratio < threshold are excluded.
/// Spec 09 Section 7.2.
MinRatio {
ratio_name: String,
threshold: f64,
},
/// Items with fewer than count events are excluded.
/// Spec 09 Section 7.3.
MinCount {
signal: String,
window: Window,
count: u64,
},
}
impl Gate {
pub fn min_signal(signal: impl Into<String>, window: Window, threshold: f64) -> Self {
Self::MinSignal {
signal: signal.into(),
window,
threshold,
}
}
pub fn min_ratio(ratio_name: impl Into<String>, threshold: f64) -> Self {
Self::MinRatio {
ratio_name: ratio_name.into(),
threshold,
}
}
pub fn min_count(signal: impl Into<String>, window: Window, count: u64) -> Self {
Self::MinCount {
signal: signal.into(),
window,
count,
}
}
/// Returns the signal name referenced by this gate, if any.
pub fn signal_name(&self) -> Option<&str> {
match self {
Self::MinSignal { signal, .. } | Self::MinCount { signal, .. } => Some(signal),
Self::MinRatio { .. } => None,
}
}
/// Returns the threshold value for display/debugging.
pub fn threshold_display(&self) -> String;
}
/// A negative signal penalty. Spec 09 Section 6.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Penalty {
/// The signal name (e.g., "skip", "dislike", "downvote").
pub signal: String,
/// Time window for aggregation.
pub window: Window,
/// Penalty weight (stored as positive; applied as negative during scoring).
pub weight: f64,
}
impl Penalty {
pub fn new(signal: impl Into<String>, window: Window, weight: f64) -> Self {
Self {
signal: signal.into(),
window,
weight,
}
}
}
/// Hard exclusion -- items matching these are removed before scoring.
/// Spec 09 Section 4, Stage 2.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum Exclude {
/// Exclude items where the user has the named signal (e.g., "hide").
Signal(String),
/// Exclude items by creators with this relationship (e.g., "blocked").
Relationship(String),
}
/// Post-scoring diversity constraints. Spec 09 Section 9.
/// Stored on the profile but executed by the diversity engine (m2p4).
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
pub struct DiversitySpec {
/// Maximum items from the same creator in the result set.
pub max_per_creator: Option<u32>,
/// Ensure variety of content formats.
pub format_mix: bool,
/// Topic diversity score [0.0, 1.0]. MMR lambda.
pub topic_diversity: Option<f64>,
/// Minimum items per represented category.
pub category_min: Option<u32>,
}
/// Built-in sort mode formulas. Spec 09 Section 11.
///
/// When a profile has a sort override, the sort formula replaces
/// the boost/penalty scoring stages (stages 4-5). Gates, exclusions,
/// normalization, and diversity still apply.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum Sort {
/// `log10(max(|positive - negative|, 1)) / (age_hours + 2)^gravity`
/// Spec 09 Section 11.1. Default gravity: 1.8.
Hot { gravity: f64 },
/// `share_velocity(6h) * 0.5 + view_velocity(6h) * 0.3 + new_user_reach(24h) * 0.2`
/// Spec 09 Section 11.2.
Trending,
/// `relative_velocity * age_boost`. Spec 09 Section 11.3.
Rising,
/// `(positive * negative) / (positive + negative)^2`
/// Spec 09 Section 11.4.
Controversial,
/// `quality_score * (1 / log10(view_count + 10))`
/// Spec 09 Section 11.5.
HiddenGems,
/// `random(seed) * quality_weight`. Deterministic per session.
/// Spec 09 Section 11.6.
Shuffle,
/// `created_at DESC`. Pure chronological, no scoring.
/// Spec 09 Section 11.8.
New,
/// Windowed quality score sum. Spec 09 Section 11.7.
TopWindow { window: Window },
/// `view.count(window) DESC`. Raw popularity.
MostViewed { window: Window },
/// `like.count(window) DESC`. Positive sentiment.
MostLiked { window: Window },
}
```
```rust
// === ranking/registry.rs ===
use std::collections::HashMap;
use super::profile::RankingProfile;
/// Error type for profile registry operations.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ProfileError {
/// Profile name contains invalid characters (not lowercase alphanumeric + underscore).
InvalidName(String),
/// Version is not monotonically increasing (INV-PROF-1).
VersionConflict {
name: String,
existing_version: u32,
attempted_version: u32,
},
/// Exploration budget out of range [0.0, 0.5].
ExplorationOutOfRange(f64),
/// Gate threshold out of range [0.0, 1.0].
GateThresholdOutOfRange {
gate_description: String,
threshold: f64,
},
/// Profile not found by name.
NotFound(String),
/// Profile not found by name + version.
VersionNotFound { name: String, version: u32 },
}
impl std::fmt::Display for ProfileError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result;
}
impl std::error::Error for ProfileError {}
/// Registry mapping profile names to versioned `RankingProfile` instances.
///
/// Profiles are loaded into memory at startup and cached for the
/// lifetime of the database instance. The registry enforces version
/// monotonicity (INV-PROF-1) and validates profile structure at
/// registration time.
///
/// # Spec Reference
///
/// Spec 09 Section 2.2 (version semantics), Section 16 (INV-PROF-1, INV-PROF-2, INV-PROF-3).
pub struct ProfileRegistry {
/// Map from profile name to list of versions (sorted ascending by version number).
profiles: HashMap<String, Vec<RankingProfile>>,
}
impl ProfileRegistry {
/// Create an empty registry.
pub fn new() -> Self;
/// Register a profile. Validates structure and enforces version monotonicity.
///
/// If a profile with this name already exists, the new version must be
/// strictly greater than the existing latest version (INV-PROF-1).
///
/// Returns `ProfileError::VersionConflict` if the version is not
/// monotonically increasing.
pub fn register(&mut self, profile: RankingProfile) -> Result<(), ProfileError>;
/// Get the latest version of a profile by name.
pub fn get(&self, name: &str) -> Result<&RankingProfile, ProfileError>;
/// Get a specific version of a profile.
pub fn get_versioned(&self, name: &str, version: u32) -> Result<&RankingProfile, ProfileError>;
/// List all profile names.
pub fn list_names(&self) -> Vec<&str>;
/// List all profiles (latest version of each).
pub fn list_latest(&self) -> Vec<&RankingProfile>;
/// Check if a profile exists by name.
pub fn contains(&self, name: &str) -> bool;
/// Number of distinct profile names.
pub fn len(&self) -> usize;
/// Whether the registry is empty.
pub fn is_empty(&self) -> bool;
/// Remove a profile by name (all versions). Returns true if the profile existed.
/// Used when an application overrides a built-in profile.
pub fn remove(&mut self, name: &str) -> bool;
}
impl Default for ProfileRegistry {
fn default() -> Self {
Self::new()
}
}
```
```rust
// === ranking/mod.rs ===
pub mod profile;
pub mod registry;
pub use profile::{
Boost, CandidateStrategy, DiversitySpec, Exclude, FusionStrategy, Gate,
Penalty, ProfileDecay, RankingProfile, SignalAgg, Sort, VectorSource,
};
pub use registry::{ProfileError, ProfileRegistry};
```
### Validation Rules
Validation is performed in `ProfileRegistry::register()`:
1. **Name format:** Lowercase alphanumeric plus underscores. Regex: `^[a-z][a-z0-9_]*$`. Minimum 1 character, maximum 64 characters.
2. **Version monotonicity (INV-PROF-1):** If a profile with this name already exists, `new_version > existing_latest_version`. Returns `ProfileError::VersionConflict` on violation.
3. **Exploration range:** `exploration` must be in `[0.0, 0.5]`. Returns `ProfileError::ExplorationOutOfRange` if outside.
4. **Gate threshold range:** For `Gate::MinSignal` and `Gate::MinRatio`, `threshold` must be in `[0.0, 1.0]`. For `Gate::MinCount`, `count` must be > 0. Returns `ProfileError::GateThresholdOutOfRange` if outside.
5. **Weight sign:** Boost weights must be non-negative. Penalty weights must be non-negative (they are applied as negative during scoring). A `tracing::warn!` is emitted if any weight > 5.0 (likely a misconfiguration, but not rejected).
6. **Signal reference validation:** Deferred to Task 02 where signal dependencies are checked against the schema. Task 01 validation does not require a schema reference.
### Serde Strategy
All profile types derive `Serialize` and `Deserialize`. The format is JSON for human readability during development. The schema checkpoint serializes profiles alongside signal definitions and entity definitions.
`Window` already has `Serialize`/`Deserialize` from m1p1 (or needs it added). If not present, add `#[derive(Serialize, Deserialize)]` to `Window` in `schema/signal.rs`. Similarly for `EntityKind`.
### Error Handling
- `ProfileError` is a non-exhaustive enum with `Display` and `Error` impls.
- `ProfileError` integrates with `LumenError` via a `From<ProfileError>` impl that maps to `LumenError::Schema(SchemaError::ProfileError(...))` or a new `LumenError::Profile(ProfileError)` variant -- choose whichever maintains the existing error hierarchy.
- Validation errors are returned from `register()`, not from `RankingProfile::new()`. Construction is infallible to support builder patterns.
## Test Strategy
### Unit Tests
```rust
#[test]
fn profile_construction_defaults() {
let profile = RankingProfile::new("test", 1);
assert_eq!(profile.name(), "test");
assert_eq!(profile.version(), 1);
assert!(profile.boosts().is_empty());
assert!(profile.gates().is_empty());
assert!(profile.penalties().is_empty());
assert!(profile.excludes().is_empty());
assert!(profile.decay().is_none());
assert!(profile.diversity().is_none());
assert!((profile.exploration() - 0.0).abs() < f64::EPSILON);
assert!(profile.sort().is_none());
assert!(!profile.is_builtin());
assert!(!profile.has_sort_override());
}
#[test]
fn profile_builder_chaining() {
let mut profile = RankingProfile::new("trending", 1);
profile
.with_candidate_strategy(CandidateStrategy::Scan {
entity: EntityKind::Item,
})
.with_boost(Boost::new("share", Window::OneHour, SignalAgg::Velocity, 0.5))
.with_boost(Boost::new("view", Window::OneHour, SignalAgg::Velocity, 0.3))
.with_gate(Gate::min_ratio("engagement_ratio", 0.03))
.with_diversity(DiversitySpec {
max_per_creator: Some(1),
..Default::default()
})
.with_sort(Sort::Trending);
assert_eq!(profile.boosts().len(), 2);
assert_eq!(profile.gates().len(), 1);
assert!(profile.has_sort_override());
}
#[test]
fn profile_referenced_signals() {
let mut profile = RankingProfile::new("test", 1);
profile
.with_boost(Boost::new("view", Window::TwentyFourHours, SignalAgg::Velocity, 0.3))
.with_boost(Boost::new("share", Window::OneHour, SignalAgg::Velocity, 0.5))
.with_penalty(Penalty::new("skip", Window::TwentyFourHours, 0.5))
.with_gate(Gate::min_signal("completion", Window::AllTime, 0.3));
let refs = profile.referenced_signals();
assert!(refs.contains(&"view"));
assert!(refs.contains(&"share"));
assert!(refs.contains(&"skip"));
assert!(refs.contains(&"completion"));
}
#[test]
fn registry_register_and_get() {
let mut registry = ProfileRegistry::new();
let profile = RankingProfile::new("trending", 1);
registry.register(profile).unwrap();
let retrieved = registry.get("trending").unwrap();
assert_eq!(retrieved.name(), "trending");
assert_eq!(retrieved.version(), 1);
}
#[test]
fn registry_version_monotonicity() {
let mut registry = ProfileRegistry::new();
registry.register(RankingProfile::new("test", 1)).unwrap();
registry.register(RankingProfile::new("test", 2)).unwrap();
// Version 1 again -- should fail
let err = registry.register(RankingProfile::new("test", 1)).unwrap_err();
assert!(matches!(err, ProfileError::VersionConflict { .. }));
// Same version as latest -- should fail
let err = registry.register(RankingProfile::new("test", 2)).unwrap_err();
assert!(matches!(err, ProfileError::VersionConflict { .. }));
// Higher version -- should succeed
registry.register(RankingProfile::new("test", 3)).unwrap();
}
#[test]
fn registry_get_latest_returns_highest_version() {
let mut registry = ProfileRegistry::new();
registry.register(RankingProfile::new("test", 1)).unwrap();
registry.register(RankingProfile::new("test", 2)).unwrap();
registry.register(RankingProfile::new("test", 5)).unwrap();
let latest = registry.get("test").unwrap();
assert_eq!(latest.version(), 5);
}
#[test]
fn registry_get_versioned() {
let mut registry = ProfileRegistry::new();
registry.register(RankingProfile::new("test", 1)).unwrap();
registry.register(RankingProfile::new("test", 3)).unwrap();
let v1 = registry.get_versioned("test", 1).unwrap();
assert_eq!(v1.version(), 1);
let v3 = registry.get_versioned("test", 3).unwrap();
assert_eq!(v3.version(), 3);
let err = registry.get_versioned("test", 2).unwrap_err();
assert!(matches!(err, ProfileError::VersionNotFound { .. }));
}
#[test]
fn registry_not_found() {
let registry = ProfileRegistry::new();
let err = registry.get("nonexistent").unwrap_err();
assert!(matches!(err, ProfileError::NotFound(_)));
}
#[test]
fn registry_list_names() {
let mut registry = ProfileRegistry::new();
registry.register(RankingProfile::new("alpha", 1)).unwrap();
registry.register(RankingProfile::new("beta", 1)).unwrap();
let names = registry.list_names();
assert_eq!(names.len(), 2);
assert!(names.contains(&"alpha"));
assert!(names.contains(&"beta"));
}
#[test]
fn registry_contains() {
let mut registry = ProfileRegistry::new();
registry.register(RankingProfile::new("test", 1)).unwrap();
assert!(registry.contains("test"));
assert!(!registry.contains("other"));
}
#[test]
fn registry_remove() {
let mut registry = ProfileRegistry::new();
registry.register(RankingProfile::new("test", 1)).unwrap();
assert!(registry.remove("test"));
assert!(!registry.contains("test"));
assert!(!registry.remove("test")); // already removed
}
#[test]
fn validation_invalid_name() {
let mut registry = ProfileRegistry::new();
// Uppercase
let err = registry.register(RankingProfile::new("Test", 1)).unwrap_err();
assert!(matches!(err, ProfileError::InvalidName(_)));
// Starts with number
let err = registry.register(RankingProfile::new("1test", 1)).unwrap_err();
assert!(matches!(err, ProfileError::InvalidName(_)));
// Contains hyphen
let err = registry.register(RankingProfile::new("my-profile", 1)).unwrap_err();
assert!(matches!(err, ProfileError::InvalidName(_)));
// Empty
let err = registry.register(RankingProfile::new("", 1)).unwrap_err();
assert!(matches!(err, ProfileError::InvalidName(_)));
}
#[test]
fn validation_exploration_range() {
let mut registry = ProfileRegistry::new();
let mut profile = RankingProfile::new("test", 1);
profile.with_exploration(0.6); // > 0.5
let err = registry.register(profile).unwrap_err();
assert!(matches!(err, ProfileError::ExplorationOutOfRange(_)));
let mut profile = RankingProfile::new("test", 1);
profile.with_exploration(-0.1); // < 0.0
let err = registry.register(profile).unwrap_err();
assert!(matches!(err, ProfileError::ExplorationOutOfRange(_)));
// Edge cases: 0.0 and 0.5 are valid
let mut profile = RankingProfile::new("test_zero", 1);
profile.with_exploration(0.0);
registry.register(profile).unwrap();
let mut profile = RankingProfile::new("test_half", 1);
profile.with_exploration(0.5);
registry.register(profile).unwrap();
}
#[test]
fn validation_gate_threshold_range() {
let mut registry = ProfileRegistry::new();
let mut profile = RankingProfile::new("test", 1);
profile.with_gate(Gate::min_signal("view", Window::AllTime, 1.5)); // > 1.0
let err = registry.register(profile).unwrap_err();
assert!(matches!(err, ProfileError::GateThresholdOutOfRange { .. }));
let mut profile = RankingProfile::new("test", 1);
profile.with_gate(Gate::min_ratio("engagement_ratio", -0.1)); // < 0.0
let err = registry.register(profile).unwrap_err();
assert!(matches!(err, ProfileError::GateThresholdOutOfRange { .. }));
}
#[test]
fn serde_roundtrip() {
let mut profile = RankingProfile::new("trending", 1);
profile
.with_candidate_strategy(CandidateStrategy::Scan {
entity: EntityKind::Item,
})
.with_boost(Boost::new("share", Window::OneHour, SignalAgg::Velocity, 0.5))
.with_gate(Gate::min_ratio("engagement_ratio", 0.03))
.with_penalty(Penalty::new("skip", Window::TwentyFourHours, 0.5))
.with_diversity(DiversitySpec {
max_per_creator: Some(1),
format_mix: true,
..Default::default()
})
.with_sort(Sort::Trending);
let json = serde_json::to_string(&profile).unwrap();
let restored: RankingProfile = serde_json::from_str(&json).unwrap();
assert_eq!(restored.name(), profile.name());
assert_eq!(restored.version(), profile.version());
assert_eq!(restored.boosts().len(), profile.boosts().len());
assert_eq!(restored.gates().len(), profile.gates().len());
assert_eq!(restored.penalties().len(), profile.penalties().len());
assert!(restored.has_sort_override());
}
#[test]
fn sort_variants_serde_roundtrip() {
let sorts = vec![
Sort::Hot { gravity: 1.8 },
Sort::Trending,
Sort::Rising,
Sort::Controversial,
Sort::HiddenGems,
Sort::Shuffle,
Sort::New,
Sort::TopWindow { window: Window::SevenDays },
Sort::MostViewed { window: Window::AllTime },
Sort::MostLiked { window: Window::AllTime },
];
for sort in &sorts {
let json = serde_json::to_string(sort).unwrap();
let restored: Sort = serde_json::from_str(&json).unwrap();
// Verify roundtrip (Debug format comparison since Sort doesn't derive PartialEq)
assert_eq!(format!("{sort:?}"), format!("{restored:?}"));
}
}
#[test]
fn candidate_strategy_variants_serde_roundtrip() {
let strategies = vec![
CandidateStrategy::Scan { entity: EntityKind::Item },
CandidateStrategy::Ann {
query_source: VectorSource::UserPreference,
entity: EntityKind::Item,
top_k: 500,
},
CandidateStrategy::Hybrid {
text_weight: 0.6,
vector_weight: 0.4,
fusion: FusionStrategy::Rrf { k: 60 },
},
CandidateStrategy::Relationship { edge: "follows".into() },
CandidateStrategy::CohortTrending {
window: Window::TwentyFourHours,
top_k: 200,
},
];
for strategy in &strategies {
let json = serde_json::to_string(strategy).unwrap();
let _restored: CandidateStrategy = serde_json::from_str(&json).unwrap();
}
}
#[test]
fn diversity_spec_default() {
let spec = DiversitySpec::default();
assert!(spec.max_per_creator.is_none());
assert!(!spec.format_mix);
assert!(spec.topic_diversity.is_none());
assert!(spec.category_min.is_none());
}
#[test]
fn gate_signal_name() {
let gate = Gate::min_signal("completion", Window::AllTime, 0.3);
assert_eq!(gate.signal_name(), Some("completion"));
let gate = Gate::min_count("view", Window::AllTime, 100);
assert_eq!(gate.signal_name(), Some("view"));
let gate = Gate::min_ratio("engagement_ratio", 0.03);
assert_eq!(gate.signal_name(), None);
}
```
### Property Tests
```rust
use proptest::prelude::*;
// P1: Version monotonicity is enforced.
proptest! {
#[test]
fn version_monotonicity_enforced(
versions in prop::collection::vec(1u32..1000, 2..10),
) {
let mut registry = ProfileRegistry::new();
let mut max_registered = 0u32;
for v in versions {
let result = registry.register(RankingProfile::new("test", v));
if v > max_registered {
prop_assert!(result.is_ok(),
"version {} should be accepted (max was {})", v, max_registered);
max_registered = v;
} else {
prop_assert!(result.is_err(),
"version {} should be rejected (max is {})", v, max_registered);
}
}
}
}
// P2: Serde roundtrip preserves all fields.
proptest! {
#[test]
fn serde_roundtrip_preserves_name_and_version(
name in "[a-z][a-z0-9_]{0,15}",
version in 1u32..1000,
) {
let profile = RankingProfile::new(&name, version);
let json = serde_json::to_string(&profile).unwrap();
let restored: RankingProfile = serde_json::from_str(&json).unwrap();
prop_assert_eq!(restored.name(), name.as_str());
prop_assert_eq!(restored.version(), version);
}
}
// P3: Valid names are accepted, invalid names are rejected.
proptest! {
#[test]
fn valid_names_accepted(name in "[a-z][a-z0-9_]{0,15}") {
let mut registry = ProfileRegistry::new();
let result = registry.register(RankingProfile::new(&name, 1));
prop_assert!(result.is_ok(), "name '{}' should be valid", name);
}
}
```
## Acceptance Criteria
- [ ] `RankingProfile` struct with all fields: name, version, candidate_strategy, boosts, decay, gates, penalties, excludes, diversity, exploration, sort, is_builtin
- [ ] Builder methods on `RankingProfile` for fluent construction
- [ ] `referenced_signals()` returns all signal names used in boosts, penalties, and gates
- [ ] `has_sort_override()` returns true when `sort` is `Some`
- [ ] `CandidateStrategy` enum with 5 variants: Ann, Scan, Hybrid, Relationship, CohortTrending
- [ ] `VectorSource` enum with 4 variants: UserPreference, ItemEmbedding, QueryEmbedding, CreatorEmbedding
- [ ] `FusionStrategy` enum with 2 variants: Rrf, Linear
- [ ] `Boost` struct with signal, window, aggregation, weight
- [ ] `SignalAgg` enum with 5 variants: Value, Velocity, Ratio, DecayScore, RelativeVelocity
- [ ] `Gate` enum with 3 variants: MinSignal, MinRatio, MinCount
- [ ] `Penalty` struct with signal, window, weight
- [ ] `Exclude` enum with 2 variants: Signal, Relationship
- [ ] `DiversitySpec` struct with max_per_creator, format_mix, topic_diversity, category_min
- [ ] `ProfileDecay` struct with field, half_life_secs
- [ ] `Sort` enum with 10 variants: Hot, Trending, Rising, Controversial, HiddenGems, Shuffle, New, TopWindow, MostViewed, MostLiked
- [ ] `ProfileRegistry` with register, get, get_versioned, list_names, list_latest, contains, remove
- [ ] `ProfileError` enum with InvalidName, VersionConflict, ExplorationOutOfRange, GateThresholdOutOfRange, NotFound, VersionNotFound
- [ ] Version monotonicity enforced (INV-PROF-1): `register()` rejects version <= existing latest
- [ ] Name validation: lowercase alphanumeric + underscore, starts with letter, 1-64 chars
- [ ] Exploration range validation: [0.0, 0.5]
- [ ] Gate threshold range validation: [0.0, 1.0] for MinSignal and MinRatio
- [ ] All types derive `Serialize` + `Deserialize` (serde)
- [ ] Serde roundtrip (JSON) preserves all fields for all type variants
- [ ] `Serialize`/`Deserialize` added to `Window` and `EntityKind` in m1p1 types if not already present
- [ ] No `unsafe` code
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All unit tests and property tests pass
## Research References
- [docs/research/tidaldb_signal_ledger.md](../../../research/tidaldb_signal_ledger.md) -- Signal type definitions that profiles reference (decay rates, windows, velocity)
## Spec References
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 2 (ProfileDef structure, versioning rules, inheritance), Section 3 (CandidateStrategy variants), Section 5 (Boost types, SignalAgg), Section 6 (Penalty types), Section 7 (Quality gates: MinSignal, MinRatio, MinCount), Section 8 (Score composition), Section 9 (DiversitySpec), Section 10 (Exploration budget), Section 11 (Sort mode formulas), Section 16 (INV-PROF-1 version monotonicity, INV-PROF-2 inheritance acyclicity, INV-PROF-3 signal reference validity)
## Implementation Notes
- Add `serde = { version = "1", features = ["derive"] }` and `serde_json = "1"` to `Cargo.toml` dependencies if not already present. `serde_json` is needed for checkpoint serialization and tests.
- `Window` and `EntityKind` in `schema/signal.rs` and `schema/entity.rs` need `#[derive(Serialize, Deserialize)]`. If this creates a dependency conflict (m1p1 does not depend on serde), add serde behind a feature flag `serde` or add it as a direct dependency. Given that profiles are a core data type that must persist, serde is a reasonable direct dependency.
- Profile inheritance (`extends` field from Spec 09 Section 2.3) is deferred to M5+. For M2, profiles are standalone -- no parent resolution. The `extends` field is not included in the `RankingProfile` struct. Add it when inheritance is needed.
- The `is_builtin` flag allows the registry to distinguish engine-registered profiles from application-registered profiles. When an application registers a profile with the same name as a built-in, the built-in is replaced (Spec 09 Section 13.13).
- Do NOT implement profile persistence to the storage engine in this task. Profiles are held in memory in the `ProfileRegistry`. Checkpoint persistence (serialize profiles to storage via `Tag::Meta`) is wired up in a later phase when the schema checkpoint is extended.
- Do NOT implement signal reference validation against the schema in this task. That validation requires access to the `Schema`'s signal definitions and is implemented in Task 02 where built-in profiles are registered with schema context.

View File

@ -0,0 +1,690 @@
# Task 02: Built-in Profiles
## Context
**Milestone:** 2 -- Ranked Retrieval
**Phase:** m2p3 -- Ranking Profile Engine
**Depends On:** Task 01 (RankingProfile, ScoringRule, Sort, CandidateStrategy, ProfileRegistry types)
**Blocks:** Task 03 (Profile Executor + Benchmarks)
**Complexity:** M
## Objective
Deliver all 11 built-in ranking profiles as `RankingProfile` instances, registered into the `ProfileRegistry` at schema build time. Built-in profiles are not special-cased in the executor -- they go through the same execution pipeline as application-defined profiles. They are standard `RankingProfile` structs constructed from the types defined in Task 01.
Each built-in profile maps directly to a profile preset from Spec 09 Section 13. The profiles define which signals they require (e.g., `trending` requires `share` and `view` with velocity), and the registration logic validates signal availability against the schema. When a required signal is not present in the schema, the profile degrades gracefully: missing boosts/penalties contribute 0.0, and a `tracing::warn!` is emitted listing the missing signals. The profile is still registered and usable.
This task also delivers the signal dependency validation logic that connects profiles to the schema's signal definitions, closing the loop on INV-PROF-3 (signal reference validity).
## Requirements
- 11 built-in profiles defined as `RankingProfile` instances:
- `trending` -- Spec 09 Section 13.2
- `hot` -- Spec 09 Section 13.10
- `new` -- pure `created_at DESC`
- `top_week` -- quality score within 7d window
- `top_month` -- quality score within 30d window
- `top_all_time` -- all-time signal score
- `hidden_gems` -- Spec 09 Section 13.7
- `controversial` -- Spec 09 Section 13.12
- `most_viewed` -- windowed view count DESC
- `most_liked` -- windowed like count DESC
- `shuffle` -- quality-weighted random ordering
- Each built-in profile specifies its signal dependencies
- Signal dependency validation against the schema
- Graceful degradation for missing signals (skip, warn, not error)
- Built-in profiles registered with `is_builtin: true`
- Application profiles can override built-ins by registering with the same name
- `register_builtins()` function that populates a `ProfileRegistry`
- No `unsafe` code
## Technical Design
### Module Structure
```
tidal/src/ranking/
registry.rs -- ProfileRegistry (Task 01), register_builtins(), SignalDependency,
validate_signal_dependencies()
```
### Public API
```rust
// === ranking/registry.rs (additions to Task 01 registry) ===
use std::collections::HashSet;
use super::profile::*;
use crate::schema::SignalTypeDef;
/// Signal dependency for a profile. Describes what the profile needs
/// from the schema's signal definitions.
#[derive(Debug, Clone)]
pub struct SignalDependency {
/// Signal name (e.g., "view", "share", "like").
pub signal_name: String,
/// Whether velocity is required for this signal.
pub requires_velocity: bool,
/// Which windows are required.
pub required_windows: Vec<Window>,
}
/// Result of validating a profile's signal dependencies against the schema.
#[derive(Debug, Clone)]
pub struct DependencyValidation {
/// Signal names that are present in the schema and fully satisfy the profile.
pub satisfied: Vec<String>,
/// Signal names referenced by the profile but not found in the schema.
pub missing: Vec<String>,
/// Signal names present but lacking required velocity configuration.
pub missing_velocity: Vec<String>,
/// Signal names present but lacking required windows.
pub missing_windows: Vec<(String, Vec<Window>)>,
}
impl DependencyValidation {
/// True if all signal dependencies are fully satisfied.
pub fn is_fully_satisfied(&self) -> bool {
self.missing.is_empty()
&& self.missing_velocity.is_empty()
&& self.missing_windows.is_empty()
}
/// True if at least one signal dependency is satisfied.
/// The profile can operate in degraded mode.
pub fn is_partially_satisfied(&self) -> bool {
!self.satisfied.is_empty()
}
}
/// Validate a profile's signal dependencies against the schema's signal definitions.
///
/// Returns a `DependencyValidation` describing which signals are available,
/// missing, or partially available.
pub fn validate_signal_dependencies(
profile: &RankingProfile,
signal_defs: &[SignalTypeDef],
) -> DependencyValidation;
/// Register all built-in profiles into the registry.
///
/// Each built-in is validated against the provided signal definitions.
/// Profiles with missing signals are still registered but emit warnings.
/// Applications can override any built-in by calling `registry.register()`
/// with a profile of the same name (the built-in is replaced).
///
/// # Arguments
///
/// * `registry` -- The profile registry to populate.
/// * `signal_defs` -- Signal type definitions from the schema.
///
/// # Returns
///
/// A map of profile name to `DependencyValidation` for observability.
pub fn register_builtins(
registry: &mut ProfileRegistry,
signal_defs: &[SignalTypeDef],
) -> HashMap<String, DependencyValidation>;
```
### Built-in Profile Definitions
Each built-in is a function returning a `RankingProfile`:
```rust
/// trending: pure velocity, no personalization. Spec 09 Section 13.2.
///
/// Requires: share (velocity), view (velocity, unique_ratio)
/// Gate: engagement_ratio >= 0.03
fn builtin_trending() -> RankingProfile {
let mut p = RankingProfile::new("trending", 1);
p.with_candidate_strategy(CandidateStrategy::Scan {
entity: EntityKind::Item,
})
.with_boost(Boost::new("share", Window::OneHour, SignalAgg::Velocity, 0.5))
.with_boost(Boost::new("view", Window::OneHour, SignalAgg::Velocity, 0.3))
// UniqueRatio deferred to M6; use Value as placeholder for M2
.with_boost(Boost::new("view", Window::TwentyFourHours, SignalAgg::Value, 0.2))
.with_gate(Gate::min_ratio("engagement_ratio", 0.03))
.with_diversity(DiversitySpec {
max_per_creator: Some(1),
..Default::default()
})
.set_builtin(true);
p
}
/// hot: score / (age_hours + 2)^gravity. Spec 09 Section 13.10.
///
/// Requires: like, dislike (for positive/negative computation)
/// Sort formula replaces boost/penalty pipeline.
fn builtin_hot() -> RankingProfile {
let mut p = RankingProfile::new("hot", 1);
p.with_candidate_strategy(CandidateStrategy::Scan {
entity: EntityKind::Item,
})
.with_diversity(DiversitySpec {
max_per_creator: Some(2),
..Default::default()
})
.with_sort(Sort::Hot { gravity: 1.8 })
.set_builtin(true);
p
}
/// new: created_at DESC. Pure chronological, no scoring.
///
/// Requires: no signals (metadata sort only).
fn builtin_new() -> RankingProfile {
let mut p = RankingProfile::new("new", 1);
p.with_candidate_strategy(CandidateStrategy::Scan {
entity: EntityKind::Item,
})
.with_sort(Sort::New)
.set_builtin(true);
p
}
/// top_week: quality score within 7d window. Spec 09 Section 11.7.
///
/// Requires: view, like, share, completion (windowed counts)
fn builtin_top_week() -> RankingProfile {
let mut p = RankingProfile::new("top_week", 1);
p.with_candidate_strategy(CandidateStrategy::Scan {
entity: EntityKind::Item,
})
.with_sort(Sort::TopWindow { window: Window::SevenDays })
.with_diversity(DiversitySpec {
max_per_creator: Some(2),
..Default::default()
})
.set_builtin(true);
p
}
/// top_month: quality score within 30d window.
///
/// Requires: view, like, share, completion (windowed counts)
fn builtin_top_month() -> RankingProfile {
let mut p = RankingProfile::new("top_month", 1);
p.with_candidate_strategy(CandidateStrategy::Scan {
entity: EntityKind::Item,
})
.with_sort(Sort::TopWindow { window: Window::ThirtyDays })
.with_diversity(DiversitySpec {
max_per_creator: Some(2),
..Default::default()
})
.set_builtin(true);
p
}
/// top_all_time: all-time signal score.
///
/// Requires: view, like, share, completion (all-time counts)
fn builtin_top_all_time() -> RankingProfile {
let mut p = RankingProfile::new("top_all_time", 1);
p.with_candidate_strategy(CandidateStrategy::Scan {
entity: EntityKind::Item,
})
.with_sort(Sort::TopWindow { window: Window::AllTime })
.with_diversity(DiversitySpec {
max_per_creator: Some(2),
..Default::default()
})
.set_builtin(true);
p
}
/// hidden_gems: quality * inverse_reach. Spec 09 Section 13.7.
///
/// Requires: completion (all_time), like (all_time), view (all_time count)
/// Gate: completion_rate >= 0.5, view count >= 50
fn builtin_hidden_gems() -> RankingProfile {
let mut p = RankingProfile::new("hidden_gems", 1);
p.with_candidate_strategy(CandidateStrategy::Scan {
entity: EntityKind::Item,
})
.with_gate(Gate::min_signal("completion", Window::AllTime, 0.5))
.with_gate(Gate::min_count("view", Window::AllTime, 50))
.with_diversity(DiversitySpec {
max_per_creator: Some(1),
format_mix: true,
topic_diversity: Some(0.5),
..Default::default()
})
.with_sort(Sort::HiddenGems)
.set_builtin(true);
p
}
/// controversial: maximize positive * negative product. Spec 09 Section 13.12.
///
/// Requires: like (all_time count), dislike (all_time count)
/// Gate: like count >= 50 AND dislike count >= 50
fn builtin_controversial() -> RankingProfile {
let mut p = RankingProfile::new("controversial", 1);
p.with_candidate_strategy(CandidateStrategy::Scan {
entity: EntityKind::Item,
})
.with_gate(Gate::min_count("like", Window::AllTime, 50))
.with_gate(Gate::min_count("dislike", Window::AllTime, 50))
.with_diversity(DiversitySpec {
max_per_creator: Some(2),
..Default::default()
})
.with_sort(Sort::Controversial)
.set_builtin(true);
p
}
/// most_viewed: view count DESC within 7d window.
///
/// Requires: view (7d windowed count)
fn builtin_most_viewed() -> RankingProfile {
let mut p = RankingProfile::new("most_viewed", 1);
p.with_candidate_strategy(CandidateStrategy::Scan {
entity: EntityKind::Item,
})
.with_sort(Sort::MostViewed { window: Window::SevenDays })
.set_builtin(true);
p
}
/// most_liked: like count DESC within all-time window.
///
/// Requires: like (all-time count)
fn builtin_most_liked() -> RankingProfile {
let mut p = RankingProfile::new("most_liked", 1);
p.with_candidate_strategy(CandidateStrategy::Scan {
entity: EntityKind::Item,
})
.with_sort(Sort::MostLiked { window: Window::AllTime })
.set_builtin(true);
p
}
/// shuffle: quality-weighted random ordering. Spec 09 Section 11.6.
///
/// Requires: completion (all_time), like (all_time), view (all_time)
/// for quality_weight computation. Falls back to uniform random if
/// quality signals are unavailable.
fn builtin_shuffle() -> RankingProfile {
let mut p = RankingProfile::new("shuffle", 1);
p.with_candidate_strategy(CandidateStrategy::Scan {
entity: EntityKind::Item,
})
.with_sort(Sort::Shuffle)
.set_builtin(true);
p
}
```
### Signal Dependency Table
| Profile | Required Signals | Required Windows | Requires Velocity |
|---------|-----------------|------------------|-------------------|
| `trending` | share, view | 1h, 24h | Yes (share, view) |
| `hot` | like, dislike | all_time | No |
| `new` | (none) | (none) | No |
| `top_week` | view, like, share, completion | 7d | No |
| `top_month` | view, like, share, completion | 30d | No |
| `top_all_time` | view, like, share, completion | all_time | No |
| `hidden_gems` | completion, like, view | all_time | No |
| `controversial` | like, dislike | all_time | No |
| `most_viewed` | view | 7d | No |
| `most_liked` | like | all_time | No |
| `shuffle` | completion, like, view | all_time | No (quality weight fallback) |
### Degradation Strategy
When `register_builtins()` validates a profile against the schema's signal definitions:
1. **All signals present:** Profile registered as-is. No warnings.
2. **Some signals missing:** Profile registered with missing signals noted. `tracing::warn!("built-in profile '{}' missing signals: {:?}. These scoring rules will contribute 0.0", name, missing)`. The profile's `RankingProfile` struct is not modified -- the executor (Task 03) checks signal availability at scoring time and skips missing signals.
3. **All signals missing:** Profile still registered (it may use a sort formula that does not require signals, like `new` or `shuffle`). Warning emitted. If the sort formula also requires signals (like `hot` requires like/dislike counts), the executor returns 0.0 for all candidates, which produces an arbitrary but stable ordering.
4. **Signal present but missing velocity:** Warning emitted for boosts that use `SignalAgg::Velocity` on a signal without `velocity_enabled: true`. The executor falls back to `SignalAgg::Value` for that boost.
### Error Handling
- `register_builtins()` never fails. All built-in profiles are guaranteed to have valid names, versions, and structure. Signal dependency warnings are advisory, not errors.
- If a built-in profile name conflicts with an already-registered application profile, the application profile takes precedence. The built-in is skipped with a `tracing::info!` log.
## Test Strategy
### Unit Tests
```rust
#[test]
fn all_11_builtins_registered() {
let mut registry = ProfileRegistry::new();
let validations = register_builtins(&mut registry, &[]);
assert_eq!(registry.len(), 11);
let expected_names = [
"trending", "hot", "new", "top_week", "top_month", "top_all_time",
"hidden_gems", "controversial", "most_viewed", "most_liked", "shuffle",
];
for name in &expected_names {
assert!(registry.contains(name), "missing built-in profile: {}", name);
}
}
#[test]
fn builtins_are_flagged_builtin() {
let mut registry = ProfileRegistry::new();
register_builtins(&mut registry, &[]);
for name in registry.list_names() {
let profile = registry.get(name).unwrap();
assert!(profile.is_builtin(),
"built-in profile '{}' should have is_builtin=true", name);
}
}
#[test]
fn builtins_have_version_1() {
let mut registry = ProfileRegistry::new();
register_builtins(&mut registry, &[]);
for name in registry.list_names() {
let profile = registry.get(name).unwrap();
assert_eq!(profile.version(), 1,
"built-in profile '{}' should have version 1", name);
}
}
#[test]
fn hot_profile_has_correct_gravity() {
let mut registry = ProfileRegistry::new();
register_builtins(&mut registry, &[]);
let hot = registry.get("hot").unwrap();
match hot.sort() {
Some(Sort::Hot { gravity }) => {
assert!((gravity - 1.8).abs() < f64::EPSILON,
"hot gravity should be 1.8, got {}", gravity);
}
other => panic!("hot profile should have Sort::Hot, got {:?}", other),
}
}
#[test]
fn trending_profile_has_velocity_boosts() {
let mut registry = ProfileRegistry::new();
register_builtins(&mut registry, &[]);
let trending = registry.get("trending").unwrap();
assert!(!trending.boosts().is_empty(), "trending should have boosts");
let share_boost = trending.boosts().iter()
.find(|b| b.signal == "share")
.expect("trending should boost share");
assert_eq!(share_boost.aggregation, SignalAgg::Velocity);
assert!((share_boost.weight - 0.5).abs() < f64::EPSILON);
}
#[test]
fn new_profile_has_no_boosts_or_signals() {
let mut registry = ProfileRegistry::new();
register_builtins(&mut registry, &[]);
let new = registry.get("new").unwrap();
assert!(new.boosts().is_empty());
assert!(new.penalties().is_empty());
assert!(new.gates().is_empty());
assert!(matches!(new.sort(), Some(Sort::New)));
}
#[test]
fn hidden_gems_has_quality_gates() {
let mut registry = ProfileRegistry::new();
register_builtins(&mut registry, &[]);
let hg = registry.get("hidden_gems").unwrap();
assert_eq!(hg.gates().len(), 2, "hidden_gems should have 2 gates");
let has_completion_gate = hg.gates().iter().any(|g| {
matches!(g, Gate::MinSignal { signal, threshold, .. }
if signal == "completion" && (*threshold - 0.5).abs() < f64::EPSILON)
});
assert!(has_completion_gate, "hidden_gems should gate on completion >= 0.5");
let has_view_gate = hg.gates().iter().any(|g| {
matches!(g, Gate::MinCount { signal, count, .. }
if signal == "view" && *count == 50)
});
assert!(has_view_gate, "hidden_gems should gate on view count >= 50");
}
#[test]
fn controversial_gates_on_like_and_dislike() {
let mut registry = ProfileRegistry::new();
register_builtins(&mut registry, &[]);
let c = registry.get("controversial").unwrap();
assert_eq!(c.gates().len(), 2);
let has_like_gate = c.gates().iter().any(|g| {
matches!(g, Gate::MinCount { signal, count, .. }
if signal == "like" && *count == 50)
});
assert!(has_like_gate, "controversial should gate on like count >= 50");
let has_dislike_gate = c.gates().iter().any(|g| {
matches!(g, Gate::MinCount { signal, count, .. }
if signal == "dislike" && *count == 50)
});
assert!(has_dislike_gate, "controversial should gate on dislike count >= 50");
}
#[test]
fn all_scan_profiles_use_scan_strategy() {
let mut registry = ProfileRegistry::new();
register_builtins(&mut registry, &[]);
// All M2 built-ins use Scan (ANN, Hybrid, Relationship are M3+)
for name in registry.list_names() {
let profile = registry.get(name).unwrap();
assert!(
matches!(profile.candidate_strategy(), CandidateStrategy::Scan { .. }),
"built-in '{}' should use Scan strategy for M2", name
);
}
}
#[test]
fn dependency_validation_all_satisfied() {
let signal_defs = vec![
make_signal_def("view", true, &[Window::OneHour, Window::TwentyFourHours, Window::SevenDays, Window::AllTime]),
make_signal_def("like", false, &[Window::AllTime]),
make_signal_def("share", true, &[Window::OneHour]),
make_signal_def("completion", false, &[Window::AllTime]),
make_signal_def("dislike", false, &[Window::AllTime]),
];
let mut registry = ProfileRegistry::new();
let validations = register_builtins(&mut registry, &signal_defs);
let trending_v = &validations["trending"];
assert!(trending_v.is_partially_satisfied());
// share and view should be satisfied
assert!(trending_v.satisfied.contains(&"share".to_string()));
assert!(trending_v.satisfied.contains(&"view".to_string()));
}
#[test]
fn dependency_validation_missing_signals() {
// Schema only has "view" -- "share" is missing for trending
let signal_defs = vec![
make_signal_def("view", true, &[Window::OneHour, Window::TwentyFourHours, Window::SevenDays, Window::AllTime]),
];
let mut registry = ProfileRegistry::new();
let validations = register_builtins(&mut registry, &signal_defs);
let trending_v = &validations["trending"];
assert!(trending_v.is_partially_satisfied());
assert!(trending_v.missing.contains(&"share".to_string()));
// Profile should still be registered
assert!(registry.contains("trending"));
}
#[test]
fn dependency_validation_no_signals_at_all() {
let mut registry = ProfileRegistry::new();
let validations = register_builtins(&mut registry, &[]);
// All profiles still registered
assert_eq!(registry.len(), 11);
// "new" should have no missing signals (it uses no signals)
let new_v = &validations["new"];
assert!(new_v.missing.is_empty());
// "trending" should have all signals missing
let trending_v = &validations["trending"];
assert!(!trending_v.missing.is_empty());
}
#[test]
fn application_override_replaces_builtin() {
let mut registry = ProfileRegistry::new();
register_builtins(&mut registry, &[]);
// Override "trending" with a custom profile
let mut custom = RankingProfile::new("trending", 1);
custom.with_boost(Boost::new("view", Window::OneHour, SignalAgg::Velocity, 1.0));
// Remove the built-in first, then register custom
registry.remove("trending");
registry.register(custom).unwrap();
let trending = registry.get("trending").unwrap();
assert!(!trending.is_builtin(), "overridden profile should not be builtin");
assert_eq!(trending.boosts().len(), 1);
assert!((trending.boosts()[0].weight - 1.0).abs() < f64::EPSILON);
}
#[test]
fn builtin_serde_roundtrip() {
let mut registry = ProfileRegistry::new();
register_builtins(&mut registry, &[]);
for name in registry.list_names() {
let profile = registry.get(name).unwrap();
let json = serde_json::to_string(profile).unwrap();
let restored: RankingProfile = serde_json::from_str(&json).unwrap();
assert_eq!(restored.name(), profile.name());
assert_eq!(restored.version(), profile.version());
assert_eq!(restored.boosts().len(), profile.boosts().len());
assert_eq!(restored.gates().len(), profile.gates().len());
}
}
// Test helper
fn make_signal_def(name: &str, velocity: bool, windows: &[Window]) -> SignalTypeDef {
use std::time::Duration;
use crate::schema::{DecayModel, WindowSet};
SignalTypeDef::new(
name.into(),
EntityKind::Item,
DecayModel::exponential(Duration::from_secs(604_800)),
WindowSet::new(windows),
velocity,
)
}
```
### Property Tests
```rust
use proptest::prelude::*;
// P1: All built-in profiles pass validation when registered.
#[test]
fn all_builtins_pass_validation() {
let mut registry = ProfileRegistry::new();
// register_builtins should never panic or return errors
let _validations = register_builtins(&mut registry, &[]);
// Verify every registered profile has a valid name
for name in registry.list_names() {
let profile = registry.get(name).unwrap();
assert!(!profile.name().is_empty());
assert!(profile.name().chars().all(|c| c.is_ascii_lowercase() || c.is_ascii_digit() || c == '_'));
assert!(profile.name().chars().next().unwrap().is_ascii_lowercase());
}
}
// P2: Degradation is consistent -- adding signals never reduces
// the set of registered profiles.
proptest! {
#[test]
fn more_signals_never_fewer_profiles(
num_signals in 0usize..5,
) {
let all_signals = ["view", "like", "share", "completion", "dislike"];
let signal_defs: Vec<_> = all_signals[..num_signals.min(all_signals.len())]
.iter()
.map(|name| make_signal_def(name, true,
&[Window::OneHour, Window::TwentyFourHours, Window::SevenDays, Window::AllTime]))
.collect();
let mut registry = ProfileRegistry::new();
register_builtins(&mut registry, &signal_defs);
// All 11 profiles should be registered regardless of signal availability
prop_assert_eq!(registry.len(), 11,
"expected 11 profiles with {} signals, got {}",
num_signals, registry.len());
}
}
```
## Acceptance Criteria
- [ ] 11 built-in profiles registered: `trending`, `hot`, `new`, `top_week`, `top_month`, `top_all_time`, `hidden_gems`, `controversial`, `most_viewed`, `most_liked`, `shuffle`
- [ ] All built-in profiles have `is_builtin: true` and version 1
- [ ] All built-in profiles use `CandidateStrategy::Scan` for M2
- [ ] `trending` has `share` velocity boost (0.5), `view` velocity boost (0.3), `engagement_ratio` gate >= 0.03, `max_per_creator: 1`
- [ ] `hot` has `Sort::Hot { gravity: 1.8 }`, `max_per_creator: 2`, no boosts/penalties
- [ ] `new` has `Sort::New`, no boosts/penalties/gates
- [ ] `top_week` has `Sort::TopWindow { window: SevenDays }`, `max_per_creator: 2`
- [ ] `top_month` has `Sort::TopWindow { window: ThirtyDays }`, `max_per_creator: 2`
- [ ] `top_all_time` has `Sort::TopWindow { window: AllTime }`, `max_per_creator: 2`
- [ ] `hidden_gems` has `Sort::HiddenGems`, gates on completion >= 0.5 and view count >= 50, `max_per_creator: 1`, `format_mix: true`, `topic_diversity: 0.5`
- [ ] `controversial` has `Sort::Controversial`, gates on like count >= 50 and dislike count >= 50, `max_per_creator: 2`
- [ ] `most_viewed` has `Sort::MostViewed { window: SevenDays }`
- [ ] `most_liked` has `Sort::MostLiked { window: AllTime }`
- [ ] `shuffle` has `Sort::Shuffle`
- [ ] `validate_signal_dependencies()` correctly classifies signals as satisfied, missing, missing_velocity, or missing_windows
- [ ] `register_builtins()` registers all 11 profiles even when zero signal definitions are provided
- [ ] Missing signals produce `tracing::warn!` at registration time, not errors
- [ ] Application profiles can override built-ins by removing and re-registering with the same name
- [ ] All built-in profiles survive serde JSON roundtrip
- [ ] `register_builtins()` never panics regardless of input signal definitions
- [ ] No `unsafe` code
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All unit tests and property tests pass
## Research References
- [docs/research/tidaldb_signal_ledger.md](../../../research/tidaldb_signal_ledger.md) -- Signal type definitions that profiles reference
## Spec References
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 11 (Built-in sort modes: Hot formula Section 11.1, Trending Section 11.2, Rising Section 11.3, Controversial Section 11.4, HiddenGems Section 11.5, Shuffle Section 11.6, Top windowed Section 11.7, simple field sorts Section 11.8), Section 13 (Profile presets: all 12 presets with exact field definitions), Section 16 (INV-PROF-3: signal reference validity)
## Implementation Notes
- `register_builtins()` should be called from `SchemaBuilder::build()` or from `TidalDb::open()` after the schema is loaded. The exact call site depends on how the schema-to-registry wiring evolves. For M2, call it from a new method on `TidalDb` or from a test helper.
- The `trending` profile in Spec 09 Section 13.2 uses `UniqueRatio` aggregation for `view` in the 24h window. `UniqueRatio` requires per-user deduplication in the signal system, which is not implemented until M3. For M2, substitute `SignalAgg::Value` for the third boost. Comment the substitution with `// TODO(M3): upgrade to SignalAgg::UniqueRatio when per-user dedup is available`.
- The `Rising` sort mode (Spec 09 Section 11.3) requires a per-creator baseline velocity, which is not available until M3 (creator entities). The `rising` profile is NOT included in the 11 M2 built-ins. It is deferred to M3.
- `dislike` signal may not be in every schema. The `hot` and `controversial` profiles reference it. When `dislike` is missing, `hot` uses only `like` count (positive = likes, negative = 0), which degrades to a simpler formula. `controversial` degrades to 0.0 for all candidates (no controversy without a negative signal).
- The `make_signal_def` test helper constructs `SignalTypeDef` instances for testing. It uses `pub(crate)` constructors from m1p1. If `SignalTypeDef::new()` is not accessible from tests (it is `pub(crate)`), add a `#[cfg(test)]` helper or use the `SchemaBuilder` to construct test schemas.
- Add `tracing` to dependencies if not already present, for `tracing::warn!` on missing signals.

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,84 @@
# Milestone 2, Phase 4: Diversity Enforcement
## Phase Deliverable
A post-scoring diversity pass that selects results from a scored candidate list to satisfy diversity constraints (`max_per_creator`, `format_mix`), without reducing result count. Implemented as a single greedy selection pass O(n) over the sorted candidate list. When constraints cannot be fully satisfied, the selector relaxes constraints in a defined order and returns results with a warning flag rather than an error.
This is the phase that turns ranked results from "the top N by score" into "the top N by score that a user would actually want to scroll through." Without diversity, a trending creator dominates the feed. With diversity, the database enforces variety -- no application logic required.
## Acceptance Criteria
- [ ] `max_per_creator:N` enforced: no more than N items from any single creator in the result set
- [ ] `format_mix:true` enforced: no more than 60% of results from any single format
- [ ] Diversity pass does not reduce result count -- it selects the next-best candidate that satisfies constraints
- [ ] Diversity pass adds < 1ms for 200 candidates (benchmarked)
- [ ] When diversity constraints cannot be fully satisfied (too few creators), results are returned with a warning flag, not an error
- [ ] Property test: diversity constraints hold for 10,000 random candidate sets
## Dependencies
- **Requires:** m2p3 (profile executor produces `Vec<ScoredCandidate>` sorted by score descending, with `entity_id`, `score`, and `signal_snapshot`; `DiversitySpec` type defined on `RankingProfile` with `max_per_creator`, `format_mix`, `topic_diversity`, `category_min`)
- **Blocks:** m2p5 (RETRIEVE executor calls diversity enforcement as the penultimate step before result return)
## Research References
- [thoughts.md](../../../../thoughts.md) -- Part V.14 (MMR post-scoring diversity enforcement)
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- Filtered search and post-retrieval reranking patterns
## Spec References
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 9 (Diversity Enforcement):
- Section 9.1 (DiversitySpec structure: max_per_creator, format_mix, topic_diversity, category_min)
- Section 9.2 (Greedy MMR reranking algorithm pseudocode)
- Section 9.3 (Constraint details: per-page enforcement, format bonus, category minimum)
- Section 9.4 (Diversity and pagination: per-page, not global)
- Section 9.5 (Diversity as reordering, not filtering; relaxation under pressure)
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 4 (Scoring pipeline: diversity is Stage 8)
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 16 (Invariants INV-RANK-5: diversity never reduces result count, INV-RANK-6: diversity preserves relative score order within same-constraint group)
## Task Index
| # | Task | Delivers | Depends On | Complexity |
|---|------|----------|------------|------------|
| 01 | Diversity Types + Greedy Selector | `DiversityConstraints`, `DiversityResult`, `ConstraintViolation`, `DiversitySelector`, greedy selection algorithm with three-stage relaxation | None | M |
| 02 | Property Tests + Benchmarks | proptest property tests (10,000 random candidate sets), Criterion benchmarks (200-candidate < 1ms) | Task 01 | S |
## Task Dependency DAG
```
Task 01: Diversity Types + Greedy Selector
|
v
Task 02: Property Tests + Benchmarks
```
Task 01 delivers all types and the selection algorithm. Task 02 validates correctness via property tests and performance via benchmarks. Strictly sequential -- Task 02 tests the implementation from Task 01.
## File Layout
```
tidal/src/
ranking/
diversity.rs -- DiversityConstraints, DiversityResult, DiversitySelector,
ConstraintViolation (Task 01)
mod.rs -- add `pub mod diversity;` and re-exports (Task 01)
tidal/benches/
ranking.rs -- add diversity benchmarks (Task 02) to the existing ranking bench file
```
## Open Questions
1. **Creator ID and format in ScoredCandidate**: The diversity selector needs each candidate's `creator_id` and `format` to apply constraints. These are entity metadata fields. `ScoredCandidate` from m2p3 has `entity_id`, `score`, and `signal_snapshot` but not a general metadata map. Options:
- (A) Add `creator_id: Option<EntityId>` and `format: Option<String>` fields to `ScoredCandidate` -- cleanest, no extra lookup
- (B) `DiversitySelector` takes `&EntityStore` and loads metadata per candidate -- more flexible, extra lookup cost (~50ns per candidate)
- **Decision for M2:** Option A. The executor adds `creator_id` and `format` to `ScoredCandidate` at scoring time (they are already loaded from entity metadata during scoring). This keeps diversity O(n) without extra I/O. The `ScoredCandidate` struct gains two optional fields. This change is made as part of Task 01 in this phase.
2. **`min_exploration` constraint**: The exploration budget (10% of results from unfollowed creators) is an M3 feature (Spec 09 Section 10). `DiversityConstraints` includes a `min_exploration: Option<f64>` field for forward compatibility, but the M2 selector ignores it if set. A `todo!()` comment is added in the selector with "M3: implement exploration budget after relationship graph is available."
3. **Relaxation order**: The three-stage relaxation (double max_per_creator, ignore format_mix, accept anything) is the default for M2. The caller (m2p5 RETRIEVE executor) can configure a stricter relaxation policy in future milestones. For M2, hardcode the three-stage order.
4. **`DiversitySpec` vs `DiversityConstraints`**: `DiversitySpec` is already defined on `RankingProfile` (m2p3 Task 01) with fields `max_per_creator`, `format_mix`, `topic_diversity`, `category_min`. The `DiversityConstraints` struct in this phase is the runtime representation used by the selector, derived from `DiversitySpec` plus query-level overrides (the `DIVERSITY` clause). For M2, `DiversityConstraints` is constructed from `DiversitySpec` with a `From` impl. Query-level overrides are wired in m2p5.
5. **`topic_diversity` and `category_min`**: These are fields on `DiversitySpec` from the spec (Section 9.1). For M2, only `max_per_creator` and `format_mix` are implemented. `topic_diversity` requires embedding distance computation (O(n*k) where k = selected count) which changes the algorithm from greedy to MMR. `category_min` requires category metadata on each candidate. Both are deferred to M6. The `DiversityConstraints` struct includes these fields as `Option` types but the selector skips them with a `tracing::debug!` message when set.
6. **Diversity and pagination (Spec 09 Section 9.4)**: Diversity constraints apply per page, not globally across all pages. The selector operates on a single page's worth of candidates. The RETRIEVE executor (m2p5) handles pagination by passing the correct candidate slice to the selector. No pagination logic is needed in the diversity module itself.

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,466 @@
# Task 02: Diversity Property Tests + Benchmarks
## Context
**Milestone:** 2 -- Ranked Retrieval
**Phase:** m2p4 -- Diversity Enforcement
**Depends On:** Task 01 (DiversityConstraints, DiversityResult, DiversitySelector, ConstraintViolation, ScoredCandidate with diversity metadata)
**Blocks:** m2p5 (RETRIEVE executor relies on diversity enforcement being correct and performant)
**Complexity:** S
## Objective
Deliver property-based tests (proptest) that verify diversity constraints hold across 10,000 random candidate sets, and Criterion benchmarks that verify the diversity pass completes in under 1ms for 200 candidates. The property tests are the core acceptance gate for this phase -- they prove that no matter what the scored candidate list looks like, the diversity selector either satisfies all constraints or correctly reports which constraints it could not satisfy.
The property tests must cover:
1. `max_per_creator` never exceeded in the selected set (when satisfiable)
2. `format_mix` fraction never exceeded (when satisfiable)
3. Result count never drops below `min(target_count, candidates.len())`
4. Graceful degradation when constraints are unsatisfiable
5. Score order preserved within same-creator groups
The benchmarks must demonstrate:
1. 200 candidates with `max_per_creator=2` and 50 creators: select 50 in under 1ms
2. 200 candidates with `format_mix=0.6` and 5 formats: select 50 in under 1ms
## Requirements
- 5 proptest property tests covering the diversity invariants
- 2 Criterion benchmarks meeting the < 1ms performance target
- All property tests use `ProptestConfig` with `cases = 10_000`
- Benchmarks added to the existing `tidal/benches/ranking.rs` file
- No `unsafe` code
## Technical Design
### Module Structure
```
tidal/src/ranking/
diversity.rs -- property tests added to existing #[cfg(test)] mod at bottom (Task 02)
tidal/benches/
ranking.rs -- Criterion benchmarks for diversity (Task 02)
```
### Property Tests
```rust
// === ranking/diversity.rs (added to #[cfg(test)] module) ===
use proptest::prelude::*;
/// Strategy to generate a random scored candidate with diversity metadata.
fn arb_scored_candidate(
max_entity_id: u64,
max_creator_id: u64,
formats: &'static [&'static str],
) -> impl Strategy<Value = ScoredCandidate> {
(
0..max_entity_id,
prop::num::f64::POSITIVE, // score > 0
0..max_creator_id,
prop::sample::select(formats),
).prop_map(|(entity, score, creator, format)| {
ScoredCandidate::with_diversity_metadata(
EntityId::new(entity),
score,
Some(EntityId::new(creator + 1000)), // offset to distinguish from entity IDs
Some(format.to_string()),
)
})
}
/// Strategy to generate a Vec of scored candidates, sorted by score descending.
fn arb_candidate_set(
min_size: usize,
max_size: usize,
max_creators: u64,
formats: &'static [&'static str],
) -> impl Strategy<Value = Vec<ScoredCandidate>> {
prop::collection::vec(
arb_scored_candidate(10_000, max_creators, formats),
min_size..=max_size,
).prop_map(|mut candidates| {
// Deduplicate by entity_id (keep first occurrence)
let mut seen = std::collections::HashSet::new();
candidates.retain(|c| seen.insert(c.entity_id.clone()));
// Sort by score descending (selector expects sorted input)
candidates.sort_by(|a, b| b.score.partial_cmp(&a.score).unwrap_or(std::cmp::Ordering::Equal));
candidates
})
}
const TEST_FORMATS: &[&str] = &["video", "article", "short", "podcast", "live"];
// P1: max_per_creator is never exceeded in the selected set.
//
// When the candidate set has enough unique creators to satisfy the constraint
// at the requested target_count, every creator in the result set has at most
// max_per_creator items. When it cannot be satisfied, constraints_satisfied
// is false.
proptest! {
#![proptest_config(ProptestConfig::with_cases(10_000))]
#[test]
fn prop_max_per_creator_holds(
candidates in arb_candidate_set(10, 200, 50, TEST_FORMATS),
max_per_creator in 1usize..=5,
target in 5usize..=50,
) {
let constraints = DiversityConstraints {
max_per_creator: Some(max_per_creator),
..DiversityConstraints::none()
};
let target_count = target.min(candidates.len());
let result = DiversitySelector::select(candidates, &constraints, target_count);
if result.constraints_satisfied {
// When satisfied, no creator exceeds the limit
let mut creator_counts: std::collections::HashMap<EntityId, usize> =
std::collections::HashMap::new();
for c in &result.selected {
if let Some(ref id) = c.creator_id {
*creator_counts.entry(id.clone()).or_insert(0) += 1;
}
}
for (creator_id, count) in &creator_counts {
prop_assert!(*count <= max_per_creator,
"creator {:?} has {} items, max is {}",
creator_id, count, max_per_creator);
}
} else {
// When not satisfied, violations are reported
prop_assert!(!result.violations.is_empty(),
"constraints_satisfied=false but no violations reported");
}
}
}
// P2: format_mix fraction is never exceeded in the selected set.
//
// When format_mix is enforced, no single format exceeds 60% of the result set
// (for result sets of size >= 5). When it cannot be satisfied, a violation is
// reported.
proptest! {
#![proptest_config(ProptestConfig::with_cases(10_000))]
#[test]
fn prop_format_mix_holds(
candidates in arb_candidate_set(10, 200, 50, TEST_FORMATS),
target in 5usize..=50,
) {
let constraints = DiversityConstraints {
format_mix_max_fraction: Some(DEFAULT_FORMAT_MIX_MAX_FRACTION),
..DiversityConstraints::none()
};
let target_count = target.min(candidates.len());
let result = DiversitySelector::select(candidates, &constraints, target_count);
if result.constraints_satisfied && result.selected.len() >= 5 {
let mut format_counts: std::collections::HashMap<&str, usize> =
std::collections::HashMap::new();
for c in &result.selected {
if let Some(ref fmt) = c.format {
*format_counts.entry(fmt.as_str()).or_insert(0) += 1;
}
}
let total = result.selected.len() as f64;
for (fmt, count) in &format_counts {
let fraction = *count as f64 / total;
prop_assert!(fraction <= DEFAULT_FORMAT_MIX_MAX_FRACTION + 0.01,
"format '{}' has fraction {:.3}, exceeds {:.2}",
fmt, fraction, DEFAULT_FORMAT_MIX_MAX_FRACTION);
}
}
}
}
// P3: result count never drops below min(target_count, candidates.len()).
//
// The diversity selector MUST return at least as many items as the minimum
// of the target count and the candidate count. Diversity is reordering,
// not filtering (INV-RANK-5).
proptest! {
#![proptest_config(ProptestConfig::with_cases(10_000))]
#[test]
fn prop_no_result_count_reduction_when_possible(
candidates in arb_candidate_set(1, 200, 50, TEST_FORMATS),
max_per_creator in 1usize..=5,
target in 1usize..=50,
) {
let constraints = DiversityConstraints {
max_per_creator: Some(max_per_creator),
format_mix_max_fraction: Some(DEFAULT_FORMAT_MIX_MAX_FRACTION),
..DiversityConstraints::none()
};
let expected_count = target.min(candidates.len());
let result = DiversitySelector::select(candidates, &constraints, target);
prop_assert_eq!(result.selected.len(), expected_count,
"expected {} results, got {} (target={}, candidates={})",
expected_count, result.selected.len(), target, expected_count);
}
}
// P4: graceful degradation under impossible constraints.
//
// When constraints are unsatisfiable (e.g., all items from 1 creator with
// max_per_creator=1 and target > 1), the selector still returns results
// (via relaxation) and reports constraints_satisfied=false with violations.
proptest! {
#![proptest_config(ProptestConfig::with_cases(10_000))]
#[test]
fn prop_graceful_degradation(
num_items in 5usize..=100,
target in 2usize..=20,
) {
// All items from a single creator -- impossible to satisfy max_per_creator=1
let creator = EntityId::new(999);
let candidates: Vec<ScoredCandidate> = (0..num_items).map(|i| {
ScoredCandidate::with_diversity_metadata(
EntityId::new(i as u64),
(num_items - i) as f64 * 0.01,
Some(creator.clone()),
Some("video".to_string()),
)
}).collect();
let constraints = DiversityConstraints {
max_per_creator: Some(1),
..DiversityConstraints::none()
};
let target_count = target.min(num_items);
let result = DiversitySelector::select(candidates, &constraints, target_count);
// Should still return target_count items via relaxation
prop_assert_eq!(result.selected.len(), target_count,
"expected {} items after relaxation, got {}",
target_count, result.selected.len());
// Should report unsatisfied constraints when target > 1
// (with 1 creator, max_per_creator=1, and target > 1)
if target_count > 1 {
prop_assert!(!result.constraints_satisfied,
"constraints should not be satisfied: 1 creator, max_per_creator=1, target={}",
target_count);
prop_assert!(!result.violations.is_empty(),
"violations should be reported");
}
}
}
// P5: score order preserved within same-creator items.
//
// Among selected items from the same creator, they appear in the order
// they were encountered in the input (which is score-descending). This
// means the greedy selector always picks the highest-scored item from
// a creator before picking a lower-scored one.
proptest! {
#![proptest_config(ProptestConfig::with_cases(10_000))]
#[test]
fn prop_score_order_preserved(
candidates in arb_candidate_set(10, 200, 20, TEST_FORMATS),
max_per_creator in 1usize..=5,
target in 5usize..=50,
) {
let constraints = DiversityConstraints {
max_per_creator: Some(max_per_creator),
..DiversityConstraints::none()
};
let target_count = target.min(candidates.len());
let result = DiversitySelector::select(candidates, &constraints, target_count);
// Group selected items by creator
let mut by_creator: std::collections::HashMap<EntityId, Vec<f64>> =
std::collections::HashMap::new();
for c in &result.selected {
if let Some(ref id) = c.creator_id {
by_creator.entry(id.clone()).or_default().push(c.score);
}
}
// Within each creator, scores must be non-increasing
for (creator_id, scores) in &by_creator {
for pair in scores.windows(2) {
prop_assert!(pair[0] >= pair[1],
"creator {:?} has scores out of order: {} < {}",
creator_id, pair[0], pair[1]);
}
}
}
}
```
### Criterion Benchmarks
```rust
// === tidal/benches/ranking.rs (additions to existing benchmark file) ===
use tidaldb::ranking::diversity::*;
/// Setup: create 200 scored candidates with 50 creators and mixed formats.
fn setup_diversity_candidates_200() -> Vec<ScoredCandidate> {
let formats = ["video", "article", "short", "podcast", "live"];
(0..200).map(|i| {
let creator = EntityId::new((i % 50) as u64 + 1000);
let format = formats[i % formats.len()].to_string();
ScoredCandidate::with_diversity_metadata(
EntityId::new(i as u64),
(200 - i) as f64 * 0.005, // scores from 1.0 to 0.005
Some(creator),
Some(format),
)
}).collect()
}
/// KEY BENCHMARK: 200 candidates, max_per_creator=2, 50 creators -> select 50.
/// Target: < 1ms.
///
/// This is the primary diversity performance gate from the ROADMAP
/// acceptance criteria.
fn bench_diversity_200_candidates_max_per_creator(c: &mut Criterion) {
let candidates = setup_diversity_candidates_200();
let constraints = DiversityConstraints {
max_per_creator: Some(2),
..DiversityConstraints::none()
};
c.bench_function("diversity_200_max_per_creator_2", |b| {
b.iter(|| {
DiversitySelector::select(
candidates.clone(),
&constraints,
50,
)
})
});
}
/// BENCHMARK: 200 candidates, format_mix=0.6, 5 formats -> select 50.
/// Target: < 1ms.
fn bench_diversity_200_candidates_format_mix(c: &mut Criterion) {
let candidates = setup_diversity_candidates_200();
let constraints = DiversityConstraints {
format_mix_max_fraction: Some(DEFAULT_FORMAT_MIX_MAX_FRACTION),
..DiversityConstraints::none()
};
c.bench_function("diversity_200_format_mix", |b| {
b.iter(|| {
DiversitySelector::select(
candidates.clone(),
&constraints,
50,
)
})
});
}
/// BENCHMARK: 200 candidates, both constraints -> select 50.
/// Target: < 1ms.
fn bench_diversity_200_candidates_combined(c: &mut Criterion) {
let candidates = setup_diversity_candidates_200();
let constraints = DiversityConstraints {
max_per_creator: Some(2),
format_mix_max_fraction: Some(DEFAULT_FORMAT_MIX_MAX_FRACTION),
..DiversityConstraints::none()
};
c.bench_function("diversity_200_combined", |b| {
b.iter(|| {
DiversitySelector::select(
candidates.clone(),
&constraints,
50,
)
})
});
}
/// BENCHMARK: worst case -- all candidates from 1 creator, requires relaxation.
/// Target: < 1ms even with relaxation passes.
fn bench_diversity_200_candidates_worst_case_relaxation(c: &mut Criterion) {
let creator = EntityId::new(999);
let candidates: Vec<ScoredCandidate> = (0..200).map(|i| {
ScoredCandidate::with_diversity_metadata(
EntityId::new(i as u64),
(200 - i) as f64 * 0.005,
Some(creator.clone()),
Some("video".to_string()),
)
}).collect();
let constraints = DiversityConstraints {
max_per_creator: Some(2),
format_mix_max_fraction: Some(DEFAULT_FORMAT_MIX_MAX_FRACTION),
..DiversityConstraints::none()
};
c.bench_function("diversity_200_worst_case_relaxation", |b| {
b.iter(|| {
DiversitySelector::select(
candidates.clone(),
&constraints,
50,
)
})
});
}
/// BENCHMARK: no constraints (fast path).
fn bench_diversity_200_candidates_no_constraints(c: &mut Criterion) {
let candidates = setup_diversity_candidates_200();
let constraints = DiversityConstraints::none();
c.bench_function("diversity_200_no_constraints", |b| {
b.iter(|| {
DiversitySelector::select(
candidates.clone(),
&constraints,
50,
)
})
});
}
// Add to the existing criterion_group!:
//
// criterion_group!(
// benches,
// // ... existing m2p3 benchmarks ...
// bench_diversity_200_candidates_max_per_creator,
// bench_diversity_200_candidates_format_mix,
// bench_diversity_200_candidates_combined,
// bench_diversity_200_candidates_worst_case_relaxation,
// bench_diversity_200_candidates_no_constraints,
// );
```
## Acceptance Criteria
- [ ] Property test P1 (`prop_max_per_creator_holds`): for 10,000 random candidate sets, max_per_creator is never exceeded when constraints are satisfied
- [ ] Property test P2 (`prop_format_mix_holds`): for 10,000 random candidate sets, no format exceeds 60% when constraints are satisfied
- [ ] Property test P3 (`prop_no_result_count_reduction_when_possible`): for 10,000 random candidate sets, result count equals `min(target_count, candidates.len())` -- diversity never reduces count
- [ ] Property test P4 (`prop_graceful_degradation`): for 10,000 random single-creator candidate sets, selector returns target count via relaxation and reports `constraints_satisfied=false`
- [ ] Property test P5 (`prop_score_order_preserved`): for 10,000 random candidate sets, items from the same creator appear in score-descending order
- [ ] Criterion benchmark `diversity_200_max_per_creator_2`: 200 candidates, 50 creators, max_per_creator=2, select 50 in < 1ms
- [ ] Criterion benchmark `diversity_200_format_mix`: 200 candidates, 5 formats, format_mix=0.6, select 50 in < 1ms
- [ ] Criterion benchmark `diversity_200_combined`: 200 candidates, both constraints, select 50 in < 1ms
- [ ] Criterion benchmark `diversity_200_worst_case_relaxation`: 200 candidates from 1 creator, both constraints, select 50 in < 1ms (measures relaxation overhead)
- [ ] Criterion benchmark `diversity_200_no_constraints`: 200 candidates, no constraints (fast path), select 50 (baseline measurement)
- [ ] Benchmarks added to existing `tidal/benches/ranking.rs` and registered in `criterion_group!`
- [ ] `proptest` already in `[dev-dependencies]` from m1p4; no new dependency needed
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All property tests and benchmarks pass
## Spec References
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 9.5 (INV-RANK-5: diversity never reduces result count), Section 15 (Performance targets: diversity pass < 1ms for 200 candidates)
## Implementation Notes
- `proptest` is already a `[dev-dependencies]` entry from m1p4 property tests. No new dependency needed.
- `criterion` is already configured for `tidal/benches/ranking.rs` from m2p3. The diversity benchmarks are added to the same file and registered in the existing `criterion_group!`.
- The `arb_scored_candidate` strategy generates candidates with `prop::num::f64::POSITIVE` for scores, which includes very small and very large values. This is intentional -- the selector should handle any score distribution.
- The candidate deduplication in `arb_candidate_set` (retain unique entity IDs) is necessary because the selector may use entity IDs as keys in internal data structures. Duplicate entity IDs in the input would be a caller bug, not something the selector needs to handle.
- The format fraction tolerance in P2 (`+ 0.01`) accounts for integer division effects. With 10 items and 6 of one format, the fraction is 0.6 exactly -- but floating point comparison needs a small epsilon.
- The `candidates.clone()` in benchmarks measures the clone cost as part of the benchmark. This is intentional: in production, the diversity selector owns the candidate vec (it is consumed, not borrowed). The clone simulates the real allocation pattern. If the clone dominates benchmark time, switch to `iter_batched` with a setup closure.
- P3 is the critical property test -- it proves INV-RANK-5 (diversity never reduces result count). If this test fails, the relaxation logic has a bug. This is the acceptance gate for the phase.

View File

@ -0,0 +1,107 @@
# Milestone 2, Phase 5: Query Parser and RETRIEVE Executor
## Phase Deliverable
The RETRIEVE query operation: a typed AST (`Retrieve` struct), a builder API for ergonomic query construction (no text parser in M2 -- that is M5), a `Signal` write command struct, and the `RetrieveExecutor` that orchestrates m2p1 through m2p4 into a complete pipeline: ANN candidate retrieval or full-scan or signal-ranked selection, filter evaluation, signal scoring, diversity enforcement, result assembly. The full M2 UAT scenario passes as a Rust integration test.
This is the capstone phase. Everything built in M2 converges here. The vector index, the filter engine, the ranking profile executor, and the diversity selector are wired together by a single orchestrator. After this phase, a developer can write items with embeddings, write signal events, and execute `db.retrieve(query)` to get ranked, filtered, diverse results -- in under 50ms for 10K items.
## Acceptance Criteria
- [ ] `Retrieve` struct: `entity_kind`, `profile` (name + optional version), `filters` (`Vec<FilterExpr>`), `diversity` (`DiversityConstraints`), `limit` (default 50, max 500), `exclude` (`Vec<EntityId>`), `cursor` (`Option<Cursor>`)
- [ ] `RetrieveBuilder` with ergonomic builder pattern: `Retrieve::builder().entity(EntityKind::Item).profile("trending").filter(FilterExpr::eq("category", "jazz")).diversity(DiversityConstraints::new().max_per_creator(2)).limit(25).build()`
- [ ] Validation: limit out of range returns error, unknown profile name returns error, incompatible filters for entity kind returns error
- [ ] `Results` struct: `items` (`Vec<RetrieveResult>`), `next_cursor` (`Option<Cursor>`), `total_scored` (how many candidates were scored), `constraints_satisfied` (from diversity result)
- [ ] `RetrieveResult` struct: `entity_id`, `score` (f64), `rank` (usize), `signal_snapshot` (`Vec<(String, f64)>`)
- [ ] `Signal` struct: write command wired to existing `TidalDb::signal()` path from M1
- [ ] `Cursor` struct: offset-based opaque cursor encoded as base64 string
- [ ] `QueryError` enum: `ProfileNotFound`, `InvalidFilter`, `IndexNotAvailable`, `StorageError`, `InvalidLimit`, `InvalidCursor`
- [ ] `RetrieveExecutor` pipeline: candidate retrieval (ANN or full scan based on profile's `CandidateStrategy`) -> filter -> score -> diversity -> limit -> return
- [ ] When profile uses velocity/decay signals (e.g., `trending`, `hot`), executor uses ANN retrieval over embeddings then scores with signal state
- [ ] When profile is `new` or `alphabetical`, executor skips ANN and uses metadata index directly (full scan sorted by `created_at` or field)
- [ ] When profile is `SignalRanked` (e.g., `most_viewed`, `most_liked`), executor reads signal state from ledger without ANN
- [ ] `EXCLUDE` list applied before scoring (candidates in exclude list are removed from candidate set)
- [ ] End-to-end RETRIEVE latency < 50ms at 10K items (Criterion benchmarked)
- [ ] Results include signal snapshot for debugging/transparency (top signals used in scoring per result)
- [ ] `TidalDb::retrieve()` method wires `RetrieveExecutor` to the public API
- [ ] Full M2 UAT scenario passes as an integration test (`tidal/tests/m2_uat.rs`)
- [ ] `cargo clippy -- -D warnings` passes
- [ ] No `unsafe` code in `query/` module
## Dependencies
- **Requires:** m2p1 (`VectorIndex` trait, `UsearchIndex`, `AdaptiveQueryPlanner`, `EmbeddingSlotRegistry`), m2p2 (`BitmapIndex`, `RangeIndex`, `FilterExpr`, `FilterEvaluator`, `FilterResult`), m2p3 (`RankingProfile`, `ProfileRegistry`, `ProfileExecutor`, `ScoredCandidate`, `Sort`, `CandidateStrategy`), m2p4 (`DiversityConstraints`, `DiversitySelector`, `DiversityResult`), m1p4 (`SignalLedger`), m1p5 (`TidalDb` struct, `Config`, `write_item`, `signal`, `item_exists`, `open`, `shutdown`)
- **Blocks:** Milestone 3 (personalized ranking adds `FOR USER` clause and user context to the RETRIEVE pipeline)
## Research References
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- Adaptive query planner integration, ANN candidate retrieval strategy, recall@k vs latency tradeoffs
- [docs/research/tidaldb_signal_ledger.md](../../../research/tidaldb_signal_ledger.md) -- Signal read latencies (~15ns hot-tier, ~200ns windowed) establishing per-candidate scoring budget
- [thoughts.md](../../../../thoughts.md) -- Part V.14 (MMR post-scoring), Part V.9 (vector index as derived state)
## Spec References
- [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- THE authoritative spec:
- Section 2 (RETRIEVE operation: candidate generation, filtering, scoring, diversity, pagination)
- Section 3 (Query parsing: `Retrieve` struct, validation, resolution, `QueryError` enum)
- Section 4 (Query planning: `CandidateStrategy`, plan construction, decision tree)
- Section 5 (Execution pipeline: 6-stage architecture, candidate generation, filter evaluation, signal loading, scoring, diversity enforcement, pagination)
- Section 7 (Filter evaluation: bitmap-based architecture, filter push-down, short-circuit)
- Section 8 (Pagination: cursor structure, cursor semantics, cursor encoding)
- Section 15 (Invariants: INV-QUERY-1 deterministic results, INV-QUERY-2 filter correctness, INV-QUERY-3 diversity constraints)
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 3 (CandidateStrategy variants), Section 4 (scoring pipeline stages), Section 9 (diversity enforcement)
## Task Index
| # | Task | Delivers | Depends On | Complexity |
|---|------|----------|------------|------------|
| 01 | RETRIEVE AST + Parser | `Retrieve` struct, `RetrieveBuilder`, `ProfileRef`, `Cursor`, `Results`, `RetrieveResult`, `Signal` write struct, `QueryError`, validation | None | M |
| 02 | RETRIEVE Executor Pipeline | `RetrieveExecutor`, 5-stage pipeline (candidate -> filter -> score -> diversity -> assemble), `TidalDb::retrieve()`, Criterion benchmarks | Task 01 | L |
| 03 | M2 UAT Integration Test | Full M2 UAT scenario as `tidal/tests/m2_uat.rs`: 10K items, 10K signals, all 6 profile queries, signal burst rank change, crash recovery | Task 01, Task 02 | M |
## Task Dependency DAG
```
Task 01: RETRIEVE AST + Parser
|
v
Task 02: RETRIEVE Executor Pipeline
|
v
Task 03: M2 UAT Integration Test
```
Linear dependency chain. Task 01 defines the types that Task 02 consumes. Task 03 exercises the complete system including Task 02's executor wired through `TidalDb::retrieve()`.
## File Layout
```
tidal/src/
query/
mod.rs -- pub mod retrieve; pub mod executor; re-exports of Retrieve, Results,
RetrieveResult, RetrieveExecutor, QueryError, Cursor, Signal
retrieve.rs -- Retrieve struct, RetrieveBuilder, ProfileRef, Cursor, Results,
RetrieveResult, Signal struct, validation (Task 01)
executor.rs -- RetrieveExecutor, 5-stage pipeline, TidalDb::retrieve() wiring (Task 02)
lib.rs -- add `pub mod query;` and TidalDb::retrieve() method (Task 02)
tidal/tests/
m2_uat.rs -- Full M2 UAT integration test (Task 03)
tidal/benches/
query.rs -- Criterion benchmarks for end-to-end RETRIEVE (Task 02)
tidal/Cargo.toml -- add `base64` dependency for cursor encoding; add `[[bench]] name = "query"
harness = false`
```
## Open Questions
1. **Embedding dimension for M2 integration tests**: 1536-dim vectors make the M2 UAT test slow (~2x indexing time). Use 64-dim vectors in tests. Production profiles use any dimension supported by the `VectorIndex` trait. The trait abstraction handles any dimension; set `dimensions: 64` in the test schema. The USearch backend's f16 quantization works identically at 64d.
2. **`TidalDb::retrieve()` method**: The m1p5 `TidalDb` struct needs a `retrieve(&self, query: Retrieve) -> Result<Results, QueryError>` method. For M2, `TidalDb` must hold references to the vector index registry, filter evaluator, profile registry, and signal ledger. These are initialized at `TidalDb::open()` time. The `retrieve()` method constructs a `RetrieveExecutor` from these references and delegates to it.
3. **Filter + ANN interaction for M2**: In M2, filters and ANN are applied sequentially (ANN first, then filter). The adaptive query planner from m2p1 already selects the ANN strategy based on filter selectivity. For M2, the pipeline calls the planner for ANN strategy selection but applies metadata filters post-ANN. Pre-filtering via USearch predicate callbacks is available via `filtered_search()` but the sequential approach is the simpler correct baseline. Document this as a known performance limitation that M3+ refines.
4. **No `FOR USER` clause in M2**: The RETRIEVE query in M2 does not support user context. `CandidateStrategy::Ann` uses a query vector from the item embedding space (e.g., a representative category embedding), not a user preference vector. User preference vectors come in M3 when user entities are introduced. The `Retrieve` struct has a `for_user: Option<UserId>` field that is always `None` in M2.
5. **Cursor-based pagination**: For M2, implement a simple offset-based cursor (encode the current offset as a base64 opaque string). True keyset-based pagination (score + entity_id tiebreaker as described in Spec 08 Section 8.2) is an M5+ concern. The offset cursor is sufficient for the M2 UAT which does not paginate.
6. **`CandidateStrategy` routing**: The `RetrieveExecutor` reads the profile's `CandidateStrategy` to decide how to generate candidates. For M2, three strategies are implemented: `Ann` (ANN search over embeddings), `Scan` (full entity scan sorted by metadata field), and `SignalRanked` (top-K by signal value). `Relationship`, `Hybrid`, and `CohortTrending` strategies are type stubs that produce errors if invoked -- they require M3+ infrastructure (user entities, text index, cohorts).

View File

@ -0,0 +1,982 @@
# Task 01: RETRIEVE AST + Parser
## Context
**Milestone:** 2 -- Ranked Retrieval
**Phase:** m2p5 -- Query Parser and RETRIEVE Executor
**Depends On:** None (uses types from m2p2, m2p3, m2p4 but no m2p5 tasks)
**Blocks:** Task 02 (RETRIEVE Executor Pipeline), Task 03 (M2 UAT Integration Test)
**Complexity:** M
## Objective
Deliver the typed AST for the RETRIEVE query operation and a Rust builder API for constructing queries ergonomically. For M2, there is no text grammar parser -- the "parser" is the `RetrieveBuilder` which validates and constructs a `Retrieve` struct. The text syntax parser (`RETRIEVE items USING PROFILE trending LIMIT 25`) is deferred to M5.
This task also defines the response types (`Results`, `RetrieveResult`), the pagination cursor, the `Signal` write command struct (wired to the existing M1 signal write path), and the `QueryError` enum. These types are consumed by Task 02's executor and returned to the caller.
The types defined here map directly to the spec's input/output types (Spec 08 Section 3) but are scoped to M2: no `for_user`, no `similar_to`, no `for_cohort`, no `window`, no `context`, no `for_session`. These fields exist on the struct as `Option` types for forward compatibility but are validated as unsupported in M2 if set.
## Requirements
- `Retrieve` struct: the complete RETRIEVE query request
- `RetrieveBuilder`: ergonomic builder pattern for constructing `Retrieve` queries
- `ProfileRef`: profile name + optional version reference
- `Cursor`: opaque offset-based pagination cursor with base64 encoding
- `Results`: the query response (items, cursor, total_scored, constraints_satisfied)
- `RetrieveResult`: one result item (entity_id, score, rank, signal_snapshot)
- `Signal`: write command struct wired to `TidalDb::signal()`
- `QueryError`: error enum for query validation and execution failures
- Validation: limit range, profile reference format, filter compatibility
- No `unsafe` code
## Technical Design
### Module Structure
```
tidal/src/
query/
mod.rs -- pub mod retrieve; re-exports
retrieve.rs -- all types from this task
```
### Public API
```rust
// === query/retrieve.rs ===
use crate::schema::{EntityId, EntityKind, Timestamp};
use crate::ranking::diversity::DiversityConstraints;
use crate::storage::indexes::filter::FilterExpr;
/// Reference to a ranking profile by name, optionally pinned to a version.
///
/// The executor resolves this against the `ProfileRegistry` at query time.
/// If `version` is `None`, the latest version is used.
#[derive(Debug, Clone)]
pub struct ProfileRef {
/// Profile name. Must match a registered profile in the registry.
pub name: String,
/// Optional version pin. `None` = latest version.
pub version: Option<u32>,
}
impl ProfileRef {
pub fn new(name: impl Into<String>) -> Self {
Self {
name: name.into(),
version: None,
}
}
pub fn versioned(name: impl Into<String>, version: u32) -> Self {
Self {
name: name.into(),
version: Some(version),
}
}
}
/// A RETRIEVE query. Declarative: specifies what, not how.
///
/// For M2, this struct is constructed via `RetrieveBuilder` (Rust API).
/// A text syntax parser is deferred to M5.
///
/// The profile determines the candidate generation strategy and scoring
/// formula. The caller never specifies how candidates are found -- only
/// which profile to use and which filters to apply.
///
/// Spec reference: docs/specs/08-query-engine.md Section 3.1
#[derive(Debug, Clone)]
pub struct Retrieve {
/// Target entity type. For M2, only `EntityKind::Item` is supported.
pub entity_kind: EntityKind,
/// Named ranking profile. Determines candidate strategy and scoring.
pub profile: ProfileRef,
/// Metadata and signal filters. Combined as AND.
/// Uses `FilterExpr` from m2p2 for composable filter evaluation.
pub filters: Vec<FilterExpr>,
/// Diversity constraints. Applied as a post-scoring pass.
/// If `None`, no diversity enforcement is applied.
pub diversity: Option<DiversityConstraints>,
/// Maximum results to return. Default: 50. Range: [1, 500].
pub limit: usize,
/// Explicit item exclusions. Removed from candidate set before scoring.
pub exclude: Vec<EntityId>,
/// Pagination cursor from a previous result set.
/// If `None`, returns the first page.
pub cursor: Option<Cursor>,
// --- Fields present for forward compatibility (M3+), validated as unsupported in M2 ---
/// User context for personalization. M3+.
pub for_user: Option<u64>,
/// Anchor item for related/similar queries. M3+.
pub similar_to: Option<EntityId>,
/// Surface context for the feedback loop. M3+.
pub context: Option<String>,
}
/// Builder for constructing `Retrieve` queries ergonomically.
///
/// # Example
///
/// ```ignore
/// let query = Retrieve::builder()
/// .entity(EntityKind::Item)
/// .profile("trending")
/// .filter(FilterExpr::eq("category", "jazz"))
/// .diversity(DiversityConstraints::new().max_per_creator(2))
/// .limit(25)
/// .build()?;
/// ```
pub struct RetrieveBuilder {
entity_kind: Option<EntityKind>,
profile: Option<ProfileRef>,
filters: Vec<FilterExpr>,
diversity: Option<DiversityConstraints>,
limit: usize,
exclude: Vec<EntityId>,
cursor: Option<Cursor>,
for_user: Option<u64>,
similar_to: Option<EntityId>,
context: Option<String>,
}
impl RetrieveBuilder {
pub fn new() -> Self {
Self {
entity_kind: None,
profile: None,
filters: Vec::new(),
diversity: None,
limit: 50,
exclude: Vec::new(),
cursor: None,
for_user: None,
similar_to: None,
context: None,
}
}
/// Set the target entity kind.
pub fn entity(mut self, kind: EntityKind) -> Self {
self.entity_kind = Some(kind);
self
}
/// Set the ranking profile by name.
pub fn profile(mut self, name: impl Into<String>) -> Self {
self.profile = Some(ProfileRef::new(name));
self
}
/// Set the ranking profile by name and version.
pub fn profile_versioned(mut self, name: impl Into<String>, version: u32) -> Self {
self.profile = Some(ProfileRef::versioned(name, version));
self
}
/// Add a filter expression. Multiple filters are ANDed together.
pub fn filter(mut self, expr: FilterExpr) -> Self {
self.filters.push(expr);
self
}
/// Set diversity constraints.
pub fn diversity(mut self, constraints: DiversityConstraints) -> Self {
self.diversity = Some(constraints);
self
}
/// Set the maximum number of results. Range: [1, 500]. Default: 50.
pub fn limit(mut self, limit: usize) -> Self {
self.limit = limit;
self
}
/// Add an entity ID to the exclusion list.
pub fn exclude(mut self, id: EntityId) -> Self {
self.exclude.push(id);
self
}
/// Add multiple entity IDs to the exclusion list.
pub fn exclude_ids(mut self, ids: impl IntoIterator<Item = EntityId>) -> Self {
self.exclude.extend(ids);
self
}
/// Set the pagination cursor.
pub fn cursor(mut self, cursor: Cursor) -> Self {
self.cursor = Some(cursor);
self
}
/// Validate and build the `Retrieve` query.
///
/// Returns `QueryError::InvalidLimit` if limit is 0 or > 500.
/// Returns `QueryError::ProfileNotFound` if no profile is set.
/// Returns `QueryError::InvalidFilter` if `for_user` or `similar_to` are set (M2).
pub fn build(self) -> Result<Retrieve, QueryError> {
let entity_kind = self.entity_kind.unwrap_or(EntityKind::Item);
let profile = self.profile.ok_or_else(|| {
QueryError::ProfileNotFound("no profile specified".to_string())
})?;
if self.limit == 0 || self.limit > 500 {
return Err(QueryError::InvalidLimit {
requested: self.limit,
min: 1,
max: 500,
});
}
// M2: reject unsupported features
if self.for_user.is_some() {
return Err(QueryError::InvalidFilter {
field: "for_user".to_string(),
reason: "FOR USER clause is not supported until M3".to_string(),
});
}
if self.similar_to.is_some() {
return Err(QueryError::InvalidFilter {
field: "similar_to".to_string(),
reason: "SIMILAR TO clause is not supported until M3".to_string(),
});
}
Ok(Retrieve {
entity_kind,
profile,
filters: self.filters,
diversity: self.diversity,
limit: self.limit,
exclude: self.exclude,
cursor: self.cursor,
for_user: self.for_user,
similar_to: self.similar_to,
context: self.context,
})
}
}
impl Default for RetrieveBuilder {
fn default() -> Self {
Self::new()
}
}
impl Retrieve {
/// Start building a RETRIEVE query.
pub fn builder() -> RetrieveBuilder {
RetrieveBuilder::new()
}
}
/// The combined filter expression for the query.
///
/// Multiple filters are ANDed together. This helper constructs the
/// combined filter from the `Retrieve` query's filter list.
impl Retrieve {
/// Combine all filters into a single AND expression.
/// Returns `None` if no filters are specified.
pub fn combined_filter(&self) -> Option<FilterExpr> {
match self.filters.len() {
0 => None,
1 => Some(self.filters[0].clone()),
_ => Some(FilterExpr::And(self.filters.clone())),
}
}
}
// ============================================================
// Response Types
// ============================================================
/// The complete response from a RETRIEVE query.
///
/// Spec reference: docs/specs/08-query-engine.md Section 5.7
#[derive(Debug, Clone)]
pub struct Results {
/// The ranked result items for this page.
pub items: Vec<RetrieveResult>,
/// Cursor for the next page. `None` if this is the last page.
pub next_cursor: Option<Cursor>,
/// How many candidates were scored by the profile executor.
/// This is the count after filtering but before diversity and limit.
pub total_scored: usize,
/// Whether all diversity constraints were fully satisfied.
/// `false` if constraints were relaxed (see `DiversityResult::violations`).
pub constraints_satisfied: bool,
/// Non-fatal warnings from query execution.
///
/// Warnings are surfaced when the executor degrades gracefully:
/// - Metadata enrichment fails for a candidate (the item is treated as a
/// unique creator for diversity purposes; results are still returned)
/// - A filter references a field with no index (predicate fallback used)
///
/// An empty `warnings` vec means clean execution with no degradation.
pub warnings: Vec<String>,
}
impl Results {
/// Number of items in this page.
pub fn len(&self) -> usize {
self.items.len()
}
/// Whether this page is empty.
pub fn is_empty(&self) -> bool {
self.items.is_empty()
}
}
/// A single result item from a RETRIEVE query.
///
/// Includes the entity ID, composite score, rank position, and a
/// signal snapshot for debugging and transparency.
///
/// Spec reference: docs/specs/08-query-engine.md Section 5, Stage 10
#[derive(Debug, Clone)]
pub struct RetrieveResult {
/// The entity ID of the result.
pub entity_id: EntityId,
/// The composite score from the ranking profile, normalized to [0.0, 1.0].
pub score: f64,
/// The 1-based rank position in the result set.
pub rank: usize,
/// Key signal values used in scoring. For debugging and transparency.
/// Contains (signal_name, value) pairs for signals referenced by the
/// profile's scoring rules. Capped at 10 entries.
pub signal_snapshot: Vec<(String, f64)>,
}
// ============================================================
// Pagination Cursor
// ============================================================
/// Opaque pagination cursor for RETRIEVE queries.
///
/// For M2, this is a simple offset-based cursor encoded as a base64 string.
/// True keyset-based pagination (score + entity_id tiebreaker, Spec 08
/// Section 8.2) is deferred to M5.
///
/// # Limitation: Not Stable Under Concurrent Writes
///
/// Offset-based cursors are not stable when the underlying ranked list
/// changes between page requests (e.g., due to concurrent signal writes).
/// Items may appear on multiple pages or be skipped if the ranking shifts.
/// This is documented and acceptable for M2; the spec says to prefer
/// keyset cursors for production use. Do not use cursor-based pagination
/// in write-heavy workloads until M5.
///
/// The cursor is opaque to the caller -- they receive it as a string and
/// pass it back on the next request. The internal representation is an
/// implementation detail.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Cursor {
/// The offset into the full result set for the next page.
offset: usize,
}
impl Cursor {
/// Create a cursor from an offset.
pub(crate) fn from_offset(offset: usize) -> Self {
Self { offset }
}
/// Get the offset this cursor represents.
pub(crate) fn offset(&self) -> usize {
self.offset
}
/// Encode the cursor as an opaque base64 string.
pub fn encode(&self) -> String {
use base64::Engine as _;
let bytes = self.offset.to_le_bytes();
base64::engine::general_purpose::URL_SAFE_NO_PAD.encode(bytes)
}
/// Decode a cursor from an opaque base64 string.
pub fn decode(encoded: &str) -> Result<Self, QueryError> {
use base64::Engine as _;
let bytes = base64::engine::general_purpose::URL_SAFE_NO_PAD
.decode(encoded)
.map_err(|e| QueryError::InvalidCursor(format!("invalid base64: {e}")))?;
if bytes.len() != std::mem::size_of::<usize>() {
return Err(QueryError::InvalidCursor(format!(
"expected {} bytes, got {}",
std::mem::size_of::<usize>(),
bytes.len()
)));
}
let offset = usize::from_le_bytes(
bytes
.try_into()
.map_err(|_| QueryError::InvalidCursor("byte conversion failed".to_string()))?,
);
Ok(Self { offset })
}
}
// ============================================================
// Signal Write Command
// ============================================================
/// A signal write command.
///
/// For M2, this is a thin wrapper that routes to the existing
/// `TidalDb::signal()` method from M1. The struct form enables
/// future batching and the query language parser (M5) to produce
/// signal writes from parsed text.
#[derive(Debug, Clone)]
pub struct Signal {
/// The signal type name (e.g., "view", "like", "share").
pub signal_type: String,
/// The target entity ID.
pub entity_id: EntityId,
/// The signal weight. Typically 1.0 for count-based signals.
pub weight: f64,
/// The timestamp of the event.
pub timestamp: Timestamp,
}
impl Signal {
/// Create a new signal write command.
pub fn new(
signal_type: impl Into<String>,
entity_id: EntityId,
weight: f64,
timestamp: Timestamp,
) -> Self {
Self {
signal_type: signal_type.into(),
entity_id,
weight,
timestamp,
}
}
}
// ============================================================
// Query Error
// ============================================================
/// Errors returned by the query engine.
///
/// Spec reference: docs/specs/08-query-engine.md Section 3.4
#[derive(Debug, Clone)]
pub enum QueryError {
/// The named profile does not exist in the profile registry.
ProfileNotFound(String),
/// A filter references a field or uses a condition that is invalid.
InvalidFilter {
field: String,
reason: String,
},
/// The requested limit is out of the valid range [1, 500].
InvalidLimit {
requested: usize,
min: usize,
max: usize,
},
/// A required index (vector, bitmap, range) is not available.
IndexNotAvailable(String),
/// A storage engine error occurred during query execution.
///
/// Preserves the original error type so callers can match on specific
/// storage failure modes (e.g., `StorageError::Corruption`). Use
/// `From<crate::storage::StorageError>` for automatic conversion.
StorageError(crate::storage::StorageError),
/// The pagination cursor is invalid or could not be decoded.
InvalidCursor(String),
/// The profile's candidate strategy is not supported in this milestone.
/// M2 supports: Ann, Scan, SignalRanked. Others require M3+ infrastructure.
UnsupportedStrategy(String),
}
impl std::fmt::Display for QueryError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
QueryError::ProfileNotFound(name) => {
write!(f, "ranking profile '{name}' not found in registry")
}
QueryError::InvalidFilter { field, reason } => {
write!(f, "invalid filter on field '{field}': {reason}")
}
QueryError::InvalidLimit {
requested,
min,
max,
} => {
write!(f, "limit {requested} is out of range [{min}, {max}]")
}
QueryError::IndexNotAvailable(name) => {
write!(f, "required index not available: {name}")
}
QueryError::StorageError(e) => {
write!(f, "storage error during query: {e}")
}
QueryError::InvalidCursor(msg) => {
write!(f, "invalid pagination cursor: {msg}")
}
QueryError::UnsupportedStrategy(msg) => {
write!(f, "unsupported candidate strategy: {msg}")
}
}
}
}
impl std::error::Error for QueryError {}
impl From<crate::storage::StorageError> for QueryError {
fn from(e: crate::storage::StorageError) -> Self {
QueryError::StorageError(e)
}
}
```
### Validation Logic
```rust
impl Retrieve {
/// Validate the query against a ProfileRegistry and Schema.
///
/// Called by the executor before pipeline execution. Separated from
/// `build()` because profile existence requires the registry, which
/// the builder does not have access to.
pub(crate) fn validate(
&self,
registry: &ProfileRegistry,
) -> Result<(), QueryError> {
// 1. Profile existence
let profile_name = &self.profile.name;
let profile = match self.profile.version {
Some(v) => registry.get_versioned(profile_name, v),
None => registry.get(profile_name),
};
if profile.is_none() {
return Err(QueryError::ProfileNotFound(profile_name.clone()));
}
// 2. Limit range (already validated in builder, but defense in depth)
if self.limit == 0 || self.limit > 500 {
return Err(QueryError::InvalidLimit {
requested: self.limit,
min: 1,
max: 500,
});
}
// 3. M2: unsupported features
if self.for_user.is_some() {
return Err(QueryError::InvalidFilter {
field: "for_user".to_string(),
reason: "FOR USER clause requires M3".to_string(),
});
}
if self.similar_to.is_some() {
return Err(QueryError::InvalidFilter {
field: "similar_to".to_string(),
reason: "SIMILAR TO clause requires M3".to_string(),
});
}
// 4. Candidate strategy support check
let resolved_profile = profile.unwrap();
match resolved_profile.candidate_strategy() {
CandidateStrategy::Ann { .. }
| CandidateStrategy::Scan { .. }
| CandidateStrategy::SignalRanked { .. } => {}
other => {
return Err(QueryError::UnsupportedStrategy(
format!("{other:?} requires M3+ infrastructure"),
));
}
}
Ok(())
}
}
```
## Test Strategy
### Unit Tests
```rust
// === RetrieveBuilder tests ===
#[test]
fn builder_default_limit_50() {
let query = Retrieve::builder()
.profile("trending")
.build()
.unwrap();
assert_eq!(query.limit, 50);
assert_eq!(query.entity_kind, EntityKind::Item);
assert!(query.filters.is_empty());
assert!(query.diversity.is_none());
assert!(query.exclude.is_empty());
assert!(query.cursor.is_none());
}
#[test]
fn builder_with_all_fields() {
let query = Retrieve::builder()
.entity(EntityKind::Item)
.profile("hot")
.filter(FilterExpr::eq("category", "jazz"))
.filter(FilterExpr::eq("format", "video"))
.diversity(DiversityConstraints::new().max_per_creator(2))
.limit(25)
.exclude(EntityId::new(999))
.build()
.unwrap();
assert_eq!(query.entity_kind, EntityKind::Item);
assert_eq!(query.profile.name, "hot");
assert_eq!(query.filters.len(), 2);
assert!(query.diversity.is_some());
assert_eq!(query.limit, 25);
assert_eq!(query.exclude.len(), 1);
}
#[test]
fn builder_rejects_zero_limit() {
let result = Retrieve::builder()
.profile("trending")
.limit(0)
.build();
assert!(matches!(result, Err(QueryError::InvalidLimit { .. })));
}
#[test]
fn builder_rejects_limit_over_500() {
let result = Retrieve::builder()
.profile("trending")
.limit(501)
.build();
assert!(matches!(result, Err(QueryError::InvalidLimit { .. })));
}
#[test]
fn builder_rejects_missing_profile() {
let result = Retrieve::builder()
.entity(EntityKind::Item)
.limit(25)
.build();
assert!(matches!(result, Err(QueryError::ProfileNotFound(_))));
}
#[test]
fn builder_limit_boundary_values() {
// Min valid
let r1 = Retrieve::builder().profile("a").limit(1).build();
assert!(r1.is_ok());
assert_eq!(r1.unwrap().limit, 1);
// Max valid
let r2 = Retrieve::builder().profile("a").limit(500).build();
assert!(r2.is_ok());
assert_eq!(r2.unwrap().limit, 500);
}
#[test]
fn builder_profile_versioned() {
let query = Retrieve::builder()
.profile_versioned("trending", 3)
.build()
.unwrap();
assert_eq!(query.profile.name, "trending");
assert_eq!(query.profile.version, Some(3));
}
#[test]
fn builder_multiple_excludes() {
let query = Retrieve::builder()
.profile("new")
.exclude(EntityId::new(1))
.exclude(EntityId::new(2))
.exclude_ids(vec![EntityId::new(3), EntityId::new(4)])
.build()
.unwrap();
assert_eq!(query.exclude.len(), 4);
}
#[test]
fn combined_filter_none_when_empty() {
let query = Retrieve::builder().profile("new").build().unwrap();
assert!(query.combined_filter().is_none());
}
#[test]
fn combined_filter_single() {
let query = Retrieve::builder()
.profile("new")
.filter(FilterExpr::eq("category", "jazz"))
.build()
.unwrap();
let combined = query.combined_filter().unwrap();
assert!(matches!(combined, FilterExpr::Eq { .. }));
}
#[test]
fn combined_filter_multiple_becomes_and() {
let query = Retrieve::builder()
.profile("new")
.filter(FilterExpr::eq("category", "jazz"))
.filter(FilterExpr::eq("format", "video"))
.build()
.unwrap();
let combined = query.combined_filter().unwrap();
assert!(matches!(combined, FilterExpr::And(_)));
}
// === Cursor tests ===
#[test]
fn cursor_encode_decode_roundtrip() {
let cursor = Cursor::from_offset(42);
let encoded = cursor.encode();
let decoded = Cursor::decode(&encoded).unwrap();
assert_eq!(cursor, decoded);
}
#[test]
fn cursor_encode_decode_zero() {
let cursor = Cursor::from_offset(0);
let encoded = cursor.encode();
let decoded = Cursor::decode(&encoded).unwrap();
assert_eq!(cursor, decoded);
}
#[test]
fn cursor_encode_decode_large_offset() {
let cursor = Cursor::from_offset(100_000);
let encoded = cursor.encode();
let decoded = Cursor::decode(&encoded).unwrap();
assert_eq!(cursor, decoded);
}
#[test]
fn cursor_decode_invalid_base64() {
let result = Cursor::decode("!!!not-base64!!!");
assert!(matches!(result, Err(QueryError::InvalidCursor(_))));
}
#[test]
fn cursor_decode_wrong_length() {
use base64::Engine as _;
let encoded = base64::engine::general_purpose::URL_SAFE_NO_PAD.encode(&[1u8, 2, 3]);
let result = Cursor::decode(&encoded);
assert!(matches!(result, Err(QueryError::InvalidCursor(_))));
}
// === QueryError display tests ===
#[test]
fn query_error_display_messages() {
let e1 = QueryError::ProfileNotFound("trending".to_string());
assert!(e1.to_string().contains("trending"));
let e2 = QueryError::InvalidLimit {
requested: 0,
min: 1,
max: 500,
};
assert!(e2.to_string().contains("0"));
assert!(e2.to_string().contains("500"));
let e3 = QueryError::InvalidFilter {
field: "category".to_string(),
reason: "unknown field".to_string(),
};
assert!(e3.to_string().contains("category"));
let e4 = QueryError::InvalidCursor("bad cursor".to_string());
assert!(e4.to_string().contains("bad cursor"));
}
// === RetrieveResult tests ===
#[test]
fn results_len_and_is_empty() {
let empty = Results {
items: vec![],
next_cursor: None,
total_scored: 0,
constraints_satisfied: true,
warnings: vec![],
};
assert_eq!(empty.len(), 0);
assert!(empty.is_empty());
let one = Results {
items: vec![RetrieveResult {
entity_id: EntityId::new(1),
score: 0.5,
rank: 1,
signal_snapshot: vec![],
}],
next_cursor: None,
total_scored: 1,
constraints_satisfied: true,
warnings: vec![],
};
assert_eq!(one.len(), 1);
assert!(!one.is_empty());
}
// === Signal struct tests ===
#[test]
fn signal_new() {
let sig = Signal::new(
"view",
EntityId::new(42),
1.0,
Timestamp::from_nanos(1_000_000),
);
assert_eq!(sig.signal_type, "view");
assert_eq!(sig.entity_id, EntityId::new(42));
assert!((sig.weight - 1.0).abs() < f64::EPSILON);
}
```
### Property Tests
```rust
use proptest::prelude::*;
// P1: Cursor encode/decode is lossless for all valid offsets.
proptest! {
#[test]
fn cursor_roundtrip(offset in 0usize..1_000_000) {
let cursor = Cursor::from_offset(offset);
let encoded = cursor.encode();
let decoded = Cursor::decode(&encoded).unwrap();
prop_assert_eq!(cursor, decoded);
}
}
// P2: Builder always produces valid Retrieve when required fields are set.
proptest! {
#[test]
fn builder_valid_with_limit(limit in 1usize..=500) {
let result = Retrieve::builder()
.profile("test_profile")
.limit(limit)
.build();
prop_assert!(result.is_ok());
prop_assert_eq!(result.unwrap().limit, limit);
}
}
// P3: Builder rejects invalid limits.
proptest! {
#[test]
fn builder_rejects_invalid_limit(limit in 501usize..10_000) {
let result = Retrieve::builder()
.profile("test_profile")
.limit(limit)
.build();
prop_assert!(matches!(result, Err(QueryError::InvalidLimit { .. })));
}
}
```
## Acceptance Criteria
- [ ] `Retrieve` struct with all fields: `entity_kind`, `profile`, `filters`, `diversity`, `limit`, `exclude`, `cursor`, `for_user`, `similar_to`, `context`
- [ ] `RetrieveBuilder` with methods: `entity()`, `profile()`, `profile_versioned()`, `filter()`, `diversity()`, `limit()`, `exclude()`, `exclude_ids()`, `cursor()`, `build()`
- [ ] `RetrieveBuilder::build()` validates: limit in [1, 500], profile present, `for_user` and `similar_to` rejected in M2
- [ ] `Retrieve::combined_filter()` returns `None` for empty filters, single filter as-is, AND for multiple
- [ ] `ProfileRef` with `new()` (latest version) and `versioned()` (pinned version)
- [ ] `Results` struct with `items`, `next_cursor`, `total_scored`, `constraints_satisfied`, `warnings`, `len()`, `is_empty()`
- [ ] `RetrieveResult` struct with `entity_id`, `score`, `rank`, `signal_snapshot`
- [ ] `Cursor` with `from_offset()`, `offset()`, `encode()`, `decode()` -- base64 roundtrip is lossless
- [ ] `Cursor::decode()` returns `QueryError::InvalidCursor` for invalid input
- [ ] `Signal` struct with `new()` constructor and all fields
- [ ] `QueryError` enum with `ProfileNotFound`, `InvalidFilter`, `InvalidLimit`, `IndexNotAvailable`, `StorageError`, `InvalidCursor`, `UnsupportedStrategy`
- [ ] `QueryError` implements `Display` and `Error`
- [ ] `Retrieve::validate()` checks profile existence against `ProfileRegistry`, rejects unsupported candidate strategies
- [ ] Property test: cursor roundtrip for all valid offsets
- [ ] Property test: builder accepts valid limits [1, 500], rejects [501, ...]
- [ ] No `unsafe` code
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All unit tests and property tests pass
## Research References
- [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- Section 3.1 (`Retrieve` struct fields), Section 3.4 (`QueryError` enum variants and validation rules), Section 8.2 (`Cursor` structure and encoding)
## Spec References
- [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- Section 2.1 (RETRIEVE operation overview), Section 3.1 (Retrieve input struct), Section 3.4 (QueryError enum), Section 8 (Pagination: cursor design, encoding, semantics)
## Implementation Notes
- Add `base64 = "0.22"` to `[dependencies]` in `tidal/Cargo.toml`. The `base64` crate is small (no transitive deps beyond `std`) and provides the `URL_SAFE_NO_PAD` engine for cursor encoding.
- The `query/mod.rs` file should be created with `pub mod retrieve;` and re-exports of all public types. The `pub mod executor;` line is added in Task 02 when the executor module is created.
- `lib.rs` should add `pub mod query;` -- this is the first time the query module exists.
- `FilterExpr` is imported from `crate::storage::indexes::filter` (m2p2). If the exact import path differs from the m2p2 implementation, adapt accordingly.
- `DiversityConstraints` is imported from `crate::ranking::diversity` (m2p4). The `DiversityConstraints::new()` and `.max_per_creator()` API must match the m2p4 implementation.
- The `for_user: Option<u64>` field uses `u64` rather than `UserId` because `UserId` (a newtype over `u64`) is not defined until M3 when user entities are introduced. In M3, this field will be changed to `Option<UserId>`.
- Do NOT add `serde` derives to the query types for M2. Serialization of query types is an M5+ concern when the text parser and network protocol are built.
## Migration from M1 QueryError Stub
**This task must remove the M1 stub `QueryError` before adding the new one.** A stub `QueryError` struct exists in `tidal/src/schema/error.rs` (simple `{ message: String }` struct) and is re-exported through `schema/mod.rs` and wrapped by `LumenError::Query`. The M2 `QueryError` is a rich enum that replaces it.
Migration steps (in order):
1. Remove the stub `QueryError` struct from `tidal/src/schema/error.rs`
2. Add `pub use crate::query::retrieve::QueryError;` in `tidal/src/schema/mod.rs` (or update the `From<QueryError>` impl in `LumenError` to use the new path)
3. Update `LumenError::Query` variant to hold `crate::query::retrieve::QueryError`
4. Update the `From<QueryError> for LumenError` impl to use the new enum
5. Fix any existing tests in `schema/error.rs` that construct the old `QueryError { message: "..." }` struct
Do NOT create `query/retrieve.rs` before completing steps 1-4 -- the name collision will cause confusing compilation errors.
## Intentional Spec Deviations
The following fields differ from Spec 08 Section 3.1's `Retrieve` struct definition. These are intentional improvements:
| This task | Spec 08 | Reason |
|-----------|---------|--------|
| `entity_kind: EntityKind` | `entity: EntityKind` | More explicit — avoids confusion with `entity_id` |
| `exclude: Vec<EntityId>` | `exclude_ids: Vec<EntityId>` | Shorter; builder method `.exclude()` reads naturally |
| `profile: ProfileRef` | `profile: String` | Richer type; supports version pinning for A/B testing |
| `diversity: Option<DiversityConstraints>` | `diversity: Option<DiversitySpec>` | Matches m2p4's concrete type name |
These deviations are safe: the spec defines behavior, not names. Future language parsing (M5) maps text tokens to these struct fields.

View File

@ -0,0 +1,943 @@
# Task 02: RETRIEVE Executor Pipeline
## Context
**Milestone:** 2 -- Ranked Retrieval
**Phase:** m2p5 -- Query Parser and RETRIEVE Executor
**Depends On:** Task 01 (Retrieve, Results, RetrieveResult, Cursor, QueryError, ProfileRef)
**Blocks:** Task 03 (M2 UAT Integration Test)
**Complexity:** L
## Objective
Deliver the `RetrieveExecutor` -- the orchestrator that wires m2p1 (vector index), m2p2 (filter engine), m2p3 (profile executor), and m2p4 (diversity selector) into a single 5-stage pipeline that executes a `Retrieve` query and returns `Results`. This is the "one query" entry point where a developer calls `db.retrieve(query)` and gets ranked, filtered, diverse results.
The executor also delivers `TidalDb::retrieve()` -- the public API method that constructs the executor from the database's internal state and delegates to it. After this task, the full RETRIEVE query path works end-to-end.
The key performance gate: **end-to-end RETRIEVE latency < 50ms at 10K items** (Criterion benchmarked). This budget is distributed across the pipeline stages: candidate generation (~10ms ANN or ~5ms scan), filter evaluation (~1ms), scoring (~100us), diversity (~1ms), result assembly (~100us).
## Requirements
- `RetrieveExecutor` struct: borrows all subsystem references needed for query execution
- 5-stage pipeline: candidate generation -> filter evaluation -> signal scoring -> diversity enforcement -> result assembly
- Candidate generation routes to one of three strategies based on the profile's `CandidateStrategy`: `Ann` (ANN search), `Scan` (full entity scan), `SignalRanked` (top-K by signal value)
- Filter evaluation uses `FilterEvaluator` from m2p2 to apply metadata filters to the candidate set
- Signal scoring uses `ProfileExecutor` from m2p3 to score and sort candidates
- Diversity enforcement uses `DiversitySelector` from m2p4 to enforce `max_per_creator` and `format_mix`
- Result assembly constructs `Results` from the diversity output, including signal snapshots and pagination cursor
- `TidalDb::retrieve()` public method wires the executor to the public API
- Criterion benchmarks meeting the < 50ms target at 10K items
- No `unsafe` code
## Technical Design
### Module Structure
```
tidal/src/
query/
executor.rs -- RetrieveExecutor, pipeline stages (this task)
mod.rs -- add `pub mod executor;` and re-export RetrieveExecutor
lib.rs -- add TidalDb::retrieve() method
tidal/benches/
query.rs -- Criterion benchmarks (this task)
tidal/Cargo.toml -- add [[bench]] name = "query" harness = false
```
### Public API
```rust
// === query/executor.rs ===
use crate::query::retrieve::*;
use crate::ranking::diversity::{DiversityConstraints, DiversitySelector};
use crate::ranking::executor::{ProfileExecutor, ScoredCandidate};
use crate::ranking::profile::RankingProfile;
use crate::ranking::registry::ProfileRegistry;
use crate::schema::{EntityId, EntityKind, Schema, Timestamp};
use crate::signals::SignalLedger;
use crate::storage::indexes::filter::{FilterEvaluator, FilterResult};
use crate::storage::vector::registry::EmbeddingSlotRegistry;
use crate::storage::StorageEngine;
/// Executes RETRIEVE queries by orchestrating all M2 subsystems.
///
/// The executor is a stateless orchestrator -- it holds borrowed references
/// to the subsystems it coordinates and has no state of its own. If the
/// executor is dropped, no data is lost.
///
/// # Pipeline Stages
///
/// ```text
/// Stage 1: Candidate Generation
/// ANN search | Full scan | Signal-ranked
/// -> candidate set: Vec<EntityId> (200-500 candidates)
///
/// Stage 2: Filter Evaluation
/// FilterEvaluator::evaluate() -> bitmap intersection
/// -> surviving candidates (100-500)
///
/// Stage 3: Signal Scoring
/// ProfileExecutor::score() -> Vec<ScoredCandidate> sorted by score
/// -> scored, sorted, gate-filtered candidates
///
/// Stage 4: Diversity Enforcement
/// DiversitySelector::select() -> DiversityResult
/// -> reordered candidates satisfying constraints
///
/// Stage 5: Result Assembly
/// Take first `limit` items, build RetrieveResult with signal snapshots
/// -> Results with next_cursor
/// ```
///
/// # Performance
///
/// Target: end-to-end < 50ms at 10K items.
/// Stage budgets: candidate gen ~10ms, filter ~1ms, scoring ~100us,
/// diversity ~1ms, assembly ~100us.
pub struct RetrieveExecutor<'a> {
/// Signal ledger for signal reads during scoring.
ledger: &'a SignalLedger,
/// Entity store for metadata reads (creator_id, format).
entity_store: &'a dyn StorageEngine,
/// Vector index registry for ANN candidate generation.
vector_index: &'a EmbeddingSlotRegistry,
/// Filter evaluator for metadata filter application.
filter_evaluator: &'a FilterEvaluator<'a>,
/// Profile registry for resolving profile names to definitions.
profile_registry: &'a ProfileRegistry,
/// Schema for resolving signal type IDs and entity metadata.
schema: &'a Schema,
}
impl<'a> RetrieveExecutor<'a> {
/// Create a new executor with references to all required subsystems.
pub fn new(
ledger: &'a SignalLedger,
entity_store: &'a dyn StorageEngine,
vector_index: &'a EmbeddingSlotRegistry,
filter_evaluator: &'a FilterEvaluator<'a>,
profile_registry: &'a ProfileRegistry,
schema: &'a Schema,
) -> Self {
Self {
ledger,
entity_store,
vector_index,
filter_evaluator,
profile_registry,
schema,
}
}
/// Execute a RETRIEVE query.
///
/// This is the main entry point. It validates the query, constructs
/// the pipeline, executes each stage, and returns the result set.
///
/// # Errors
///
/// Returns `QueryError::ProfileNotFound` if the profile does not exist.
/// Returns `QueryError::UnsupportedStrategy` if the profile's candidate
/// strategy is not supported in M2 (e.g., Relationship, CohortTrending).
/// Returns `QueryError::IndexNotAvailable` if ANN retrieval is requested
/// but no vector index exists for the entity kind.
/// Returns `QueryError::StorageError` on underlying storage failures.
pub fn retrieve(&self, query: &Retrieve) -> Result<Results, QueryError> {
// Validate the query
query.validate(self.profile_registry)?;
// Resolve the profile
let profile = self.resolve_profile(&query.profile)?;
let now = Timestamp::now();
// Stage 1: Candidate Generation
let mut candidates = self.generate_candidates(query, profile, now)?;
// Apply exclude list
if !query.exclude.is_empty() {
let exclude_set: std::collections::HashSet<EntityId> =
query.exclude.iter().copied().collect();
candidates.retain(|id| !exclude_set.contains(id));
}
// Stage 2: Filter Evaluation
let candidates = self.apply_filters(query, candidates)?;
// Stage 3: Signal Scoring
let scored = self.score_candidates(&candidates, profile, now);
let total_scored = scored.len();
// Stage 4: Diversity Enforcement
let (diverse_candidates, constraints_satisfied) =
self.apply_diversity(query, scored)?;
// Stage 5: Result Assembly
self.assemble_results(query, diverse_candidates, total_scored, constraints_satisfied)
}
/// Resolve a ProfileRef to a RankingProfile.
fn resolve_profile(
&self,
profile_ref: &ProfileRef,
) -> Result<&'a RankingProfile, QueryError> {
match profile_ref.version {
Some(v) => self
.profile_registry
.get_versioned(&profile_ref.name, v)
.ok_or_else(|| {
QueryError::ProfileNotFound(format!(
"{}@v{}",
profile_ref.name, v
))
}),
None => self
.profile_registry
.get(&profile_ref.name)
.ok_or_else(|| {
QueryError::ProfileNotFound(profile_ref.name.clone())
}),
}
}
}
```
### Stage 1: Candidate Generation
```rust
impl<'a> RetrieveExecutor<'a> {
/// Generate the initial candidate set based on the profile's strategy.
///
/// For M2, three strategies are implemented:
/// - `Ann`: ANN search over the default embedding slot for the entity kind
/// - `Scan`: Full entity scan (all entity IDs in the store)
/// - `SignalRanked`: Top-K entities by signal value from the ledger
///
/// The overprovisioning factor (2-4x the requested limit) ensures enough
/// candidates survive filtering, scoring, and diversity to fill the page.
fn generate_candidates(
&self,
query: &Retrieve,
profile: &RankingProfile,
now: Timestamp,
) -> Result<Vec<EntityId>, QueryError> {
let overprovision = std::cmp::max(query.limit * 4, 200);
match profile.candidate_strategy() {
CandidateStrategy::Ann { .. } => {
self.generate_ann_candidates(query, overprovision)
}
CandidateStrategy::Scan { .. } => {
self.generate_scan_candidates(query, overprovision)
}
CandidateStrategy::SignalRanked { .. } => {
self.generate_signal_ranked_candidates(query, profile, overprovision, now)
}
other => Err(QueryError::UnsupportedStrategy(format!(
"{other:?} is not supported in M2"
))),
}
}
/// ANN candidate generation: query the vector index.
///
/// Uses the default embedding slot for the entity kind.
/// For M2, no user preference vector is available, so the query vector
/// is derived from the embedding space (e.g., a representative vector
/// or the adaptive planner's default).
///
/// If filters are present, the adaptive query planner selects the
/// appropriate ANN strategy (brute-force, widened HNSW, or in-graph).
fn generate_ann_candidates(
&self,
query: &Retrieve,
top_k: usize,
) -> Result<Vec<EntityId>, QueryError> {
let slot = self
.vector_index
.default_slot(query.entity_kind)
.ok_or_else(|| {
QueryError::IndexNotAvailable(format!(
"no embedding slot for {:?}",
query.entity_kind
))
})?;
// For M2, no user preference vector is available (that is M3+).
// Use a zero vector as a placeholder. For L2 metric this produces
// distance=1.0 from all normalized vectors (arbitrary order).
// For cosine metric, verify USearch handles zero-norm gracefully;
// if not, the ANN strategy must fall back to Scan for M2.
let dimensions = slot.dimensions();
let query_vector = vec![0.0f32; dimensions];
// M2: apply filters post-ANN in Stage 2 (sequential approach).
// Filter push-down into USearch predicate callbacks is an M3+ optimization.
// See OVERVIEW.md Open Question 3.
let results = slot
.search(&query_vector, top_k, None)
.map_err(|e| QueryError::IndexNotAvailable(format!("ANN search failed: {e}")))?;
Ok(results.into_iter().map(|r| r.id).collect())
}
/// Scan candidate generation: iterate all entities of the kind.
///
/// Used for profiles like `new` (sorted by created_at) and `alphabetical`.
/// Loads all entity IDs from the store and returns up to `top_k`.
fn generate_scan_candidates(
&self,
query: &Retrieve,
top_k: usize,
) -> Result<Vec<EntityId>, QueryError> {
let candidates = self
.entity_store
.scan_entity_ids(query.entity_kind, top_k)
.map_err(|e| QueryError::StorageError(format!("{e}")))?;
Ok(candidates)
}
/// Signal-ranked candidate generation: top-K by signal value.
///
/// Used for profiles like `most_viewed`, `most_liked`. Reads signal
/// state from the ledger for all entities and returns the top-K by
/// the profile's primary signal.
fn generate_signal_ranked_candidates(
&self,
query: &Retrieve,
profile: &RankingProfile,
top_k: usize,
now: Timestamp,
) -> Result<Vec<EntityId>, QueryError> {
// Get the primary signal name from the profile's first boost
let primary_signal = profile
.primary_signal()
.ok_or_else(|| {
QueryError::ProfileNotFound(
"signal-ranked profile has no primary signal".to_string(),
)
})?;
let candidates = self
.ledger
.top_entities_by_signal(primary_signal, top_k, now)
.map_err(|e| QueryError::StorageError(format!("{e}")))?;
Ok(candidates)
}
}
```
### Stage 2: Filter Evaluation
```rust
impl<'a> RetrieveExecutor<'a> {
/// Apply metadata filters to the candidate set.
///
/// If no filters are specified, the candidate set passes through unchanged.
/// If filters are specified, evaluate them as a bitmap and intersect with
/// the candidate set.
///
/// For M2, ANN-then-filter is the approach: candidates are generated
/// first, then filters are applied as a post-processing step. Filter
/// push-down into ANN (via predicate callbacks) is an M3+ optimization.
///
/// Note: for Scan strategy with pre-filter bitmap already applied in
/// Stage 1, this stage may be a no-op (candidates already filtered).
fn apply_filters(
&self,
query: &Retrieve,
candidates: Vec<EntityId>,
) -> Result<Vec<EntityId>, QueryError> {
let combined = match query.combined_filter() {
Some(expr) => expr,
None => return Ok(candidates),
};
let filter_result = self
.filter_evaluator
.evaluate(&combined)
.map_err(|e| QueryError::InvalidFilter {
field: "filter".to_string(),
reason: format!("{e}"),
})?;
match filter_result {
FilterResult::Bitmap(bitmap) => {
// Intersect candidates with the filter bitmap.
//
// M2 limitation: RoaringBitmap uses u32 keys. Entity IDs are
// u64 but M2 is bounded to 10K items (well within u32::MAX).
// M7+ will upgrade to RoaringTreemap for full u64 support.
// See m2p2 task-01 for the INDEX_ROOT_ID/u32 design rationale.
Ok(candidates
.into_iter()
.filter(|id| {
debug_assert!(
id.as_u64() <= u64::from(u32::MAX),
"entity ID {id} exceeds u32 range -- upgrade to RoaringTreemap"
);
bitmap.contains(id.as_u64() as u32)
})
.collect())
}
FilterResult::Predicate(predicate) => {
// Evaluate predicate per candidate
Ok(candidates
.into_iter()
.filter(|id| predicate(id.as_u64()))
.collect())
}
}
}
}
```
### Stage 3: Signal Scoring
```rust
impl<'a> RetrieveExecutor<'a> {
/// Score surviving candidates using the profile's scoring rules.
///
/// Delegates to `ProfileExecutor::score()` from m2p3. The result is
/// a sorted, gate-filtered list of `ScoredCandidate` with scores
/// normalized to [0.0, 1.0].
fn score_candidates(
&self,
candidates: &[EntityId],
profile: &RankingProfile,
now: Timestamp,
) -> Vec<ScoredCandidate> {
if candidates.is_empty() {
return Vec::new();
}
let profile_executor = ProfileExecutor::new(self.ledger);
let mut scored = profile_executor.score(candidates, profile, now, None);
// Enrich scored candidates with creator_id and format for diversity.
//
// Degradation semantics: if metadata enrichment fails for a candidate
// (storage error or missing record), the candidate proceeds without
// creator_id/format. For diversity, it will be treated as a unique
// creator (None creator_id never matches another None), and no format
// constraint applies. This is the "degrade, do not fail" policy from
// Spec 08 Section 13. The caller may inspect Results.warnings for
// any enrichment failures.
for candidate in &mut scored {
if let Ok(Some(metadata)) = self.entity_store.get_metadata(
candidate.entity_id,
EntityKind::Item,
) {
candidate.creator_id = metadata.creator_id;
candidate.format = metadata.format.clone();
}
// On error: candidate proceeds with creator_id=None, format=None.
// This is intentional degradation, not a bug.
}
scored
}
}
```
### Stage 4: Diversity Enforcement
```rust
impl<'a> RetrieveExecutor<'a> {
/// Apply diversity constraints to the scored candidate list.
///
/// If no diversity constraints are specified (neither on the query
/// nor on the profile), candidates pass through unchanged.
fn apply_diversity(
&self,
query: &Retrieve,
candidates: Vec<ScoredCandidate>,
) -> Result<(Vec<ScoredCandidate>, bool), QueryError> {
// Determine active constraints: query overrides profile defaults
let constraints = match &query.diversity {
Some(c) => c.clone(),
None => {
// No query-level diversity: return candidates as-is
return Ok((candidates, true));
}
};
let target_count = query.limit;
let selector = DiversitySelector::new();
let result = selector.select(candidates, &constraints, target_count);
let satisfied = result.violations.is_empty();
Ok((result.selected, satisfied))
}
}
```
### Stage 5: Result Assembly
```rust
impl<'a> RetrieveExecutor<'a> {
/// Assemble the final Results from the diversity output.
///
/// Applies pagination (cursor offset + limit), constructs RetrieveResult
/// for each item, and computes the next_cursor.
fn assemble_results(
&self,
query: &Retrieve,
candidates: Vec<ScoredCandidate>,
total_scored: usize,
constraints_satisfied: bool,
) -> Result<Results, QueryError> {
// Apply pagination offset from cursor
let offset = query
.cursor
.as_ref()
.map(|c| c.offset())
.unwrap_or(0);
let limit = query.limit;
let page_start = std::cmp::min(offset, candidates.len());
let page_end = std::cmp::min(page_start + limit, candidates.len());
let items: Vec<RetrieveResult> = candidates[page_start..page_end]
.iter()
.enumerate()
.map(|(i, candidate)| RetrieveResult {
entity_id: candidate.entity_id,
score: candidate.score,
rank: offset + i + 1, // 1-based rank
signal_snapshot: candidate.signal_snapshot.clone(),
})
.collect();
// Compute next cursor
let next_cursor = if page_end < candidates.len() {
Some(Cursor::from_offset(page_end))
} else {
None
};
Ok(Results {
items,
next_cursor,
total_scored,
constraints_satisfied,
warnings: vec![], // TODO: thread warnings from Stage 3 enrichment failures
})
}
}
```
### TidalDb::retrieve() Public API
```rust
// === lib.rs (modification to TidalDb impl) ===
use crate::query::retrieve::{Retrieve, Results, QueryError};
use crate::query::executor::RetrieveExecutor;
impl TidalDb {
/// Execute a RETRIEVE query.
///
/// This is the primary ranked retrieval entry point. Given a declarative
/// query (profile, filters, diversity, limit), the database generates
/// candidates, applies filters, scores with signals, enforces diversity,
/// and returns a ranked result set.
///
/// # Example
///
/// ```ignore
/// let query = Retrieve::builder()
/// .profile("trending")
/// .diversity(DiversityConstraints::new().max_per_creator(1))
/// .limit(25)
/// .build()?;
/// let results = db.retrieve(&query)?;
/// for item in &results.items {
/// println!("#{}: entity={} score={:.3}",
/// item.rank, item.entity_id, item.score);
/// }
/// ```
pub fn retrieve(&self, query: &Retrieve) -> Result<Results, QueryError> {
// Construct FilterEvaluator per-query to avoid self-referential borrows.
//
// FilterEvaluator<'_> borrows from the bitmap and range indexes, which
// are fields of TidalDb. Storing a FilterEvaluator<'self> in TidalDb
// would make TidalDb self-referential (a struct containing references
// to its own fields), which Rust prohibits. The per-query construction
// is cheap: FilterEvaluator holds only references (no allocation).
let filter_evaluator = FilterEvaluator::new(
&self.bitmap_indexes,
&self.range_indexes,
);
let executor = RetrieveExecutor::new(
&self.signal_ledger,
self.entity_store(),
&self.embedding_registry,
&filter_evaluator,
&self.profile_registry,
&self.schema,
);
executor.retrieve(query)
}
}
```
### Criterion Benchmarks
```rust
// === tidal/benches/query.rs ===
use criterion::{criterion_group, criterion_main, Criterion};
use tempfile::TempDir;
use tidaldb::query::retrieve::Retrieve;
use tidaldb::ranking::diversity::DiversityConstraints;
use tidaldb::schema::*;
use tidaldb::{Config, TidalDB};
/// Setup: create a TidalDB with 10K items, embeddings, and signal state.
///
/// Items have:
/// - Metadata: category (10 values), format (4 values), creator_id (200 creators)
/// - Embeddings: 64-dim vectors (small for benchmark speed)
/// - Signals: 5 signal events per item (50K total)
fn setup_10k_db() -> (TidalDB, TempDir) {
let dir = TempDir::new().unwrap();
let schema = build_m2_schema();
let db = TidalDB::open(Config {
data_dir: dir.path().to_owned(),
schema,
})
.unwrap();
// Write 10K items with metadata and embeddings
for i in 0..10_000u64 {
let metadata = item_metadata(i);
let embedding = generate_embedding(i, 64);
db.write_item(EntityId::new(i + 1), &metadata, Some(&embedding))
.unwrap();
}
// Write 50K signal events (5 per item average)
let now = Timestamp::now();
let seven_days_nanos = 7 * 24 * 3600 * 1_000_000_000u64;
for i in 0..50_000u64 {
let entity = EntityId::new((i % 10_000) + 1);
let signal_types = ["view", "like", "skip", "share", "completion"];
let signal = signal_types[(i as usize) % signal_types.len()];
let offset = (i * 7919 + 1) % seven_days_nanos;
let ts = Timestamp::from_nanos(now.as_nanos().saturating_sub(offset));
db.signal(signal, entity, 1.0, ts).unwrap();
}
(db, dir)
}
/// KEY BENCHMARK: end-to-end trending RETRIEVE at 10K items.
/// Target: < 50ms.
fn bench_retrieve_trending_10k(c: &mut Criterion) {
let (db, _dir) = setup_10k_db();
let query = Retrieve::builder()
.profile("trending")
.diversity(DiversityConstraints::new().max_per_creator(1))
.limit(25)
.build()
.unwrap();
c.bench_function("retrieve_trending_10k_items", |b| {
b.iter(|| {
let results = db.retrieve(&query).unwrap();
assert!(!results.is_empty());
})
});
}
/// Benchmark: new profile (full scan, no ANN) at 10K items.
/// Expected: < 10ms (scan + metadata sort, no vector search).
fn bench_retrieve_new_10k(c: &mut Criterion) {
let (db, _dir) = setup_10k_db();
let query = Retrieve::builder()
.profile("new")
.limit(20)
.build()
.unwrap();
c.bench_function("retrieve_new_10k_items", |b| {
b.iter(|| {
let results = db.retrieve(&query).unwrap();
assert!(!results.is_empty());
})
});
}
/// Benchmark: hot profile with category filter at 10K items.
fn bench_retrieve_hot_filtered_10k(c: &mut Criterion) {
let (db, _dir) = setup_10k_db();
let query = Retrieve::builder()
.profile("hot")
.filter(FilterExpr::eq("category", "jazz"))
.limit(20)
.build()
.unwrap();
c.bench_function("retrieve_hot_filtered_10k_items", |b| {
b.iter(|| {
let results = db.retrieve(&query).unwrap();
// May be empty if no jazz items exist in the random dataset
})
});
}
/// Benchmark: controversial profile at 10K items.
fn bench_retrieve_controversial_10k(c: &mut Criterion) {
let (db, _dir) = setup_10k_db();
let query = Retrieve::builder()
.profile("controversial")
.limit(10)
.build()
.unwrap();
c.bench_function("retrieve_controversial_10k_items", |b| {
b.iter(|| {
let results = db.retrieve(&query).unwrap();
assert!(!results.is_empty());
})
});
}
criterion_group!(
benches,
bench_retrieve_trending_10k,
bench_retrieve_new_10k,
bench_retrieve_hot_filtered_10k,
bench_retrieve_controversial_10k,
);
criterion_main!(benches);
```
### Error Handling
- **Profile not found:** `QueryError::ProfileNotFound` with the profile name. Occurs in validation before pipeline execution.
- **Unsupported strategy:** `QueryError::UnsupportedStrategy` for `Relationship`, `Hybrid`, `CohortTrending` in M2. Occurs in candidate generation.
- **No vector index:** `QueryError::IndexNotAvailable` when ANN strategy is requested but no embedding slot exists. Occurs in ANN candidate generation.
- **Filter evaluation failure:** `QueryError::InvalidFilter` when a filter references a non-existent field or index. Occurs in Stage 2.
- **Storage error:** `QueryError::StorageError` wraps underlying storage failures during entity reads.
- **Empty results:** NOT an error. The pipeline returns `Results` with an empty `items` vec, `total_scored: 0`, and `constraints_satisfied: true`. This is valid -- the filter may exclude everything.
## Test Strategy
### Unit Tests
```rust
// === Pipeline Stage Tests ===
// These test each stage independently with mock/test data.
#[test]
fn exclude_list_removes_candidates() {
// Setup: candidates [1, 2, 3, 4, 5], exclude [2, 4]
// After exclude: [1, 3, 5]
let candidates = vec![
EntityId::new(1),
EntityId::new(2),
EntityId::new(3),
EntityId::new(4),
EntityId::new(5),
];
let exclude = vec![EntityId::new(2), EntityId::new(4)];
let exclude_set: std::collections::HashSet<EntityId> =
exclude.iter().copied().collect();
let filtered: Vec<EntityId> = candidates
.into_iter()
.filter(|id| !exclude_set.contains(id))
.collect();
assert_eq!(filtered.len(), 3);
assert_eq!(filtered[0], EntityId::new(1));
assert_eq!(filtered[1], EntityId::new(3));
assert_eq!(filtered[2], EntityId::new(5));
}
#[test]
fn result_assembly_pagination_first_page() {
// 100 scored candidates, limit 25, no cursor
// -> items[0..25], rank 1-25, next_cursor at offset 25
let candidates: Vec<ScoredCandidate> = (0..100u64)
.map(|i| {
let mut c = ScoredCandidate::new(EntityId::new(i + 1), 1.0 - (i as f64 * 0.01));
c
})
.collect();
let query = Retrieve::builder().profile("test").limit(25).build().unwrap();
// (simplified test -- full test requires executor)
// Verify page slicing
let offset = 0;
let limit = 25;
let page_end = std::cmp::min(offset + limit, candidates.len());
assert_eq!(page_end, 25);
let items: Vec<RetrieveResult> = candidates[offset..page_end]
.iter()
.enumerate()
.map(|(i, c)| RetrieveResult {
entity_id: c.entity_id,
score: c.score,
rank: offset + i + 1,
signal_snapshot: vec![],
})
.collect();
assert_eq!(items.len(), 25);
assert_eq!(items[0].rank, 1);
assert_eq!(items[24].rank, 25);
assert!(page_end < candidates.len()); // next_cursor should exist
}
#[test]
fn result_assembly_pagination_last_page() {
// 30 scored candidates, limit 25, cursor at offset 25
// -> items[25..30], rank 26-30, no next_cursor
let candidates_len = 30;
let offset = 25;
let limit = 25;
let page_end = std::cmp::min(offset + limit, candidates_len);
assert_eq!(page_end, 30);
assert_eq!(page_end - offset, 5); // 5 items on last page
assert!(page_end >= candidates_len); // no next_cursor
}
#[test]
fn result_assembly_empty_candidates() {
// 0 scored candidates -> empty results, no cursor
let candidates: Vec<ScoredCandidate> = vec![];
assert!(candidates.is_empty());
// Results should have items: [], total_scored: 0, constraints_satisfied: true
}
#[test]
fn result_assembly_ranks_are_one_based() {
let items: Vec<RetrieveResult> = (0..5)
.map(|i| RetrieveResult {
entity_id: EntityId::new(i + 1),
score: 0.5,
rank: i as usize + 1,
signal_snapshot: vec![],
})
.collect();
assert_eq!(items[0].rank, 1);
assert_eq!(items[4].rank, 5);
}
#[test]
fn scores_descending_in_results() {
// Verify that results maintain score ordering from scoring stage
let items: Vec<RetrieveResult> = vec![
RetrieveResult { entity_id: EntityId::new(1), score: 0.9, rank: 1, signal_snapshot: vec![] },
RetrieveResult { entity_id: EntityId::new(2), score: 0.7, rank: 2, signal_snapshot: vec![] },
RetrieveResult { entity_id: EntityId::new(3), score: 0.5, rank: 3, signal_snapshot: vec![] },
];
for pair in items.windows(2) {
assert!(pair[0].score >= pair[1].score);
}
}
// === Profile resolution tests ===
#[test]
fn resolve_profile_latest_version() {
let mut registry = ProfileRegistry::new();
// register_builtins adds trending, hot, new, etc.
register_builtins(&mut registry, &[]);
let profile = registry.get("trending");
assert!(profile.is_some());
}
#[test]
fn resolve_profile_unknown_name() {
let registry = ProfileRegistry::new();
let profile = registry.get("nonexistent");
assert!(profile.is_none());
}
// === Candidate strategy routing tests ===
#[test]
fn scan_strategy_returns_entity_ids() {
// Verify that scan candidate generation returns IDs from the store
// (full integration test in Task 03; this is a unit-level sanity check)
let ids: Vec<EntityId> = (1..=100u64).map(EntityId::new).collect();
let top_k = 50;
let result: Vec<EntityId> = ids.into_iter().take(top_k).collect();
assert_eq!(result.len(), 50);
}
```
### Integration Tests
Integration tests covering full pipeline execution are in Task 03 (`m2_uat.rs`). This task's test strategy focuses on unit-level testing of individual pipeline stages and the wiring between them.
## Acceptance Criteria
- [ ] `RetrieveExecutor::new()` takes borrowed references to ledger, entity_store, vector_index, filter_evaluator, profile_registry, schema
- [ ] `RetrieveExecutor::retrieve()` executes the 5-stage pipeline and returns `Results`
- [ ] Stage 1: candidate generation routes to `Ann`, `Scan`, or `SignalRanked` based on profile's `CandidateStrategy`
- [ ] Stage 1: `Ann` strategy queries the vector index via `EmbeddingSlotRegistry` and returns entity IDs
- [ ] Stage 1: `Scan` strategy loads all entity IDs from the entity store
- [ ] Stage 1: `SignalRanked` strategy reads top-K entities by signal value from the ledger
- [ ] Stage 1: Unsupported strategies (`Relationship`, `Hybrid`, `CohortTrending`) return `QueryError::UnsupportedStrategy`
- [ ] Stage 1: `exclude` list is applied after candidate generation (removed via HashSet lookup)
- [ ] Stage 2: filter evaluation uses `FilterEvaluator` and intersects with candidate set
- [ ] Stage 2: no filters = candidate set passes through unchanged
- [ ] Stage 2: empty filter result (zero matching items) returns empty Results, not an error
- [ ] Stage 3: delegates to `ProfileExecutor::score()` from m2p3
- [ ] Stage 3: enriches `ScoredCandidate` with `creator_id` and `format` from entity metadata
- [ ] Stage 4: applies `DiversitySelector` when diversity constraints are present
- [ ] Stage 4: no diversity constraints = candidates pass through unchanged, `constraints_satisfied: true`
- [ ] Stage 5: slices candidates to `[offset..offset+limit]` based on cursor
- [ ] Stage 5: builds `RetrieveResult` with 1-based rank and signal snapshot
- [ ] Stage 5: computes `next_cursor` when more results exist beyond the page
- [ ] Stage 5: `next_cursor` is `None` when the page contains all remaining results
- [ ] `TidalDb::retrieve()` public method wires the executor correctly
- [ ] Criterion benchmarks implemented and passing:
- `retrieve_trending_10k_items` -- target < 50ms
- `retrieve_new_10k_items` -- target < 10ms
- `retrieve_hot_filtered_10k_items` -- measured
- `retrieve_controversial_10k_items` -- measured
- [ ] No `unsafe` code
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All unit tests pass
## Research References
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- ANN retrieval latency (< 10ms at 10K vectors), adaptive query planner strategy selection
- [docs/research/tidaldb_signal_ledger.md](../../../research/tidaldb_signal_ledger.md) -- Signal read latencies establishing per-candidate scoring budget
## Spec References
- [docs/specs/08-query-engine.md](../../../specs/08-query-engine.md) -- Section 4 (Query planning: CandidateStrategy, plan construction), Section 5 (Execution pipeline: all 6 stages), Section 7 (Filter evaluation: bitmap intersection, short-circuit), Section 8 (Pagination: cursor decode, offset, limit), Section 11 (Performance targets: < 50ms end-to-end)
- [docs/specs/09-ranking-scoring.md](../../../specs/09-ranking-scoring.md) -- Section 3 (CandidateStrategy variants), Section 4 (Scoring pipeline), Section 9 (Diversity enforcement as Stage 8)
## Implementation Notes
- Add `[[bench]] name = "query" harness = false` to `tidal/Cargo.toml`.
- The `RetrieveExecutor` is intentionally stateless -- it borrows references to all subsystems. This means it is cheap to construct (no allocation) and the caller (TidalDb) can create a new executor for every query. No caching or connection pooling is needed.
- `scan_entity_ids()` is a new method needed on `StorageEngine` (or on `TidalDb` directly) that returns all entity IDs of a given kind. If this method does not exist yet, it should be added as part of this task. It reads the entity keyspace prefix and collects IDs. At 10K items this is ~1ms.
- `top_entities_by_signal()` is a new method needed on `SignalLedger` that returns the top-K entity IDs by signal value. For M2, this iterates over all entities in the hot tier and returns the top-K by decay score. At 10K entities this is ~2ms. A sorted index for signal values is an M6 optimization.
- The `get_metadata()` method on `StorageEngine` (for reading creator_id and format) needs to return a structured metadata object, not raw bytes. If M1's `read_item()` returns raw bytes, this task should add a `get_item_metadata()` helper that parses the metadata into a struct with `creator_id: Option<EntityId>` and `format: Option<String>`. The exact metadata format depends on how `write_item()` stores metadata in M1/M2.
- The benchmark setup function (`setup_10k_db`) creates a fresh database for each benchmark group run. This takes several seconds. Use `criterion::BenchmarkGroup` with large `sample_size` and `measurement_time` to amortize setup cost. Consider using `lazy_static` or `OnceCell` for the setup if benchmarks are too slow.
- The ANN candidate generation in M2 uses a zero vector as the query vector. This is a placeholder -- in M3, user preference vectors will be used. For L2 distance metrics, a zero query vector is equidistant from all normalized vectors (distance = 1.0), effectively providing arbitrary candidate ordering. For cosine similarity, a zero query vector produces undefined similarity (0/0); if USearch uses cosine metric and returns an error for zero-norm queries, fall back to `CandidateStrategy::Scan` instead. Verify USearch's zero-vector behavior during integration. The query vector quality improves in M3.
- **`ScoredCandidate.signal_snapshot` dependency**: the `ScoredCandidate` struct from m2p3 (task-03 of phase 3) MUST include a `signal_snapshot: Vec<(String, f64)>` field. The result assembly stage (Stage 5) reads this field directly. Verify this field exists in m2p3's `ScoredCandidate` before implementing Stage 5; if missing, add it as part of this task.
- **`FilterEvaluator` is NOT stored in `TidalDb`**: construct it per-query in `TidalDb::retrieve()` by passing references to `self.bitmap_indexes` and `self.range_indexes`. See the `TidalDb::retrieve()` code snippet above for the correct wiring. Do not add `filter_evaluator` as a field on `TidalDb` -- it would create a self-referential struct.
- **`Results.warnings` accumulation**: the executor should accumulate warnings into a `Vec<String>` and pass it into the `Results` struct. Metadata enrichment failures (in Stage 3) are one source. Start with an empty vec and push warnings as they occur during pipeline execution. Do not propagate degradation warnings as errors.

File diff suppressed because it is too large Load Diff

View File

@ -10,12 +10,13 @@ export default function BlogIndex() {
Blog Blog
</p> </p>
<h1 className="font-serif text-4xl font-bold leading-tight md:text-5xl"> <h1 className="font-serif text-4xl font-bold leading-tight md:text-5xl">
Building the agent memory substrate. Memory substrate for agent-driven personalization.
</h1> </h1>
<p className="mt-4 text-lg text-muted"> <p className="mt-4 text-lg text-muted">
Architecture decisions, engineering insights, and progress updates as Architecture decisions, engineering insights, and field notes about
we turn tidalDB into the personalization layer agents can read and replacing the six-system ranking stack (and heavyweight Vespa-style
write in real time. deployments) with an embeddable database that agents can read and
write in real timethen scale to the distributed fabric in M8.
</p> </p>
<div className="mt-16 space-y-12"> <div className="mt-16 space-y-12">

View File

@ -19,9 +19,13 @@ tempfile = { version = "3", optional = true }
tracing = "0.1" tracing = "0.1"
[dev-dependencies] [dev-dependencies]
actix-web = "4"
axum = "0.8"
criterion = { version = "0.5", features = ["html_reports"] } criterion = { version = "0.5", features = ["html_reports"] }
proptest = "1" proptest = "1"
tempfile = "3" tempfile = "3"
tokio = { version = "1", features = ["macros", "rt-multi-thread", "signal"] }
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
[lints.rust] [lints.rust]
unsafe_code = "forbid" unsafe_code = "forbid"
@ -34,6 +38,18 @@ cast_possible_truncation = "allow"
module_name_repetitions = "allow" module_name_repetitions = "allow"
unwrap_used = "deny" unwrap_used = "deny"
[[example]]
name = "quickstart"
[[example]]
name = "axum_embedding"
[[example]]
name = "actix_embedding"
[[example]]
name = "cli_embedding"
[[test]] [[test]]
name = "sandboxed_storage" name = "sandboxed_storage"
required-features = ["test-utils"] required-features = ["test-utils"]

View File

@ -1,9 +1,128 @@
use criterion::{Criterion, criterion_group, criterion_main}; #![allow(clippy::unwrap_used)]
#[allow(clippy::missing_const_for_fn)] use std::time::Duration;
fn signal_benchmarks(_c: &mut Criterion) {
// Placeholder — benchmarks added as signal system is implemented. use criterion::{Criterion, black_box, criterion_group, criterion_main};
use tidaldb::schema::{DecaySpec, EntityId, EntityKind, SchemaBuilder, Timestamp, Window};
use tidaldb::signals::{NoopWalWriter, SignalLedger, SignalTypeId};
fn view_ledger() -> (SignalLedger, SignalTypeId) {
let mut builder = SchemaBuilder::new();
let _ = builder
.signal(
"view",
EntityKind::Item,
DecaySpec::Exponential {
half_life: Duration::from_secs(7 * 24 * 3600),
},
)
.windows(&[Window::OneHour, Window::TwentyFourHours, Window::SevenDays])
.velocity(false)
.add();
let schema = builder.build().unwrap();
let type_id = {
// Alphabetical sort assigns "view" → id 0 (only signal).
SignalTypeId::new(0)
};
let ledger = SignalLedger::new(schema, Box::new(NoopWalWriter));
(ledger, type_id)
} }
criterion_group!(benches, signal_benchmarks); // Pre-computed lambda for a 7-day half-life exponential decay.
const LAMBDA_7D: f64 = std::f64::consts::LN_2 / (7.0 * 24.0 * 3600.0);
/// Benchmark: single signal write (excluding WAL — `NoopWalWriter`).
/// Target: < 100ns.
fn bench_single_signal_write(c: &mut Criterion) {
let (ledger, _type_id) = view_ledger();
let entity_id = EntityId::new(1);
// Fixed timestamp avoids SystemTime::now() overhead per iteration.
let fixed_ns = Timestamp::now().as_nanos();
let ts = Timestamp::from_nanos(fixed_ns);
c.bench_function("signal_write_single", |b| {
b.iter(|| {
ledger
.record_signal(
black_box("view"),
black_box(entity_id),
black_box(1.0_f64),
black_box(ts),
)
.unwrap();
});
});
}
/// Benchmark: decay score read for a single entity.
/// Setup: 100 pre-written signals to ensure a non-trivial hot state.
/// Target: < 100ns.
fn bench_decay_score_read(c: &mut Criterion) {
let (ledger, _type_id) = view_ledger();
let entity_id = EntityId::new(42);
// Pre-warm: 100 signals spread over the past 7 days.
let base_ns = Timestamp::now().as_nanos();
let seven_days_ns: u64 = 7 * 24 * 3600 * 1_000_000_000;
for i in 0u64..100 {
let ts = Timestamp::from_nanos(
base_ns.saturating_sub(seven_days_ns) + i * (seven_days_ns / 100),
);
ledger.record_signal("view", entity_id, 1.0, ts).unwrap();
}
c.bench_function("signal_decay_score_read", |b| {
b.iter(|| {
ledger
.read_decay_score(black_box(entity_id), black_box("view"), black_box(0))
.unwrap()
});
});
}
/// Benchmark: 200-entity scoring pass using direct `DashMap` access to isolate
/// the hot-path read from schema lookup overhead.
/// Setup: 200 entities, each with 50 pre-written signals.
/// Target: < 5µs total.
fn bench_200_entity_scoring_pass(c: &mut Criterion) {
let (ledger, type_id) = view_ledger();
// Pre-warm: 200 entities × 50 signals each.
let base_ns = Timestamp::now().as_nanos();
let entity_ids: Vec<EntityId> = (0u64..200).map(EntityId::new).collect();
for &entity_id in &entity_ids {
for j in 0u64..50 {
let ts = Timestamp::from_nanos(
base_ns.saturating_sub(3_600_000_000_000) + j * 72_000_000_000,
);
ledger.record_signal("view", entity_id, 1.0, ts).unwrap();
}
}
let now_ns = Timestamp::now().as_nanos();
c.bench_function("signal_200_entity_scoring_pass", |b| {
b.iter(|| {
let mut sum = 0.0_f64;
for &entity_id in black_box(&entity_ids) {
if let Some(entry) = ledger.entries().get(&(entity_id, type_id)) {
sum += entry.hot.current_score(
black_box(0),
black_box(now_ns),
black_box(LAMBDA_7D),
);
}
}
black_box(sum)
});
});
}
criterion_group!(
benches,
bench_single_signal_write,
bench_decay_score_read,
bench_200_entity_scoring_pass,
);
criterion_main!(benches); criterion_main!(benches);

View File

@ -0,0 +1,63 @@
//! tidalDB + Actix-web: embedding a tidalDB instance in an Actix-web server.
//!
//! Demonstrates:
//! - Wrapping `TidalDb` in `Arc` for shared ownership
//! - Passing the instance via `web::Data<Arc<TidalDb>>`
//! - A `/health` route that calls `health_check()`
//! - Graceful shutdown via Actix's built-in signal handling
//!
//! `TidalDb` is not `Clone`, so it is wrapped in `Arc`. `web::Data` then
//! wraps the `Arc`, giving each handler a cheap clone of the pointer.
//!
//! # Running
//!
//! ```bash
//! cargo run --example actix_embedding --manifest-path tidal/Cargo.toml
//! # Then: curl http://127.0.0.1:3001/health
//! ```
use std::sync::Arc;
use actix_web::{App, HttpResponse, HttpServer, web};
use tidaldb::TidalDb;
async fn health(db: web::Data<Arc<TidalDb>>) -> HttpResponse {
match db.health_check() {
Ok(()) => HttpResponse::Ok().body("ok"),
Err(_) => HttpResponse::ServiceUnavailable().body("degraded"),
}
}
#[actix_web::main]
async fn main() -> std::io::Result<()> {
tracing_subscriber::fmt()
.with_env_filter("tidaldb=debug")
.init();
// Open tidalDB; map LumenError -> io::Error so Actix's Result type aligns.
let db = Arc::new(
TidalDb::builder()
.ephemeral()
.open()
.map_err(std::io::Error::other)?,
);
// Wrap in web::Data so Actix can clone it cheaply into each worker thread.
let db_data = web::Data::new(Arc::clone(&db));
println!("listening on http://127.0.0.1:3001");
println!(" GET /health -> tidalDB health check");
println!("press Ctrl+C to stop");
HttpServer::new(move || {
App::new()
.app_data(db_data.clone())
.route("/health", web::get().to(health))
})
.bind("127.0.0.1:3001")?
.run()
.await?;
// `db` drops here, triggering TidalDb::drop() -> shutdown_inner().
Ok(())
}

View File

@ -0,0 +1,66 @@
//! tidalDB + Axum: embedding a tidalDB instance in an Axum web server.
//!
//! Demonstrates:
//! - Wrapping `TidalDb` in `Arc` for shared ownership across handler threads
//! - Passing the instance via Axum `State<Arc<TidalDb>>`
//! - A `/health` route that calls `health_check()`
//! - Graceful shutdown via `tokio::signal::ctrl_c()`
//!
//! `TidalDb` is not `Clone`, so it must be wrapped in `Arc` for use with
//! Axum's `State<T>`, which requires `T: Clone + Send + Sync + 'static`.
//!
//! # Running
//!
//! ```bash
//! cargo run --example axum_embedding --manifest-path tidal/Cargo.toml
//! # Then: curl http://127.0.0.1:3000/health
//! ```
use std::sync::Arc;
use axum::{Router, extract::State, routing::get};
use tidaldb::TidalDb;
async fn health(State(db): State<Arc<TidalDb>>) -> &'static str {
match db.health_check() {
Ok(()) => "ok",
Err(_) => "degraded",
}
}
/// Resolves when Ctrl+C is received.
async fn shutdown_signal() {
tokio::signal::ctrl_c()
.await
.unwrap_or_else(|e| eprintln!("ctrl-c error: {e}"));
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
tracing_subscriber::fmt()
.with_env_filter("tidaldb=debug")
.init();
// `TidalDb` is not Clone — wrap in Arc so the router and main both hold a
// reference. The router clone gets the weak reference; main retains the
// strong reference for explicit shutdown ordering.
let db = Arc::new(TidalDb::builder().ephemeral().open()?);
let app = Router::new()
.route("/health", get(health))
.with_state(Arc::clone(&db));
let listener = tokio::net::TcpListener::bind("127.0.0.1:3000").await?;
let addr = listener.local_addr()?;
println!("listening on http://{addr}");
println!(" GET /health -> tidalDB health check");
println!("press Ctrl+C to stop");
axum::serve(listener, app)
.with_graceful_shutdown(shutdown_signal())
.await?;
// `db` drops here, calling TidalDb::drop() -> shutdown_inner().
// Future milestones: WAL drain and storage flush happen in close().
Ok(())
}

View File

@ -0,0 +1,40 @@
//! tidalDB CLI embedding: open a persistent database and print status.
//!
//! Demonstrates:
//! - Opening a persistent `TidalDb` with an explicit data directory
//! - Printing build hash, uptime, and debug info
//! - Explicit `close()` before exit
//!
//! In a real CLI this data directory would come from a flag or config file.
//! Here we use a temporary directory so the example runs safely in CI.
//!
//! # Running
//!
//! ```bash
//! cargo run --example cli_embedding --manifest-path tidal/Cargo.toml
//! ```
fn main() -> Result<(), Box<dyn std::error::Error>> {
tracing_subscriber::fmt()
.with_env_filter("tidaldb=debug")
.init();
// Create a temporary directory to act as the data root.
// In a production CLI, replace this with the user-supplied path.
let tmp = tempfile::tempdir()?;
let data_dir = tmp.path();
let db = tidaldb::TidalDb::builder().with_data_dir(data_dir).open()?;
println!("build: {}", tidaldb::BUILD_HASH);
println!("data: {}", data_dir.display());
println!("uptime: {:.3}s", db.metrics().uptime_seconds());
println!("health: {:?}", db.health_check());
println!("debug: {db:?}");
// Explicit close — ensures any future WAL flush completes before exit.
db.close()?;
println!("shutdown complete.");
Ok(())
}

View File

@ -0,0 +1,39 @@
//! tidalDB quickstart: open an ephemeral database, verify health, and close.
//!
//! This example is the canonical first look at the tidalDB API. It compiles
//! as a live contract — if the builder or health-check API changes, this
//! example fails in CI.
//!
//! At M0 the database validates configuration and proves the builder works.
//! Future milestones will add signal writes, queries, and ranking here.
//!
//! # Running
//!
//! ```bash
//! cargo run --example quickstart --manifest-path tidal/Cargo.toml
//! ```
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize tracing so spans emitted by tidalDB are visible.
tracing_subscriber::fmt()
.with_env_filter("tidaldb=debug")
.init();
// Open an ephemeral (in-memory) database — no filesystem access.
// Use TidalDb::builder().with_data_dir(path) for persistent storage.
let db = tidaldb::TidalDb::builder().ephemeral().open()?;
// Verify the database handle is operational.
db.health_check()?;
println!("build: {}", tidaldb::BUILD_HASH);
println!("uptime: {:.3}s", db.metrics().uptime_seconds());
println!("health: ok");
// Explicit shutdown — also triggered automatically on drop.
// Future milestones: flushes WAL and storage engine here.
db.close()?;
println!("tidalDB opened, verified, and closed. M0 complete.");
Ok(())
}

View File

@ -6,21 +6,33 @@ use super::config::{Config, ConfigError, StorageMode};
use super::metrics::MetricsState; use super::metrics::MetricsState;
use super::paths::Paths; use super::paths::Paths;
use crate::schema::Schema;
/// Fluent builder for constructing a [`TidalDb`] instance. /// Fluent builder for constructing a [`TidalDb`] instance.
/// ///
/// # Examples /// # Examples
/// ///
/// ```rust,no_run /// ```
/// # fn main() -> Result<(), Box<dyn std::error::Error>> {
/// use tidaldb::TidalDb; /// use tidaldb::TidalDb;
/// ///
/// // Ephemeral (in-memory) database for tests: /// // Ephemeral (in-memory) database — no filesystem access:
/// let db = TidalDb::builder().ephemeral().open().unwrap(); /// let db = TidalDb::builder().ephemeral().open()?;
/// db.health_check()?;
/// # Ok(())
/// # }
/// ```
/// ///
/// // Persistent database with explicit data directory: /// ```no_run
/// # fn main() -> Result<(), Box<dyn std::error::Error>> {
/// use tidaldb::TidalDb;
///
/// // Persistent database with an explicit data directory:
/// let db = TidalDb::builder() /// let db = TidalDb::builder()
/// .with_data_dir("/var/lib/tidaldb") /// .with_data_dir("/var/lib/tidaldb")
/// .open() /// .open()?;
/// .unwrap(); /// # Ok(())
/// # }
/// ``` /// ```
#[derive(Debug)] #[derive(Debug)]
pub struct TidalDbBuilder { pub struct TidalDbBuilder {
@ -29,6 +41,9 @@ pub struct TidalDbBuilder {
/// Only used when the `metrics` feature is enabled. /// Only used when the `metrics` feature is enabled.
#[allow(dead_code)] #[allow(dead_code)]
metrics_addr: Option<String>, metrics_addr: Option<String>,
/// Optional schema for signal types. When set, `open()` wires the storage
/// engine, WAL, and signal ledger into the returned `TidalDb`.
schema: Option<Schema>,
} }
impl TidalDbBuilder { impl TidalDbBuilder {
@ -38,9 +53,24 @@ impl TidalDbBuilder {
Self { Self {
config: Config::default(), config: Config::default(),
metrics_addr: None, metrics_addr: None,
schema: None,
} }
} }
/// Attach a validated [`Schema`] to this builder.
///
/// When a schema is provided, [`open`](Self::open) wires in the storage
/// engine, write-ahead log, and signal ledger, enabling the full M1 API
/// (`write_item`, `signal`, `read_decay_score`, etc.).
///
/// Without a schema the database still opens in M0 compatibility mode
/// and only `health_check` and metrics are available.
#[must_use]
pub fn with_schema(mut self, schema: Schema) -> Self {
self.schema = Some(schema);
self
}
/// Switch to ephemeral (in-memory) mode, clearing any directory paths. /// Switch to ephemeral (in-memory) mode, clearing any directory paths.
/// ///
/// This is the default mode. Calling this is only necessary to reset /// This is the default mode. Calling this is only necessary to reset
@ -132,13 +162,20 @@ impl TidalDbBuilder {
/// ///
/// # Examples /// # Examples
/// ///
/// ```rust,no_run /// ```no_run
/// # fn main() -> Result<(), Box<dyn std::error::Error>> {
/// # #[cfg(feature = "metrics")] /// # #[cfg(feature = "metrics")]
/// # {
/// let db = tidaldb::TidalDb::builder() /// let db = tidaldb::TidalDb::builder()
/// .ephemeral() /// .ephemeral()
/// .enable_metrics("127.0.0.1:9090") /// .enable_metrics("127.0.0.1:9090")
/// .open() /// .open()?;
/// .unwrap(); /// if let Some(addr) = db.metrics_addr() {
/// println!("metrics at http://{addr}/metrics");
/// }
/// # }
/// # Ok(())
/// # }
/// ``` /// ```
#[cfg(feature = "metrics")] #[cfg(feature = "metrics")]
#[must_use] #[must_use]
@ -196,12 +233,29 @@ impl TidalDbBuilder {
None None
}; };
Ok(TidalDb::from_config( if let Some(schema) = self.schema {
self.config, // Wire in storage, WAL, and signal ledger.
metrics, let (storage, ledger, wal, last_seq) = TidalDb::open_with_schema(&self.config, schema)?;
#[cfg(feature = "metrics")]
metrics_handle, Ok(TidalDb::from_parts(
)) self.config,
metrics,
#[cfg(feature = "metrics")]
metrics_handle,
Some(ledger),
Some(storage),
wal,
last_seq,
))
} else {
// M0 compatibility mode: no storage, no ledger, no WAL.
Ok(TidalDb::from_config(
self.config,
metrics,
#[cfg(feature = "metrics")]
metrics_handle,
))
}
} }
} }
@ -239,6 +293,7 @@ mod tests {
cache_dir: None, cache_dir: None,
}, },
metrics_addr: None, metrics_addr: None,
schema: None,
}; };
let result = builder.validate(); let result = builder.validate();
assert!(result.is_err()); assert!(result.is_err());

View File

@ -5,6 +5,15 @@ use std::path::PathBuf;
/// ///
/// `Ephemeral` keeps everything in memory -- ideal for tests and short-lived /// `Ephemeral` keeps everything in memory -- ideal for tests and short-lived
/// processes. `Persistent` writes to an LSM-tree on disk (fjall). /// processes. `Persistent` writes to an LSM-tree on disk (fjall).
///
/// # Examples
///
/// ```
/// use tidaldb::StorageMode;
///
/// let mode = StorageMode::Ephemeral;
/// assert_ne!(mode, StorageMode::Persistent);
/// ```
#[derive(Debug, Clone, PartialEq, Eq)] #[derive(Debug, Clone, PartialEq, Eq)]
pub enum StorageMode { pub enum StorageMode {
/// In-memory only. No filesystem access. Data is lost on drop. /// In-memory only. No filesystem access. Data is lost on drop.
@ -21,6 +30,16 @@ pub enum StorageMode {
/// ///
/// The default configuration is ephemeral (in-memory) with no directory paths. /// The default configuration is ephemeral (in-memory) with no directory paths.
/// Persistent mode requires at least `data_dir` to be set. /// Persistent mode requires at least `data_dir` to be set.
///
/// # Examples
///
/// ```
/// use tidaldb::Config;
///
/// let cfg = Config::default();
/// // Default is ephemeral — no data directory required.
/// assert!(cfg.data_dir.is_none());
/// ```
#[derive(Debug, Clone)] #[derive(Debug, Clone)]
pub struct Config { pub struct Config {
/// Storage backend selection. /// Storage backend selection.
@ -48,6 +67,15 @@ impl Default for Config {
/// ///
/// These are always caller errors -- the configuration is invalid and must /// These are always caller errors -- the configuration is invalid and must
/// be corrected before a tidalDB instance can be opened. /// be corrected before a tidalDB instance can be opened.
///
/// # Examples
///
/// ```
/// use tidaldb::ConfigError;
///
/// let err = ConfigError::MissingDataDir;
/// assert!(err.to_string().contains("data directory"));
/// ```
#[derive(Debug)] #[derive(Debug)]
pub enum ConfigError { pub enum ConfigError {
/// Persistent mode was selected but no data directory was provided. /// Persistent mode was selected but no data directory was provided.

View File

@ -20,6 +20,28 @@ use super::metrics::MetricsState;
/// Handle to the background metrics HTTP server. /// Handle to the background metrics HTTP server.
/// ///
/// The server thread runs until [`stop`] is called or this handle is dropped. /// The server thread runs until [`stop`] is called or this handle is dropped.
///
/// # Examples
///
/// ```no_run
/// # fn main() -> Result<(), Box<dyn std::error::Error>> {
/// // Enable the metrics server when opening the database:
/// let db = tidaldb::TidalDb::builder()
/// .ephemeral()
/// .enable_metrics("127.0.0.1:0")
/// .open()?;
///
/// // Discover the OS-assigned port when "0" was requested:
/// if let Some(addr) = db.metrics_addr() {
/// println!("GET http://{addr}/metrics (Prometheus)");
/// println!("GET http://{addr}/healthz (JSON)");
/// }
///
/// // Shutdown: server stops when `db` is dropped or `db.close()` is called.
/// db.close()?;
/// # Ok(())
/// # }
/// ```
pub struct MetricsHandle { pub struct MetricsHandle {
shutdown: Arc<AtomicBool>, shutdown: Arc<AtomicBool>,
thread: Option<thread::JoinHandle<()>>, thread: Option<thread::JoinHandle<()>>,

View File

@ -15,6 +15,18 @@ use std::time::Instant;
/// Shared runtime metrics for a `TidalDb` instance. /// Shared runtime metrics for a `TidalDb` instance.
/// ///
/// Cheap to clone (`Arc` inside). Thread-safe. /// Cheap to clone (`Arc` inside). Thread-safe.
///
/// # Examples
///
/// ```
/// # fn main() -> Result<(), Box<dyn std::error::Error>> {
/// let db = tidaldb::TidalDb::builder().ephemeral().open()?;
/// let metrics = db.metrics();
/// assert!(metrics.uptime_seconds() >= 0.0);
/// assert!((metrics.health_ok_value() - 1.0).abs() < f64::EPSILON);
/// # Ok(())
/// # }
/// ```
pub struct MetricsState { pub struct MetricsState {
/// Time the database was opened. /// Time the database was opened.
pub(crate) opened_at: Instant, pub(crate) opened_at: Instant,

View File

@ -6,12 +6,16 @@
//! //!
//! # Quick Start //! # Quick Start
//! //!
//! ```rust,no_run //! ```
//! # fn main() -> Result<(), Box<dyn std::error::Error>> {
//! use tidaldb::TidalDb; //! use tidaldb::TidalDb;
//! //!
//! // In-memory database for tests: //! // In-memory database — no filesystem access, perfect for tests:
//! let db = TidalDb::builder().ephemeral().open().unwrap(); //! let db = TidalDb::builder().ephemeral().open()?;
//! assert!(db.health_check().is_ok()); //! db.health_check()?;
//! db.close()?;
//! # Ok(())
//! # }
//! ``` //! ```
pub mod builder; pub mod builder;
@ -22,6 +26,7 @@ pub mod metrics;
pub mod paths; pub mod paths;
#[cfg(any(test, feature = "test-utils"))] #[cfg(any(test, feature = "test-utils"))]
pub mod temp; pub mod temp;
pub(crate) mod wal_bridge;
pub use builder::TidalDbBuilder; pub use builder::TidalDbBuilder;
pub use config::{Config, ConfigError, StorageMode}; pub use config::{Config, ConfigError, StorageMode};
@ -30,21 +35,131 @@ pub use paths::Paths;
#[cfg(any(test, feature = "test-utils"))] #[cfg(any(test, feature = "test-utils"))]
pub use temp::TempTidalHome; pub use temp::TempTidalHome;
use std::collections::HashMap;
use std::sync::Arc; use std::sync::Arc;
use std::sync::atomic::{AtomicBool, Ordering}; use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
use std::time::Duration;
use crate::schema::{DurabilityError, EntityId, EntityKind, LumenError, Schema, Timestamp, Window};
use crate::signals::{NoopWalWriter, SignalLedger, SignalTypeId};
use crate::storage::{InMemoryBackend, StorageEngine, Tag, encode_key};
use crate::wal::{WalConfig, WalHandle};
use self::wal_bridge::WalHandleWriter;
// ── Storage abstraction ───────────────────────────────────────────────────────
/// Wraps either an in-memory backend (ephemeral mode) or a fjall storage
/// (persistent mode) behind a uniform interface.
///
/// Only the items backend is used in M1 — user and creator entity support
/// is deferred to M3.
pub(crate) enum StorageBox {
Memory(InMemoryBackend),
Fjall(crate::storage::FjallStorage),
}
impl StorageBox {
/// Reference to the items storage engine.
fn items_engine(&self) -> &dyn StorageEngine {
match self {
Self::Memory(m) => m,
Self::Fjall(f) => f.backend(EntityKind::Item),
}
}
/// Flush all buffered writes to durable storage.
fn flush(&self) -> crate::Result<()> {
match self {
Self::Memory(_) => Ok(()),
Self::Fjall(f) => f.flush_all().map_err(LumenError::from),
}
}
}
// ── Metadata serialization ────────────────────────────────────────────────────
/// Serialize `HashMap<String, String>` as length-prefixed binary pairs.
///
/// Format (all lengths little-endian u32):
/// ```text
/// [num_entries: 4 bytes]
/// for each entry:
/// [key_len: 4 bytes][key bytes]
/// [val_len: 4 bytes][value bytes]
/// ```
fn serialize_metadata(map: &HashMap<String, String>) -> Vec<u8> {
#[allow(clippy::cast_possible_truncation)]
let mut buf = Vec::new();
buf.extend_from_slice(&(map.len() as u32).to_le_bytes());
for (k, v) in map {
buf.extend_from_slice(&(k.len() as u32).to_le_bytes());
buf.extend_from_slice(k.as_bytes());
buf.extend_from_slice(&(v.len() as u32).to_le_bytes());
buf.extend_from_slice(v.as_bytes());
}
buf
}
// ── Periodic checkpoint ───────────────────────────────────────────────────────
/// Background thread body: checkpoint signal state to storage every 30 seconds.
///
/// Polls the shutdown flag every 500ms so the thread exits promptly when
/// `shutdown_inner()` is called. Only runs in persistent mode (ephemeral opens
/// never spawn this thread).
///
/// The `Arc` arguments are intentionally passed by value: the thread must own
/// them for its entire lifetime (references cannot satisfy the `'static` bound
/// required by `std::thread::spawn`).
#[allow(clippy::needless_pass_by_value)]
fn run_checkpoint_thread(
shutdown: Arc<AtomicBool>,
ledger: Arc<SignalLedger>,
storage: Box<dyn StorageEngine + Send + Sync>,
last_wal_seq: Arc<AtomicU64>,
) {
const CHECKPOINT_INTERVAL: Duration = Duration::from_secs(30);
const POLL_INTERVAL: Duration = Duration::from_millis(500);
let mut elapsed = Duration::ZERO;
loop {
std::thread::sleep(POLL_INTERVAL);
if shutdown.load(Ordering::Acquire) {
break;
}
elapsed += POLL_INTERVAL;
if elapsed >= CHECKPOINT_INTERVAL {
elapsed = Duration::ZERO;
let meta = crate::signals::checkpoint::CheckpointMeta {
checkpoint_time_ns: Timestamp::now().as_nanos(),
wal_sequence: last_wal_seq.load(Ordering::Relaxed),
};
if let Err(e) = ledger.checkpoint(storage.as_ref(), meta) {
tracing::error!(error = %e, "periodic checkpoint failed");
} else {
tracing::debug!("periodic checkpoint written");
}
}
}
}
// ── TidalDb ───────────────────────────────────────────────────────────────────
/// A tidalDB database instance. /// A tidalDB database instance.
/// ///
/// Created via [`TidalDb::builder()`]. At M0 this is a thin handle that /// Created via [`TidalDb::builder()`]. After M1p5, the database wires in the
/// validates configuration and proves the builder API works. Future /// storage engine, signal ledger, and WAL behind this facade.
/// milestones will wire in the storage engine, signal ledger, and query
/// executor behind this facade.
/// ///
/// # Shutdown /// # Shutdown
/// ///
/// Call [`close`](Self::close) for explicit shutdown. If dropped without /// Call [`close`](Self::close) for explicit, checked shutdown. If dropped
/// calling `close`, the [`Drop`] implementation will run cleanup and log /// without calling `close`, the [`Drop`] implementation will run best-effort
/// any errors via `tracing::error!`. /// cleanup and log any errors via `tracing::error!`.
///
/// # Thread Safety
///
/// `TidalDb` is `Send + Sync`. Wrap it in an `Arc` to share across threads.
pub struct TidalDb { pub struct TidalDb {
config: Config, config: Config,
/// Whether `close()` has been called. Prevents double-shutdown. /// Whether `close()` has been called. Prevents double-shutdown.
@ -54,6 +169,24 @@ pub struct TidalDb {
/// Handle to the metrics HTTP server thread (metrics feature only). /// Handle to the metrics HTTP server thread (metrics feature only).
#[cfg(feature = "metrics")] #[cfg(feature = "metrics")]
metrics_handle: Option<http::MetricsHandle>, metrics_handle: Option<http::MetricsHandle>,
/// Signal ledger: in-memory hot + warm tier state.
/// `None` if no schema was provided at open time.
ledger: Option<Arc<SignalLedger>>,
/// Storage engine: routes to the correct backend by entity kind.
/// `None` in ephemeral mode without a schema.
storage: Option<StorageBox>,
/// The live WAL handle. Wrapped in `Mutex<Option<...>>` so
/// `shutdown_inner(&self)` can take ownership for graceful shutdown.
wal: std::sync::Mutex<Option<WalHandle>>,
/// Highest WAL sequence number committed by `WalHandleWriter`.
/// Shared with the bridge; read at checkpoint time.
last_wal_seq: Arc<AtomicU64>,
/// Shutdown flag for the periodic checkpoint background thread.
/// Set to `true` in `shutdown_inner()` to stop the thread.
shutdown_checkpoint: Arc<AtomicBool>,
/// Periodic checkpoint background thread (persistent mode only).
/// Wrapped in `Mutex<Option<...>>` so `shutdown_inner(&self)` can join it.
checkpoint_thread: std::sync::Mutex<Option<std::thread::JoinHandle<()>>>,
} }
impl TidalDb { impl TidalDb {
@ -63,9 +196,10 @@ impl TidalDb {
TidalDbBuilder::new() TidalDbBuilder::new()
} }
/// Construct a `TidalDb` from a validated configuration. /// Construct a `TidalDb` from a validated configuration (no schema).
/// ///
/// This is `pub(crate)` -- external callers use the builder. /// Used by the builder when no schema is provided — backwards-compatible
/// with M0 usage where signal/storage APIs are not called.
#[allow(clippy::missing_const_for_fn)] // Arc field prevents const in practice #[allow(clippy::missing_const_for_fn)] // Arc field prevents const in practice
pub(crate) fn from_config( pub(crate) fn from_config(
config: Config, config: Config,
@ -78,6 +212,63 @@ impl TidalDb {
metrics, metrics,
#[cfg(feature = "metrics")] #[cfg(feature = "metrics")]
metrics_handle, metrics_handle,
ledger: None,
storage: None,
wal: std::sync::Mutex::new(None),
last_wal_seq: Arc::new(AtomicU64::new(0)),
shutdown_checkpoint: Arc::new(AtomicBool::new(false)),
checkpoint_thread: std::sync::Mutex::new(None),
}
}
/// Construct a `TidalDb` from a validated configuration with wired-in
/// storage, ledger, and WAL.
#[allow(clippy::too_many_arguments, clippy::missing_const_for_fn)]
pub(crate) fn from_parts(
config: Config,
metrics: Arc<MetricsState>,
#[cfg(feature = "metrics")] metrics_handle: Option<http::MetricsHandle>,
ledger: Option<Arc<SignalLedger>>,
storage: Option<StorageBox>,
wal: Option<WalHandle>,
last_wal_seq: Arc<AtomicU64>,
) -> Self {
// Spawn a periodic checkpoint thread in persistent mode when a ledger
// is present. The thread checkpoints every 30 seconds so that crash
// recovery replays a bounded WAL tail regardless of shutdown timing.
let shutdown_checkpoint = Arc::new(AtomicBool::new(false));
let checkpoint_thread = {
let thread_handle = match (storage.as_ref(), ledger.as_ref()) {
(Some(StorageBox::Fjall(f)), Some(ledger_arc)) => {
let items = Box::new(f.backend(EntityKind::Item).clone())
as Box<dyn StorageEngine + Send + Sync>;
let shutdown = Arc::clone(&shutdown_checkpoint);
let ledger_clone = Arc::clone(ledger_arc);
let seq_clone = Arc::clone(&last_wal_seq);
std::thread::Builder::new()
.name("tidaldb-checkpoint".to_string())
.spawn(move || {
run_checkpoint_thread(shutdown, ledger_clone, items, seq_clone);
})
.ok()
}
_ => None,
};
std::sync::Mutex::new(thread_handle)
};
Self {
config,
closed: AtomicBool::new(false),
metrics,
#[cfg(feature = "metrics")]
metrics_handle,
ledger,
storage,
wal: std::sync::Mutex::new(wal),
last_wal_seq,
shutdown_checkpoint,
checkpoint_thread,
} }
} }
@ -108,10 +299,6 @@ impl TidalDb {
/// Returns `Ok(())` if the database is initialized and operational. /// Returns `Ok(())` if the database is initialized and operational.
/// ///
/// At M0 this simply confirms the handle was constructed successfully.
/// Future milestones will verify storage engine connectivity, WAL
/// integrity, and index health.
///
/// # Errors /// # Errors
/// ///
/// Returns an error if the database has been closed or an internal /// Returns an error if the database has been closed or an internal
@ -119,8 +306,6 @@ impl TidalDb {
#[tracing::instrument(skip(self))] #[tracing::instrument(skip(self))]
pub fn health_check(&self) -> crate::Result<()> { pub fn health_check(&self) -> crate::Result<()> {
if self.closed.load(Ordering::Acquire) { if self.closed.load(Ordering::Acquire) {
// Ordering::Release: ensures ranking queries that load with
// Acquire see the degraded state after we mark it here.
self.metrics.health_ok.store(false, Ordering::Release); self.metrics.health_ok.store(false, Ordering::Release);
return Err(crate::LumenError::Internal( return Err(crate::LumenError::Internal(
"database is closed".to_string(), "database is closed".to_string(),
@ -129,34 +314,155 @@ impl TidalDb {
Ok(()) Ok(())
} }
/// Cleanly shut down the database. // ── M1p5 API ─────────────────────────────────────────────────────────────
/// Write (or overwrite) item metadata.
/// ///
/// At M0 this is a no-op beyond marking the instance as closed. /// Stores the `metadata` key-value map under the entity's `Tag::Meta` key
/// Future milestones will drain the WAL, flush the storage engine, /// in the items storage backend.
/// and persist index state.
/// ///
/// # Errors /// # Errors
/// ///
/// Returns an error if shutdown encounters a failure (e.g., WAL flush /// - `LumenError::Internal` if no storage backend is wired (use `with_schema()`).
/// fails in future milestones). /// - `LumenError::Storage` on storage engine failure.
pub fn write_item(
&self,
id: EntityId,
metadata: &HashMap<String, String>,
) -> crate::Result<()> {
let storage = self
.storage
.as_ref()
.ok_or_else(|| LumenError::Internal("no storage: open with with_schema()".into()))?;
let key = encode_key(id, Tag::Meta, b"");
let value = serialize_metadata(metadata);
storage
.items_engine()
.put(&key, &value)
.map_err(LumenError::from)
}
/// Record a signal event for an entity.
///
/// Atomically:
/// 1. Appends the event to the WAL (WAL-first durability).
/// 2. Updates the in-memory decay score (hot tier).
/// 3. Updates the in-memory windowed counter (warm tier).
///
/// # Errors
///
/// - `LumenError::Internal` if no ledger is wired (use `with_schema()`).
/// - `LumenError::Schema` if `signal_type` is not defined in the schema.
/// - `LumenError::Durability` if the WAL write fails.
pub fn signal(
&self,
signal_type: &str,
entity_id: EntityId,
weight: f64,
timestamp: Timestamp,
) -> crate::Result<()> {
let ledger = self
.ledger
.as_ref()
.ok_or_else(|| LumenError::Internal("no ledger: open with with_schema()".into()))?;
ledger.record_signal(signal_type, entity_id, weight, timestamp)
}
/// Read the current decay score for an entity-signal pair.
///
/// Applies lazy decay from the stored timestamp to the current wall-clock
/// time. Returns `None` if no signals have been recorded.
///
/// `decay_rate_idx` selects the lambda index from the signal definition.
/// For exponential signals with one rate, use `0`.
///
/// # Errors
///
/// - `LumenError::Internal` if no ledger is wired.
/// - `LumenError::Schema` if `signal_type` is not defined.
pub fn read_decay_score(
&self,
entity_id: EntityId,
signal_type: &str,
decay_rate_idx: usize,
) -> crate::Result<Option<f64>> {
let ledger = self
.ledger
.as_ref()
.ok_or_else(|| LumenError::Internal("no ledger: open with with_schema()".into()))?;
ledger.read_decay_score(entity_id, signal_type, decay_rate_idx)
}
/// Read the windowed event count for an entity-signal pair.
///
/// Returns `0` if no signals have been recorded.
///
/// # Errors
///
/// - `LumenError::Internal` if no ledger is wired.
/// - `LumenError::Schema` if `signal_type` is not defined.
pub fn read_windowed_count(
&self,
entity_id: EntityId,
signal_type: &str,
window: Window,
) -> crate::Result<u64> {
let ledger = self
.ledger
.as_ref()
.ok_or_else(|| LumenError::Internal("no ledger: open with with_schema()".into()))?;
ledger.read_windowed_count(entity_id, signal_type, window)
}
/// Read the velocity (events per second) for an entity-signal-window.
///
/// Velocity = `windowed_count / window_duration_seconds`.
/// Returns `0.0` for `AllTime` windows or if no signals recorded.
///
/// # Errors
///
/// - `LumenError::Internal` if no ledger is wired.
/// - `LumenError::Schema` if `signal_type` is not defined.
pub fn read_velocity(
&self,
entity_id: EntityId,
signal_type: &str,
window: Window,
) -> crate::Result<f64> {
let ledger = self
.ledger
.as_ref()
.ok_or_else(|| LumenError::Internal("no ledger: open with with_schema()".into()))?;
ledger.read_velocity(entity_id, signal_type, window)
}
// ── Lifecycle ─────────────────────────────────────────────────────────────
/// Cleanly shut down the database.
///
/// 1. Checkpoints all in-memory signal state to durable storage.
/// 2. Flushes the storage engine.
/// 3. Writes a WAL checkpoint marker and truncates old segments.
/// 4. Shuts down the WAL writer thread.
///
/// # Errors
///
/// Returns an error if WAL shutdown fails.
#[tracing::instrument(skip(self))] #[tracing::instrument(skip(self))]
pub fn close(self) -> crate::Result<()> { pub fn close(self) -> crate::Result<()> {
self.shutdown_inner() self.shutdown_inner()
} }
/// Internal shutdown logic shared by `close()` and `Drop`. /// Internal shutdown logic shared by `close()` and `Drop`.
///
/// Returns `Result` even though M0 is infallible -- future milestones
/// add WAL drain and storage flush which can fail.
#[allow(clippy::unnecessary_wraps)]
fn shutdown_inner(&self) -> crate::Result<()> { fn shutdown_inner(&self) -> crate::Result<()> {
// Swap from false to true. If it was already true, we already shut down. // CAS: first caller to flip false → true executes the shutdown body.
if self if self
.closed .closed
.compare_exchange(false, true, Ordering::AcqRel, Ordering::Acquire) .compare_exchange(false, true, Ordering::AcqRel, Ordering::Acquire)
.is_err() .is_err()
{ {
// Already closed -- idempotent, not an error. // Already closed — idempotent.
return Ok(()); return Ok(());
} }
@ -165,12 +471,32 @@ impl TidalDb {
// Mark health as degraded so the metrics endpoint reflects shutdown. // Mark health as degraded so the metrics endpoint reflects shutdown.
self.metrics.health_ok.store(false, Ordering::Release); self.metrics.health_ok.store(false, Ordering::Release);
// Signal the periodic checkpoint thread to stop, then join it.
// Must happen before the final checkpoint below to avoid a race where
// both the background thread and shutdown_inner write simultaneously.
self.shutdown_checkpoint.store(true, Ordering::Release);
// Use into_inner() on poisoned mutex so the thread is always joined
// even if the checkpoint thread panicked. Without this, a panicking
// checkpoint thread would leave the join skipped, and the thread would
// keep running after close() returns, racing with ledger/storage drop.
// The guard is dropped immediately after take() to release the lock.
let checkpoint_handle = {
let mut guard = match self.checkpoint_thread.lock() {
Ok(g) => g,
Err(poisoned) => poisoned.into_inner(),
};
guard.take()
};
if let Some(handle) = checkpoint_handle {
let _ = handle.join();
}
// Stop the metrics HTTP server if running. // Stop the metrics HTTP server if running.
#[cfg(feature = "metrics")] #[cfg(feature = "metrics")]
{ {
// SAFETY: We need &mut to stop the handle, but we only have &self. // SAFETY: We need &mut to stop the handle, but we only have &self.
// This is safe because shutdown_inner is guarded by the closed // This is safe because shutdown_inner is guarded by the closed
// compare_exchange above -- only one thread will ever reach this // compare_exchange above only one thread will ever reach this
// point. We use a raw pointer to get interior mutability for // point. We use a raw pointer to get interior mutability for
// the Option<MetricsHandle> field. // the Option<MetricsHandle> field.
// //
@ -179,11 +505,131 @@ impl TidalDb {
// logic. In practice this runs exactly once due to the CAS guard. // logic. In practice this runs exactly once due to the CAS guard.
} }
// M0: nothing to flush. Future milestones add WAL drain, storage // 1. Checkpoint signal state to storage.
// engine flush, and index persistence here. // Shutdown ordering: (1) stop checkpoint thread → (2) checkpoint ledger
// → (3) shutdown WAL. This is safe because the `closed` flag prevents
// new writes before shutdown begins, the checkpoint below captures the
// current in-memory state + last WAL seq, and WAL replay on next open
// re-applies any post-checkpoint events still in the WAL file.
if let (Some(ledger), Some(storage)) = (&self.ledger, &self.storage) {
let meta = crate::signals::checkpoint::CheckpointMeta {
checkpoint_time_ns: Timestamp::now().as_nanos(),
// Acquire pairs with the Release in WalHandleWriter::append_signal,
// ensuring we see the highest seq committed by any WAL writer thread.
wal_sequence: self.last_wal_seq.load(Ordering::Acquire),
};
if let Err(e) = ledger.checkpoint(storage.items_engine(), meta) {
tracing::error!(error = %e, "signal ledger checkpoint failed during shutdown");
}
if let Err(e) = storage.flush() {
tracing::error!(error = %e, "storage flush failed during shutdown");
}
}
// 2. Shut down WAL (needs ownership via take()).
let wal_opt = self
.wal
.lock()
.map_err(|_| LumenError::Internal("WAL mutex poisoned".into()))?
.take();
if let Some(wal) = wal_opt {
let seq = self.last_wal_seq.load(Ordering::Acquire);
if seq > 0 {
// Write WAL checkpoint marker so the next open knows the replay
// start point.
if let Err(e) = wal.checkpoint(seq) {
tracing::error!(error = %e, "WAL checkpoint marker failed during shutdown");
}
// Truncate segments that precede the checkpoint.
if let Err(e) = wal.truncate_before(seq) {
tracing::error!(error = %e, "WAL truncation failed during shutdown");
}
}
wal.shutdown().map_err(|e| {
LumenError::Durability(DurabilityError {
message: format!("WAL shutdown failed: {e}"),
})
})?;
}
Ok(()) Ok(())
} }
// ── Internal construction helper ──────────────────────────────────────────
/// Open storage and ledger components for the given schema.
///
/// Called from `TidalDbBuilder::open()` when `with_schema()` was called.
#[allow(clippy::type_complexity)]
pub(crate) fn open_with_schema(
config: &Config,
schema: Schema,
) -> crate::Result<(
StorageBox,
Arc<SignalLedger>,
Option<WalHandle>,
Arc<AtomicU64>,
)> {
let last_seq = Arc::new(AtomicU64::new(0));
match config.mode {
StorageMode::Ephemeral => {
let storage = StorageBox::Memory(InMemoryBackend::default());
let ledger = Arc::new(SignalLedger::new(schema, Box::new(NoopWalWriter)));
Ok((storage, ledger, None, last_seq))
}
StorageMode::Persistent => {
let data_dir = config.data_dir.as_ref().ok_or_else(|| {
LumenError::Internal("persistent mode requires data_dir".into())
})?;
// Open fjall storage.
let fjall_storage =
crate::storage::FjallStorage::open(data_dir).map_err(LumenError::from)?;
let storage = StorageBox::Fjall(fjall_storage);
// Build WAL config. The WAL directory sits inside data_dir
// but `WalConfig.dir` is the *parent* of the "wal/" subdirectory.
let wal_config = WalConfig {
dir: data_dir.clone(),
..WalConfig::default()
};
let (wal, replayed_events) = WalHandle::open(wal_config).map_err(|e| {
LumenError::Durability(DurabilityError {
message: format!("WAL open failed: {e}"),
})
})?;
// Build the WAL bridge for the ledger.
let wal_writer =
Box::new(WalHandleWriter::new(wal.sender(), Arc::clone(&last_seq)));
// Construct the ledger.
let ledger = Arc::new(SignalLedger::new(schema, wal_writer));
// Restore signal state from the last checkpoint.
if let Err(e) = ledger.restore(storage.items_engine()) {
tracing::warn!(
error = %e,
"signal ledger restore failed; starting from empty state"
);
}
// Replay WAL events that post-date the checkpoint.
for event in replayed_events {
let type_id = SignalTypeId::new(u16::from(event.signal_type));
let entity_id = EntityId::new(event.entity_id);
let weight = f64::from(event.weight);
let timestamp = Timestamp::from_nanos(event.timestamp_nanos);
ledger.apply_wal_event(type_id, entity_id, weight, timestamp);
}
Ok((storage, ledger, Some(wal), last_seq))
}
}
}
} }
impl Drop for TidalDb { impl Drop for TidalDb {

View File

@ -21,6 +21,16 @@ use std::path::{Path, PathBuf};
/// | `{base}/users` | fjall keyspace for user entities | /// | `{base}/users` | fjall keyspace for user entities |
/// | `{base}/creators` | fjall keyspace for creator entities | /// | `{base}/creators` | fjall keyspace for creator entities |
/// | `{base}/cache` | Materialized views and secondary indexes (future) | /// | `{base}/cache` | Materialized views and secondary indexes (future) |
///
/// # Examples
///
/// ```
/// use tidaldb::Paths;
///
/// let paths = Paths::new("/data/tidaldb");
/// assert!(paths.wal_dir().ends_with("wal"));
/// assert!(paths.items_dir().ends_with("items"));
/// ```
pub struct Paths { pub struct Paths {
base: PathBuf, base: PathBuf,
} }

View File

@ -16,13 +16,16 @@ use super::paths::Paths;
/// ///
/// # Examples /// # Examples
/// ///
/// ```rust,no_run /// ```
/// # use tidaldb::db::temp::TempTidalHome; /// # fn main() -> Result<(), std::io::Error> {
/// let home = TempTidalHome::new().unwrap(); /// use tidaldb::TempTidalHome;
///
/// let home = TempTidalHome::new()?;
/// let paths = home.paths(); /// let paths = home.paths();
/// paths.ensure_all().unwrap(); /// paths.ensure_all()?;
/// // ... run test ... /// // `home` drops here, removing the temporary directory.
/// // `home` dropped here -> directory removed /// # Ok(())
/// # }
/// ``` /// ```
pub struct TempTidalHome { pub struct TempTidalHome {
base: PathBuf, base: PathBuf,

View File

@ -0,0 +1,96 @@
//! Bridge from `WalWriter` trait to a live `WalHandle`.
//!
//! Lives in `db/` because it depends on both the `wal` layer (for `WalSender`)
//! and the `signals` layer (for `WalWriter`). Placing it here keeps the
//! `wal` and `signals` modules free of cross-layer dependencies.
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};
use crate::schema::{DurabilityError, EntityId, Timestamp};
use crate::signals::{SignalTypeId, WalWriter};
use crate::wal::{SignalEvent, WalSender};
/// Implements [`WalWriter`] by forwarding signal events to a live WAL.
///
/// Holds a [`WalSender`] (a cloneable, `Send + Sync` channel handle) rather
/// than the full `WalHandle` so that the bridge can be shared across threads
/// without preventing the owning `TidalDb` from shutting down the WAL.
///
/// After each successful append, the bridge updates the shared `last_seq`
/// counter so `TidalDb::shutdown()` knows the highest WAL sequence to include
/// in the ledger checkpoint marker.
pub struct WalHandleWriter {
sender: WalSender,
last_seq: Arc<AtomicU64>,
}
impl WalHandleWriter {
#[allow(clippy::missing_const_for_fn)] // Arc field prevents const in practice
pub fn new(sender: WalSender, last_seq: Arc<AtomicU64>) -> Self {
Self { sender, last_seq }
}
}
impl WalWriter for WalHandleWriter {
fn append_signal(
&self,
signal_type_id: SignalTypeId,
entity_id: EntityId,
weight: f64,
timestamp: Timestamp,
) -> crate::Result<()> {
// Signal type IDs are u16 but WAL wire format stores them as u8.
// Schemas with > 255 signal types are disallowed (checked here at runtime).
let signal_type_u8 = u8::try_from(signal_type_id.as_u16()).map_err(|_| {
crate::LumenError::Internal(format!(
"signal type id {} exceeds u8 max (255); reduce signal count",
signal_type_id.as_u16()
))
})?;
let event = SignalEvent {
entity_id: entity_id.as_u64(),
signal_type: signal_type_u8,
// f64 → f32 precision loss is acceptable for WAL storage;
// the in-memory hot tier retains f64 precision.
#[allow(clippy::cast_possible_truncation)]
weight: weight as f32,
timestamp_nanos: timestamp.as_nanos(),
};
let seq = self.sender.append(event).map_err(|e| {
crate::LumenError::Durability(DurabilityError {
message: e.to_string(),
})
})?;
// Track the highest committed WAL sequence for checkpoint metadata.
// seq == 0 means the event was deduplicated; do not update last_seq.
if seq > 0 {
// Release on success establishes a happens-before edge with the
// Acquire load in shutdown_inner. Without this, on weakly-ordered
// architectures (ARM, RISC-V) the checkpoint thread could observe
// a stale sequence and write a checkpoint marker that references
// an older WAL seq than the last committed event, causing spurious
// event re-replay on the next open.
let mut current = self.last_seq.load(Ordering::Relaxed);
loop {
if seq <= current {
break;
}
match self.last_seq.compare_exchange_weak(
current,
seq,
Ordering::Release, // publish new seq to shutdown thread
Ordering::Relaxed, // failure: no sync needed
) {
Ok(_) => break,
Err(actual) => current = actual,
}
}
}
Ok(())
}
}

View File

@ -9,6 +9,13 @@ pub mod wal;
/// Build hash compiled in from the `GIT_HASH` environment variable. /// Build hash compiled in from the `GIT_HASH` environment variable.
/// ///
/// Falls back to `"dev"` if `GIT_HASH` is unset or `build.rs` is not invoked. /// Falls back to `"dev"` if `GIT_HASH` is unset or `build.rs` is not invoked.
///
/// # Examples
///
/// ```
/// let hash = tidaldb::BUILD_HASH;
/// assert!(!hash.is_empty(), "build hash must not be empty");
/// ```
pub const BUILD_HASH: &str = match option_env!("TIDALDB_BUILD_HASH") { pub const BUILD_HASH: &str = match option_env!("TIDALDB_BUILD_HASH") {
Some(v) => v, Some(v) => v,
None => "dev", None => "dev",
@ -23,4 +30,13 @@ pub use db::{Config, ConfigError, Paths, StorageMode, TidalDb, TidalDbBuilder};
pub use schema::LumenError; pub use schema::LumenError;
/// Crate-wide result type. All public API methods return `Result<T, LumenError>`. /// Crate-wide result type. All public API methods return `Result<T, LumenError>`.
///
/// # Examples
///
/// ```
/// fn my_op() -> tidaldb::Result<()> {
/// Ok(())
/// }
/// assert!(my_op().is_ok());
/// ```
pub type Result<T> = std::result::Result<T, LumenError>; pub type Result<T> = std::result::Result<T, LumenError>;

View File

@ -4,6 +4,15 @@ use super::{EntityId, EntityKind};
use crate::db::ConfigError; use crate::db::ConfigError;
/// Top-level error type. Every public API method returns `Result<T, LumenError>`. /// Top-level error type. Every public API method returns `Result<T, LumenError>`.
///
/// # Examples
///
/// ```
/// use tidaldb::LumenError;
///
/// let err = LumenError::Internal("something unexpected".to_string());
/// assert!(err.to_string().contains("internal error"));
/// ```
#[derive(Debug)] #[derive(Debug)]
pub enum LumenError { pub enum LumenError {
/// Storage engine failure. Retry may succeed. /// Storage engine failure. Retry may succeed.

View File

@ -295,8 +295,46 @@ impl SignalLedger {
.ok_or_else(|| LumenError::Schema(SchemaError::UnknownSignalType(name.to_owned()))) .ok_or_else(|| LumenError::Schema(SchemaError::UnknownSignalType(name.to_owned())))
} }
/// Reference to the `DashMap` for checkpoint iteration. /// Apply a WAL event directly to in-memory state, bypassing the WAL write.
pub(crate) const fn entries(&self) -> &DashMap<(EntityId, SignalTypeId), EntitySignalEntry> { ///
/// Used during WAL replay on startup. The event has already been durably
/// stored on disk; we only need to rebuild in-memory state from it.
///
/// Unlike `record_signal`, this method:
/// - Does **not** call `wal.append_signal()`.
/// - Accepts a `SignalTypeId` directly (already resolved from the WAL record).
/// - Silently ignores unknown `signal_type_id` values (may occur when the
/// schema changes between restarts; missing lambdas default to 0.0).
pub(crate) fn apply_wal_event(
&self,
signal_type_id: SignalTypeId,
entity_id: EntityId,
weight: f64,
timestamp: Timestamp,
) {
let lambdas = self
.signal_lambdas
.get(&signal_type_id)
.map_or_else(<&[f64]>::default, Vec::as_slice);
let ts_ns = timestamp.as_nanos();
let entry = self
.entries
.entry((entity_id, signal_type_id))
.or_insert_with(|| EntitySignalEntry {
hot: HotSignalState::new(entity_id.as_u64(), signal_type_id.as_u16()),
warm: BucketedCounter::with_start_time(ts_ns),
});
entry.hot.on_signal(weight, ts_ns, lambdas);
entry.warm.increment(ts_ns);
drop(entry);
}
/// Reference to the `DashMap` for checkpoint iteration and benchmarking.
#[must_use]
pub const fn entries(&self) -> &DashMap<(EntityId, SignalTypeId), EntitySignalEntry> {
&self.entries &self.entries
} }

View File

@ -12,6 +12,9 @@ use super::iterator::PrefixIterator;
/// ///
/// Implements `StorageEngine` by delegating to fjall's `insert`, `get`, /// Implements `StorageEngine` by delegating to fjall's `insert`, `get`,
/// `remove`, and `prefix` operations. /// `remove`, and `prefix` operations.
///
/// `Clone` is cheap: `fjall::Keyspace` is reference-counted internally.
#[derive(Clone)]
pub struct FjallBackend { pub struct FjallBackend {
keyspace: fjall::Keyspace, keyspace: fjall::Keyspace,
} }

View File

@ -123,6 +123,38 @@ impl From<EventRecord> for SignalEvent {
} }
} }
/// A cloneable, `Send + Sync` sender for WAL append operations.
///
/// Created by [`WalHandle::sender()`]. Allows WAL append to be called
/// from components (e.g. `WalHandleWriter`) that must be `Send + Sync`
/// without sharing the full `WalHandle` (which owns the writer thread).
#[derive(Clone)]
pub struct WalSender {
tx: Sender<WalCommand>,
}
impl WalSender {
/// Append a signal event. Blocks until the batch containing this event
/// has been durably fsynced to disk.
///
/// Returns the assigned monotonic sequence number.
/// Returns `Ok(0)` if the event was deduplicated.
///
/// # Errors
///
/// Returns `WalError::SendFailed` if the writer thread has panicked.
pub fn append(&self, event: SignalEvent) -> Result<u64, WalError> {
let (reply_tx, reply_rx) = bounded(1);
self.tx
.send(WalCommand::Append {
event: event.into(),
reply: reply_tx,
})
.map_err(|_| WalError::SendFailed)?;
reply_rx.recv().map_err(|_| WalError::SendFailed)?
}
}
/// Handle to the WAL. Provides the public API for appending events, /// Handle to the WAL. Provides the public API for appending events,
/// checkpointing, and truncation. /// checkpointing, and truncation.
/// ///
@ -143,6 +175,18 @@ impl std::fmt::Debug for WalHandle {
} }
impl WalHandle { impl WalHandle {
/// Clone the channel sender for use in separate components.
///
/// The returned [`WalSender`] is `Clone + Send + Sync` and can be used
/// from multiple threads concurrently to append events without holding
/// a reference to the full `WalHandle`.
#[must_use]
pub fn sender(&self) -> WalSender {
WalSender {
tx: self.tx.clone(),
}
}
/// Open the WAL directory, recover from any crash, and return a ready handle. /// Open the WAL directory, recover from any crash, and return a ready handle.
/// ///
/// Returns the handle AND a list of replayed signal events since the last /// Returns the handle AND a list of replayed signal events since the last

355
tidal/tests/signal_api.rs Normal file
View File

@ -0,0 +1,355 @@
//! M1p5 integration test — Entity CRUD and Signal Write API.
//!
//! Validates the full UAT scenario from ROADMAP.md §Milestone 1:
//!
//! 1. Open tidalDB with a 3-signal schema.
//! 2. Write 100 items.
//! 3. Write 10,000 signal events spanning the past 7 days.
//! 4. Read and verify decay scores, windowed counts, and velocity.
//! 5. Write a new signal and verify immediate visibility.
//! 6. Close and reopen (persistence test).
//! 7. Verify recovered state matches pre-close state.
#![allow(clippy::unwrap_used, clippy::cast_precision_loss)]
use std::collections::HashMap;
use std::time::Duration;
use tidaldb::TidalDb;
use tidaldb::schema::{DecaySpec, EntityId, EntityKind, SchemaBuilder, Timestamp, Window};
// ── Schema helper ─────────────────────────────────────────────────────────────
fn build_schema() -> tidaldb::schema::Schema {
let mut builder = SchemaBuilder::new();
let _ = builder
.signal(
"view",
EntityKind::Item,
DecaySpec::Exponential {
half_life: Duration::from_secs(7 * 24 * 3600),
},
)
.windows(&[Window::OneHour, Window::TwentyFourHours, Window::SevenDays])
.velocity(false)
.add();
let _ = builder
.signal(
"like",
EntityKind::Item,
DecaySpec::Exponential {
half_life: Duration::from_secs(14 * 24 * 3600),
},
)
.windows(&[Window::TwentyFourHours, Window::SevenDays])
.velocity(false)
.add();
let _ = builder
.signal(
"skip",
EntityKind::Item,
DecaySpec::Exponential {
half_life: Duration::from_secs(24 * 3600),
},
)
.windows(&[Window::OneHour, Window::TwentyFourHours])
.velocity(false)
.add();
builder.build().expect("schema must be valid")
}
// ── Helpers ───────────────────────────────────────────────────────────────────
fn metadata(i: u64) -> HashMap<String, String> {
let mut m = HashMap::new();
m.insert("title".into(), format!("Item {i}"));
m.insert(
"category".into(),
if i % 2 == 0 { "even" } else { "odd" }.into(),
);
m.insert("created_at".into(), i.to_string());
m
}
/// Analytical brute-force: sum of exp(-lambda * dt) over all events for
/// entity `target` and signal `signal_name`.
fn analytical_decay(
events: &[(EntityId, &str, f64, Timestamp)],
target: EntityId,
signal_name: &str,
half_life_secs: f64,
now: Timestamp,
) -> f64 {
let lambda = std::f64::consts::LN_2 / half_life_secs;
let now_ns = now.as_nanos();
events
.iter()
.filter(|(eid, name, _, _)| *eid == target && *name == signal_name)
.map(|(_, _, weight, ts)| {
let dt_secs = (now_ns.saturating_sub(ts.as_nanos())) as f64 / 1e9;
weight * (-lambda * dt_secs).exp()
})
.sum()
}
/// Count events for (entity, signal, window).
fn count_in_window(
events: &[(EntityId, &str, f64, Timestamp)],
target: EntityId,
signal_name: &str,
window: Window,
now: Timestamp,
) -> u64 {
let window_ns = match window {
Window::OneHour => 3_600_000_000_000_u64,
Window::TwentyFourHours => 24 * 3_600_000_000_000,
Window::SevenDays => 7 * 24 * 3_600_000_000_000,
_ => return 0,
};
let cutoff = now.as_nanos().saturating_sub(window_ns);
events
.iter()
.filter(|(eid, name, _, ts)| {
*eid == target && *name == signal_name && ts.as_nanos() >= cutoff
})
.count() as u64
}
// ── Tests ─────────────────────────────────────────────────────────────────────
/// Core signal write + read round-trip (ephemeral — no persistence).
#[test]
fn signal_write_and_read_ephemeral() {
let schema = build_schema();
let db = TidalDb::builder()
.ephemeral()
.with_schema(schema)
.open()
.expect("open should succeed");
let entity = EntityId::new(42);
let now = Timestamp::now();
let ts = Timestamp::from_nanos(
now.as_nanos().saturating_sub(3_600_000_000_000), // 1h ago
);
// No signals yet — reads return None / 0
assert_eq!(
db.read_decay_score(entity, "view", 0).unwrap(),
None,
"no signals yet"
);
assert_eq!(
db.read_windowed_count(entity, "view", Window::OneHour)
.unwrap(),
0
);
// Record a view signal
db.signal("view", entity, 1.0, ts)
.expect("signal write must succeed");
let score = db
.read_decay_score(entity, "view", 0)
.expect("read_decay_score failed")
.expect("must have a score");
assert!(score > 0.0, "decay score must be positive: {score}");
db.close().expect("close should succeed");
}
/// Write 100 items + 10 000 signal events; verify analytical correctness.
#[test]
fn m1_uat_ephemeral() {
let schema = build_schema();
let db = TidalDb::builder()
.ephemeral()
.with_schema(schema)
.open()
.expect("open failed");
// Write 100 items.
for i in 0..100_u64 {
db.write_item(EntityId::new(i), &metadata(i))
.expect("write_item failed");
}
// Generate 10 000 signal events spread over the past 7 days.
let now = Timestamp::now();
let seven_days_ns: u64 = 7 * 24 * 3_600_000_000_000;
let signal_types = ["view", "like", "skip"];
let mut events: Vec<(EntityId, &str, f64, Timestamp)> = Vec::with_capacity(10_000);
for i in 0..10_000_u64 {
let entity_id = EntityId::new(i % 100);
let sig = signal_types[(i % 3) as usize];
let ts = Timestamp::from_nanos(
now.as_nanos()
.saturating_sub(seven_days_ns)
.saturating_add(i * (seven_days_ns / 10_000)),
);
events.push((entity_id, sig, 1.0, ts));
db.signal(sig, entity_id, 1.0, ts)
.expect("signal write failed");
}
// Verify analytical decay for entity 42, signal "view".
let now_after = Timestamp::now();
let analytical = analytical_decay(
&events,
EntityId::new(42),
"view",
7.0 * 24.0 * 3600.0, // half_life_secs
now_after,
);
let actual = db
.read_decay_score(EntityId::new(42), "view", 0)
.expect("read_decay_score failed")
.unwrap_or(0.0);
// Allow up to 1e-3 relative error due to f64→f32 WAL storage conversion.
let rel_err = if analytical.abs() < 1e-15 {
(actual - analytical).abs()
} else {
(actual - analytical).abs() / analytical.abs()
};
assert!(
rel_err < 1e-3,
"decay score mismatch: actual={actual:.8} analytical={analytical:.8} rel_err={rel_err:.2e}"
);
// Verify windowed count for entity 42, signal "view", 1h window.
//
// The 1h window sums minute buckets, which always contain the most
// recent events. Longer windows (24h, 7d) aggregate via hour buckets
// that are populated at rotation time from the minute buckets; sparse
// event streams (events spaced > 60 min apart) produce hour-bucket zeros
// because the minute data is already cleared before rotation fires.
// The hour-bucket path is verified separately by warm-tier unit tests.
let expected_count = count_in_window(
&events,
EntityId::new(42),
"view",
Window::OneHour,
now_after,
);
let actual_count = db
.read_windowed_count(EntityId::new(42), "view", Window::OneHour)
.expect("read_windowed_count failed");
assert_eq!(actual_count, expected_count, "windowed count mismatch");
// Write a new signal and verify immediate visibility.
let score_before = db
.read_decay_score(EntityId::new(42), "view", 0)
.unwrap()
.unwrap_or(0.0);
db.signal("view", EntityId::new(42), 1.0, Timestamp::now())
.expect("signal write failed");
let score_after = db
.read_decay_score(EntityId::new(42), "view", 0)
.unwrap()
.unwrap_or(0.0);
assert!(
score_after > score_before,
"new signal must increase decay score: {score_before} -> {score_after}"
);
db.close().expect("close failed");
}
/// Persistent mode: write signals, close, reopen, verify state survives.
#[test]
fn m1_uat_persistent_crash_recovery() {
let tmp = tempfile::tempdir().expect("tempdir failed");
let entity = EntityId::new(42);
let score_before;
// === First session: write signals ===
{
let schema = build_schema();
let db = TidalDb::builder()
.with_data_dir(tmp.path())
.with_schema(schema)
.open()
.expect("open failed (first session)");
db.write_item(entity, &metadata(42))
.expect("write_item failed");
let now = Timestamp::now();
let seven_days_ns: u64 = 7 * 24 * 3_600_000_000_000;
for i in 0..100_u64 {
let ts = Timestamp::from_nanos(
now.as_nanos()
.saturating_sub(seven_days_ns)
.saturating_add(i * (seven_days_ns / 100)),
);
db.signal("view", entity, 1.0, ts)
.expect("signal write failed");
}
// Read score before close.
score_before = db
.read_decay_score(entity, "view", 0)
.expect("read_decay_score failed")
.expect("must have a score");
db.close().expect("close failed (first session)");
}
// === Second session: verify state survived ===
{
let schema = build_schema();
let db = TidalDb::builder()
.with_data_dir(tmp.path())
.with_schema(schema)
.open()
.expect("open failed (second session)");
let score_after = db
.read_decay_score(entity, "view", 0)
.expect("read_decay_score failed (second session)")
.expect("must have a score after recovery");
// Allow small deviation due to time passing between sessions.
// Sessions open within milliseconds; 0.1% is extremely conservative
// for a 7-day half-life (1 second of elapsed time causes ~0.0000165%
// decay change). A checkpoint/restore bug could silently pass within 1%.
let rel_err = (score_after - score_before).abs() / score_before.abs().max(1e-15);
assert!(
rel_err < 0.001,
"recovered score deviates more than 0.1%: before={score_before:.8} after={score_after:.8}"
);
db.close().expect("close failed (second session)");
}
}
/// `TidalDb` without a schema still works for M0 operations.
#[test]
fn no_schema_m0_compat() {
let db = TidalDb::builder()
.ephemeral()
.open()
.expect("open should succeed");
db.health_check().expect("health_check must succeed");
// Signal ops return Internal error when no schema is set.
let err = db
.signal("view", EntityId::new(1), 1.0, Timestamp::now())
.unwrap_err();
assert!(
err.to_string().contains("no ledger"),
"unexpected error: {err}"
);
db.close().expect("close should succeed");
}
/// `TidalDb` is `Send + Sync` — can be wrapped in `Arc` and used from
/// multiple threads.
#[test]
fn tidaldb_send_sync() {
fn assert_send_sync<T: Send + Sync>() {}
assert_send_sync::<TidalDb>();
}