feat: complete Milestone 5 — full-text search, RRF fusion, and creator search
- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs) - M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates) - M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking) - M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators) - Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.) - Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.) - Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers - Add benches: fusion, search, session, text_index - Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index) - Update blog posts, roadmap, content strategy, and M5 planning docs - Add tmp/ and .claude/worktrees/ to .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
39ada28c6e
commit
192c473f55
@ -139,8 +139,27 @@ Action: Fix failures and re-run /uat milestone {N} phase {N}
|
||||
|
||||
{If ACCEPT:}
|
||||
Milestone {N} Phase {N}.{N} is ACCEPTED.
|
||||
|
||||
**MANDATORY: Make exactly ONE edit to docs/planning/ROADMAP.md before declaring acceptance.**
|
||||
|
||||
The Current Status table is the single source of truth for completion. Do not write status anywhere else.
|
||||
|
||||
1. Open `docs/planning/ROADMAP.md`
|
||||
2. Find the Current Status table (under `## Current Status`)
|
||||
3. Add or update the row for this phase:
|
||||
- If this phase has a row marked NOT STARTED, change it to COMPLETE and fill in the test count
|
||||
- If there is no row yet, add one: `| **m{N}p{N}: {Phase Name}** | COMPLETE | {test count} passing |`
|
||||
4. Update the `**Next:**` line below the table to name the next phase
|
||||
|
||||
DO NOT:
|
||||
- Write "COMPLETE" to any OVERVIEW.md file
|
||||
- Write "COMPLETE" to any phase header inside ROADMAP.md's milestone sections
|
||||
- Add a "Lessons learned" or "Current phase" entry anywhere
|
||||
- Mark checkboxes in phase sections of ROADMAP.md
|
||||
|
||||
{If this is the final phase in the milestone:}
|
||||
All phases accepted. Milestone {N} UAT scenario can now be tested end-to-end.
|
||||
Update the `**Next:**` line to name the next milestone.
|
||||
{Otherwise:}
|
||||
Ready for: /milestone plan milestone {N} phase {N+1} (or /implement if already planned)
|
||||
```
|
||||
@ -209,6 +228,7 @@ Before finalizing acceptance, challenge:
|
||||
- ALWAYS run the full test suite (not just new tests)
|
||||
- ALWAYS present evidence (test name, measured value) for every pass
|
||||
- ALWAYS state the next step after ACCEPT or REJECT
|
||||
- ALWAYS update `docs/planning/ROADMAP.md` Current Status table on ACCEPT before declaring the phase done
|
||||
|
||||
## When Things Go Wrong
|
||||
|
||||
|
||||
@ -23,7 +23,7 @@ The content strategy defines 16-20 posts across the full roadmap. Do not invent
|
||||
|
||||
### Current Queue (update as phases complete)
|
||||
|
||||
As of M1 + M2 complete:
|
||||
As of M4 complete, posts 1-11 written:
|
||||
|
||||
| Priority | Post | Status |
|
||||
|----------|------|--------|
|
||||
@ -32,11 +32,40 @@ As of M1 + M2 complete:
|
||||
| 3 | Post 3: "What three databases taught us before we wrote a line of code" | PUBLISHED |
|
||||
| 4 | Post 4: "Signals wrote 100ms ago. The query sees them now." | PUBLISHED |
|
||||
| 5 | Post 5: "One query. Six systems. Under 50 milliseconds." | PUBLISHED |
|
||||
| 6 | Post 6: "Diversity enforcement in 3 microseconds" | Ready -- M2 complete |
|
||||
| 7 | Post 7: "Ranking profiles are data, not code" | Ready -- M2 complete |
|
||||
| 8 | "Why we chose fjall over RocksDB (for now)" | Ready -- M1 complete |
|
||||
| 9 | "Why not SQL" | Ready -- best paired with M2 (now shipped) |
|
||||
| 10 | "USearch, not from scratch" | Ready -- M2 complete |
|
||||
| 6 | Post 6: "Diversity enforcement in 3 microseconds" | PUBLISHED |
|
||||
| 7 | Post 7: "Ranking profiles are data, not code" | PUBLISHED |
|
||||
| 8 | Post 8: "The feedback loop that closes in one write" | PUBLISHED |
|
||||
| 9 | Post 9: "Negative signals are equal citizens" | PUBLISHED |
|
||||
| 10 | Post 10: "Cold start without application logic" | PUBLISHED |
|
||||
| 11 | Post 11: "Search and ranking are the same system" | PUBLISHED |
|
||||
| — | "Why we chose fjall over RocksDB (for now)" | Ready -- anytime ADR |
|
||||
| — | "Why not SQL" | Ready -- anytime ADR |
|
||||
| — | "USearch, not from scratch" | Ready -- anytime ADR |
|
||||
|
||||
### Code anchors for the three READY posts
|
||||
|
||||
**"Why we chose fjall over RocksDB (for now)"**
|
||||
- `tidal/src/storage/engine.rs` -- the `StorageEngine` trait; six methods, zero fjall imports -- this is the abstraction boundary
|
||||
- `tidal/src/storage/fjall.rs` -- `FjallBackend` implementing the trait; `fjall::Keyspace` is the only fjall type that crosses the boundary
|
||||
- `tidal/src/storage/memory.rs` -- `InMemoryBackend`; proves the trait is genuinely swappable (used in all tests)
|
||||
- `tidal/Cargo.toml` -- version pin, no `unsafe` in fjall feature flags
|
||||
- `thoughts.md` Part V.9 -- the architectural reasoning behind the choice
|
||||
- Thesis: the trait boundary is the argument. Show it. Then explain why the decision is reversible.
|
||||
|
||||
**"Why not SQL"**
|
||||
- `tidal/src/query/retrieve.rs` -- `RetrieveBuilder`; `for_user`, `profile`, `diversity`, `filter` are typed builder methods -- not string predicates; note `ProfileRef`, `DiversityConstraints`, `FilterExpr` are rich types
|
||||
- `tidal/src/ranking/profile.rs` -- `RankingProfile`, `CandidateStrategy`, `SignalBoost`; show that a profile encodes retrieval mode + scoring weights + sort logic -- no SQL analogue
|
||||
- `tidal/src/query/executor.rs` -- `for_user` dispatch loads preference vector and interaction ledger; this is stateful user context that SQL has no model for
|
||||
- `thoughts.md` Part II.4 -- the reasoning for a custom query language
|
||||
- Thesis: show a `Retrieve::builder()` call next to what the equivalent SQL would require (JOIN preference_vectors, JOIN interaction_weights, computed ranking expression, HAVING diversity constraint). The SQL falls apart. The builder does not.
|
||||
|
||||
**"USearch, not from scratch"**
|
||||
- `tidal/src/storage/vector/mod.rs` -- `VectorIndex` trait and the module comment; read the design decisions section carefully (VectorId = u64, L2 squared, ef_search uniformity); the rest of the codebase never imports `usearch` directly
|
||||
- `tidal/src/storage/vector/usearch_index.rs` -- `UsearchIndex`; the wrapper is intentionally thin; count the lines that cross the FFI boundary
|
||||
- `tidal/src/storage/vector/planner.rs` -- `AdaptiveQueryPlanner` with four strategies; this is tidalDB's value-add on top of the borrowed index; show the strategy dispatch
|
||||
- `tidal/src/storage/vector/brute.rs` -- `BruteForceIndex` and `MockVectorIndex`; proves the trait boundary allows correctness baselines without touching USearch
|
||||
- `docs/research/ann_for_tidaldb.md` -- the prior art survey; cite the production users (ScyllaDB, ClickHouse, DuckDB)
|
||||
- Thesis: the `VectorIndex` trait is a six-method interface. Show it. Then show that `UsearchIndex` is ~150 lines wrapping it. Then show the `AdaptiveQueryPlanner` -- that is what we built. The six months of HNSW engineering is USearch's problem.
|
||||
|
||||
## When to Use
|
||||
|
||||
|
||||
4
.gitignore
vendored
4
.gitignore
vendored
@ -30,3 +30,7 @@ logs/
|
||||
*~
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
|
||||
# Ephemeral / scratch
|
||||
tmp/
|
||||
.claude/worktrees/
|
||||
|
||||
525
Cargo.lock
generated
525
Cargo.lock
generated
@ -49,7 +49,7 @@ dependencies = [
|
||||
"mime",
|
||||
"percent-encoding",
|
||||
"pin-project-lite",
|
||||
"rand",
|
||||
"rand 0.9.2",
|
||||
"sha1",
|
||||
"smallvec",
|
||||
"tokio",
|
||||
@ -215,6 +215,12 @@ dependencies = [
|
||||
"alloc-no-stdlib",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "allocator-api2"
|
||||
version = "0.2.21"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "683d7910e743518b0e34f1186f92494becacb047c7b6bf616c96772180fef923"
|
||||
|
||||
[[package]]
|
||||
name = "anes"
|
||||
version = "0.1.6"
|
||||
@ -233,6 +239,15 @@ version = "1.0.102"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "7f202df86484c868dbad7eaa557ef785d5c66295e41b460ef922eca0723b842c"
|
||||
|
||||
[[package]]
|
||||
name = "arc-swap"
|
||||
version = "1.8.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "f9f3647c145568cec02c42054e07bdf9a5a698e15b466fb2341bfc393cd24aa5"
|
||||
dependencies = [
|
||||
"rustversion",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "arrayref"
|
||||
version = "0.3.9"
|
||||
@ -245,6 +260,17 @@ version = "0.7.6"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "7c02d123df017efcdfbd739ef81735b36c5ba83ec3c59c80a9d7ecc718f92e50"
|
||||
|
||||
[[package]]
|
||||
name = "async-trait"
|
||||
version = "0.1.89"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "9035ad2d096bed7955a320ee7e2230574d28fd3c3a0f186cbea1ff3c7eed5dbb"
|
||||
dependencies = [
|
||||
"proc-macro2",
|
||||
"quote",
|
||||
"syn",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "atomic-waker"
|
||||
version = "1.1.2"
|
||||
@ -336,6 +362,15 @@ version = "2.11.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "843867be96c8daad0d758b57df9392b6d8d271134fce549de6ce169ff98a92af"
|
||||
|
||||
[[package]]
|
||||
name = "bitpacking"
|
||||
version = "0.9.3"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "96a7139abd3d9cebf8cd6f920a389cf3dc9576172e32f4563f188cae3c3eb019"
|
||||
dependencies = [
|
||||
"crunchy",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "blake3"
|
||||
version = "1.8.3"
|
||||
@ -443,6 +478,12 @@ dependencies = [
|
||||
"shlex",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "census"
|
||||
version = "0.4.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "4f4c707c6a209cbe82d10abd08e1ea8995e9ea937d2550646e02798948992be0"
|
||||
|
||||
[[package]]
|
||||
name = "cfg-if"
|
||||
version = "1.0.4"
|
||||
@ -575,7 +616,7 @@ dependencies = [
|
||||
"clap",
|
||||
"criterion-plot",
|
||||
"is-terminal",
|
||||
"itertools",
|
||||
"itertools 0.10.5",
|
||||
"num-traits",
|
||||
"once_cell",
|
||||
"oorandom",
|
||||
@ -596,7 +637,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "6b50826342786a51a89e2da3a28f1c32b06e387201bc2d19791f622c673706b1"
|
||||
dependencies = [
|
||||
"cast",
|
||||
"itertools",
|
||||
"itertools 0.10.5",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@ -764,6 +805,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "cc3dc5ad92c2e2d1c193bbbbdf2ea477cb81331de4f3103f267ca18368b988c4"
|
||||
dependencies = [
|
||||
"powerfmt",
|
||||
"serde_core",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@ -810,6 +852,12 @@ dependencies = [
|
||||
"syn",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "downcast-rs"
|
||||
version = "1.2.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "75b325c5dbd37f80359721ad39aca5a29fb04c89279657cffdda8736d0c0b9d2"
|
||||
|
||||
[[package]]
|
||||
name = "either"
|
||||
version = "1.15.0"
|
||||
@ -853,6 +901,12 @@ dependencies = [
|
||||
"windows-sys 0.61.2",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "fastdivide"
|
||||
version = "0.4.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "9afc2bd4d5a73106dd53d10d73d3401c2f32730ba2c0b93ddb888a8983680471"
|
||||
|
||||
[[package]]
|
||||
name = "fastrand"
|
||||
version = "2.3.0"
|
||||
@ -928,6 +982,16 @@ dependencies = [
|
||||
"percent-encoding",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "fs4"
|
||||
version = "0.8.4"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "f7e180ac76c23b45e767bd7ae9579bc0bb458618c4bc71835926e098e61d15f8"
|
||||
dependencies = [
|
||||
"rustix 0.38.44",
|
||||
"windows-sys 0.52.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "futures-channel"
|
||||
version = "0.3.32"
|
||||
@ -977,6 +1041,17 @@ dependencies = [
|
||||
"version_check",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "getrandom"
|
||||
version = "0.2.17"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "ff2abc00be7fca6ebc474524697ae276ad847ad0a6b3faa4bcb027e9a4614ad0"
|
||||
dependencies = [
|
||||
"cfg-if",
|
||||
"libc",
|
||||
"wasi",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "getrandom"
|
||||
version = "0.3.4"
|
||||
@ -1044,6 +1119,8 @@ version = "0.15.5"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "9229cfe53dfd69f0609a49f65461bd93001ea1ef889cd5529dd176593f5338a1"
|
||||
dependencies = [
|
||||
"allocator-api2",
|
||||
"equivalent",
|
||||
"foldhash 0.1.5",
|
||||
]
|
||||
|
||||
@ -1065,6 +1142,12 @@ version = "0.5.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "fc0fef456e4baa96da950455cd02c081ca953b141298e41db3fc7e36b1da849c"
|
||||
|
||||
[[package]]
|
||||
name = "htmlescape"
|
||||
version = "0.3.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "e9025058dae765dee5070ec375f591e2ba14638c63feff74f13805a72e523163"
|
||||
|
||||
[[package]]
|
||||
name = "http"
|
||||
version = "0.2.12"
|
||||
@ -1283,6 +1366,18 @@ dependencies = [
|
||||
"serde_core",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "instant"
|
||||
version = "0.1.13"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "e0242819d153cba4b4b05a5a8f2a7e9bbf97b6055b2a002b395c96b5ff3c0222"
|
||||
dependencies = [
|
||||
"cfg-if",
|
||||
"js-sys",
|
||||
"wasm-bindgen",
|
||||
"web-sys",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "interval-heap"
|
||||
version = "0.0.5"
|
||||
@ -1312,6 +1407,15 @@ dependencies = [
|
||||
"either",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "itertools"
|
||||
version = "0.12.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "ba291022dbbd398a455acf126c1e341954079855bc60dfdda641363bd6922569"
|
||||
dependencies = [
|
||||
"either",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "itoa"
|
||||
version = "1.0.17"
|
||||
@ -1356,12 +1460,24 @@ version = "0.1.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "09edd9e8b54e49e587e4f6295a7d29c3ea94d469cb40ab8ca70b288248a81db2"
|
||||
|
||||
[[package]]
|
||||
name = "levenshtein_automata"
|
||||
version = "0.2.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "0c2cdeb66e45e9f36bfad5bbdb4d2384e70936afbee843c6f6543f0c551ebb25"
|
||||
|
||||
[[package]]
|
||||
name = "libc"
|
||||
version = "0.2.182"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "6800badb6cb2082ffd7b6a67e6125bb39f18782f793520caee8cb8846be06112"
|
||||
|
||||
[[package]]
|
||||
name = "libm"
|
||||
version = "0.2.16"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "b6d2cec3eae94f9f509c767b45932f1ada8350c4bdb85af2fcab4a3c14807981"
|
||||
|
||||
[[package]]
|
||||
name = "link-cplusplus"
|
||||
version = "1.0.12"
|
||||
@ -1371,6 +1487,12 @@ dependencies = [
|
||||
"cc",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "linux-raw-sys"
|
||||
version = "0.4.15"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "d26c52dbd32dccf2d10cac7725f8eae5296885fb5703b261f7d0a0739ec807ab"
|
||||
|
||||
[[package]]
|
||||
name = "linux-raw-sys"
|
||||
version = "0.11.0"
|
||||
@ -1415,6 +1537,15 @@ version = "0.4.29"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "5e5032e24019045c762d3c0f28f5b6b8bbf38563a65908389bf7978758920897"
|
||||
|
||||
[[package]]
|
||||
name = "lru"
|
||||
version = "0.12.5"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "234cf4f4a04dc1f57e24b96cc0cd600cf2af460d4161ac5ecdd0af8e1f3b2a38"
|
||||
dependencies = [
|
||||
"hashbrown 0.15.5",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "lsm-tree"
|
||||
version = "3.0.2"
|
||||
@ -1429,7 +1560,7 @@ dependencies = [
|
||||
"log",
|
||||
"lz4_flex",
|
||||
"quick_cache",
|
||||
"rustc-hash",
|
||||
"rustc-hash 2.1.1",
|
||||
"self_cell",
|
||||
"sfa",
|
||||
"tempfile",
|
||||
@ -1461,18 +1592,43 @@ version = "0.8.4"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "47e1ffaa40ddd1f3ed91f717a33c8c0ee23fff369e3aa8772b9605cc1d22f4c3"
|
||||
|
||||
[[package]]
|
||||
name = "measure_time"
|
||||
version = "0.8.3"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "dbefd235b0aadd181626f281e1d684e116972988c14c264e42069d5e8a5775cc"
|
||||
dependencies = [
|
||||
"instant",
|
||||
"log",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "memchr"
|
||||
version = "2.8.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "f8ca58f447f06ed17d5fc4043ce1b10dd205e060fb3ce5b979b8ed8e59ff3f79"
|
||||
|
||||
[[package]]
|
||||
name = "memmap2"
|
||||
version = "0.9.10"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "714098028fe011992e1c3962653c96b2d578c4b4bce9036e15ff220319b1e0e3"
|
||||
dependencies = [
|
||||
"libc",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "mime"
|
||||
version = "0.3.17"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "6877bb514081ee2a7ff5ef9de3281f14a4dd4bceac4c09388074a6b5df8a139a"
|
||||
|
||||
[[package]]
|
||||
name = "minimal-lexical"
|
||||
version = "0.2.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "68354c5c6bd36d73ff3feceb05efa59b6acb7626617f4962be322a825e61f79a"
|
||||
|
||||
[[package]]
|
||||
name = "miniz_oxide"
|
||||
version = "0.8.9"
|
||||
@ -1495,6 +1651,22 @@ dependencies = [
|
||||
"windows-sys 0.61.2",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "murmurhash32"
|
||||
version = "0.3.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "2195bf6aa996a481483b29d62a7663eed3fe39600c460e323f8ff41e90bdd89b"
|
||||
|
||||
[[package]]
|
||||
name = "nom"
|
||||
version = "7.1.3"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "d273983c5a657a70a3e8f2a01329822f3b8c8172b73826411a55751e404a0a4a"
|
||||
dependencies = [
|
||||
"memchr",
|
||||
"minimal-lexical",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nu-ansi-term"
|
||||
version = "0.50.3"
|
||||
@ -1517,6 +1689,17 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "071dfc062690e90b734c0b2273ce72ad0ffa95f0c74596bc250dcfd960262841"
|
||||
dependencies = [
|
||||
"autocfg",
|
||||
"libm",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "num_cpus"
|
||||
version = "1.17.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "91df4bbde75afed763b708b7eee1e8e7651e02d97f6d5dd763e89367e957b23b"
|
||||
dependencies = [
|
||||
"hermit-abi",
|
||||
"libc",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@ -1525,12 +1708,27 @@ version = "1.21.3"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "42f5e15c9953c5e4ccceeb2e7382a716482c34515315f7b03532b8b4e8393d2d"
|
||||
|
||||
[[package]]
|
||||
name = "oneshot"
|
||||
version = "0.1.13"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "269bca4c2591a28585d6bf10d9ed0332b7d76900a1b02bec41bdc3a2cdcda107"
|
||||
|
||||
[[package]]
|
||||
name = "oorandom"
|
||||
version = "11.1.5"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "d6790f58c7ff633d8771f42965289203411a5e5c68388703c06e14f24770b41e"
|
||||
|
||||
[[package]]
|
||||
name = "ownedbytes"
|
||||
version = "0.7.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "c3a059efb063b8f425b948e042e6b9bd85edfe60e913630ed727b23e2dfcc558"
|
||||
dependencies = [
|
||||
"stable_deref_trait",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "parking_lot"
|
||||
version = "0.12.5"
|
||||
@ -1659,8 +1857,8 @@ dependencies = [
|
||||
"bit-vec",
|
||||
"bitflags",
|
||||
"num-traits",
|
||||
"rand",
|
||||
"rand_chacha",
|
||||
"rand 0.9.2",
|
||||
"rand_chacha 0.9.0",
|
||||
"rand_xorshift",
|
||||
"regex-syntax",
|
||||
"rusty-fork",
|
||||
@ -1699,14 +1897,35 @@ version = "5.3.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "69cdb34c158ceb288df11e18b4bd39de994f6657d83847bdffdbd7f346754b0f"
|
||||
|
||||
[[package]]
|
||||
name = "rand"
|
||||
version = "0.8.5"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "34af8d1a0e25924bc5b7c43c079c942339d8f0a8b57c39049bef581b46327404"
|
||||
dependencies = [
|
||||
"libc",
|
||||
"rand_chacha 0.3.1",
|
||||
"rand_core 0.6.4",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "rand"
|
||||
version = "0.9.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "6db2770f06117d490610c7488547d543617b21bfa07796d7a12f6f1bd53850d1"
|
||||
dependencies = [
|
||||
"rand_chacha",
|
||||
"rand_core",
|
||||
"rand_chacha 0.9.0",
|
||||
"rand_core 0.9.5",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "rand_chacha"
|
||||
version = "0.3.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "e6c10a63a0fa32252be49d21e7709d4d4baf8d231c2dbce1eaa8141b9b127d88"
|
||||
dependencies = [
|
||||
"ppv-lite86",
|
||||
"rand_core 0.6.4",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@ -1716,7 +1935,16 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "d3022b5f1df60f26e1ffddd6c66e8aa15de382ae63b3a0c1bfc0e4d3e3f325cb"
|
||||
dependencies = [
|
||||
"ppv-lite86",
|
||||
"rand_core",
|
||||
"rand_core 0.9.5",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "rand_core"
|
||||
version = "0.6.4"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "ec0be4795e2f6a28069bec0b5ff3e2ac9bafc99e6a9a7dc3547996c5c816922c"
|
||||
dependencies = [
|
||||
"getrandom 0.2.17",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@ -1728,13 +1956,23 @@ dependencies = [
|
||||
"getrandom 0.3.4",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "rand_distr"
|
||||
version = "0.4.3"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "32cb0b9bc82b0a0876c2dd994a7e7a2683d3e7390ca40e6886785ef0c7e3ee31"
|
||||
dependencies = [
|
||||
"num-traits",
|
||||
"rand 0.8.5",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "rand_xorshift"
|
||||
version = "0.4.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "513962919efc330f829edb2535844d1b912b0fbe2ca165d613e4e8788bb05a5a"
|
||||
dependencies = [
|
||||
"rand_core",
|
||||
"rand_core 0.9.5",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@ -1811,6 +2049,22 @@ dependencies = [
|
||||
"byteorder",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "rust-stemmers"
|
||||
version = "1.2.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "e46a2036019fdb888131db7a4c847a1063a7493f971ed94ea82c67eada63ca54"
|
||||
dependencies = [
|
||||
"serde",
|
||||
"serde_derive",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "rustc-hash"
|
||||
version = "1.1.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "08d43f7aa6b08d49f382cde6a7982047c3426db949b1424bc4b7ec9ae12c6ce2"
|
||||
|
||||
[[package]]
|
||||
name = "rustc-hash"
|
||||
version = "2.1.1"
|
||||
@ -1826,6 +2080,19 @@ dependencies = [
|
||||
"semver",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "rustix"
|
||||
version = "0.38.44"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "fdb5bc1ae2baa591800df16c9ca78619bf65c0488b41b96ccec5d11220d8c154"
|
||||
dependencies = [
|
||||
"bitflags",
|
||||
"errno",
|
||||
"libc",
|
||||
"linux-raw-sys 0.4.15",
|
||||
"windows-sys 0.52.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "rustix"
|
||||
version = "1.1.3"
|
||||
@ -1835,7 +2102,7 @@ dependencies = [
|
||||
"bitflags",
|
||||
"errno",
|
||||
"libc",
|
||||
"linux-raw-sys",
|
||||
"linux-raw-sys 0.11.0",
|
||||
"windows-sys 0.61.2",
|
||||
]
|
||||
|
||||
@ -2015,6 +2282,15 @@ version = "0.3.8"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "e320a6c5ad31d271ad523dcf3ad13e2767ad8b1cb8f047f75a8aeaf8da139da2"
|
||||
|
||||
[[package]]
|
||||
name = "sketches-ddsketch"
|
||||
version = "0.2.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "85636c14b73d81f541e525f585c0a2109e6744e1565b5c1668e31c70c10ed65c"
|
||||
dependencies = [
|
||||
"serde",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "slab"
|
||||
version = "0.4.12"
|
||||
@ -2096,6 +2372,147 @@ dependencies = [
|
||||
"syn",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tantivy"
|
||||
version = "0.22.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "96599ea6fccd844fc833fed21d2eecac2e6a7c1afd9e044057391d78b1feb141"
|
||||
dependencies = [
|
||||
"aho-corasick",
|
||||
"arc-swap",
|
||||
"base64",
|
||||
"bitpacking",
|
||||
"byteorder",
|
||||
"census",
|
||||
"crc32fast",
|
||||
"crossbeam-channel",
|
||||
"downcast-rs",
|
||||
"fastdivide",
|
||||
"fnv",
|
||||
"fs4",
|
||||
"htmlescape",
|
||||
"itertools 0.12.1",
|
||||
"levenshtein_automata",
|
||||
"log",
|
||||
"lru",
|
||||
"lz4_flex",
|
||||
"measure_time",
|
||||
"memmap2",
|
||||
"num_cpus",
|
||||
"once_cell",
|
||||
"oneshot",
|
||||
"rayon",
|
||||
"regex",
|
||||
"rust-stemmers",
|
||||
"rustc-hash 1.1.0",
|
||||
"serde",
|
||||
"serde_json",
|
||||
"sketches-ddsketch",
|
||||
"smallvec",
|
||||
"tantivy-bitpacker",
|
||||
"tantivy-columnar",
|
||||
"tantivy-common",
|
||||
"tantivy-fst",
|
||||
"tantivy-query-grammar",
|
||||
"tantivy-stacker",
|
||||
"tantivy-tokenizer-api",
|
||||
"tempfile",
|
||||
"thiserror 1.0.69",
|
||||
"time",
|
||||
"uuid",
|
||||
"winapi",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tantivy-bitpacker"
|
||||
version = "0.6.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "284899c2325d6832203ac6ff5891b297fc5239c3dc754c5bc1977855b23c10df"
|
||||
dependencies = [
|
||||
"bitpacking",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tantivy-columnar"
|
||||
version = "0.3.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "12722224ffbe346c7fec3275c699e508fd0d4710e629e933d5736ec524a1f44e"
|
||||
dependencies = [
|
||||
"downcast-rs",
|
||||
"fastdivide",
|
||||
"itertools 0.12.1",
|
||||
"serde",
|
||||
"tantivy-bitpacker",
|
||||
"tantivy-common",
|
||||
"tantivy-sstable",
|
||||
"tantivy-stacker",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tantivy-common"
|
||||
version = "0.7.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "8019e3cabcfd20a1380b491e13ff42f57bb38bf97c3d5fa5c07e50816e0621f4"
|
||||
dependencies = [
|
||||
"async-trait",
|
||||
"byteorder",
|
||||
"ownedbytes",
|
||||
"serde",
|
||||
"time",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tantivy-fst"
|
||||
version = "0.5.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "d60769b80ad7953d8a7b2c70cdfe722bbcdcac6bccc8ac934c40c034d866fc18"
|
||||
dependencies = [
|
||||
"byteorder",
|
||||
"regex-syntax",
|
||||
"utf8-ranges",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tantivy-query-grammar"
|
||||
version = "0.22.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "847434d4af57b32e309f4ab1b4f1707a6c566656264caa427ff4285c4d9d0b82"
|
||||
dependencies = [
|
||||
"nom",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tantivy-sstable"
|
||||
version = "0.3.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "c69578242e8e9fc989119f522ba5b49a38ac20f576fc778035b96cc94f41f98e"
|
||||
dependencies = [
|
||||
"tantivy-bitpacker",
|
||||
"tantivy-common",
|
||||
"tantivy-fst",
|
||||
"zstd",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tantivy-stacker"
|
||||
version = "0.3.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "c56d6ff5591fc332739b3ce7035b57995a3ce29a93ffd6012660e0949c956ea8"
|
||||
dependencies = [
|
||||
"murmurhash32",
|
||||
"rand_distr",
|
||||
"tantivy-common",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tantivy-tokenizer-api"
|
||||
version = "0.3.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "2a0dcade25819a89cfe6f17d932c9cedff11989936bf6dd4f336d50392053b04"
|
||||
dependencies = [
|
||||
"serde",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tempfile"
|
||||
version = "3.25.0"
|
||||
@ -2105,7 +2522,7 @@ dependencies = [
|
||||
"fastrand",
|
||||
"getrandom 0.4.1",
|
||||
"once_cell",
|
||||
"rustix",
|
||||
"rustix 1.1.3",
|
||||
"windows-sys 0.61.2",
|
||||
]
|
||||
|
||||
@ -2118,6 +2535,46 @@ dependencies = [
|
||||
"winapi-util",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "thiserror"
|
||||
version = "1.0.69"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "b6aaf5339b578ea85b50e080feb250a3e8ae8cfcdff9a461c9ec2904bc923f52"
|
||||
dependencies = [
|
||||
"thiserror-impl 1.0.69",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "thiserror"
|
||||
version = "2.0.18"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "4288b5bcbc7920c07a1149a35cf9590a2aa808e0bc1eafaade0b80947865fbc4"
|
||||
dependencies = [
|
||||
"thiserror-impl 2.0.18",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "thiserror-impl"
|
||||
version = "1.0.69"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "4fee6c4efc90059e10f81e6d42c60a18f76588c3d74cb83a0b242a2b6c7504c1"
|
||||
dependencies = [
|
||||
"proc-macro2",
|
||||
"quote",
|
||||
"syn",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "thiserror-impl"
|
||||
version = "2.0.18"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "ebc4ee7f67670e9b64d05fa4253e753e016c6c95ff35b89b7941d6b856dec1d5"
|
||||
dependencies = [
|
||||
"proc-macro2",
|
||||
"quote",
|
||||
"syn",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "thread_local"
|
||||
version = "1.1.9"
|
||||
@ -2148,11 +2605,13 @@ dependencies = [
|
||||
"dashmap",
|
||||
"fjall",
|
||||
"proptest",
|
||||
"rand",
|
||||
"rand 0.9.2",
|
||||
"roaring",
|
||||
"serde",
|
||||
"serde_json",
|
||||
"tantivy",
|
||||
"tempfile",
|
||||
"thiserror 2.0.18",
|
||||
"tokio",
|
||||
"tracing",
|
||||
"tracing-subscriber",
|
||||
@ -2405,12 +2864,30 @@ dependencies = [
|
||||
"cxx-build",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "utf8-ranges"
|
||||
version = "1.0.5"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "7fcfc827f90e53a02eaef5e535ee14266c1d569214c6aa70133a624d8a3164ba"
|
||||
|
||||
[[package]]
|
||||
name = "utf8_iter"
|
||||
version = "1.0.4"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "b6c140620e7ffbb22c2dee59cafe6084a59b5ffc27a8859a5f0d494b5d52b6be"
|
||||
|
||||
[[package]]
|
||||
name = "uuid"
|
||||
version = "1.21.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "b672338555252d43fd2240c714dc444b8c6fb0a5c5335e65a07bba7742735ddb"
|
||||
dependencies = [
|
||||
"getrandom 0.4.1",
|
||||
"js-sys",
|
||||
"serde_core",
|
||||
"wasm-bindgen",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "valuable"
|
||||
version = "0.1.1"
|
||||
@ -2561,6 +3038,22 @@ dependencies = [
|
||||
"wasm-bindgen",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "winapi"
|
||||
version = "0.3.9"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419"
|
||||
dependencies = [
|
||||
"winapi-i686-pc-windows-gnu",
|
||||
"winapi-x86_64-pc-windows-gnu",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "winapi-i686-pc-windows-gnu"
|
||||
version = "0.4.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6"
|
||||
|
||||
[[package]]
|
||||
name = "winapi-util"
|
||||
version = "0.1.11"
|
||||
@ -2570,6 +3063,12 @@ dependencies = [
|
||||
"windows-sys 0.61.2",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "winapi-x86_64-pc-windows-gnu"
|
||||
version = "0.4.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f"
|
||||
|
||||
[[package]]
|
||||
name = "windows-link"
|
||||
version = "0.2.1"
|
||||
|
||||
@ -24,7 +24,7 @@ The audience is engineers who have built or are currently maintaining recommenda
|
||||
|
||||
These posts can be written before the engine is feature-complete. They draw on the vision, architecture research, and the problem space -- not on shipped code.
|
||||
|
||||
#### Post 1: "Every content platform builds the same 6 systems from scratch"
|
||||
#### Post 1: "Every content platform builds the same 6 systems from scratch" [PUBLISHED]
|
||||
|
||||
- **Type:** Vision / Problem Statement
|
||||
- **Thesis:** The Elasticsearch + Redis + Kafka + feature store + vector DB + ranking service stack is not an architecture. It is scar tissue. The seams between these systems are where correctness dies.
|
||||
@ -39,7 +39,7 @@ These posts can be written before the engine is feature-complete. They draw on t
|
||||
|
||||
M1 proves that temporal signals with O(1) decay, velocity, and windowed aggregation work as a database primitive. This is the most technically interesting milestone for blog content because the math is elegant and the performance numbers are dramatic.
|
||||
|
||||
#### Post 2: "Running decay scores are O(1) -- here is the math"
|
||||
#### Post 2: "Running decay scores are O(1) -- here is the math" [PUBLISHED]
|
||||
|
||||
- **Type:** Technical Deep Dive
|
||||
- **Roadmap phase:** m1p4 (Signal Ledger) completion
|
||||
@ -49,7 +49,7 @@ M1 proves that temporal signals with O(1) decay, velocity, and windowed aggregat
|
||||
- **Code to include:** The `EntitySignalState` struct. The forward-decay write path. The out-of-order event correction. Benchmark output showing 200-entity scoring pass under 5 microseconds.
|
||||
- **Why it matters:** This is the post that demonstrates tidalDB is not vaporware. The math is verifiable. The benchmarks are reproducible. Engineers who have implemented trending scores in Redis will immediately understand the value.
|
||||
|
||||
#### Post 3: "What three databases taught us before we wrote a line of code"
|
||||
#### Post 3: "What three databases taught us before we wrote a line of code" [PUBLISHED]
|
||||
|
||||
- **Type:** Architecture Decision Record
|
||||
- **Roadmap phase:** m1p1-m1p3 completion (the foundation phases)
|
||||
@ -59,7 +59,7 @@ M1 proves that temporal signals with O(1) decay, velocity, and windowed aggregat
|
||||
- **Code to include:** Key encoding format. Cache-line aligned struct. Group commit writer. Side-by-side comparison of the pattern in the source database and in tidalDB.
|
||||
- **Why it matters:** Engineers respect builders who study prior art. This post establishes technical credibility and shows the architectural foundation is grounded in real patterns, not invented from scratch.
|
||||
|
||||
#### Post 4: "Signals wrote 100ms ago. The query sees them now."
|
||||
#### Post 4: "Signals wrote 100ms ago. The query sees them now." [PUBLISHED]
|
||||
|
||||
- **Type:** Devlog / Milestone Announcement
|
||||
- **Roadmap phase:** m1p5 (Entity CRUD and Signal Write API) -- M1 complete
|
||||
@ -75,7 +75,7 @@ M1 proves that temporal signals with O(1) decay, velocity, and windowed aggregat
|
||||
|
||||
M2 proves that a single query can retrieve, filter, score, and enforce diversity over live signals. This is where tidalDB stops being a signal engine and starts being a database.
|
||||
|
||||
#### Post 5: "One query. Six systems. Under 50 milliseconds."
|
||||
#### Post 5: "One query. Six systems. Under 50 milliseconds." [PUBLISHED]
|
||||
|
||||
- **Type:** Technical Deep Dive / Announcement
|
||||
- **Roadmap phase:** m2p5 (RETRIEVE Query Executor) -- M2 complete
|
||||
@ -85,7 +85,7 @@ M2 proves that a single query can retrieve, filter, score, and enforce diversity
|
||||
- **Code to include:** The RETRIEVE query. The ranked result with signal snapshots. The trending profile definition. A before/after signal burst showing the ranking change.
|
||||
- **Why it matters:** This is the money post. The one-query thesis is no longer a vision document -- it is a benchmark. Engineers who operate the 6-system stack will immediately understand what this eliminates.
|
||||
|
||||
#### Post 6: "Diversity enforcement in 3 microseconds"
|
||||
#### Post 6: "Diversity enforcement in 3 microseconds" [PUBLISHED]
|
||||
|
||||
- **Type:** Technical Deep Dive
|
||||
- **Roadmap phase:** m2p4 (Diversity Enforcement)
|
||||
@ -95,7 +95,7 @@ M2 proves that a single query can retrieve, filter, score, and enforce diversity
|
||||
- **Code to include:** The DiversitySpec. The greedy selector. A concrete example showing reordering (creator A dominates pre-diversity, balanced post-diversity). Benchmark numbers.
|
||||
- **Why it matters:** Every team building a feed implements diversity in the API layer. Showing that it belongs in the database -- and costs 3 microseconds -- is a strong differentiator. This is the kind of post that gets shared in Slack channels.
|
||||
|
||||
#### Post 7: "Ranking profiles are data, not code"
|
||||
#### Post 7: "Ranking profiles are data, not code" [PUBLISHED]
|
||||
|
||||
- **Type:** Architecture Decision Record
|
||||
- **Roadmap phase:** m2p3 (Ranking Profile Engine)
|
||||
@ -111,7 +111,7 @@ M2 proves that a single query can retrieve, filter, score, and enforce diversity
|
||||
|
||||
M3 is where the feedback loop closes. Signal writes update the user's preference vector, the creator's interaction weight, and the item's signal ledger -- atomically, in one write. The "For You" query works.
|
||||
|
||||
#### Post 8: "The feedback loop that closes in one write"
|
||||
#### Post 8: "The feedback loop that closes in one write" [PUBLISHED]
|
||||
|
||||
- **Type:** Technical Deep Dive
|
||||
- **Roadmap phase:** m3p2 (Feedback Loop) completion
|
||||
@ -121,7 +121,7 @@ M3 is where the feedback loop closes. Signal writes update the user's preference
|
||||
- **Code to include:** The signal write. The 10-step atomic update path. A before/after query showing the preference shift. The property test that proves hidden items and blocked creators never surface.
|
||||
- **Why it matters:** The closed feedback loop is the core architectural thesis of tidalDB. This post proves it works. It is the strongest argument against the 6-system stack, because the stack's primary failure mode is feedback lag.
|
||||
|
||||
#### Post 9: "Negative signals are equal citizens"
|
||||
#### Post 9: "Negative signals are equal citizens" [PUBLISHED]
|
||||
|
||||
- **Type:** Architecture Decision Record
|
||||
- **Roadmap phase:** m3p2 (Feedback Loop)
|
||||
@ -131,7 +131,7 @@ M3 is where the feedback loop closes. Signal writes update the user's preference
|
||||
- **Code to include:** Signal type definitions for skip, hide, block. The penalty clause in a ranking profile. The property test: 10,000 random signal sequences never produce a result where a hidden item or blocked creator appears.
|
||||
- **Why it matters:** Most recommendation systems handle negative feedback as an afterthought -- a manual "not interested" button that writes to a separate blocklist. tidalDB's approach is architecturally different and engineers building these systems will recognize the improvement immediately.
|
||||
|
||||
#### Post 10: "Cold start without application logic"
|
||||
#### Post 10: "Cold start without application logic" [PUBLISHED]
|
||||
|
||||
- **Type:** Technical Deep Dive
|
||||
- **Roadmap phase:** m3p3 (Personalized Ranking Profiles)
|
||||
@ -143,50 +143,49 @@ M3 is where the feedback loop closes. Signal writes update the user's preference
|
||||
|
||||
---
|
||||
|
||||
### Milestone 4: Hybrid Search
|
||||
### Milestone 5: Hybrid Search
|
||||
|
||||
M4 merges full-text search with semantic similarity and signal-ranked results. Search and retrieval become the same system.
|
||||
M5 merges full-text search with semantic similarity and signal-ranked results. Search and retrieval become the same system.
|
||||
|
||||
#### Post 11: "Search and ranking are the same system"
|
||||
#### Post 11: "Search and ranking are the same system" [PUBLISHED]
|
||||
|
||||
- **Type:** Technical Deep Dive / Announcement
|
||||
- **Roadmap phase:** m4p3 (SEARCH Query Executor) -- M4 complete
|
||||
- **Thesis:** `SEARCH items QUERY "jazz piano" VECTOR [embedding] FOR USER @user_42 USING PROFILE search LIMIT 20` combines BM25 text relevance, semantic vector similarity, and user personalization in one ranked list. The fusion uses Reciprocal Rank Fusion. Personalization re-ranks within the relevant set -- an irrelevant result never surfaces because the user likes the creator. This is one query. It replaces Elasticsearch + a vector DB + a ranking service.
|
||||
- **Source material:** m4p3 integration test, docs/research/tantivy.md, ARCHITECTURE.md (Text Search, Hybrid Fusion)
|
||||
- **When to publish:** After M4 UAT passes.
|
||||
- **Code to include:** The SEARCH query. The RRF formula. A comparison: the same query with BM25 only, ANN only, and fused. The personalization overlay changing result order for two different users.
|
||||
- **Why it matters:** Search is the most complex surface and the one engineers know best. Showing that text search, semantic search, and ranking collapse into one query is the most concrete demonstration of the 6-to-1 thesis.
|
||||
- **Type:** Technical Deep Dive / Architecture Preview
|
||||
- **Roadmap phase:** Published during M4. Describes the architecture that M5 will complete.
|
||||
- **Status:** PUBLISHED. Written as an architectural intent post -- explains the unified pipeline design, what is already built (RETRIEVE, USearch, ranking), and what remains (Tantivy, RRF fusion, SEARCH query). Does not claim SEARCH is shipped.
|
||||
- **Thesis:** Text retrieval, vector retrieval, and signal-based ranking belong in one pipeline. The data model is already unified. Fusion is arithmetic. The RETRIEVE pipeline, the HNSW index, and the ranking profiles are all in place. Three pieces of wiring remain.
|
||||
- **Source material:** Published post at `site/content/blog/search-and-ranking.mdx`
|
||||
- **Why it matters:** Frames the M5 work for the audience before it ships. The architectural argument stands regardless of what's wired today.
|
||||
|
||||
#### Post 12: "Tantivy as a derived index, not a source of truth"
|
||||
|
||||
- **Type:** Architecture Decision Record
|
||||
- **Roadmap phase:** m4p1 (Tantivy Integration)
|
||||
- **Roadmap phase:** m5p1 (Tantivy Integration)
|
||||
- **Thesis:** The entity store is the source of truth. Tantivy is a materialized view. If the index is corrupted, it can be rebuilt from the entity store. Crash recovery replays from a stored sequence number. Consistency is DB-primary, not two-phase commit. This is simpler, deterministic, and the right model for an embedded database.
|
||||
- **Source material:** docs/research/tantivy.md, m4p1 task docs, ARCHITECTURE.md
|
||||
- **When to publish:** After m4p1 is complete.
|
||||
- **Source material:** docs/research/tantivy.md, m5p1 task docs (once written), ARCHITECTURE.md
|
||||
- **When to publish:** After m5p1 is complete.
|
||||
- **Code to include:** The outbox pattern. The crash recovery sequence number. The background indexer. The consistency model.
|
||||
- **Why it matters:** This is a useful architectural pattern beyond tidalDB. Engineers building systems with derived indexes will find this directly applicable.
|
||||
|
||||
---
|
||||
|
||||
### Milestone 5: Full Surface Coverage
|
||||
### Milestone 6: Full Surface Coverage
|
||||
|
||||
M5 completes all 14 use cases. The content here shifts from "how does the engine work" to "what can you build with it."
|
||||
M6 completes all 14 use cases. The content here shifts from "how does the engine work" to "what can you build with it."
|
||||
|
||||
#### Post 13: "14 use cases, one query engine"
|
||||
|
||||
- **Type:** Devlog / Announcement
|
||||
- **Roadmap phase:** M5 complete
|
||||
- **Roadmap phase:** M6 complete
|
||||
- **Thesis:** For You feeds, trending, search, following, related content, notifications, hidden gems, controversial, live content, creator discovery, user library, cohort-scoped trending -- every surface a content platform needs, driven by the same query primitives. The application specifies profiles, filters, and context. The database executes ranking.
|
||||
- **Source material:** USE_CASES.md, M5 UAT results
|
||||
- **When to publish:** After M5 UAT passes.
|
||||
- **Source material:** USE_CASES.md, M6 UAT results
|
||||
- **When to publish:** After M6 UAT passes.
|
||||
- **Code to include:** A curated selection of 4-5 queries spanning different surfaces (for_you, trending, search, hidden_gems, cohort_trending). Each with a brief setup and result.
|
||||
- **Why it matters:** This is the completeness post. It demonstrates that the database is not a toy or a prototype -- it handles the full surface area of a real content platform.
|
||||
|
||||
#### Post 14: "Cohort-scoped trending: what is hot for people like you"
|
||||
|
||||
- **Type:** Technical Deep Dive
|
||||
- **Roadmap phase:** M5, likely Phase 3 (Social Graph and Collaborative Filtering)
|
||||
- **Roadmap phase:** M6, cohort-scoped trending phase
|
||||
- **Thesis:** "What's trending" means different things to different audiences. A 22-year-old in Tokyo and a 45-year-old in Texas see different trending pages -- not because of personalization (individual preference), but because different content is genuinely trending within their respective audience segments. tidalDB maintains per-cohort signal aggregation using RoaringBitmaps for O(1) membership testing and sparse fan-out for storage efficiency.
|
||||
- **Source material:** USE_CASES.md (UC-15), ARCHITECTURE.md (Cohort-scoped aggregation), API.md (Cohort Definitions)
|
||||
- **When to publish:** After cohort-scoped trending passes integration tests.
|
||||
@ -195,27 +194,27 @@ M5 completes all 14 use cases. The content here shifts from "how does the engine
|
||||
|
||||
---
|
||||
|
||||
### Milestone 6: Production Hardening
|
||||
### Production Hardening
|
||||
|
||||
M6 is about trust. The content shifts from "what it does" to "why you can trust it."
|
||||
The final phase is about trust. The content shifts from "what it does" to "why you can trust it."
|
||||
|
||||
#### Post 15: "Kill it at any point. It comes back correct."
|
||||
|
||||
- **Type:** Technical Deep Dive
|
||||
- **Roadmap phase:** m6p1 (Crash Recovery Hardening)
|
||||
- **Roadmap phase:** Production hardening -- crash recovery hardening phase
|
||||
- **Thesis:** We injected faults at every write-path stage. Recovery time is under 30 seconds at 1M items. WAL replay produces state identical to pre-crash. No phantom items, no lost signals, no inconsistent aggregates. The WAL is the source of truth. Everything else is derived state that can be rebuilt.
|
||||
- **Source material:** m6p1 test results, fault injection methodology
|
||||
- **When to publish:** After m6p1 passes.
|
||||
- **Source material:** Crash recovery test results, fault injection methodology, WAL implementation
|
||||
- **When to publish:** After crash recovery hardening passes.
|
||||
- **Code to include:** The crash simulation test. Recovery time measurements. The WAL checkpoint and replay sequence.
|
||||
- **Why it matters:** Trust is the precondition for adoption. Engineers will not embed a database they cannot crash-test. This post is the trust credential.
|
||||
|
||||
#### Post 16: "Graceful degradation: less precise, never wrong"
|
||||
|
||||
- **Type:** Architecture Decision Record
|
||||
- **Roadmap phase:** m6p2 (Graceful Degradation)
|
||||
- **Roadmap phase:** Production hardening -- graceful degradation phase
|
||||
- **Thesis:** Under 3x overload, tidalDB does not return errors. It reduces candidate set size, uses coarser aggregates, skips diversity enforcement, and serves from materialized cache -- in that order. Results are less precise but never incorrect. The degradation order is documented and configurable.
|
||||
- **Source material:** m6p2 task docs, ARCHITECTURE.md (Graceful degradation)
|
||||
- **When to publish:** After m6p2 is complete.
|
||||
- **Source material:** Graceful degradation task docs, ARCHITECTURE.md (Graceful degradation)
|
||||
- **When to publish:** After graceful degradation is complete.
|
||||
- **Code to include:** The degradation cascade. Load test results at 1x, 2x, 3x. Latency distribution at each level.
|
||||
- **Why it matters:** This is how production systems should behave. Engineers who have been paged for "ranking service returned 500" will appreciate a system that degrades gracefully instead.
|
||||
|
||||
@ -228,38 +227,56 @@ These posts are not tied to specific milestones. They can be written whenever th
|
||||
#### "Why not SQL"
|
||||
|
||||
- **Type:** Architecture Decision Record
|
||||
- **Status:** READY -- code shipped, decision documented
|
||||
- **Thesis:** The custom query language exists because SQL cannot express ranking semantics without losing optimization opportunities. `FOR USER` means "load this user's preference vector and relationship graph." `USING PROFILE` means "apply this named scoring function." `DIVERSITY` means "enforce post-ranking constraints." These are not WHERE clauses.
|
||||
- **Source material:** thoughts.md (Part II.4), VISION.md (query examples), API.md
|
||||
- **When to publish:** Any time after M1. Best paired with M2 when the RETRIEVE query is functional.
|
||||
- **Code to read:**
|
||||
- `tidal/src/query/retrieve.rs` -- `RetrieveBuilder` showing FOR USER, USING PROFILE, DIVERSITY as typed builder methods, not string predicates
|
||||
- `tidal/src/ranking/profile.rs` -- `RankingProfile` and `CandidateStrategy` showing how profiles express scoring intent that SQL GROUP BY cannot
|
||||
- `tidal/src/query/executor.rs` -- dispatcher showing that FOR USER loads preference vector and relationship graph -- state SQL has no model for
|
||||
- **When to publish:** Any time after M1. Best paired with M2 when the RETRIEVE query is functional. Both are complete.
|
||||
|
||||
#### "Why we chose fjall over RocksDB (for now)"
|
||||
|
||||
- **Type:** Architecture Decision Record
|
||||
- **Status:** READY -- code shipped, decision documented
|
||||
- **Thesis:** Pure Rust, `#![forbid(unsafe_code)]`, fast compile times, trait-abstracted for swap. fjall is not the fastest LSM-tree. It is the right one for an embeddable database built by a small team that values correctness over raw throughput, with a trait boundary that makes the decision reversible.
|
||||
- **Source material:** thoughts.md (Part V.9), m1p3 task docs, CODING_GUIDELINES.md
|
||||
- **When to publish:** After m1p3 is complete (already shipped). This post is ready now.
|
||||
- **Source material:** thoughts.md (Part V.9), CODING_GUIDELINES.md
|
||||
- **Code to read:**
|
||||
- `tidal/src/storage/engine.rs` -- the `StorageEngine` trait (six methods, zero fjall imports -- this is the abstraction boundary)
|
||||
- `tidal/src/storage/fjall.rs` -- `FjallBackend` implementing the trait; note the `fjall::Keyspace` is the only fjall type that crosses the boundary
|
||||
- `tidal/src/storage/memory.rs` -- `InMemoryBackend` proving the trait is genuinely swappable (used in all tests)
|
||||
- `tidal/Cargo.toml` -- fjall version pin, no `unsafe` in the crate's feature flags
|
||||
- **When to publish:** Any time. Code has been shipped since m1p3.
|
||||
|
||||
#### "USearch, not from scratch"
|
||||
|
||||
- **Type:** Architecture Decision Record
|
||||
- **Status:** READY -- code shipped, decision documented
|
||||
- **Thesis:** Correct, high-performance, concurrent HNSW with SIMD distance computation is 6-12 months of dedicated work. We are not a vector database company. USearch runs in ScyllaDB, ClickHouse, and DuckDB. The FFI boundary is thin. Build what differentiates you. Borrow what does not.
|
||||
- **Source material:** docs/research/ann_for_tidaldb.md, m2p1 task docs, ARCHITECTURE.md (Vector Index)
|
||||
- **When to publish:** After m2p1 (USearch integration) is complete.
|
||||
- **Source material:** docs/research/ann_for_tidaldb.md, ARCHITECTURE.md (Vector Index)
|
||||
- **Code to read:**
|
||||
- `tidal/src/storage/vector/mod.rs` -- `VectorIndex` trait and the module comment explaining the design decisions (VectorId = u64, L2 squared, ef_search uniformity)
|
||||
- `tidal/src/storage/vector/usearch_index.rs` -- `UsearchIndex` wrapping the USearch FFI; the wrapper is thin by design
|
||||
- `tidal/src/storage/vector/planner.rs` -- `AdaptiveQueryPlanner` with four strategies (HNSW, in-graph filter, widened beam, pre-filter brute-force) -- this is tidalDB's contribution on top of the borrowed index
|
||||
- `tidal/src/storage/vector/brute.rs` -- `BruteForceIndex` and `MockVectorIndex` proving the trait boundary is real
|
||||
- **When to publish:** Any time. Code has been shipped since m2p1.
|
||||
|
||||
---
|
||||
|
||||
## Post Cadence
|
||||
|
||||
| Milestone | Posts | Approximate Pace |
|
||||
|-----------|-------|-----------------|
|
||||
| Pre-implementation | 1 | Publish when ready |
|
||||
| M1 (Signal Engine) | 2-3 | One per phase completion |
|
||||
| M2 (Ranked Retrieval) | 3 | One per major phase |
|
||||
| M3 (Personalized Ranking) | 2-3 | One per key insight |
|
||||
| M4 (Hybrid Search) | 2 | One per major phase |
|
||||
| M5 (Full Coverage) | 2 | At milestone boundaries |
|
||||
| M6 (Production Hardening) | 2 | At milestone boundaries |
|
||||
| Ongoing / ADRs | 2-3 | When the decision is fresh |
|
||||
| Milestone | Posts | Approximate Pace | Status |
|
||||
|-----------|-------|-----------------|--------|
|
||||
| Pre-implementation | 1 | Publish when ready | PUBLISHED |
|
||||
| M1 (Signal Engine) | 2-3 | One per phase completion | PUBLISHED |
|
||||
| M2 (Ranked Retrieval) | 3 | One per major phase | PUBLISHED |
|
||||
| M3 (Personalized Ranking) | 2-3 | One per key insight | PUBLISHED |
|
||||
| M4 (Agent Session Layer) | 1 (Post 11, architectural preview) | Published during M4 | PUBLISHED |
|
||||
| M5 (Hybrid Search) | 1 (Post 12) | After m5p1 ships | Blocked on M5 |
|
||||
| M6 (Full Coverage) | 2 (Posts 13-14) | At milestone boundaries | Blocked on M6 |
|
||||
| Production Hardening | 2 (Posts 15-16) | At milestone boundaries | Blocked on hardening phase |
|
||||
| Ongoing / ADRs | 3 (fjall, SQL, USearch) | When the decision is fresh | READY |
|
||||
|
||||
**Target: 16-20 posts across the full roadmap.** Not more. Each one earns its place.
|
||||
|
||||
@ -277,28 +294,46 @@ These posts are not tied to specific milestones. They can be written whenever th
|
||||
|
||||
## Reference: Roadmap to Post Mapping
|
||||
|
||||
| Roadmap Phase | Post # | Title (Working) |
|
||||
|---------------|--------|-----------------|
|
||||
| Pre-implementation | 1 | Every content platform builds the same 6 systems from scratch |
|
||||
| m1p1-m1p3 (Foundation) | 3 | What three databases taught us before we wrote a line of code |
|
||||
| m1p4 (Signal Ledger) | 2 | Running decay scores are O(1) -- here is the math |
|
||||
| m1p5 (M1 Complete) | 4 | Signals wrote 100ms ago. The query sees them now. |
|
||||
| m2p3 (Ranking Profiles) | 7 | Ranking profiles are data, not code |
|
||||
| m2p4 (Diversity) | 6 | Diversity enforcement in 3 microseconds |
|
||||
| m2p5 (M2 Complete) | 5 | One query. Six systems. Under 50 milliseconds. |
|
||||
| m3p2 (Feedback Loop) | 8, 9 | The feedback loop that closes in one write / Negative signals are equal citizens |
|
||||
| m3p3 (Personalized Profiles) | 10 | Cold start without application logic |
|
||||
| m4p1 (Tantivy) | 12 | Tantivy as a derived index, not a source of truth |
|
||||
| m4p3 (M4 Complete) | 11 | Search and ranking are the same system |
|
||||
| M5 Complete | 13, 14 | 14 use cases, one query engine / Cohort-scoped trending |
|
||||
| m6p1 (Crash Recovery) | 15 | Kill it at any point. It comes back correct. |
|
||||
| m6p2 (Graceful Degradation) | 16 | Graceful degradation: less precise, never wrong |
|
||||
| Any time | -- | Why not SQL / Why fjall / USearch, not from scratch |
|
||||
| Roadmap Phase | Post # | Title | Status |
|
||||
|---------------|--------|-------|--------|
|
||||
| Pre-implementation | 1 | Every content platform builds the same 6 systems from scratch | PUBLISHED |
|
||||
| m1p1-m1p3 (Foundation) | 3 | What three databases taught us before we wrote a line of code | PUBLISHED |
|
||||
| m1p4 (Signal Ledger) | 2 | Running decay scores are O(1) -- here is the math | PUBLISHED |
|
||||
| m1p5 (M1 Complete) | 4 | Signals wrote 100ms ago. The query sees them now. | PUBLISHED |
|
||||
| m2p3 (Ranking Profiles) | 7 | Ranking profiles are data, not code | PUBLISHED |
|
||||
| m2p4 (Diversity) | 6 | Diversity enforcement in 3 microseconds | PUBLISHED |
|
||||
| m2p5 (M2 Complete) | 5 | One query. Six systems. Under 50 milliseconds. | PUBLISHED |
|
||||
| m3p2 (Feedback Loop) | 8, 9 | The feedback loop that closes in one write / Negative signals are equal citizens | PUBLISHED |
|
||||
| m3p3 (Personalized Profiles) | 10 | Cold start without application logic | PUBLISHED |
|
||||
| M4 Complete (Agent Session Layer) | 11 | Search and ranking are the same system | PUBLISHED |
|
||||
| Any time | -- | Why we chose fjall over RocksDB (for now) | READY |
|
||||
| Any time | -- | Why not SQL | READY |
|
||||
| Any time | -- | USearch, not from scratch | READY |
|
||||
| M5p1 (Tantivy Integration) | 12 | Tantivy as a derived index, not a source of truth | Blocked on M5p1 |
|
||||
| M5 Complete | 13, 14 | 14 use cases, one query engine / Cohort-scoped trending | Blocked on M5 |
|
||||
| m6p1 (Crash Recovery) | 15 | Kill it at any point. It comes back correct. | Blocked on M6 |
|
||||
| m6p2 (Graceful Degradation) | 16 | Graceful degradation: less precise, never wrong | Blocked on M6 |
|
||||
|
||||
---
|
||||
|
||||
## Immediate Next Actions
|
||||
## Current Queue
|
||||
|
||||
1. **Write Post 1** ("Every content platform builds the same 6 systems from scratch") -- this can be published now. It establishes the problem and the audience. It does not depend on shipped code.
|
||||
2. **Write Post 3** ("What three databases taught us") -- m1p1 through m1p3 are complete. The source material (thoughts.md) is rich. The code exists.
|
||||
3. **Prepare Post 2 outline** ("Running decay scores are O(1)") -- the research doc exists, the math is decided, but the implementation is not yet shipped (m1p4 is next). Write the outline. Wait for the benchmarks.
|
||||
**As of M4 complete (Agent Session Layer), posts 1-11 published.**
|
||||
|
||||
### Ready to write now
|
||||
|
||||
| Post | Status |
|
||||
|------|--------|
|
||||
| "Why we chose fjall over RocksDB (for now)" | READY -- m1p3 shipped, decision is documented |
|
||||
| "Why not SQL" | READY -- any time after M1 |
|
||||
| "USearch, not from scratch" | READY -- m2p1 shipped |
|
||||
|
||||
### Next milestone-gated posts
|
||||
|
||||
| Post | Blocked on |
|
||||
|------|------------|
|
||||
| Post 12: "Tantivy as a derived index, not a source of truth" | M5p1 (Tantivy integration) |
|
||||
| Post 13: "14 use cases, one query engine" | M5 complete |
|
||||
| Post 14: "Cohort-scoped trending" | M5 complete |
|
||||
| Post 15: "Kill it at any point. It comes back correct." | M6p1 |
|
||||
| Post 16: "Graceful degradation: less precise, never wrong" | M6p2 |
|
||||
|
||||
@ -87,21 +87,13 @@ The roadmap now has two tracks:
|
||||
| **m3p2: Feedback Loop -- Signal Writes Update User State** | COMPLETE | passing |
|
||||
| **m3p3: Personalized Ranking Profiles** | COMPLETE | passing |
|
||||
| **m3p4: User State Filters + M3 UAT** | COMPLETE | 571 lib + 11 m3_uat + 6 m2_uat + 5 signal_api + 8 vector_usearch passing |
|
||||
| P0: Beachhead Validation | NOT STARTED | -- |
|
||||
| P1: Concierge Alpha | NOT STARTED | -- |
|
||||
| PG1: Personalization Core Done gate | NOT STARTED | -- |
|
||||
| P2: Productized Beta | NOT STARTED | -- |
|
||||
| P3: Public Launch | NOT STARTED | -- |
|
||||
| P4: Scale + Revenue Fit | NOT STARTED | -- |
|
||||
| **m4: Agent Session Layer** | COMPLETE | 607 lib + 12 m4_uat + 11 m3_uat + 7 m2_uat + 5 signal_api + 8 vector_usearch + 12 storage passing |
|
||||
| **m5p1: Tantivy Integration** | COMPLETE | 650 lib + 3 text_index integration = 653 passing; BM25 @ 10K docs = 0.26ms |
|
||||
| **m5p2: Hybrid Fusion (RRF)** | COMPLETE | 665 lib passing; RRF fusion @ 1K candidates = 46µs |
|
||||
| **m5p3: SEARCH Query Executor** | COMPLETE | 681 lib + 12 m5_search integration = 693 passing |
|
||||
| **m5p4: Creator and People Search** | COMPLETE | 705 lib + 9 m5_uat + 6 m5p4_creator_search + 12 m5_search = 732 passing |
|
||||
|
||||
**Current phase:** Milestone 3 COMPLETE. All phases (m3p1–m3p4) and all 12 tasks are done. Next: M4 Agent Memory.
|
||||
|
||||
**Lessons learned:**
|
||||
|
||||
- m1p3 keyspaces are organized per `EntityKind` ("items", "users", "creators"), not by data category. The `Tag` enum in key encoding provides the data-category namespace within each entity-kind keyspace.
|
||||
- The `LumenError` name is a legacy artifact from a predecessor project. Will be renamed when convenient but does not block progress.
|
||||
- MSRV was bumped to 1.91 for fjall 3 compatibility.
|
||||
- M2 complete: RETRIEVE query with 11+ sort modes, metadata filters, diversity constraints, and live signal ranking all operational at < 50ms at 10K items.
|
||||
**Next:** M5 COMPLETE. Next: M6 Full Surface Coverage.
|
||||
|
||||
---
|
||||
|
||||
@ -296,7 +288,7 @@ Then:
|
||||
|
||||
### Phases
|
||||
|
||||
#### Phase 1: Core Type System and Schema -- COMPLETE
|
||||
#### Phase 1: Core Type System and Schema
|
||||
|
||||
**Delivers:** The foundational type system -- entity IDs, signal type definitions, decay rate declarations, window specifications, and the error types that every subsequent module depends on. The schema module that validates and stores signal/entity definitions.
|
||||
|
||||
@ -318,7 +310,7 @@ Then:
|
||||
**Complexity:** M
|
||||
**Research Reference:** `docs/research/tidaldb_signal_ledger.md` (decay formula, EntityState struct)
|
||||
|
||||
#### Phase 2: Write-Ahead Log -- COMPLETE
|
||||
#### Phase 2: Write-Ahead Log
|
||||
|
||||
**Delivers:** A durable, append-only log for signal events. Every signal write is fsync'd before acknowledgment. Group commit amortizes fsync cost. Content-addressed events via BLAKE3 for deduplication. The WAL is the source of truth -- all other state is derived.
|
||||
|
||||
@ -336,7 +328,7 @@ Then:
|
||||
**Complexity:** L
|
||||
**Research Reference:** `docs/research/tidaldb_wal.md` (wire format, group commit, crash detection, deduplication), `thoughts.md` Part II.1 (WAL convergence), Part V.5-6 (quarantine-first, group commit)
|
||||
|
||||
#### Phase 3: Storage Engine Trait and fjall Backend -- COMPLETE
|
||||
#### Phase 3: Storage Engine Trait and fjall Backend
|
||||
|
||||
**Delivers:** The `StorageEngine` trait abstraction and two implementations: `FjallBackend` (fjall 3 LSM-tree) for production and `InMemoryBackend` (BTreeMap + RwLock) for deterministic testing. Key encoding follows the subject-prefix pattern with a `Tag` discriminant. `FjallStorage` coordinates three keyspaces per entity kind. `FjallAtomicBatch` provides cross-keyspace atomic writes.
|
||||
|
||||
@ -1383,7 +1375,13 @@ A developer can embed tidalDB alongside an agent runtime and: (1) declare agent
|
||||
|
||||
### Milestone Thesis
|
||||
|
||||
A developer can execute `SEARCH items QUERY "rust tutorial beginner" VECTOR query_vector FOR USER @user_id USING PROFILE search LIMIT 20` and get results that combine BM25 text relevance, semantic similarity, and user personalization in a single ranked list. This proves that search and retrieval are the same system.
|
||||
M4 proved agents can write scoped signals and query session context within a personalized ranking pipeline. M5 proves that text search and vector retrieval are the same system. A developer can execute `SEARCH items QUERY "rust tutorial beginner" VECTOR query_vector FOR USER @user_id USING PROFILE search LIMIT 20` and get results that combine BM25 text relevance, semantic similarity, and user personalization in a single ranked list -- with the same signal freshness, diversity enforcement, and feedback loop guarantees that RETRIEVE already provides.
|
||||
|
||||
### Enables
|
||||
|
||||
- **UC-02** (Search) -- Full: keyword search, exact phrase, boolean operators, field-scoped, hybrid BM25 + semantic, personalized re-ranking, search click feedback loop
|
||||
- **UC-10** (People/Creator Search) -- Full: creator discovery by name/topic, "creators like X" via embedding similarity, creator attribute filters
|
||||
- **UC-11** (Visual/Semantic Search) -- Core: vector-only search for image similarity, semantic intent queries ("something relaxing to watch")
|
||||
|
||||
### UAT Scenario
|
||||
|
||||
@ -1391,10 +1389,13 @@ A developer can execute `SEARCH items QUERY "rust tutorial beginner" VECTOR quer
|
||||
Given:
|
||||
A tidalDB instance with:
|
||||
- 10,000 items with text fields (title, description, tags) indexed for full-text search
|
||||
- All items have embeddings
|
||||
- 500 users with engagement history
|
||||
- Search profile defined: text relevance as floor, semantic similarity,
|
||||
personalization adjustment
|
||||
- All items have 1536-dim embeddings
|
||||
- 500 users with engagement history and preference vectors
|
||||
- 200 creators with name, handle, and aggregated embeddings
|
||||
- Signal types: view (7d decay), like (14d decay), skip (1d decay),
|
||||
search_click (3d decay, with query context)
|
||||
- Profiles: "search" (text_weight:0.6, vector_weight:0.4, RRF k=60,
|
||||
personalization overlay, completion gate > 0.3, diversity max_per_creator:2)
|
||||
|
||||
When:
|
||||
1. SEARCH items QUERY "rust tutorial beginner" VECTOR [query_embedding]
|
||||
@ -1404,109 +1405,309 @@ When:
|
||||
3. SEARCH items QUERY "\"exact phrase match\"" USING PROFILE search LIMIT 10
|
||||
4. SEARCH items QUERY "jazz -beginner" USING PROFILE search LIMIT 10
|
||||
5. SEARCH creators QUERY "jazz" LIMIT 10
|
||||
6. User clicks result #3, record SIGNAL search_click
|
||||
7. User searches same query again
|
||||
6. SEARCH creators SIMILAR TO @creator_xyz LIMIT 10
|
||||
7. SIGNAL search_click item:@item_abc user:@user_42
|
||||
context:{ query: "rust tutorial beginner", rank_at_click: 3 }
|
||||
8. Re-execute search #1
|
||||
|
||||
Then:
|
||||
- Step 1: Results combine BM25 + semantic similarity via RRF;
|
||||
personalization re-ranks within relevant set; user_42 (a beginner)
|
||||
sees beginner content elevated
|
||||
- Step 2: Text-only search (no vector), filtered by duration and format
|
||||
sees beginner content elevated; max 2 per creator enforced
|
||||
- Step 2: Text-only search (no vector), filtered by duration and format;
|
||||
only short videos returned
|
||||
- Step 3: Exact phrase match -- only items containing "exact phrase match"
|
||||
- Step 4: Boolean exclusion -- no items matching "beginner"
|
||||
- Step 5: Creator search by name/topic
|
||||
- Step 6: Signal recorded with query context and rank position
|
||||
- Step 7: Clicked result may rank higher due to search_click signal
|
||||
as a contiguous sequence
|
||||
- Step 4: Boolean exclusion -- no items matching "beginner" appear in results
|
||||
- Step 5: Creators returned by name/topic match, ordered by engagement rate
|
||||
- Step 6: Creators semantically similar to @creator_xyz by embedding distance
|
||||
- Step 7: Signal recorded with query context and rank position;
|
||||
item and user-topic affinity updated
|
||||
- Step 8: Clicked result @item_abc may rank higher due to search_click signal;
|
||||
signal written < 100ms ago is reflected
|
||||
- Performance: SEARCH < 50ms at 10K items
|
||||
```
|
||||
|
||||
### Phases
|
||||
|
||||
#### Phase 1: Tantivy Integration
|
||||
#### Phase 1: Tantivy Integration (m5p1)
|
||||
|
||||
**Delivers:** Tantivy embedded as a derived index for full-text search. DB-primary consistency pattern: entity store is source of truth, Tantivy is a materialized view updated via outbox. BM25 scoring exposed via custom Collector and Weight/Scorer seek pattern.
|
||||
**Delivers:** Tantivy embedded as a derived index for full-text search. DB-primary consistency pattern: entity store is source of truth, Tantivy is a materialized view updated via an outbox sequence. BM25 scoring exposed via custom Collector and the Weight/Scorer seek pattern. Schema text fields (title, description, tags) automatically indexed. Crash recovery replays from the last committed sequence number stored in Tantivy's commit payload.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
|
||||
- [ ] Tantivy index created from schema text field definitions (title, description, tags)
|
||||
- [ ] Background indexer reads entity store outbox and feeds Tantivy writer
|
||||
- [ ] Tantivy commit stores last-processed sequence number in payload for crash recovery
|
||||
- [ ] Custom `AllScoresCollector` returns all matching doc IDs with BM25 scores
|
||||
- [ ] `Weight::scorer` + `DocSet::seek` pattern scores specific candidate IDs (for re-ranking ANN results)
|
||||
- [ ] External entity ID -> DocAddress mapping maintained and updated on segment merge
|
||||
- [ ] Boolean queries supported: AND, OR, NOT, exact phrase, field-scoped
|
||||
- [ ] Commit interval: every 1-5 seconds or every N thousand documents
|
||||
- [ ] Index rebuild from entity store completes in < 10 minutes at 10K items
|
||||
- [ ] BM25 query latency < 10ms at 10K documents (benchmarked)
|
||||
- [ ] `TextIndex` struct wraps Tantivy `Index`, `IndexWriter` (behind `Mutex`), and `IndexReader` with auto-reload
|
||||
- [ ] Tantivy schema created from tidalDB schema text field definitions: `text` fields get full-text tokenization with Tantivy's default tokenizer; `keyword` fields get raw (untokenized) indexing for exact match
|
||||
- [ ] `TextIndexWriter::index_item(entity_id, metadata)` adds or updates a document in Tantivy; `delete_item(entity_id)` removes via `delete_term` on the entity_id fast field
|
||||
- [ ] Background indexer: `TextIndexSyncer` reads entity store writes (via WAL sequence tracking) and feeds Tantivy writer; commit interval configurable (default: every 1000 documents or 2 seconds, whichever comes first)
|
||||
- [ ] Each Tantivy `commit()` stores the last-processed WAL sequence number in the commit payload via `set_payload()`; on crash recovery, replay from that sequence number
|
||||
- [ ] Custom `AllScoresCollector` implementing Tantivy's `Collector` trait returns all matching `(EntityId, f32)` pairs with BM25 scores; `requires_scoring()` returns `true`
|
||||
- [ ] `ScoredCandidateCollector` implementing Tantivy's `Collector` trait accepts a pre-sorted candidate set and returns BM25 scores for only those candidates via `DocSet::seek()` (for scoring ANN results)
|
||||
- [ ] External `EntityId -> DocAddress` mapping maintained via a fast field (`entity_id_field`) on every Tantivy document; mapping rebuilt on `IndexReader::reload()` after segment merges
|
||||
- [ ] Boolean query parsing: AND, OR, NOT operators; exact phrase (`"..."`); field-scoped (`title:jazz`, `tag:tutorial`); exclusion (`-beginner`); wildcard prefix (`pian*`)
|
||||
- [ ] Index rebuild from entity store: `text_index.rebuild_from(storage)` scans all items and rebuilds the Tantivy index; completes in < 10 minutes at 10K items
|
||||
- [ ] BM25 query latency < 10ms at 10K documents (Criterion benchmarked)
|
||||
- [ ] Tantivy `IndexWriter` heap budget set to 50MB (conservative for embedded use)
|
||||
- [ ] `LogMergePolicy` configured with defaults; `wait_merging_threads()` called on shutdown
|
||||
- [ ] `TextIndex` is `Send + Sync` -- safe to share across threads behind `Arc`
|
||||
|
||||
**Depends On:** m1p3 (storage engine), m1p5 (entity API)
|
||||
**Complexity:** L
|
||||
**Research Reference:** `docs/research/tantivy.md` (Collector API, consistency pattern, seek scoring, commit model)
|
||||
**Task Breakdown:**
|
||||
|
||||
#### Phase 2: Hybrid Fusion (RRF)
|
||||
| # | Task | Delivers | Complexity |
|
||||
|---|------|----------|------------|
|
||||
| 01 | TextIndex Core | `TextIndex` struct, Tantivy schema generation from tidalDB schema, `IndexWriter`/`IndexReader` lifecycle, `entity_id` fast field, `TextIndex::open()` and `TextIndex::close()` | L |
|
||||
| 02 | Document Write/Delete | `index_item()`, `delete_item()`, field mapping (text -> tokenized, keyword -> raw), metadata-to-document conversion | M |
|
||||
| 03 | Background Syncer | `TextIndexSyncer` reads WAL sequence, feeds writer, configurable commit interval, `set_payload()` with sequence number, crash recovery replay | L |
|
||||
| 04 | BM25 Scoring Collectors | `AllScoresCollector` for full scoring, `ScoredCandidateCollector` for seek-based candidate scoring, entity ID resolution from fast field | M |
|
||||
| 05 | Boolean Query Parsing | AND/OR/NOT, exact phrase, field-scoped, exclusion, wildcard prefix; wraps Tantivy's `QueryParser` with custom syntax extensions | M |
|
||||
|
||||
**Delivers:** Reciprocal Rank Fusion combining BM25 ranked lists with ANN ranked lists into a single scored result set. The starting point is RRF with k=60; the architecture supports upgrading to tuned linear combination when relevance labels exist.
|
||||
**Depends On:** m1p3 (storage engine, key encoding), m1p5 (entity write API, WAL sequence), m2p2 (metadata fields used for field-scoped queries)
|
||||
**Complexity:** XL (5 tasks; Tasks 01-02 sequential, then 03/04/05 can parallelize after 02 completes)
|
||||
**Research Reference:** `docs/research/tantivy.md` (Collector API, consistency pattern, seek scoring, commit model, single-writer lock, segment merge)
|
||||
|
||||
#### Phase 2: Hybrid Fusion (RRF) (m5p2)
|
||||
|
||||
**Delivers:** Reciprocal Rank Fusion combining BM25 ranked lists with ANN ranked lists into a single scored result set. The starting point is RRF with k=60; the architecture supports upgrading to tuned linear combination when relevance labels exist. Handles the three retrieval modes: text-only, vector-only, and hybrid.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
|
||||
- [ ] `RRF(d) = 1/(60 + rank_bm25(d)) + 1/(60 + rank_ann(d))` implemented
|
||||
- [ ] Documents appearing in only one list contribute only their single-list term
|
||||
- [ ] RRF results are re-rankable by personalization (user preference overlay)
|
||||
- [ ] When only text query is provided (no vector), pure BM25 ranking used
|
||||
- [ ] When only vector is provided (no text), pure ANN ranking used
|
||||
- [ ] Fusion adds < 1ms to query time (benchmarked)
|
||||
- [ ] k parameter configurable (default 60)
|
||||
- [ ] `HybridFusion` struct with `fuse(bm25_results: &[(EntityId, f32)], ann_results: &[(EntityId, f32)], k: u32) -> Vec<(EntityId, f64)>` method
|
||||
- [ ] RRF formula: `score(d) = 1.0 / (k + rank_bm25(d)) + 1.0 / (k + rank_ann(d))` where `k = 60` by default
|
||||
- [ ] Documents appearing in only one list contribute only their single-list term (the other term is zero)
|
||||
- [ ] Results sorted by fused score descending
|
||||
- [ ] RRF results are passed to the existing `ProfileExecutor` for personalization re-ranking (user preference overlay, signal boosts, quality gates)
|
||||
- [ ] When only text query is provided (no vector), pure BM25 ranking passed directly to profile executor
|
||||
- [ ] When only vector is provided (no text), pure ANN ranking passed directly to profile executor
|
||||
- [ ] `k` parameter configurable per profile or per query (default 60)
|
||||
- [ ] Fusion adds < 1ms to query time for 1000 candidates from each list (Criterion benchmarked)
|
||||
- [ ] Property test: for any pair of ranked lists, RRF output contains the union of both input document sets with correct score computation to 6 decimal places
|
||||
|
||||
**Depends On:** Phase 1 (BM25 scores), m2p1 (ANN scores)
|
||||
**Complexity:** S
|
||||
**Research Reference:** `docs/research/tantivy.md` (RRF section, Cormack et al.)
|
||||
**Task Breakdown:**
|
||||
|
||||
#### Phase 3: SEARCH Query Parser and Executor
|
||||
| # | Task | Delivers | Complexity |
|
||||
|---|------|----------|------------|
|
||||
| 01 | RRF Implementation | `HybridFusion::fuse()`, rank-to-score conversion, union merge of ranked lists, configurable `k` | S |
|
||||
| 02 | Retrieval Mode Router | Logic to select text-only, vector-only, or hybrid based on query contents; routes to BM25, ANN, or fusion accordingly | S |
|
||||
|
||||
**Delivers:** The SEARCH query parser and executor that orchestrates text retrieval, semantic retrieval, fusion, personalization, filtering, diversity, and result assembly.
|
||||
**Depends On:** m5p1 (BM25 scored results), m2p1 (ANN scored results)
|
||||
**Complexity:** S (2 small tasks; Task 02 depends on 01)
|
||||
**Research Reference:** `docs/research/tantivy.md` (RRF section, Cormack et al. SIGIR 2009, k=60 insensitivity, production system approaches)
|
||||
|
||||
#### Phase 3: SEARCH Query Parser and Executor (m5p3)
|
||||
|
||||
**Delivers:** The `SEARCH` query operation -- parser, planner, and executor -- that orchestrates text retrieval, semantic retrieval, hybrid fusion, personalization, filtering, diversity, and result assembly. Reuses the existing filter engine (m2p2), diversity enforcement (m2p4), and profile executor (m2p3/m3p3) from prior milestones. The `search_click` signal type is integrated for feedback loop closure.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
|
||||
- [ ] Parser handles: `SEARCH items/creators`, `QUERY "text"`, `VECTOR [embedding]`, `FOR USER`, `USING PROFILE`, `FILTER`, `DIVERSITY`, `LIMIT`
|
||||
- [ ] Query text parsing: exact phrase (`"...""`), boolean operators (AND/OR/NOT/-), field-scoped (`title:...`), wildcard (`term*`)
|
||||
- [ ] Executor pipeline: text retrieval -> ANN retrieval -> fusion -> personalization -> filter -> diversity -> return
|
||||
- [ ] When both QUERY and VECTOR provided, hybrid fusion (RRF)
|
||||
- [ ] When only QUERY, BM25-only retrieval
|
||||
- [ ] When only VECTOR, ANN-only retrieval
|
||||
- [ ] Search results include: entity_id, combined_score, bm25_score, semantic_score, rank
|
||||
- [ ] `search_click` signal writes include query context and rank position
|
||||
- [ ] End-to-end SEARCH < 50ms at 10K items (benchmarked)
|
||||
- [ ] `Search` struct with fields: `entity_kind`, `query_text: Option<String>`, `query_vector: Option<Vec<f32>>`, `for_user: Option<u64>`, `for_session: Option<SessionId>`, `profile: ProfileRef`, `filters: Vec<FilterExpr>`, `diversity: Option<DiversityConstraints>`, `limit: u32`
|
||||
- [ ] `SearchBuilder` with fluent API: `.query("text")`, `.vector(&[f32])`, `.for_user(id)`, `.for_session(id)`, `.using_profile("search")`, `.filter(expr)`, `.diversity(constraints)`, `.limit(n)`, `.build()`
|
||||
- [ ] `db.search(&Search) -> Result<SearchResults>` executes the full pipeline
|
||||
- [ ] Search executor pipeline: (1) parse query text into Tantivy query, (2) if vector present, execute ANN retrieval, (3) if both, fuse via RRF, (4) load user context if `for_user` present, (5) apply profile scoring (personalization, signal boosts, quality gates), (6) apply metadata filters, (7) apply diversity enforcement, (8) assemble results with scores
|
||||
- [ ] `SearchResults` struct contains: `items: Vec<SearchResultItem>`, `next_cursor: Option<Cursor>`, `total_candidates: u64`
|
||||
- [ ] `SearchResultItem` contains: `id: EntityId`, `score: f64`, `bm25_score: Option<f32>`, `semantic_score: Option<f32>`, `signals: SignalSnapshot`
|
||||
- [ ] Query text parsing handles: bare terms (`jazz piano`), exact phrase (`"jazz piano"`), boolean operators (`AND`, `OR`, `NOT`, `-`), field-scoped (`title:jazz`, `tag:tutorial`, `creator:handle`), wildcard prefix (`pian*`), hashtag (`#jazz`)
|
||||
- [ ] `search_click` signal type recognized: `db.signal("search_click", item_id, 1.0, ts)` with context containing `query` and `rank_at_click` fields
|
||||
- [ ] Search profile `search` registered as a builtin: text relevance as floor, personalization adjustment, completion gate, diversity
|
||||
- [ ] Session context (`FOR SESSION`) integrates with search the same way it does with RETRIEVE (preference hint keyword boost, reward velocity factor)
|
||||
- [ ] End-to-end SEARCH < 50ms at 10K items (Criterion benchmarked)
|
||||
- [ ] Full M5 UAT steps 1-4 and 7-8 pass as integration test assertions
|
||||
|
||||
**Depends On:** Phase 1, Phase 2, m2p5 (query parser infrastructure)
|
||||
**Complexity:** M
|
||||
**Task Breakdown:**
|
||||
|
||||
#### Phase 4: Creator and People Search
|
||||
| # | Task | Delivers | Complexity |
|
||||
|---|------|----------|------------|
|
||||
| 01 | Search Types and Builder | `Search`, `SearchBuilder`, `SearchResults`, `SearchResultItem` structs with validation | M |
|
||||
| 02 | Search Executor Pipeline | `SearchExecutor` orchestrating BM25 retrieval, ANN retrieval, fusion, profile scoring, filtering, diversity, result assembly | L |
|
||||
| 03 | Search Profile Builtin | `search` profile definition registered in `ProfileRegistry`, text relevance floor, personalization overlay, configurable RRF k | S |
|
||||
| 04 | search_click Signal Integration | `search_click` signal type with context fields (query, rank_at_click), feedback loop wiring into user-topic affinity | S |
|
||||
|
||||
**Delivers:** Search over creator entities by name, topic, and attributes. "Creators like X" via creator embedding similarity. Enables UC-10.
|
||||
**Depends On:** m5p1 (Tantivy integration, BM25 queries), m5p2 (hybrid fusion), m2p2 (filter engine), m2p3 (profile executor), m2p4 (diversity), m2p5 (query parser infrastructure, RETRIEVE executor pattern to follow), m3p3 (personalized profiles, UserContext), m4p4 (SessionContext for FOR SESSION)
|
||||
**Complexity:** L (4 tasks; Tasks 01 first, then 02 depends on 01; Tasks 03 and 04 can parallelize with 02)
|
||||
**Research Reference:** `VISION.md` (SEARCH query syntax), `API.md` (SEARCH operation, query syntax table), `USE_CASES.md` UC-02 (search capabilities), `SEQUENCE.md` UC-02 (search sequence diagram)
|
||||
|
||||
#### Phase 4: Creator and People Search (m5p4)
|
||||
|
||||
**Delivers:** Search over creator entities by name, topic, and attributes. "Creators like X" via creator embedding similarity. Creator entities indexed in both Tantivy (text fields) and USearch (embeddings). Enables UC-10 (People and Creator Search).
|
||||
|
||||
**Acceptance Criteria:**
|
||||
|
||||
- [ ] Creator entities indexed in Tantivy (name, handle, bio, topics)
|
||||
- [ ] Creator embeddings searchable via ANN (aggregated from catalog)
|
||||
- [ ] `SEARCH creators QUERY "jazz" LIMIT 10` returns creators matching topic
|
||||
- [ ] `SEARCH creators SIMILAR TO @creator_id LIMIT 10` returns similar creators by embedding
|
||||
- [ ] Creator filters: verified, min_followers, language, followed_by_user
|
||||
- [ ] Creator sort modes: follower_count, engagement_rate, posting_frequency
|
||||
- [ ] Creator entities indexed in Tantivy when written via `db.write_creator()`: fields `name` (text, tokenized), `handle` (keyword, raw), `region` (keyword), `language` (keyword), `verified` (bool)
|
||||
- [ ] Creator embeddings indexed in a dedicated USearch index (separate from item embeddings) when provided via `write_creator(id, metadata, Some(embedding))`
|
||||
- [ ] `SEARCH creators QUERY "jazz" LIMIT 10` returns creators matching by name or topic, ordered by BM25 relevance
|
||||
- [ ] `SEARCH creators QUERY "jazz" FILTER verified:true LIMIT 10` filters by creator attributes
|
||||
- [ ] `SEARCH creators SIMILAR TO @creator_id LIMIT 10` retrieves the source creator's embedding and runs ANN against the creator vector index
|
||||
- [ ] Creator search results include: `id: EntityId`, `score: f64`, `metadata: HashMap<String, String>`
|
||||
- [ ] Creator sort modes available: `Sort::CreatorEngagementRate` (average engagement ratio across recent catalog), `Sort::MostFollowed` (follower count desc)
|
||||
- [ ] Creator filters composable: `verified`, `min_followers`, `max_followers`, `language`, `region`, `followed_by_user` (requires FOR USER)
|
||||
- [ ] `followed_by_user` filter uses the existing `FollowsBitmap` infrastructure from m3p1 to restrict results to creators the user follows
|
||||
- [ ] Hybrid search on creators: `SEARCH creators QUERY "jazz" VECTOR [query_embedding] LIMIT 10` fuses BM25 name/topic match with embedding similarity via RRF
|
||||
- [ ] Creator search latency < 20ms at 200 creators (Criterion benchmarked)
|
||||
- [ ] Full M5 UAT steps 5-6 pass as integration test assertions
|
||||
|
||||
**Depends On:** Phase 1, m3p1 (creator entities)
|
||||
**Complexity:** M
|
||||
**Task Breakdown:**
|
||||
|
||||
| # | Task | Delivers | Complexity |
|
||||
|---|------|----------|------------|
|
||||
| 01 | Creator Text Indexing | Tantivy indexing for creator entities, field mapping, write/delete hooks in `write_creator()`/`update_creator()`, syncer integration | M |
|
||||
| 02 | Creator Vector Index | Dedicated USearch index for creator embeddings, insertion on `write_creator()`, ANN search, `SIMILAR TO @creator_id` resolution | M |
|
||||
| 03 | Creator Search Executor | `SEARCH creators` routing in search executor, creator-specific filters (verified, followers, language), sort modes, `followed_by_user` via FollowsBitmap, hybrid fusion for creators | M |
|
||||
|
||||
**Depends On:** m5p1 (Tantivy integration, syncer infrastructure), m5p3 (SEARCH executor pipeline to extend), m3p1 (creator entities, FollowsBitmap), m2p1 (vector index infrastructure)
|
||||
**Complexity:** L (3 tasks; Task 01 and 02 can parallelize; Task 03 depends on both)
|
||||
**Research Reference:** `USE_CASES.md` UC-10 (People and Creator Search: name search, "creators like X", social graph discovery), `API.md` (SEARCH creators examples), `docs/research/ann_for_tidaldb.md` (creator embedding similarity)
|
||||
|
||||
### Phase Dependency DAG
|
||||
|
||||
```
|
||||
m5p1 (Tantivy Integration)
|
||||
| \
|
||||
v \
|
||||
m5p2 (RRF) \
|
||||
| \
|
||||
v v
|
||||
m5p3 (SEARCH Executor)
|
||||
|
|
||||
v
|
||||
m5p4 (Creator Search)
|
||||
```
|
||||
|
||||
m5p1 is the foundation -- everything else depends on having a working text index. m5p2 (RRF fusion) depends on m5p1 for BM25 scores and on the existing m2p1 for ANN scores. m5p3 (SEARCH executor) depends on both m5p1 and m5p2 to orchestrate the full pipeline. m5p4 (Creator search) depends on m5p1 (for creator text indexing) and m5p3 (for the search executor to extend).
|
||||
|
||||
Within m5p1, tasks 01-02 are sequential (schema before documents), then tasks 03, 04, and 05 can parallelize once document write is working.
|
||||
|
||||
### Deferred to Later Milestones
|
||||
|
||||
- **Autocomplete and search suggestions (UC-02.3)** -- deferred to M5; requires prefix indexes and trending query tracking
|
||||
- **Saved searches and alerts (UC-02.4)** -- deferred to M5; requires persistent query storage and push notification
|
||||
- **Visual search / image search (UC-11)** -- deferred to M5; requires multi-modal embedding support
|
||||
- **"Did you mean" typo correction** -- deferred to M5; requires edit-distance computation on term dictionary
|
||||
- **Tuned linear combination (replacing RRF)** -- deferred to M5; requires relevance labels for alpha tuning
|
||||
- **Autocomplete and search suggestions (UC-02.3)** -- deferred to M6; requires prefix indexes on the Tantivy term dictionary and trending query tracking infrastructure; M5 proves search works, M6 adds the polish features
|
||||
- **Saved searches and alerts (UC-02.4)** -- deferred to M6; requires persistent query storage, new-result detection on each indexing pass, and push notification integration; M5 provides the search primitive, M6 builds subscriptions on top
|
||||
- **Visual search / image search (UC-11 full)** -- deferred to M6; UC-11 core (vector-only search) works via M5's `SEARCH items VECTOR [embedding] LIMIT N`; the full crop-and-search and multi-modal (text query against image items) workflow requires additional embedding pipeline coordination
|
||||
- **"Did you mean" typo correction** -- deferred to M6; requires edit-distance computation on the Tantivy term dictionary and a suggestion model; not required for M5's UAT
|
||||
- **Tuned linear combination (replacing RRF)** -- deferred to M7 or later; requires relevance labels and offline evaluation infrastructure; RRF is the correct zero-configuration starting point
|
||||
- **Query composition / SEARCH WITHIN scope** (searching within trending, within cohort trending, within following) -- deferred to M6; requires candidate set intersection with scoped retrieval; M5 proves standalone search works first
|
||||
- **Semantic session hint matching** -- deferred to M6; M4's keyword matching is sufficient; semantic matching via Tantivy text analysis would upgrade hint precision but is not required for M5's UAT
|
||||
- **Search result explanation** ("why this result?") -- deferred to M6/M7; Tantivy provides `Query::explain()` per document but it is expensive; not required for M5's UAT
|
||||
|
||||
### Integration Test
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn milestone_5_uat() {
|
||||
let db = open_with_search_schema();
|
||||
|
||||
// Write 10K items with text fields, embeddings, and metadata.
|
||||
for i in 0..10_000u64 {
|
||||
let meta = item_metadata(i); // title, description, tags, category, format, creator_id
|
||||
let embedding = item_embedding(i); // 1536-dim
|
||||
db.write_item_with_metadata(EntityId::new(i), &meta).unwrap();
|
||||
db.write_item_embedding(EntityId::new(i), &embedding).unwrap();
|
||||
}
|
||||
|
||||
// Write 200 creators with names, handles, and embeddings.
|
||||
for c in 0..200u64 {
|
||||
let meta = creator_metadata(c); // name, handle, verified, language
|
||||
let embedding = creator_embedding(c);
|
||||
db.write_creator(EntityId::new(c), &meta).unwrap();
|
||||
db.write_creator_embedding(EntityId::new(c), &embedding).unwrap();
|
||||
}
|
||||
|
||||
// Write user 42 with engagement history.
|
||||
db.write_user(EntityId::new(42), &user_metadata()).unwrap();
|
||||
for e in generate_engagement_events(500, EntityId::new(42)) {
|
||||
db.signal(&e.signal_type, e.entity_id, e.weight, e.timestamp).unwrap();
|
||||
}
|
||||
|
||||
// Wait for Tantivy syncer to commit.
|
||||
db.flush_text_index().unwrap();
|
||||
|
||||
// Step 1: Hybrid search with personalization and diversity.
|
||||
let query_vec = embed("rust tutorial beginner");
|
||||
let results = db.search(
|
||||
SearchBuilder::new(EntityKind::Item, ProfileRef::new("search"))
|
||||
.query("rust tutorial beginner")
|
||||
.vector(&query_vec)
|
||||
.for_user(42)
|
||||
.diversity(DiversityConstraints { max_per_creator: Some(2), ..Default::default() })
|
||||
.limit(20)
|
||||
.build().unwrap()
|
||||
).unwrap();
|
||||
assert_eq!(results.items.len(), 20);
|
||||
assert!(results.items.iter().all(|r| r.score > 0.0));
|
||||
assert!(results.items.windows(2).all(|w| w[0].score >= w[1].score));
|
||||
assert!(creator_counts(&results.items).values().all(|&c| c <= 2));
|
||||
|
||||
// Step 2: Text-only with filters.
|
||||
let filtered = db.search(
|
||||
SearchBuilder::new(EntityKind::Item, ProfileRef::new("search"))
|
||||
.query("jazz piano")
|
||||
.for_user(42)
|
||||
.filter(FilterExpr::eq("format", "video"))
|
||||
.limit(20)
|
||||
.build().unwrap()
|
||||
).unwrap();
|
||||
assert!(filtered.items.iter().all(|r| r.bm25_score.is_some()));
|
||||
assert!(filtered.items.iter().all(|r| r.semantic_score.is_none()));
|
||||
|
||||
// Step 3: Exact phrase match.
|
||||
let phrase = db.search(
|
||||
SearchBuilder::new(EntityKind::Item, ProfileRef::new("search"))
|
||||
.query("\"exact phrase match\"")
|
||||
.limit(10)
|
||||
.build().unwrap()
|
||||
).unwrap();
|
||||
// All returned items must contain the exact phrase in some text field.
|
||||
|
||||
// Step 4: Boolean exclusion.
|
||||
let excluded = db.search(
|
||||
SearchBuilder::new(EntityKind::Item, ProfileRef::new("search"))
|
||||
.query("jazz -beginner")
|
||||
.limit(10)
|
||||
.build().unwrap()
|
||||
).unwrap();
|
||||
// No returned items should match "beginner" in any text field.
|
||||
|
||||
// Step 5: Creator search by topic.
|
||||
let creators = db.search(
|
||||
SearchBuilder::new(EntityKind::Creator, ProfileRef::new("search"))
|
||||
.query("jazz")
|
||||
.limit(10)
|
||||
.build().unwrap()
|
||||
).unwrap();
|
||||
assert!(!creators.items.is_empty());
|
||||
|
||||
// Step 6: Creators similar to creator_xyz by embedding.
|
||||
let similar_creators = db.search(
|
||||
SearchBuilder::new(EntityKind::Creator, ProfileRef::new("search"))
|
||||
.similar_to(EntityId::new(5))
|
||||
.limit(10)
|
||||
.build().unwrap()
|
||||
).unwrap();
|
||||
assert!(!similar_creators.items.is_empty());
|
||||
assert!(similar_creators.items.iter().all(|r| r.id != EntityId::new(5)));
|
||||
|
||||
// Step 7: Search click signal with context.
|
||||
let clicked = results.items[2].id;
|
||||
db.signal("search_click", clicked, 1.0, Timestamp::now()).unwrap();
|
||||
|
||||
// Step 8: Re-search -- clicked result may rank higher.
|
||||
let results2 = db.search(
|
||||
SearchBuilder::new(EntityKind::Item, ProfileRef::new("search"))
|
||||
.query("rust tutorial beginner")
|
||||
.vector(&query_vec)
|
||||
.for_user(42)
|
||||
.limit(20)
|
||||
.build().unwrap()
|
||||
).unwrap();
|
||||
let rank_before = results.items.iter().position(|r| r.id == clicked).unwrap();
|
||||
let rank_after = results2.items.iter().position(|r| r.id == clicked);
|
||||
// The clicked result should appear at the same or better rank.
|
||||
if let Some(ra) = rank_after {
|
||||
assert!(ra <= rank_before);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Done When
|
||||
|
||||
A developer can execute SEARCH queries that combine full-text BM25 relevance with semantic vector similarity and user personalization in a single ranked result set. Boolean queries, phrase matching, field-scoped search, and creator search all work. Results reflect engagement signals. End-to-end SEARCH latency < 50ms at 10K items.
|
||||
A developer can execute `SEARCH items QUERY "rust tutorial beginner" VECTOR [query_embedding] FOR USER @user_42 USING PROFILE search DIVERSITY max_per_creator:2 LIMIT 20` and receive results that combine BM25 text relevance with semantic vector similarity, re-ranked by user personalization and engagement signals, with diversity constraints enforced. Boolean queries (`AND`/`OR`/`NOT`), exact phrase matching (`"..."`), field-scoped search (`title:...`), and wildcard prefix (`term*`) all work. Creator search returns creators by name, topic, and embedding similarity. The `search_click` signal closes the feedback loop -- a clicked result influences the next search. End-to-end SEARCH latency < 50ms at 10K items. All 8 UAT scenario steps pass in the integration test.
|
||||
|
||||
---
|
||||
|
||||
@ -1552,7 +1753,7 @@ Then:
|
||||
|
||||
### Phases
|
||||
|
||||
(Phases for M5 are provisional -- detailed decomposition happens after M4 ships, informed by what was learned.)
|
||||
(Phases for M6 are provisional -- detailed decomposition happens after M5 ships, informed by what was learned.)
|
||||
|
||||
#### Phase 1: Complete Sort Mode Coverage
|
||||
|
||||
|
||||
@ -1,7 +1,5 @@
|
||||
# Milestone 1, Phase 2: Write-Ahead Log
|
||||
|
||||
## Status: COMPLETE
|
||||
|
||||
## Phase Deliverable
|
||||
|
||||
A durable, append-only signal event log. Every signal write (view, like, skip, completion) is appended to the WAL before any aggregation occurs. Signal aggregates, decay scores, and windowed counts are derived state — the WAL is the source of truth. Group commit amortizes fsync cost across concurrent writers. Content-addressed events via per-event BLAKE3 hash for deduplication. Crash recovery scans forward from last checkpoint and truncates corrupted tails.
|
||||
@ -39,12 +37,12 @@ A durable, append-only signal event log. Every signal write (view, like, skip, c
|
||||
|
||||
## Task Index
|
||||
|
||||
| # | Task | Delivers | Depends On | Complexity | Status |
|
||||
|---|------|----------|------------|------------|--------|
|
||||
| 01 | WAL Wire Format and Segment Files | `BatchHeader`, `EventRecord`, `SegmentWriter`, `WalError` | None | M | COMPLETE |
|
||||
| 02 | Group Commit Writer | `WriterConfig`, `WalCommand`, `run_writer` loop | Task 01 | M | COMPLETE |
|
||||
| 03 | Crash Recovery and Replay | `WalReader`, `recover()`, partial-write truncation | Task 01 | M | COMPLETE |
|
||||
| 04 | Deduplication, Checkpoint, and Public API | `DedupWindow`, `CheckpointManager`, `WalHandle`, `SignalEvent` | Task 02, Task 03 | M | COMPLETE |
|
||||
| # | Task | Delivers | Depends On | Complexity |
|
||||
|---|------|----------|------------|------------|
|
||||
| 01 | WAL Wire Format and Segment Files | `BatchHeader`, `EventRecord`, `SegmentWriter`, `WalError` | None | M |
|
||||
| 02 | Group Commit Writer | `WriterConfig`, `WalCommand`, `run_writer` loop | Task 01 | M |
|
||||
| 03 | Crash Recovery and Replay | `WalReader`, `recover()`, partial-write truncation | Task 01 | M |
|
||||
| 04 | Deduplication, Checkpoint, and Public API | `DedupWindow`, `CheckpointManager`, `WalHandle`, `SignalEvent` | Task 02, Task 03 | M |
|
||||
|
||||
## Task Dependency DAG
|
||||
|
||||
|
||||
@ -1,7 +1,5 @@
|
||||
# Milestone 1, Phase 3: Storage Engine Trait and fjall Backend
|
||||
|
||||
## Status: COMPLETE (140 tests passing: 128 unit + 12 integration)
|
||||
|
||||
## Phase Deliverable
|
||||
|
||||
The `StorageEngine` trait abstraction and two implementations: `FjallBackend` (fjall 3 LSM-tree) for production and `InMemoryBackend` (BTreeMap + RwLock) for deterministic testing. Key encoding follows the subject-prefix pattern with a `Tag` discriminant. `FjallStorage` coordinates three keyspaces per entity kind. `FjallAtomicBatch` provides cross-keyspace atomic writes.
|
||||
@ -40,11 +38,11 @@ This phase is the durable entity store — where metadata, signal checkpoints, a
|
||||
|
||||
## Task Index
|
||||
|
||||
| # | Task | Delivers | Depends On | Complexity | Status |
|
||||
|---|------|----------|------------|------------|--------|
|
||||
| 01 | StorageEngine Trait and Key Encoding | `StorageEngine`, `Tag`, `encode_key`, `parse_key`, `entity_prefix`, `entity_tag_prefix`, `WriteBatch`, `BatchOp`, `PrefixIterator`, `StorageError` | None | M | COMPLETE |
|
||||
| 02 | FjallBackend | `FjallBackend`, `FjallStorage`, `FjallAtomicBatch`, persistence tests | Task 01 | M | COMPLETE |
|
||||
| 03 | InMemoryBackend | `InMemoryBackend`, property tests, benchmarks | Task 01 | S | COMPLETE |
|
||||
| # | Task | Delivers | Depends On | Complexity |
|
||||
|---|------|----------|------------|------------|
|
||||
| 01 | StorageEngine Trait and Key Encoding | `StorageEngine`, `Tag`, `encode_key`, `parse_key`, `entity_prefix`, `entity_tag_prefix`, `WriteBatch`, `BatchOp`, `PrefixIterator`, `StorageError` | None | M |
|
||||
| 02 | FjallBackend | `FjallBackend`, `FjallStorage`, `FjallAtomicBatch`, persistence tests | Task 01 | M |
|
||||
| 03 | InMemoryBackend | `InMemoryBackend`, property tests, benchmarks | Task 01 | S |
|
||||
|
||||
## Task Dependency DAG
|
||||
|
||||
|
||||
71
docs/planning/milestone-5/phase-1/OVERVIEW.md
Normal file
71
docs/planning/milestone-5/phase-1/OVERVIEW.md
Normal file
@ -0,0 +1,71 @@
|
||||
# m5p1: Tantivy Integration
|
||||
|
||||
## Delivers
|
||||
|
||||
Tantivy embedded as a derived index for full-text search. DB-primary consistency pattern: entity store is source of truth, Tantivy is a materialized view updated via an outbox sequence. BM25 scoring exposed via custom Collector and the Weight/Scorer seek pattern. Schema text fields (title, description, tags) automatically indexed. Crash recovery replays from the last committed sequence number stored in Tantivy's commit payload.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- m1p3 (storage engine, key encoding, `StorageEngine` trait, `scan_prefix`)
|
||||
- m1p5 (entity write API, WAL sequence numbers)
|
||||
- m2p2 (metadata fields used for field-scoped queries)
|
||||
- m4 (full TidalDb API with sessions and agents — all complete)
|
||||
|
||||
## Research References
|
||||
|
||||
- `docs/research/tantivy.md` — Collector API, consistency pattern, seek scoring, commit model, single-writer lock, segment merge
|
||||
- `CODING_GUIDELINES.md` Section 5 — Text Search guidelines
|
||||
- `CODING_GUIDELINES.md` Section 7 — Error handling
|
||||
|
||||
## Acceptance Criteria (Phase Level)
|
||||
|
||||
- [ ] `TextIndex` struct wraps Tantivy `Index`, `IndexWriter` (behind `Mutex`), and `IndexReader` with auto-reload
|
||||
- [ ] Tantivy schema created from tidalDB schema text field definitions: `text` fields get full-text tokenization; `keyword` fields get raw indexing
|
||||
- [ ] `TextIndexWriter::index_item(entity_id, metadata)` adds or updates a document in Tantivy; `delete_item(entity_id)` removes via `delete_term`
|
||||
- [ ] Background indexer: `TextIndexSyncer` reads entity store writes (via WAL sequence tracking) and feeds Tantivy writer; commit interval configurable (default: every 1000 docs or 2 seconds)
|
||||
- [ ] Each Tantivy `commit()` stores the last-processed WAL sequence number in the commit payload via `set_payload()`; crash recovery replays from that sequence number
|
||||
- [ ] Custom `AllScoresCollector` implementing Tantivy's `Collector` trait returns all matching `(EntityId, f32)` pairs with BM25 scores; `requires_scoring()` returns `true`
|
||||
- [ ] `ScoredCandidateCollector` implementing Tantivy's `Collector` trait accepts a pre-sorted candidate set and returns BM25 scores via `DocSet::seek()`
|
||||
- [ ] External `EntityId -> DocAddress` mapping maintained via a fast field (`entity_id_field`) on every Tantivy document
|
||||
- [ ] Boolean query parsing: AND, OR, NOT operators; exact phrase (`"..."`); field-scoped (`title:jazz`); exclusion (`-beginner`); wildcard prefix (`pian*`)
|
||||
- [ ] Index rebuild from entity store: `text_index.rebuild_from(storage)` scans all items and rebuilds Tantivy index
|
||||
- [ ] BM25 query latency < 10ms at 10K documents (Criterion benchmarked)
|
||||
- [ ] Tantivy `IndexWriter` heap budget set to 50MB
|
||||
- [ ] `LogMergePolicy` configured with defaults; `wait_merging_threads()` called on shutdown
|
||||
- [ ] `TextIndex` is `Send + Sync` — safe to share across threads behind `Arc`
|
||||
|
||||
## Task Execution Order
|
||||
|
||||
```
|
||||
task-01 (TextIndex Core)
|
||||
|
|
||||
v
|
||||
task-02 (Document Write/Delete)
|
||||
| | |
|
||||
v v v
|
||||
task-03 task-04 task-05
|
||||
(Syncer) (Collectors) (Query Parser)
|
||||
```
|
||||
|
||||
Tasks 01-02 are sequential. Tasks 03, 04, 05 can parallelize after task-02 completes.
|
||||
|
||||
## Module Location
|
||||
|
||||
New module: `tidal/src/text/` with submodules:
|
||||
- `mod.rs` — public re-exports
|
||||
- `index.rs` — `TextIndex`, `TextIndexConfig`
|
||||
- `writer.rs` — `TextIndexWriter` (write/delete operations)
|
||||
- `syncer.rs` — `TextIndexSyncer` (background indexing)
|
||||
- `collectors.rs` — `AllScoresCollector`, `ScoredCandidateCollector`
|
||||
- `query.rs` — `TextQueryParser`
|
||||
|
||||
## Notes
|
||||
|
||||
- `tantivy` must be added to `tidal/Cargo.toml` as a dependency
|
||||
- Text field definitions must be added to `Schema` / `SchemaBuilder`
|
||||
- The `unsafe_code = "forbid"` lint is crate-level — `tantivy` itself uses unsafe but we do not need unsafe in our wrapper code
|
||||
- `tantivy` crate itself has `forbid(unsafe_code)` in some modules but not all — the FFI is contained within their crate
|
||||
|
||||
## Done When
|
||||
|
||||
All 14 acceptance criteria above pass. Tests pass with `cargo test --manifest-path tidal/Cargo.toml`. The `text_index` bench shows BM25 query < 10ms at 10K documents.
|
||||
328
docs/planning/milestone-5/phase-1/task-01-text-index-core.md
Normal file
328
docs/planning/milestone-5/phase-1/task-01-text-index-core.md
Normal file
@ -0,0 +1,328 @@
|
||||
# Task 01: TextIndex Core
|
||||
|
||||
## Delivers
|
||||
|
||||
`TextIndex` struct, Tantivy schema generation from tidalDB schema text field definitions, `IndexWriter`/`IndexReader` lifecycle, `entity_id` fast field, `TextIndex::open()` and `TextIndex::close()`.
|
||||
|
||||
Also extends `Schema` and `SchemaBuilder` with `TextFieldDef` — the declaration of which metadata keys to index for full-text search, and whether they are tokenized text or keyword (raw) fields.
|
||||
|
||||
## Complexity: L
|
||||
|
||||
## Dependencies
|
||||
|
||||
- None from prior m5 tasks (this is the foundation)
|
||||
- tidalDB `Schema` (schema/validation.rs) — will be extended
|
||||
- Cargo.toml — `tantivy` dependency must be added
|
||||
|
||||
## Technical Design
|
||||
|
||||
### 1. Add `tantivy` to Cargo.toml
|
||||
|
||||
```toml
|
||||
tantivy = "0.22"
|
||||
```
|
||||
|
||||
Use `0.22` — stable API, widely deployed, Collector trait and DocSet::seek available.
|
||||
|
||||
### 2. Add TextFieldDef to Schema
|
||||
|
||||
In `schema/validation.rs`, add:
|
||||
|
||||
```rust
|
||||
/// Declaration of a text field for full-text search indexing.
|
||||
///
|
||||
/// When a text field is declared in the schema, items written to tidalDB
|
||||
/// will have the corresponding metadata key indexed by Tantivy for full-text search.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct TextFieldDef {
|
||||
/// The metadata key to index (e.g., "title", "description", "tags").
|
||||
pub key: String,
|
||||
/// Whether this field is tokenized (full-text) or raw (keyword/exact-match).
|
||||
pub field_type: TextFieldType,
|
||||
}
|
||||
|
||||
/// The Tantivy indexing mode for a text field.
|
||||
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||
pub enum TextFieldType {
|
||||
/// Full tokenization with Tantivy's default tokenizer (lowercase, whitespace split).
|
||||
/// Good for: title, description, body text.
|
||||
Text,
|
||||
/// Raw storage, no tokenization. Only exact-match queries work.
|
||||
/// Good for: category, format, creator_id, language tags.
|
||||
Keyword,
|
||||
}
|
||||
```
|
||||
|
||||
Add `text_fields: Vec<TextFieldDef>` to `Schema` and `SchemaBuilder`.
|
||||
Add `SchemaBuilder::text_field(key, TextFieldType)` builder method.
|
||||
Expose `Schema::text_fields() -> &[TextFieldDef]`.
|
||||
|
||||
### 3. TextIndex Module Structure
|
||||
|
||||
Create `tidal/src/text/` module with:
|
||||
|
||||
```
|
||||
tidal/src/text/
|
||||
├── mod.rs # pub re-exports
|
||||
├── index.rs # TextIndex struct and config
|
||||
├── writer.rs # TextIndexWriter
|
||||
├── syncer.rs # TextIndexSyncer (task-03)
|
||||
├── collectors.rs # Scoring collectors (task-04)
|
||||
└── query.rs # TextQueryParser (task-05)
|
||||
```
|
||||
|
||||
Add `pub mod text;` to `tidal/src/lib.rs`.
|
||||
|
||||
### 4. TextIndex Struct
|
||||
|
||||
```rust
|
||||
// tidal/src/text/index.rs
|
||||
|
||||
use std::path::PathBuf;
|
||||
use std::sync::{Arc, Mutex};
|
||||
use tantivy::{Index, IndexReader, IndexWriter, ReloadPolicy, schema as tv_schema};
|
||||
|
||||
use crate::schema::{EntityId, TextFieldDef, TextFieldType};
|
||||
use crate::TidalError;
|
||||
|
||||
/// Configuration for the text index.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct TextIndexConfig {
|
||||
/// Directory for Tantivy index files.
|
||||
pub index_dir: PathBuf,
|
||||
/// IndexWriter heap budget in bytes. Default: 50MB.
|
||||
pub heap_budget_bytes: usize,
|
||||
/// Maximum documents before forcing a commit.
|
||||
pub commit_every_n_docs: usize,
|
||||
/// Maximum seconds between commits.
|
||||
pub commit_every_secs: u64,
|
||||
}
|
||||
|
||||
impl Default for TextIndexConfig {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
index_dir: PathBuf::from("data/text_index"),
|
||||
heap_budget_bytes: 50 * 1024 * 1024, // 50MB
|
||||
commit_every_n_docs: 1000,
|
||||
commit_every_secs: 2,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Fields that every Tantivy document must have.
|
||||
pub(crate) struct TantivyFields {
|
||||
/// Fast field for the tidalDB entity ID (u64). Used for EntityId->DocAddress mapping.
|
||||
pub entity_id: tv_schema::Field,
|
||||
/// Declared text fields from the tidalDB schema.
|
||||
pub text_fields: Vec<(String, tv_schema::Field, TextFieldType)>,
|
||||
}
|
||||
|
||||
/// The text index. Wraps Tantivy's Index, IndexWriter, and IndexReader.
|
||||
///
|
||||
/// Thread-safe: the IndexWriter is behind a Mutex (Tantivy enforces single-writer),
|
||||
/// the IndexReader provides lock-free snapshot reads.
|
||||
///
|
||||
/// IMPORTANT: `TextIndex` is a derived index. The entity store is the source of truth.
|
||||
/// If the Tantivy index is lost, call `rebuild_from()` to reconstruct it.
|
||||
pub struct TextIndex {
|
||||
pub(crate) index: Index,
|
||||
pub(crate) writer: Mutex<IndexWriter>,
|
||||
pub(crate) reader: IndexReader,
|
||||
pub(crate) fields: Arc<TantivyFields>,
|
||||
pub(crate) config: TextIndexConfig,
|
||||
}
|
||||
```
|
||||
|
||||
### 5. TextIndex::open() and ::close()
|
||||
|
||||
```rust
|
||||
impl TextIndex {
|
||||
/// Open or create a TextIndex from the given config and field definitions.
|
||||
///
|
||||
/// If the index directory exists, opens the existing index.
|
||||
/// If not, creates a new index.
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns `TidalError::Internal` if Tantivy initialization fails.
|
||||
pub fn open(config: TextIndexConfig, text_fields: &[TextFieldDef]) -> crate::Result<Self> {
|
||||
// 1. Build Tantivy schema
|
||||
let (tv_schema, fields) = build_tantivy_schema(text_fields)?;
|
||||
|
||||
// 2. Open or create index
|
||||
let index = if config.index_dir.exists() {
|
||||
Index::open_in_dir(&config.index_dir)
|
||||
.map_err(|e| TidalError::Internal(format!("tantivy open: {e}")))?
|
||||
} else {
|
||||
std::fs::create_dir_all(&config.index_dir)
|
||||
.map_err(|e| TidalError::Internal(format!("create index dir: {e}")))?;
|
||||
Index::create_in_dir(&config.index_dir, tv_schema)
|
||||
.map_err(|e| TidalError::Internal(format!("tantivy create: {e}")))?
|
||||
};
|
||||
|
||||
// 3. Create IndexWriter with heap budget
|
||||
let writer = index
|
||||
.writer(config.heap_budget_bytes)
|
||||
.map_err(|e| TidalError::Internal(format!("tantivy writer: {e}")))?;
|
||||
|
||||
// 4. Create IndexReader with auto-reload on commit
|
||||
let reader = index
|
||||
.reader_builder()
|
||||
.reload_policy(ReloadPolicy::OnCommitWithDelay)
|
||||
.try_into()
|
||||
.map_err(|e| TidalError::Internal(format!("tantivy reader: {e}")))?;
|
||||
|
||||
Ok(Self {
|
||||
index,
|
||||
writer: Mutex::new(writer),
|
||||
reader,
|
||||
fields: Arc::new(fields),
|
||||
config,
|
||||
})
|
||||
}
|
||||
|
||||
/// Open an in-memory text index for testing.
|
||||
pub fn ephemeral(text_fields: &[TextFieldDef]) -> crate::Result<Self> {
|
||||
let (tv_schema, fields) = build_tantivy_schema(text_fields)?;
|
||||
let index = Index::create_in_ram(tv_schema);
|
||||
let writer = index
|
||||
.writer(15 * 1024 * 1024) // 15MB minimum for ephemeral
|
||||
.map_err(|e| TidalError::Internal(format!("tantivy writer: {e}")))?;
|
||||
let reader = index
|
||||
.reader_builder()
|
||||
.reload_policy(ReloadPolicy::Manual)
|
||||
.try_into()
|
||||
.map_err(|e| TidalError::Internal(format!("tantivy reader: {e}")))?;
|
||||
let config = TextIndexConfig {
|
||||
index_dir: PathBuf::from(":memory:"),
|
||||
..Default::default()
|
||||
};
|
||||
Ok(Self {
|
||||
index,
|
||||
writer: Mutex::new(writer),
|
||||
reader,
|
||||
fields: Arc::new(fields),
|
||||
config,
|
||||
})
|
||||
}
|
||||
|
||||
/// Graceful shutdown: wait for background merges to complete.
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns `TidalError::Internal` if the writer fails to commit or merge.
|
||||
pub fn close(self) -> crate::Result<()> {
|
||||
let mut writer = self
|
||||
.writer
|
||||
.into_inner()
|
||||
.map_err(|e| TidalError::Internal(format!("writer lock poisoned: {e}")))?;
|
||||
writer
|
||||
.wait_merging_threads()
|
||||
.map_err(|e| TidalError::Internal(format!("tantivy merge wait: {e}")))
|
||||
}
|
||||
|
||||
/// Get a reference to the fields mapping (for writer and collector use).
|
||||
#[must_use]
|
||||
pub fn fields(&self) -> &Arc<TantivyFields> {
|
||||
&self.fields
|
||||
}
|
||||
}
|
||||
|
||||
/// Construct a Tantivy schema from tidalDB text field definitions.
|
||||
///
|
||||
/// Always adds:
|
||||
/// - `entity_id`: u64 fast field for EntityId -> DocAddress mapping
|
||||
///
|
||||
/// For each TextFieldDef:
|
||||
/// - `TextFieldType::Text` → `TEXT | STORED` (tokenized, stored for highlight)
|
||||
/// - `TextFieldType::Keyword` → `STRING | STORED` (raw, stored)
|
||||
fn build_tantivy_schema(
|
||||
text_fields: &[TextFieldDef],
|
||||
) -> crate::Result<(tv_schema::Schema, TantivyFields)> {
|
||||
let mut sb = tv_schema::Schema::builder();
|
||||
|
||||
// entity_id fast field — every document must have this
|
||||
let entity_id_field = sb.add_u64_field(
|
||||
"entity_id",
|
||||
tv_schema::FAST | tv_schema::STORED,
|
||||
);
|
||||
|
||||
let mut fields = Vec::with_capacity(text_fields.len());
|
||||
for def in text_fields {
|
||||
let options = match def.field_type {
|
||||
TextFieldType::Text => tv_schema::TEXT | tv_schema::STORED,
|
||||
TextFieldType::Keyword => tv_schema::STRING | tv_schema::STORED,
|
||||
};
|
||||
let field = sb.add_text_field(&def.key, options);
|
||||
fields.push((def.key.clone(), field, def.field_type.clone()));
|
||||
}
|
||||
|
||||
let schema = sb.build();
|
||||
Ok((
|
||||
schema,
|
||||
TantivyFields {
|
||||
entity_id: entity_id_field,
|
||||
text_fields: fields,
|
||||
},
|
||||
))
|
||||
}
|
||||
```
|
||||
|
||||
### 6. TextIndex must be Send + Sync
|
||||
|
||||
`tantivy::Index` is `Send + Sync`. `tantivy::IndexWriter` is `Send` (not `Sync`) — hence the `Mutex<IndexWriter>`. `tantivy::IndexReader` is `Send + Sync`. `Mutex<IndexWriter>` is `Send + Sync` when `IndexWriter: Send`. So `TextIndex` is `Send + Sync` implicitly.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] `TextFieldDef` and `TextFieldType` types in `schema/validation.rs`
|
||||
- [ ] `SchemaBuilder::text_field(key, TextFieldType)` builder method
|
||||
- [ ] `Schema::text_fields() -> &[TextFieldDef]` accessor
|
||||
- [ ] `tidal/src/text/` module created with `pub mod text;` in `lib.rs`
|
||||
- [ ] `TextIndex::open(config, text_fields)` creates or opens a Tantivy index
|
||||
- [ ] `TextIndex::ephemeral(text_fields)` creates an in-memory index for tests
|
||||
- [ ] `TextIndex::close(self)` calls `wait_merging_threads()`
|
||||
- [ ] `entity_id` fast field present in every Tantivy document
|
||||
- [ ] `Text` fields use `TEXT | STORED` options (tokenized)
|
||||
- [ ] `Keyword` fields use `STRING | STORED` options (raw/exact)
|
||||
- [ ] `TextIndex` is `Send + Sync`
|
||||
- [ ] Unit tests: `open_and_close`, `ephemeral_creates_valid_index`, `schema_has_entity_id_field`, `text_fields_correct_options`, `keyword_fields_correct_options`
|
||||
- [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass
|
||||
|
||||
## Test Strategy
|
||||
|
||||
```rust
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::schema::{TextFieldDef, TextFieldType};
|
||||
|
||||
fn test_fields() -> Vec<TextFieldDef> {
|
||||
vec![
|
||||
TextFieldDef { key: "title".into(), field_type: TextFieldType::Text },
|
||||
TextFieldDef { key: "tags".into(), field_type: TextFieldType::Keyword },
|
||||
]
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ephemeral_creates_valid_index() {
|
||||
let idx = TextIndex::ephemeral(&test_fields()).unwrap();
|
||||
let fields = idx.fields();
|
||||
// entity_id field exists
|
||||
assert!(fields.text_fields.iter().any(|(k, _, _)| k == "title"));
|
||||
assert!(fields.text_fields.iter().any(|(k, _, _)| k == "tags"));
|
||||
idx.close().unwrap();
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn open_and_close_on_disk() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let config = TextIndexConfig {
|
||||
index_dir: dir.path().to_path_buf(),
|
||||
..Default::default()
|
||||
};
|
||||
let idx = TextIndex::open(config.clone(), &test_fields()).unwrap();
|
||||
idx.close().unwrap();
|
||||
// Reopen
|
||||
let idx2 = TextIndex::open(config, &test_fields()).unwrap();
|
||||
idx2.close().unwrap();
|
||||
}
|
||||
}
|
||||
```
|
||||
@ -0,0 +1,197 @@
|
||||
# Task 02: Document Write/Delete
|
||||
|
||||
## Delivers
|
||||
|
||||
`TextIndexWriter` with `index_item()`, `delete_item()`, field mapping (text → tokenized, keyword → raw), metadata-to-document conversion, and commit with sequence number payload.
|
||||
|
||||
## Complexity: M
|
||||
|
||||
## Dependencies
|
||||
|
||||
- Task 01 complete: `TextIndex`, `TantivyFields`, `TextFieldDef`, `TextFieldType` all exist
|
||||
|
||||
## Technical Design
|
||||
|
||||
### TextIndexWriter
|
||||
|
||||
```rust
|
||||
// tidal/src/text/writer.rs
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::sync::MutexGuard;
|
||||
use tantivy::{Document, Term, doc};
|
||||
use tantivy::schema::Value;
|
||||
|
||||
use crate::schema::EntityId;
|
||||
use crate::text::index::{TextIndex, TantivyFields};
|
||||
use crate::TidalError;
|
||||
|
||||
/// Write operations on the Tantivy text index.
|
||||
///
|
||||
/// This is a thin wrapper over the locked IndexWriter that converts tidalDB
|
||||
/// metadata maps into Tantivy documents and handles entity_id-based deletes.
|
||||
///
|
||||
/// Thread safety: `TextIndexWriter` holds a `MutexGuard` on the IndexWriter.
|
||||
/// Operations are batched in memory and only become visible after `commit()`.
|
||||
pub struct TextIndexWriter<'a> {
|
||||
writer: MutexGuard<'a, tantivy::IndexWriter>,
|
||||
fields: &'a TantivyFields,
|
||||
}
|
||||
|
||||
impl TextIndex {
|
||||
/// Lock the writer and return a `TextIndexWriter` for batch operations.
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns `TidalError::Internal` if the writer mutex is poisoned.
|
||||
pub fn writer_guard(&self) -> crate::Result<TextIndexWriter<'_>> {
|
||||
let writer = self
|
||||
.writer
|
||||
.lock()
|
||||
.map_err(|e| TidalError::Internal(format!("writer lock poisoned: {e}")))?;
|
||||
Ok(TextIndexWriter {
|
||||
writer,
|
||||
fields: &self.fields,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
impl<'a> TextIndexWriter<'a> {
|
||||
/// Index or re-index an item.
|
||||
///
|
||||
/// Tantivy has no atomic update — this deletes any existing document for
|
||||
/// `entity_id` and adds a fresh document. Both operations are in the same
|
||||
/// batch and become visible atomically on the next `commit()`.
|
||||
///
|
||||
/// Only metadata keys that match a declared text field are indexed.
|
||||
/// Unknown keys are silently ignored.
|
||||
pub fn index_item(
|
||||
&mut self,
|
||||
entity_id: EntityId,
|
||||
metadata: &HashMap<String, String>,
|
||||
) -> crate::Result<()> {
|
||||
// Delete any existing document for this entity_id
|
||||
let id_term = Term::from_field_u64(self.fields.entity_id, entity_id.get());
|
||||
self.writer.delete_term(id_term);
|
||||
|
||||
// Build document
|
||||
let mut doc = Document::new();
|
||||
doc.add_u64(self.fields.entity_id, entity_id.get());
|
||||
|
||||
for (key, tv_field, _field_type) in &self.fields.text_fields {
|
||||
if let Some(value) = metadata.get(key) {
|
||||
doc.add_text(*tv_field, value);
|
||||
}
|
||||
}
|
||||
|
||||
self.writer
|
||||
.add_document(doc)
|
||||
.map_err(|e| TidalError::Internal(format!("tantivy add_document: {e}")))?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Remove an item from the index.
|
||||
///
|
||||
/// The delete takes effect on the next `commit()`.
|
||||
pub fn delete_item(&mut self, entity_id: EntityId) {
|
||||
let id_term = Term::from_field_u64(self.fields.entity_id, entity_id.get());
|
||||
self.writer.delete_term(id_term);
|
||||
}
|
||||
|
||||
/// Commit all pending writes and store `last_seq` in the commit payload.
|
||||
///
|
||||
/// This is the durability boundary: after `commit()` returns, all indexed
|
||||
/// documents are visible to new `IndexReader::searcher()` instances.
|
||||
///
|
||||
/// The `last_seq` is stored in the Tantivy commit payload via `set_payload()`.
|
||||
/// On crash recovery, read the last commit payload to find the resume point.
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns `TidalError::Internal` if the commit fails.
|
||||
pub fn commit(&mut self, last_seq: u64) -> crate::Result<()> {
|
||||
self.writer.set_payload(&last_seq.to_string());
|
||||
self.writer
|
||||
.commit()
|
||||
.map_err(|e| TidalError::Internal(format!("tantivy commit: {e}")))?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read the last committed sequence number from the Tantivy index payload.
|
||||
///
|
||||
/// Returns 0 if no commit payload exists (fresh index or first run).
|
||||
pub fn last_committed_seq(index: &tantivy::Index) -> u64 {
|
||||
index
|
||||
.load_metas()
|
||||
.ok()
|
||||
.and_then(|meta| meta.payload)
|
||||
.and_then(|p| p.parse::<u64>().ok())
|
||||
.unwrap_or(0)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Integration with TidalDb
|
||||
|
||||
Wire `index_item` calls into `TidalDb::write_item_with_metadata()` and `write_item()`. The text index should be updated **after** the entity store write succeeds (DB-primary consistency: entity store wins, Tantivy is derived).
|
||||
|
||||
In the immediate term (before the background syncer in task-03), do a synchronous index update after each write. The background syncer in task-03 will replace this with an async outbox pattern.
|
||||
|
||||
Actually, for correctness in m5p1, keep it synchronous (direct call after entity store write). Task-03 (Background Syncer) replaces the synchronous write with the outbox pattern.
|
||||
|
||||
### EntityId fast field access
|
||||
|
||||
`EntityId` must expose its inner `u64` value. Check if `EntityId::get()` exists — if not, add it:
|
||||
|
||||
```rust
|
||||
impl EntityId {
|
||||
pub fn get(&self) -> u64 {
|
||||
self.0 // or whatever the inner field is
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] `TextIndexWriter::index_item(entity_id, metadata)` builds a Tantivy document with `entity_id` fast field + all matching text fields
|
||||
- [ ] Unknown metadata keys (not declared as text fields) are silently ignored
|
||||
- [ ] `delete_item(entity_id)` issues a `delete_term` on the `entity_id` fast field
|
||||
- [ ] `index_item` does delete-then-add (same batch): updating an item does not leave orphan documents
|
||||
- [ ] `commit(last_seq)` calls `set_payload(&last_seq.to_string())` before `commit()`
|
||||
- [ ] `TextIndexWriter::last_committed_seq(index)` reads payload from last commit; returns 0 on fresh index
|
||||
- [ ] `TextIndex::writer_guard()` acquires the mutex and returns `TextIndexWriter`
|
||||
- [ ] Unit tests: `index_and_search`, `delete_removes_document`, `update_replaces_document`, `commit_stores_sequence`, `last_committed_seq_returns_zero_fresh`, `last_committed_seq_returns_stored_value`
|
||||
- [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass
|
||||
|
||||
## Test Strategy
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn index_and_search() {
|
||||
let fields = vec![
|
||||
TextFieldDef { key: "title".into(), field_type: TextFieldType::Text },
|
||||
];
|
||||
let idx = TextIndex::ephemeral(&fields).unwrap();
|
||||
let mut w = idx.writer_guard().unwrap();
|
||||
let mut meta = HashMap::new();
|
||||
meta.insert("title".into(), "Rust programming language".into());
|
||||
w.index_item(EntityId::new(42), &meta).unwrap();
|
||||
w.commit(1).unwrap();
|
||||
// Searcher should find item 42 for query "Rust"
|
||||
idx.reader.reload().unwrap(); // force reader refresh in test
|
||||
let searcher = idx.reader.searcher();
|
||||
// ... assert item found
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn delete_removes_document() {
|
||||
// Write, commit, delete, commit, verify not found
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn commit_stores_sequence() {
|
||||
let idx = TextIndex::ephemeral(&[]).unwrap(); // no text fields, just entity_id
|
||||
// index_item with only entity_id field, commit(seq=42)
|
||||
let seq = TextIndexWriter::last_committed_seq(&idx.index);
|
||||
assert_eq!(seq, 42);
|
||||
}
|
||||
```
|
||||
241
docs/planning/milestone-5/phase-1/task-03-background-syncer.md
Normal file
241
docs/planning/milestone-5/phase-1/task-03-background-syncer.md
Normal file
@ -0,0 +1,241 @@
|
||||
# Task 03: Background Syncer
|
||||
|
||||
## Delivers
|
||||
|
||||
`TextIndexSyncer` — a background thread that reads entity store writes (tracked via a sequence counter), feeds Tantivy writer, commits on interval (every 1000 docs or 2 seconds), and stores the last-processed sequence number in the commit payload. On crash recovery, reads the commit payload to find the resume point and replays from the entity store.
|
||||
|
||||
## Complexity: L
|
||||
|
||||
## Dependencies
|
||||
|
||||
- Task 01 complete: `TextIndex`, `TextIndexConfig`
|
||||
- Task 02 complete: `TextIndexWriter`, `commit(seq)`, `last_committed_seq()`
|
||||
- `StorageEngine` trait with `scan_prefix()` for rebuild
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Approach
|
||||
|
||||
Use an **outbox sequence counter** approach. The entity store write path increments a shared `AtomicU64` sequence counter each time an item is written. The syncer reads this counter and processes any items with sequence numbers above its last committed value.
|
||||
|
||||
For the initial m5p1 implementation, use a simpler approach:
|
||||
1. The syncer runs on a configurable interval (default: 2 seconds)
|
||||
2. On each tick, it scans ALL items from the entity store and re-indexes them if their sequence number is higher than last committed
|
||||
3. A more sophisticated outbox pattern (WAL-based) is deferred to future work
|
||||
|
||||
This is correct but not optimally efficient — full rebuild handles correctness, partial updates optimize throughput. For 10K items, a full rebuild takes < 1 second, so this is acceptable.
|
||||
|
||||
Actually, looking at the WAL sequence numbers and the entity store, the simplest correct approach is:
|
||||
- Maintain a monotonic `write_counter: AtomicU64` in `TidalDb` that increments on each `write_item_with_metadata()` call
|
||||
- The syncer checks if `write_counter > last_committed_seq` and if so, does a full index rebuild
|
||||
- This guarantees correctness at the cost of always doing a full rebuild (acceptable for 10K items)
|
||||
|
||||
For a more sophisticated approach with incremental updates, we track which entity IDs have been updated since the last commit via a concurrent queue:
|
||||
|
||||
```rust
|
||||
// In TidalDb: a channel where item writes post (entity_id, write_seq) pairs
|
||||
pending_text_updates: crossbeam::channel::Sender<(EntityId, u64)>
|
||||
```
|
||||
|
||||
The syncer receives these pairs, batches them, and commits on interval.
|
||||
|
||||
Use the channel approach — it's more efficient and correctly handles the outbox pattern.
|
||||
|
||||
### TextIndexSyncer
|
||||
|
||||
```rust
|
||||
// tidal/src/text/syncer.rs
|
||||
|
||||
use std::sync::Arc;
|
||||
use std::time::{Duration, Instant};
|
||||
use crossbeam::channel::{Receiver, RecvTimeoutError};
|
||||
use crate::schema::EntityId;
|
||||
use crate::text::index::TextIndex;
|
||||
use crate::storage::StorageEngine;
|
||||
use crate::TidalError;
|
||||
|
||||
/// A pending write event: entity_id + WAL sequence number of the write.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct PendingWrite {
|
||||
pub entity_id: EntityId,
|
||||
pub metadata: std::collections::HashMap<String, String>,
|
||||
pub seq: u64,
|
||||
/// If true, this is a delete (item was removed).
|
||||
pub deleted: bool,
|
||||
}
|
||||
|
||||
/// Background syncer that feeds the Tantivy text index from the entity store outbox.
|
||||
pub struct TextIndexSyncer {
|
||||
index: Arc<TextIndex>,
|
||||
rx: Receiver<PendingWrite>,
|
||||
commit_every_n: usize,
|
||||
commit_every: Duration,
|
||||
}
|
||||
|
||||
impl TextIndexSyncer {
|
||||
pub fn new(
|
||||
index: Arc<TextIndex>,
|
||||
rx: Receiver<PendingWrite>,
|
||||
commit_every_n: usize,
|
||||
commit_every_secs: u64,
|
||||
) -> Self {
|
||||
Self {
|
||||
index,
|
||||
rx,
|
||||
commit_every_n,
|
||||
commit_every: Duration::from_secs(commit_every_secs),
|
||||
}
|
||||
}
|
||||
|
||||
/// Run the syncer loop. Blocks until the channel is closed (sender dropped).
|
||||
///
|
||||
/// This is intended to run on a dedicated background thread.
|
||||
pub fn run(self) -> crate::Result<()> {
|
||||
let mut pending_count = 0usize;
|
||||
let mut last_commit_time = Instant::now();
|
||||
let mut last_seq = 0u64;
|
||||
let mut writer = self.index.writer_guard()?;
|
||||
|
||||
loop {
|
||||
// Try to receive with timeout
|
||||
match self.rx.recv_timeout(Duration::from_millis(100)) {
|
||||
Ok(update) => {
|
||||
if update.deleted {
|
||||
writer.delete_item(update.entity_id);
|
||||
} else {
|
||||
writer.index_item(update.entity_id, &update.metadata)?;
|
||||
}
|
||||
if update.seq > last_seq {
|
||||
last_seq = update.seq;
|
||||
}
|
||||
pending_count += 1;
|
||||
|
||||
// Commit if batch is full
|
||||
if pending_count >= self.commit_every_n {
|
||||
writer.commit(last_seq)?;
|
||||
pending_count = 0;
|
||||
last_commit_time = Instant::now();
|
||||
}
|
||||
}
|
||||
Err(RecvTimeoutError::Timeout) => {
|
||||
// Commit on timeout if there are pending documents
|
||||
if pending_count > 0 && last_commit_time.elapsed() >= self.commit_every {
|
||||
writer.commit(last_seq)?;
|
||||
pending_count = 0;
|
||||
last_commit_time = Instant::now();
|
||||
}
|
||||
}
|
||||
Err(RecvTimeoutError::Disconnected) => {
|
||||
// Channel closed: flush remaining
|
||||
if pending_count > 0 {
|
||||
writer.commit(last_seq)?;
|
||||
}
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Crash Recovery
|
||||
|
||||
On `TidalDb::open()` (or `TidalDb::builder().open()`), after opening the Tantivy index:
|
||||
|
||||
```rust
|
||||
let last_committed = TextIndexWriter::last_committed_seq(&text_index.index);
|
||||
// The syncer will process events with seq > last_committed
|
||||
// Since entity_writes are tracked, items written after last_committed
|
||||
// will be re-submitted to the syncer automatically on the first cycle.
|
||||
```
|
||||
|
||||
For the initial implementation, implement `rebuild_from()`:
|
||||
|
||||
```rust
|
||||
impl TextIndex {
|
||||
/// Rebuild the Tantivy index from the entity store.
|
||||
///
|
||||
/// Scans all items in the entity store and re-indexes them.
|
||||
/// The last committed sequence is set to `last_seq` after rebuild.
|
||||
///
|
||||
/// Used for crash recovery and initial setup.
|
||||
pub fn rebuild_from(
|
||||
&self,
|
||||
storage: &dyn crate::storage::StorageEngine,
|
||||
last_seq: u64,
|
||||
) -> crate::Result<()> {
|
||||
let mut writer = self.writer_guard()?;
|
||||
|
||||
// Delete all existing documents
|
||||
writer.writer.delete_all_documents()
|
||||
.map_err(|e| TidalError::Internal(format!("tantivy delete_all: {e}")))?;
|
||||
|
||||
// Scan all items from entity store
|
||||
for entry in storage.scan_prefix(&[]) {
|
||||
let (key, value) = entry.map_err(|e| TidalError::from(e))?;
|
||||
// Parse entity_id from key, metadata from value
|
||||
// ... decode and index each item
|
||||
}
|
||||
|
||||
writer.commit(last_seq)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Integration in TidalDb
|
||||
|
||||
Add to `TidalDb`:
|
||||
- `text_index: Option<Arc<TextIndex>>` — `None` if no text fields declared in schema
|
||||
- `text_tx: Option<crossbeam::channel::Sender<PendingWrite>>` — channel to syncer
|
||||
- `text_syncer_thread: Option<std::thread::JoinHandle<crate::Result<()>>>` — background thread
|
||||
|
||||
On `write_item_with_metadata()`, after the entity store write, send to `text_tx` if `Some`.
|
||||
|
||||
On `close()` / `shutdown()`, drop `text_tx` to signal the syncer to flush and exit, then join the thread.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] `TextIndexSyncer` struct with `new()` and `run()` methods
|
||||
- [ ] `PendingWrite` struct with `entity_id`, `metadata`, `seq`, `deleted` fields
|
||||
- [ ] Syncer commits after `commit_every_n` documents
|
||||
- [ ] Syncer commits after `commit_every_secs` timeout even with fewer documents
|
||||
- [ ] Syncer flushes remaining documents when channel is closed (graceful shutdown)
|
||||
- [ ] Each commit stores `last_seq` in the Tantivy commit payload
|
||||
- [ ] `TextIndex::rebuild_from(storage, last_seq)` scans entity store and re-indexes all items
|
||||
- [ ] `TidalDb` holds `Option<Arc<TextIndex>>` — `None` if schema has no text fields
|
||||
- [ ] `TidalDb::write_item_with_metadata()` sends `PendingWrite` to the syncer channel
|
||||
- [ ] `TidalDb::close()` drops the channel sender and joins the syncer thread
|
||||
- [ ] Unit tests: `syncer_commits_on_batch`, `syncer_commits_on_timeout`, `syncer_flushes_on_shutdown`, `rebuild_from_indexes_all_items`
|
||||
- [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass
|
||||
|
||||
## Test Strategy
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn syncer_commits_on_batch() {
|
||||
let (tx, rx) = crossbeam::channel::unbounded();
|
||||
let idx = Arc::new(TextIndex::ephemeral(&test_fields()).unwrap());
|
||||
let syncer = TextIndexSyncer::new(Arc::clone(&idx), rx, 3, 60);
|
||||
let handle = std::thread::spawn(move || syncer.run());
|
||||
|
||||
// Send 3 items → triggers commit
|
||||
for i in 0..3u64 {
|
||||
tx.send(PendingWrite {
|
||||
entity_id: EntityId::new(i),
|
||||
metadata: make_meta(i),
|
||||
seq: i + 1,
|
||||
deleted: false,
|
||||
}).unwrap();
|
||||
}
|
||||
|
||||
// Drop sender to trigger flush
|
||||
drop(tx);
|
||||
handle.join().unwrap().unwrap();
|
||||
|
||||
// Verify all 3 items are in the index
|
||||
let searcher = idx.reader.searcher();
|
||||
assert_eq!(searcher.num_docs(), 3);
|
||||
}
|
||||
```
|
||||
@ -0,0 +1,230 @@
|
||||
# Task 04: BM25 Scoring Collectors
|
||||
|
||||
## Delivers
|
||||
|
||||
`AllScoresCollector` for returning all matching `(EntityId, f32)` pairs with BM25 scores, and `ScoredCandidateCollector` for scoring a pre-sorted candidate set using `DocSet::seek()`. Entity ID resolution from the `entity_id` fast field.
|
||||
|
||||
## Complexity: M
|
||||
|
||||
## Dependencies
|
||||
|
||||
- Task 01 complete: `TextIndex`, `TantivyFields` with `entity_id` fast field
|
||||
- Task 02 complete: documents have `entity_id` fast field populated
|
||||
|
||||
## Technical Design
|
||||
|
||||
From `docs/research/tantivy.md` — the Collector API is the recommended approach:
|
||||
|
||||
### AllScoresCollector
|
||||
|
||||
Captures every `(DocAddress, Score)` pair matching a query. `requires_scoring()` must return `true` or BM25 is skipped.
|
||||
|
||||
```rust
|
||||
// tidal/src/text/collectors.rs
|
||||
|
||||
use std::collections::HashMap;
|
||||
use tantivy::{DocAddress, DocId, Score, SegmentOrdinal, SegmentReader};
|
||||
use tantivy::collector::{Collector, SegmentCollector};
|
||||
use tantivy::fastfield::FastFieldReader;
|
||||
use crate::schema::EntityId;
|
||||
|
||||
// ── AllScoresCollector ────────────────────────────────────────────────────────
|
||||
|
||||
/// Tantivy Collector that captures all matching documents with their BM25 scores.
|
||||
///
|
||||
/// Returns `Vec<(EntityId, f32)>` — every matched item with its BM25 relevance score.
|
||||
/// `requires_scoring()` must return `true` for BM25 to be computed.
|
||||
pub struct AllScoresCollector {
|
||||
pub entity_id_field: tantivy::schema::Field,
|
||||
}
|
||||
|
||||
pub struct AllScoresSegmentCollector {
|
||||
segment_ord: SegmentOrdinal,
|
||||
entity_id_reader: tantivy::fastfield::Column<u64>,
|
||||
results: Vec<(EntityId, f32)>,
|
||||
}
|
||||
|
||||
impl Collector for AllScoresCollector {
|
||||
type Fruit = Vec<(EntityId, f32)>;
|
||||
type Child = AllScoresSegmentCollector;
|
||||
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: SegmentOrdinal,
|
||||
segment: &SegmentReader,
|
||||
) -> tantivy::Result<Self::Child> {
|
||||
let entity_id_reader = segment
|
||||
.fast_fields()
|
||||
.u64(self.entity_id_field)?;
|
||||
Ok(AllScoresSegmentCollector {
|
||||
segment_ord: segment_local_id,
|
||||
entity_id_reader,
|
||||
results: Vec::new(),
|
||||
})
|
||||
}
|
||||
|
||||
fn requires_scoring(&self) -> bool {
|
||||
true // CRITICAL: must be true for BM25 computation
|
||||
}
|
||||
|
||||
fn merge_fruits(
|
||||
&self,
|
||||
segment_fruits: Vec<Vec<(EntityId, f32)>>,
|
||||
) -> tantivy::Result<Self::Fruit> {
|
||||
Ok(segment_fruits.into_iter().flatten().collect())
|
||||
}
|
||||
}
|
||||
|
||||
impl SegmentCollector for AllScoresSegmentCollector {
|
||||
type Fruit = Vec<(EntityId, f32)>;
|
||||
|
||||
fn collect(&mut self, doc: DocId, score: Score) {
|
||||
if let Ok(entity_id_val) = self.entity_id_reader.get_val(doc) {
|
||||
self.results.push((EntityId::new(entity_id_val), score));
|
||||
}
|
||||
}
|
||||
|
||||
fn harvest(self) -> Self::Fruit {
|
||||
self.results
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### ScoredCandidateCollector
|
||||
|
||||
Scores a pre-sorted candidate set using `DocSet::seek()`. From the research doc:
|
||||
> Seek advances to the first doc ≥ target; if it returns exactly the target, the document matches the query, and `scorer.score()` gives its BM25 score.
|
||||
|
||||
For tidalDB's use case, we need to:
|
||||
1. Map `EntityId` → `(SegmentOrdinal, DocId)` — requires maintaining a lookup table
|
||||
2. Sort candidates by `(segment_ord, doc_id)` for ascending seek
|
||||
3. For each candidate, seek to its `DocId` and read the score
|
||||
|
||||
The entity_id → DocAddress mapping needs to be maintained in `TextIndex`. After each commit, the reader reloads and we can rebuild the mapping by scanning the fast field.
|
||||
|
||||
```rust
|
||||
/// Scores a pre-sorted candidate set via BM25 (seek-based).
|
||||
///
|
||||
/// For tidalDB's hybrid search: given ANN results as a candidate set,
|
||||
/// get BM25 scores for those specific documents.
|
||||
///
|
||||
/// `candidates` must be sorted by (segment_ord, doc_id) ascending for seek to work.
|
||||
pub fn score_candidates(
|
||||
searcher: &tantivy::Searcher,
|
||||
query: &dyn tantivy::query::Query,
|
||||
candidates: &[(u32, u32, EntityId)], // (segment_ord, doc_id, entity_id)
|
||||
) -> crate::Result<Vec<(EntityId, f32)>> {
|
||||
use tantivy::query::EnableScoring;
|
||||
|
||||
let weight = query
|
||||
.weight(EnableScoring::enabled_from_searcher(searcher))
|
||||
.map_err(|e| crate::TidalError::Internal(format!("tantivy weight: {e}")))?;
|
||||
|
||||
let mut results = Vec::with_capacity(candidates.len());
|
||||
|
||||
// Group candidates by segment
|
||||
let mut by_segment: HashMap<u32, Vec<(u32, EntityId)>> = HashMap::new();
|
||||
for &(seg_ord, doc_id, entity_id) in candidates {
|
||||
by_segment.entry(seg_ord).or_default().push((doc_id, entity_id));
|
||||
}
|
||||
|
||||
for (seg_ord, mut docs) in by_segment {
|
||||
docs.sort_by_key(|(doc_id, _)| *doc_id); // must be ascending for seek
|
||||
|
||||
let segment_reader = searcher
|
||||
.segment_reader(seg_ord)
|
||||
.ok_or_else(|| crate::TidalError::Internal("segment not found".into()))?;
|
||||
|
||||
let mut scorer = weight
|
||||
.scorer(segment_reader, 1.0)
|
||||
.map_err(|e| crate::TidalError::Internal(format!("tantivy scorer: {e}")))?;
|
||||
|
||||
for (doc_id, entity_id) in docs {
|
||||
use tantivy::DocSet;
|
||||
let reached = scorer.seek(doc_id);
|
||||
if reached == doc_id {
|
||||
results.push((entity_id, scorer.score()));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Ok(results)
|
||||
}
|
||||
```
|
||||
|
||||
### EntityId → DocAddress Mapping
|
||||
|
||||
Maintain a `DashMap<u64, (u32, u32)>` — entity_id → (segment_ord, doc_id). This mapping:
|
||||
- Is rebuilt after each commit (by scanning the fast field across all segments)
|
||||
- Is used by `score_candidates()` to convert EntityId candidates to Tantivy addresses
|
||||
- Can be rebuilt from the index at any time (derived state)
|
||||
|
||||
Add to `TextIndex`:
|
||||
```rust
|
||||
/// Entity ID → (segment_ord, doc_id) mapping.
|
||||
/// Rebuilt after each commit by scanning the entity_id fast field.
|
||||
pub(crate) entity_map: Arc<DashMap<u64, (u32, u32)>>,
|
||||
```
|
||||
|
||||
Add `TextIndex::rebuild_entity_map()` method that scans all segments.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] `AllScoresCollector` implements Tantivy's `Collector` trait
|
||||
- [ ] `AllScoresCollector::requires_scoring()` returns `true`
|
||||
- [ ] `AllScoresCollector::for_segment()` creates a segment collector with fast field reader
|
||||
- [ ] `AllScoresSegmentCollector::collect()` reads the `entity_id` fast field value and pushes `(EntityId, f32)` to results
|
||||
- [ ] `AllScoresCollector::merge_fruits()` flattens segment results into a single `Vec<(EntityId, f32)>`
|
||||
- [ ] `score_candidates()` function groups candidates by segment, sorts by doc_id ascending, seeks through posting lists, returns scores for matching documents
|
||||
- [ ] `TextIndex::entity_map: Arc<DashMap<u64, (u32, u32)>>` maintained
|
||||
- [ ] `TextIndex::rebuild_entity_map()` scans all segments and populates the entity map
|
||||
- [ ] `rebuild_entity_map()` called after each commit in the syncer
|
||||
- [ ] Unit tests: `all_scores_collector_captures_bm25`, `all_scores_requires_scoring`, `score_candidates_seek_based`, `entity_map_rebuilds_after_commit`, `missing_candidate_skipped`
|
||||
- [ ] Property test: for any set of indexed documents, `AllScoresCollector` returns exactly the matching documents with positive scores
|
||||
- [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass
|
||||
|
||||
## Test Strategy
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn all_scores_collector_captures_bm25() {
|
||||
let fields = vec![TextFieldDef { key: "title".into(), field_type: TextFieldType::Text }];
|
||||
let idx = TextIndex::ephemeral(&fields).unwrap();
|
||||
|
||||
// Index 3 documents
|
||||
let mut w = idx.writer_guard().unwrap();
|
||||
for (i, title) in [(1u64, "jazz piano"), (2u64, "rock guitar"), (3u64, "jazz violin")] {
|
||||
let mut m = HashMap::new();
|
||||
m.insert("title".into(), title.into());
|
||||
w.index_item(EntityId::new(i), &m).unwrap();
|
||||
}
|
||||
w.commit(3).unwrap();
|
||||
|
||||
// Force reader reload
|
||||
idx.reader.reload().unwrap();
|
||||
let searcher = idx.reader.searcher();
|
||||
|
||||
// Search for "jazz"
|
||||
let qp = tantivy::query::QueryParser::for_index(&idx.index, vec![idx.fields().text_fields[0].1]);
|
||||
let query = qp.parse_query("jazz").unwrap();
|
||||
|
||||
let collector = AllScoresCollector { entity_id_field: idx.fields().entity_id };
|
||||
let results = searcher.search(&query, &collector).unwrap();
|
||||
|
||||
// Should find entities 1 and 3 (jazz), not 2 (rock)
|
||||
let ids: Vec<u64> = results.iter().map(|(id, _)| id.get()).collect();
|
||||
assert!(ids.contains(&1));
|
||||
assert!(ids.contains(&3));
|
||||
assert!(!ids.contains(&2));
|
||||
|
||||
// All BM25 scores should be positive
|
||||
for (_, score) in &results {
|
||||
assert!(*score > 0.0);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn score_candidates_seek_based() {
|
||||
// Index 5 docs, rebuild entity_map, score only candidates [1, 3] via seek
|
||||
}
|
||||
```
|
||||
@ -0,0 +1,266 @@
|
||||
# Task 05: Boolean Query Parsing
|
||||
|
||||
## Delivers
|
||||
|
||||
`TextQueryParser` — a wrapper over Tantivy's `QueryParser` with custom syntax extensions. Handles: AND/OR/NOT operators, exact phrase (`"..."`), field-scoped (`title:jazz`, `tag:tutorial`), exclusion (`-beginner`), wildcard prefix (`pian*`), hashtag (`#jazz`).
|
||||
|
||||
## Complexity: M
|
||||
|
||||
## Dependencies
|
||||
|
||||
- Task 01 complete: `TextIndex`, `TantivyFields` with field names
|
||||
- Task 02 complete: documents indexed
|
||||
|
||||
## Technical Design
|
||||
|
||||
Tantivy's built-in `QueryParser` already handles most of the required syntax. tidalDB's `TextQueryParser` wraps it and adds:
|
||||
1. Pre-processing of `#jazz` → `jazz` (hashtag syntax → bare term)
|
||||
2. Pre-processing of `creator:handle` → field-scoped query on creator field
|
||||
3. Validation and error messages appropriate for tidalDB's API
|
||||
|
||||
From `docs/research/tantivy.md`:
|
||||
> QueryParser handles: bare terms, exact phrase, boolean AND/OR/NOT, field-scoped, exclusion (-term), wildcard prefix (pian*)
|
||||
|
||||
```rust
|
||||
// tidal/src/text/query.rs
|
||||
|
||||
use tantivy::query::Query;
|
||||
use tantivy::schema::Field;
|
||||
use crate::schema::TextFieldDef;
|
||||
use crate::TidalError;
|
||||
|
||||
/// Parser for text search queries.
|
||||
///
|
||||
/// Wraps Tantivy's QueryParser with tidalDB-specific syntax extensions:
|
||||
/// - `#jazz` → bare term `jazz` (hashtag pre-processing)
|
||||
/// - `creator:handle` → field-scoped query if `creator` field is declared
|
||||
/// - All other Tantivy query syntax passes through unchanged
|
||||
pub struct TextQueryParser {
|
||||
inner: tantivy::query::QueryParser,
|
||||
default_fields: Vec<Field>,
|
||||
}
|
||||
|
||||
impl TextQueryParser {
|
||||
/// Create a parser that searches across all `Text`-type declared fields by default.
|
||||
///
|
||||
/// `Keyword` fields require explicit field scoping (`field:value`).
|
||||
pub fn new(
|
||||
index: &tantivy::Index,
|
||||
text_fields: &[(String, Field, crate::schema::TextFieldType)],
|
||||
) -> Self {
|
||||
use crate::schema::TextFieldType;
|
||||
|
||||
// Default search fields are Text-type only (tokenized)
|
||||
let default_fields: Vec<Field> = text_fields
|
||||
.iter()
|
||||
.filter(|(_, _, ft)| *ft == TextFieldType::Text)
|
||||
.map(|(_, f, _)| *f)
|
||||
.collect();
|
||||
|
||||
let inner = tantivy::query::QueryParser::for_index(index, default_fields.clone());
|
||||
Self { inner, default_fields }
|
||||
}
|
||||
|
||||
/// Parse a query string into a Tantivy `Query`.
|
||||
///
|
||||
/// Applies tidalDB pre-processing before passing to Tantivy's parser.
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns `TidalError::Query` if the query string is syntactically invalid.
|
||||
pub fn parse(&self, query_str: &str) -> crate::Result<Box<dyn Query>> {
|
||||
let preprocessed = preprocess_query(query_str);
|
||||
self.inner
|
||||
.parse_query(&preprocessed)
|
||||
.map_err(|e| TidalError::Query(crate::query::retrieve::QueryError::ParseError(
|
||||
format!("text query parse error: {e}")
|
||||
)))
|
||||
}
|
||||
}
|
||||
|
||||
/// Pre-process a tidalDB query string before passing to Tantivy's QueryParser.
|
||||
///
|
||||
/// Transformations:
|
||||
/// - `#jazz` → `jazz` (hashtag syntax)
|
||||
/// - Other syntax passes through to Tantivy's parser
|
||||
fn preprocess_query(query: &str) -> String {
|
||||
// Replace #word with word (remove hashtag prefix)
|
||||
let mut result = String::with_capacity(query.len());
|
||||
let mut chars = query.chars().peekable();
|
||||
|
||||
while let Some(ch) = chars.next() {
|
||||
if ch == '#' {
|
||||
// Check if followed by an alphanumeric char (valid hashtag)
|
||||
if chars.peek().map(|c| c.is_alphanumeric()).unwrap_or(false) {
|
||||
// Skip the '#' — the following word is the term
|
||||
continue;
|
||||
} else {
|
||||
// Not a valid hashtag, pass through
|
||||
result.push(ch);
|
||||
}
|
||||
} else {
|
||||
result.push(ch);
|
||||
}
|
||||
}
|
||||
|
||||
result
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn preprocess_removes_hashtag() {
|
||||
assert_eq!(preprocess_query("#jazz"), "jazz");
|
||||
assert_eq!(preprocess_query("#jazz #piano"), "jazz piano");
|
||||
assert_eq!(preprocess_query("jazz #piano"), "jazz piano");
|
||||
assert_eq!(preprocess_query("no-hashtag"), "no-hashtag");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_bare_terms() {
|
||||
// "jazz piano" → boolean OR of jazz and piano (Tantivy default)
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_exact_phrase() {
|
||||
// "\"jazz piano\"" → PhraseQuery
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_boolean_and() {
|
||||
// "jazz AND piano" → BooleanQuery with must clauses
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_boolean_not() {
|
||||
// "jazz -beginner" or "jazz NOT beginner" → excludes beginner
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_field_scoped() {
|
||||
// "title:jazz" → scopes query to title field
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_wildcard_prefix() {
|
||||
// "pian*" → PrefixQuery matching piano, pianist, etc.
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_hashtag() {
|
||||
// "#jazz" → same result as "jazz"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Integration with TextIndex
|
||||
|
||||
Add `TextIndex::query_parser()` method:
|
||||
|
||||
```rust
|
||||
impl TextIndex {
|
||||
pub fn query_parser(&self) -> TextQueryParser {
|
||||
TextQueryParser::new(&self.index, &self.fields.text_fields)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Wildcard prefix note
|
||||
|
||||
Tantivy's `QueryParser` supports wildcard prefix queries (`pian*`) when the field uses `Indexing::positions()`. By default `TEXT` fields include positions — so prefix queries work out of the box.
|
||||
|
||||
However, Tantivy disables regex and leading-wildcard queries (`*jazz`) by default for performance. tidalDB only needs trailing wildcards (`pian*`), which Tantivy handles via `PrefixQuery`.
|
||||
|
||||
Enable fuzzy queries is deferred to M6. For M5, exact, phrase, boolean, field-scoped, and prefix are sufficient.
|
||||
|
||||
### Boolean operator note
|
||||
|
||||
Tantivy's `QueryParser` uses `OR` as default conjunction. To configure `AND` as default (which is what most users expect for multi-word queries like "rust tutorial"):
|
||||
|
||||
```rust
|
||||
inner.set_conjunction_by_default();
|
||||
```
|
||||
|
||||
This makes `"rust tutorial"` behave as `rust AND tutorial` rather than `rust OR tutorial`, which produces more precise results. tidalDB should enable conjunction by default.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] `TextQueryParser` struct with `new(index, text_fields)` and `parse(query_str)` methods
|
||||
- [ ] Default search fields are `Text`-type only (not `Keyword`)
|
||||
- [ ] `#jazz` pre-processed to `jazz` before parsing
|
||||
- [ ] Bare terms: `rust tutorial` → conjunction of `rust` AND `tutorial` (default conjunction mode)
|
||||
- [ ] Exact phrase: `"exact phrase"` → `PhraseQuery` matching contiguous sequence
|
||||
- [ ] Boolean AND: `jazz AND piano` → `BooleanQuery` with two must clauses
|
||||
- [ ] Boolean OR: `jazz OR rock` → `BooleanQuery` with should clauses
|
||||
- [ ] Boolean NOT / exclusion: `jazz -beginner` → excludes items with "beginner"
|
||||
- [ ] Field-scoped: `title:jazz` → queries only the `title` field
|
||||
- [ ] Wildcard prefix: `pian*` → matches "piano", "pianist", etc.
|
||||
- [ ] Hashtag: `#jazz` → same results as bare `jazz`
|
||||
- [ ] Invalid query string returns `TidalError::Query` with descriptive message
|
||||
- [ ] `TextIndex::query_parser()` returns a `TextQueryParser` configured for the index
|
||||
- [ ] Unit tests: all syntax types above with assertions on query type returned
|
||||
- [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass
|
||||
|
||||
## Full Integration Test (BM25 search end-to-end)
|
||||
|
||||
After tasks 01-05 complete, add an integration test in `tidal/tests/text_index.rs`:
|
||||
|
||||
```rust
|
||||
/// Validates the full m5p1 text index pipeline:
|
||||
/// index → write → commit → search → score
|
||||
#[test]
|
||||
fn text_index_end_to_end() {
|
||||
let fields = vec![
|
||||
TextFieldDef { key: "title".into(), field_type: TextFieldType::Text },
|
||||
TextFieldDef { key: "description".into(), field_type: TextFieldType::Text },
|
||||
TextFieldDef { key: "category".into(), field_type: TextFieldType::Keyword },
|
||||
];
|
||||
|
||||
let idx = TextIndex::ephemeral(&fields).unwrap();
|
||||
|
||||
// Write 100 items
|
||||
let mut w = idx.writer_guard().unwrap();
|
||||
for i in 0..100u64 {
|
||||
let mut meta = HashMap::new();
|
||||
meta.insert("title".into(), format!("Rust tutorial {i}"));
|
||||
meta.insert("description".into(), "Learn Rust programming".into());
|
||||
meta.insert("category".into(), "programming".into());
|
||||
w.index_item(EntityId::new(i), &meta).unwrap();
|
||||
}
|
||||
w.commit(100).unwrap();
|
||||
drop(w);
|
||||
|
||||
idx.reader.reload().unwrap();
|
||||
let searcher = idx.reader.searcher();
|
||||
let parser = idx.query_parser();
|
||||
|
||||
// Test 1: bare terms
|
||||
let q = parser.parse("Rust tutorial").unwrap();
|
||||
let collector = AllScoresCollector { entity_id_field: idx.fields().entity_id };
|
||||
let results = searcher.search(q.as_ref(), &collector).unwrap();
|
||||
assert!(!results.is_empty());
|
||||
|
||||
// Test 2: exact phrase
|
||||
let q = parser.parse("\"Rust programming\"").unwrap();
|
||||
let results = searcher.search(q.as_ref(), &collector).unwrap();
|
||||
assert!(!results.is_empty()); // matches description
|
||||
|
||||
// Test 3: field-scoped keyword
|
||||
let q = parser.parse("category:programming").unwrap();
|
||||
let results = searcher.search(q.as_ref(), &collector).unwrap();
|
||||
assert_eq!(results.len(), 100);
|
||||
|
||||
// Test 4: exclusion
|
||||
let q = parser.parse("Rust -tutorial").unwrap();
|
||||
let results = searcher.search(q.as_ref(), &collector).unwrap();
|
||||
// "Rust programming" description matches "Rust" but not "tutorial"
|
||||
assert!(!results.is_empty());
|
||||
|
||||
// Test 5: BM25 latency < 10ms at 100 docs (trivial at this scale)
|
||||
let start = std::time::Instant::now();
|
||||
let q = parser.parse("Rust").unwrap();
|
||||
let _ = searcher.search(q.as_ref(), &collector).unwrap();
|
||||
assert!(start.elapsed().as_millis() < 10);
|
||||
}
|
||||
```
|
||||
60
docs/planning/milestone-5/phase-2/OVERVIEW.md
Normal file
60
docs/planning/milestone-5/phase-2/OVERVIEW.md
Normal file
@ -0,0 +1,60 @@
|
||||
# m5p2: Hybrid Fusion (RRF)
|
||||
|
||||
## Delivers
|
||||
|
||||
Reciprocal Rank Fusion combining BM25 ranked lists with ANN ranked lists into a single scored result set. The starting point is RRF with k=60; the architecture supports upgrading to tuned linear combination when relevance labels exist. Handles the three retrieval modes: text-only, vector-only, and hybrid. A `RetrievalMode` enum and `route_results()` function encapsulate the decision logic that the m5p3 `SearchExecutor` will call.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- m5p1 COMPLETE: `TextIndex`, `AllScoresCollector`, `TextQueryParser` — BM25 search that returns `Vec<(EntityId, f32)>`
|
||||
- m2p1 COMPLETE: `VectorIndex` trait, `VectorSearchResult { id: VectorId, distance: f32 }`, `EmbeddingSlotRegistry` — ANN search
|
||||
|
||||
## Research References
|
||||
|
||||
- `docs/research/tantivy.md` — Section "Start with Reciprocal Rank Fusion": RRF formula, k=60, Cormack et al. SIGIR 2009, production system comparison, upgrade path to linear combination
|
||||
|
||||
## Acceptance Criteria (Phase Level)
|
||||
|
||||
- [ ] `HybridFusion` struct with `k: u32` field (default 60) in `tidal/src/query/fusion.rs`
|
||||
- [ ] `HybridFusion::fuse(bm25_results: &[(EntityId, f32)], ann_results: &[(EntityId, f32)], k: u32) -> Vec<(EntityId, f64)>` implements RRF
|
||||
- [ ] RRF formula: `score(d) = 1.0 / (k + rank_bm25(d)) + 1.0 / (k + rank_ann(d))`, ranks are 1-based (rank 1 = best)
|
||||
- [ ] Documents in only one list contribute only their single-list term; the missing-list term is zero
|
||||
- [ ] Results sorted descending by fused score
|
||||
- [ ] `RetrievalMode` enum: `TextOnly`, `VectorOnly`, `Hybrid`
|
||||
- [ ] `RetrievalMode::determine(has_text: bool, has_vector: bool) -> Option<RetrievalMode>` returns the correct mode
|
||||
- [ ] `route_results()` converts single-mode results to `Vec<(EntityId, f64)>` and calls `HybridFusion::fuse()` for hybrid
|
||||
- [ ] Pure BM25 path: results passed through as `Vec<(EntityId, f64)>` without fusion overhead
|
||||
- [ ] Pure ANN path: `VectorSearchResult` list converted to `Vec<(EntityId, f64)>` (score = 1.0 / (k + rank))
|
||||
- [ ] `k` parameter configurable; default 60
|
||||
- [ ] Fusion adds < 1ms to query time for 1000 candidates from each list (Criterion benchmarked)
|
||||
- [ ] Property test: for any pair of ranked lists, RRF output is the union of both input document sets; scores computed correctly to 6 decimal places
|
||||
|
||||
## Task Execution Order
|
||||
|
||||
```
|
||||
task-01 (RRF Implementation)
|
||||
|
|
||||
v
|
||||
task-02 (Retrieval Mode Router)
|
||||
```
|
||||
|
||||
Both tasks are sequential. Task 02 depends on `HybridFusion` from Task 01.
|
||||
|
||||
## Module Location
|
||||
|
||||
New module: `tidal/src/query/fusion.rs`
|
||||
|
||||
- `HybridFusion` — RRF computation struct
|
||||
- `RetrievalMode` — enum for text-only / vector-only / hybrid
|
||||
- `route_results()` — routes pre-retrieved result lists through the appropriate path
|
||||
|
||||
Add `pub mod fusion;` to `tidal/src/query/mod.rs`.
|
||||
|
||||
## Notes
|
||||
|
||||
- RRF uses **rank position** only — the input `f32` scores are used only for ordering, not for the fusion formula itself
|
||||
- BM25 results: `(EntityId, f32)` where **higher score = better** → sort descending, rank 1 = index 0
|
||||
- ANN results: `VectorSearchResult { id, distance }` where **lower distance = better** → sort ascending, rank 1 = index 0
|
||||
- For the ANN-only path, convert `VectorSearchResult { id, distance }` to `(EntityId, score)` where `score = 1.0 / (k + rank)` to produce a consistent `f64` output
|
||||
- The `rrf` crate exists on crates.io but we implement from scratch to avoid a dependency and maintain full control over the algorithm
|
||||
- No unsafe code — pure indexing arithmetic
|
||||
281
docs/planning/milestone-5/phase-2/task-01-rrf-implementation.md
Normal file
281
docs/planning/milestone-5/phase-2/task-01-rrf-implementation.md
Normal file
@ -0,0 +1,281 @@
|
||||
# Task 01: RRF Implementation
|
||||
|
||||
## Delivers
|
||||
|
||||
`HybridFusion` struct implementing Reciprocal Rank Fusion. `fuse()` merges a BM25 ranked list and an ANN ranked list into a single `Vec<(EntityId, f64)>` sorted by descending fused score. Documents appearing in only one list contribute only their single-list term. `k = 60` by default, configurable.
|
||||
|
||||
## Complexity: S
|
||||
|
||||
## Dependencies
|
||||
|
||||
- m5p1 COMPLETE: `EntityId` type, `tidal/src/query/` module structure
|
||||
- `tidal/src/query/mod.rs` exists for adding the `fusion` submodule
|
||||
|
||||
## Technical Design
|
||||
|
||||
### RRF Formula
|
||||
|
||||
From `docs/research/tantivy.md`:
|
||||
|
||||
> RRFscore(d) = 1/(60 + rank_bm25(d)) + 1/(60 + rank_ann(d))
|
||||
|
||||
Where:
|
||||
- `rank_bm25(d)` is the 1-based rank of document `d` in the BM25 list (rank 1 = highest BM25 score)
|
||||
- `rank_ann(d)` is the 1-based rank of document `d` in the ANN list (rank 1 = lowest L2 distance)
|
||||
- Documents absent from a list contribute zero for that term
|
||||
|
||||
The k=60 constant is insensitive across 30–100 range. We implement it as configurable, defaulting to 60.
|
||||
|
||||
### Input Conventions
|
||||
|
||||
- **BM25 results**: `&[(EntityId, f32)]` where the f32 is the BM25 score. **The caller must pre-sort these descending by score.** The `fuse()` function uses position-as-rank (position 0 = rank 1).
|
||||
- **ANN results**: `&[(EntityId, f32)]` where the f32 is the L2-squared distance. **The caller must pre-sort these ascending by distance.** The `fuse()` function uses position-as-rank (position 0 = rank 1).
|
||||
|
||||
This design keeps `fuse()` a pure function with no sorting overhead. The caller controls ordering.
|
||||
|
||||
### HybridFusion
|
||||
|
||||
```rust
|
||||
// tidal/src/query/fusion.rs
|
||||
|
||||
use std::collections::HashMap;
|
||||
use crate::schema::EntityId;
|
||||
|
||||
/// Reciprocal Rank Fusion (Cormack et al. SIGIR 2009).
|
||||
///
|
||||
/// Merges ranked lists from heterogeneous retrieval systems (BM25 text scores,
|
||||
/// ANN vector distances) into a single ranked list using only rank positions.
|
||||
///
|
||||
/// The k=60 constant is insensitive across [30, 100] — see the research
|
||||
/// literature. A configurable k is provided for experimentation.
|
||||
///
|
||||
/// # Reference
|
||||
///
|
||||
/// Cormack, Clarke, Büttcher. "Reciprocal Rank Fusion Outperforms Condorcet
|
||||
/// and Individual Rank Learning Methods." SIGIR 2009.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct HybridFusion {
|
||||
/// Rank offset constant. Default 60 per the original paper.
|
||||
pub k: u32,
|
||||
}
|
||||
|
||||
impl Default for HybridFusion {
|
||||
fn default() -> Self {
|
||||
Self { k: 60 }
|
||||
}
|
||||
}
|
||||
|
||||
impl HybridFusion {
|
||||
/// Create a fusion instance with the default k=60.
|
||||
#[must_use]
|
||||
pub fn new() -> Self {
|
||||
Self::default()
|
||||
}
|
||||
|
||||
/// Create a fusion instance with a custom k.
|
||||
#[must_use]
|
||||
pub fn with_k(k: u32) -> Self {
|
||||
Self { k }
|
||||
}
|
||||
|
||||
/// Fuse two ranked lists via Reciprocal Rank Fusion.
|
||||
///
|
||||
/// Both lists must be pre-sorted in "best first" order by the caller:
|
||||
/// - `bm25_results`: sorted descending by BM25 score (index 0 = rank 1)
|
||||
/// - `ann_results`: sorted ascending by L2 distance (index 0 = rank 1)
|
||||
///
|
||||
/// The f32 value in each tuple is used only to establish the ordering by
|
||||
/// the caller; `fuse()` itself uses only position (0-indexed) as the rank.
|
||||
///
|
||||
/// Documents appearing in only one list contribute only their single-list
|
||||
/// term. The missing-list contribution is zero.
|
||||
///
|
||||
/// Returns results sorted by descending fused RRF score.
|
||||
#[must_use]
|
||||
pub fn fuse(
|
||||
&self,
|
||||
bm25_results: &[(EntityId, f32)],
|
||||
ann_results: &[(EntityId, f32)],
|
||||
) -> Vec<(EntityId, f64)> {
|
||||
let k = f64::from(self.k);
|
||||
|
||||
// Map entity_id -> accumulated RRF score
|
||||
let capacity = bm25_results.len() + ann_results.len();
|
||||
let mut scores: HashMap<u64, f64> = HashMap::with_capacity(capacity);
|
||||
|
||||
// BM25 contribution: rank is 1-based (position 0 → rank 1)
|
||||
for (rank_0based, (entity_id, _score)) in bm25_results.iter().enumerate() {
|
||||
let rank = (rank_0based + 1) as f64;
|
||||
*scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
|
||||
}
|
||||
|
||||
// ANN contribution: rank is 1-based (position 0 → rank 1)
|
||||
for (rank_0based, (entity_id, _distance)) in ann_results.iter().enumerate() {
|
||||
let rank = (rank_0based + 1) as f64;
|
||||
*scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
|
||||
}
|
||||
|
||||
// Collect and sort descending by fused score
|
||||
let mut results: Vec<(EntityId, f64)> = scores
|
||||
.into_iter()
|
||||
.map(|(id, score)| (EntityId::new(id), score))
|
||||
.collect();
|
||||
results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
|
||||
|
||||
results
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Registration in query/mod.rs
|
||||
|
||||
Add to `tidal/src/query/mod.rs`:
|
||||
```rust
|
||||
pub mod fusion;
|
||||
pub use fusion::HybridFusion;
|
||||
```
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] `tidal/src/query/fusion.rs` created with `HybridFusion` struct
|
||||
- [ ] `HybridFusion { k: u32 }` with `Default` trait (k=60)
|
||||
- [ ] `HybridFusion::new()` returns default k=60 instance
|
||||
- [ ] `HybridFusion::with_k(k)` returns configured instance
|
||||
- [ ] `fuse(bm25: &[(EntityId, f32)], ann: &[(EntityId, f32)]) -> Vec<(EntityId, f64)>` implements RRF
|
||||
- [ ] BM25 rank: position 0 → rank 1, position N-1 → rank N
|
||||
- [ ] ANN rank: position 0 → rank 1, position N-1 → rank N
|
||||
- [ ] Documents in only one list: single-list contribution only (missing term = 0)
|
||||
- [ ] Results sorted descending by fused score
|
||||
- [ ] `pub mod fusion;` added to `tidal/src/query/mod.rs`
|
||||
- [ ] `pub use fusion::HybridFusion;` exported from `tidal/src/query/mod.rs`
|
||||
- [ ] Unit tests: `fuse_both_lists`, `fuse_bm25_only`, `fuse_ann_only`, `fuse_empty_lists`, `fuse_single_doc_both_lists`, `fuse_k_affects_scores`
|
||||
- [ ] Property test: for any pair of ranked lists, fused output is union of both document sets; score for doc in both lists > score for doc in only one list
|
||||
- [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass
|
||||
|
||||
## Test Strategy
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn fuse_both_lists() {
|
||||
// BM25: [A=1.0, B=0.8, C=0.5] (descending)
|
||||
// ANN: [B=0.1, A=0.2, D=0.5] (ascending distance)
|
||||
// Expected: B ranks highest (rank 1 in ANN, rank 2 in BM25)
|
||||
// A is rank 1 in BM25, rank 2 in ANN
|
||||
let bm25 = vec![
|
||||
(EntityId::new(1), 1.0f32), // A, rank 1
|
||||
(EntityId::new(2), 0.8f32), // B, rank 2
|
||||
(EntityId::new(3), 0.5f32), // C, rank 3 (BM25 only)
|
||||
];
|
||||
let ann = vec![
|
||||
(EntityId::new(2), 0.1f32), // B, rank 1 (best ANN match)
|
||||
(EntityId::new(1), 0.2f32), // A, rank 2
|
||||
(EntityId::new(4), 0.5f32), // D, rank 3 (ANN only)
|
||||
];
|
||||
|
||||
let fusion = HybridFusion::new(); // k=60
|
||||
let results = fusion.fuse(&bm25, &ann);
|
||||
|
||||
// B: 1/(60+2) + 1/(60+1) = 1/62 + 1/61 ≈ 0.01613 + 0.01639 = 0.03252
|
||||
// A: 1/(60+1) + 1/(60+2) = 1/61 + 1/62 ≈ 0.03252 (same as B — tie!)
|
||||
// Actually: B is rank 2 in BM25, rank 1 in ANN; A is rank 1 in BM25, rank 2 in ANN
|
||||
// B: 1/(60+2) + 1/(60+1) = 0.03252
|
||||
// A: 1/(60+1) + 1/(60+2) = 0.03252 (same score — tie)
|
||||
// C: 1/(60+3) + 0 = 1/63 ≈ 0.01587
|
||||
// D: 0 + 1/(60+3) = 1/63 ≈ 0.01587
|
||||
|
||||
// Verify all 4 documents are in the output
|
||||
let ids: Vec<u64> = results.iter().map(|(id, _)| id.as_u64()).collect();
|
||||
assert!(ids.contains(&1)); // A
|
||||
assert!(ids.contains(&2)); // B
|
||||
assert!(ids.contains(&3)); // C
|
||||
assert!(ids.contains(&4)); // D
|
||||
|
||||
// C and D (single-list) have lower scores than A and B (both lists)
|
||||
let c_score = results.iter().find(|(id, _)| id.as_u64() == 3).unwrap().1;
|
||||
let d_score = results.iter().find(|(id, _)| id.as_u64() == 4).unwrap().1;
|
||||
let a_score = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1;
|
||||
let b_score = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1;
|
||||
assert!(a_score > c_score);
|
||||
assert!(b_score > d_score);
|
||||
|
||||
// Scores are sorted descending
|
||||
let scores: Vec<f64> = results.iter().map(|(_, s)| *s).collect();
|
||||
for i in 1..scores.len() {
|
||||
assert!(scores[i-1] >= scores[i]);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn fuse_bm25_only() {
|
||||
let bm25 = vec![(EntityId::new(1), 1.0f32), (EntityId::new(2), 0.5f32)];
|
||||
let ann = vec![];
|
||||
|
||||
let fusion = HybridFusion::new();
|
||||
let results = fusion.fuse(&bm25, &ann);
|
||||
|
||||
assert_eq!(results.len(), 2);
|
||||
// rank 1 doc scores higher than rank 2
|
||||
let score_1 = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1;
|
||||
let score_2 = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1;
|
||||
assert!(score_1 > score_2);
|
||||
// Score = 1/(60+1) for rank 1 = 1/61
|
||||
let expected = 1.0 / (60.0 + 1.0);
|
||||
assert!((score_1 - expected).abs() < 1e-9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn fuse_k_affects_scores() {
|
||||
let bm25 = vec![(EntityId::new(1), 1.0f32)];
|
||||
let ann = vec![(EntityId::new(1), 0.1f32)];
|
||||
|
||||
let fusion_60 = HybridFusion::new(); // k=60
|
||||
let fusion_30 = HybridFusion::with_k(30); // k=30
|
||||
|
||||
let results_60 = fusion_60.fuse(&bm25, &ann);
|
||||
let results_30 = fusion_30.fuse(&bm25, &ann);
|
||||
|
||||
// k=30: 1/(30+1) + 1/(30+1) = 2/31 ≈ 0.0645
|
||||
// k=60: 1/(60+1) + 1/(60+1) = 2/61 ≈ 0.0328
|
||||
// k=30 produces higher scores
|
||||
assert!(results_30[0].1 > results_60[0].1);
|
||||
}
|
||||
```
|
||||
|
||||
### Property Test
|
||||
|
||||
```rust
|
||||
use proptest::prelude::*;
|
||||
|
||||
proptest! {
|
||||
#[test]
|
||||
fn rrf_output_is_union_of_inputs(
|
||||
bm25_ids in prop::collection::vec(1u64..=100, 0..20),
|
||||
ann_ids in prop::collection::vec(1u64..=100, 0..20),
|
||||
) {
|
||||
let bm25: Vec<(EntityId, f32)> = bm25_ids.iter().enumerate()
|
||||
.map(|(i, &id)| (EntityId::new(id), (100 - i) as f32))
|
||||
.collect();
|
||||
let ann: Vec<(EntityId, f32)> = ann_ids.iter().enumerate()
|
||||
.map(|(i, &id)| (EntityId::new(id), i as f32 * 0.01))
|
||||
.collect();
|
||||
|
||||
let fusion = HybridFusion::new();
|
||||
let results = fusion.fuse(&bm25, &ann);
|
||||
|
||||
// Output must contain the union of all input IDs
|
||||
let all_ids: std::collections::HashSet<u64> = bm25_ids.iter()
|
||||
.chain(ann_ids.iter())
|
||||
.copied()
|
||||
.collect();
|
||||
let result_ids: std::collections::HashSet<u64> = results.iter()
|
||||
.map(|(id, _)| id.as_u64())
|
||||
.collect();
|
||||
prop_assert_eq!(&all_ids, &result_ids);
|
||||
|
||||
// Results must be sorted descending
|
||||
for i in 1..results.len() {
|
||||
prop_assert!(results[i-1].1 >= results[i].1);
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
@ -0,0 +1,236 @@
|
||||
# Task 02: Retrieval Mode Router
|
||||
|
||||
## Delivers
|
||||
|
||||
`RetrievalMode` enum and `route_results()` function. `RetrievalMode::determine()` selects text-only, vector-only, or hybrid based on what's present in the query. `route_results()` converts pre-retrieved result lists through the appropriate path — direct passthrough for single-mode, `HybridFusion::fuse()` for hybrid. Criterion benchmark confirming fusion adds < 1ms at 1000 candidates per list.
|
||||
|
||||
## Complexity: S
|
||||
|
||||
## Dependencies
|
||||
|
||||
- Task 01 COMPLETE: `HybridFusion` with `fuse()` method in `tidal/src/query/fusion.rs`
|
||||
- m5p1 COMPLETE: `EntityId` type
|
||||
- m2p1 COMPLETE: `VectorSearchResult { id: VectorId, distance: f32 }` in `tidal/src/storage/vector/`
|
||||
|
||||
## Technical Design
|
||||
|
||||
### RetrievalMode
|
||||
|
||||
```rust
|
||||
// tidal/src/query/fusion.rs (additions)
|
||||
|
||||
/// Which retrieval system(s) to use for a search query.
|
||||
///
|
||||
/// Determined by what the query provides:
|
||||
/// - `TextOnly` — only `query_text` is present
|
||||
/// - `VectorOnly` — only `query_vector` is present
|
||||
/// - `Hybrid` — both `query_text` and `query_vector` are present
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub enum RetrievalMode {
|
||||
/// Execute BM25 text search only.
|
||||
TextOnly,
|
||||
/// Execute ANN vector search only.
|
||||
VectorOnly,
|
||||
/// Execute both and fuse results via RRF.
|
||||
Hybrid,
|
||||
}
|
||||
|
||||
impl RetrievalMode {
|
||||
/// Determine the retrieval mode from query contents.
|
||||
///
|
||||
/// Returns `None` if neither text nor vector is provided (invalid query).
|
||||
#[must_use]
|
||||
pub fn determine(has_text: bool, has_vector: bool) -> Option<Self> {
|
||||
match (has_text, has_vector) {
|
||||
(true, false) => Some(Self::TextOnly),
|
||||
(false, true) => Some(Self::VectorOnly),
|
||||
(true, true) => Some(Self::Hybrid),
|
||||
(false, false) => None,
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### route_results()
|
||||
|
||||
```rust
|
||||
/// Route pre-retrieved result lists through the appropriate fusion path.
|
||||
///
|
||||
/// - `TextOnly`: converts BM25 scores to `f64` and returns them sorted descending.
|
||||
/// - `VectorOnly`: converts ANN distance → rank-based score and returns sorted descending.
|
||||
/// - `Hybrid`: calls `HybridFusion::fuse()` and returns the fused result.
|
||||
///
|
||||
/// # Inputs
|
||||
///
|
||||
/// - `bm25_results`: `(EntityId, f32)` where f32 is BM25 score, **pre-sorted descending**.
|
||||
/// - `ann_results`: `(EntityId, f32)` where f32 is L2-squared distance, **pre-sorted ascending**.
|
||||
/// - Both slices may be empty; callers pass `&[]` for unused modes.
|
||||
///
|
||||
/// # Returns
|
||||
///
|
||||
/// `Vec<(EntityId, f64)>` sorted descending by score. For `TextOnly` and `VectorOnly`,
|
||||
/// scores are normalized to `[0, 1]` relative to the top candidate (score 1.0).
|
||||
/// For `Hybrid`, scores are raw RRF values (typically 0.01–0.04 for k=60).
|
||||
pub fn route_results(
|
||||
mode: RetrievalMode,
|
||||
bm25_results: &[(EntityId, f32)],
|
||||
ann_results: &[(EntityId, f32)],
|
||||
fusion: &HybridFusion,
|
||||
) -> Vec<(EntityId, f64)> {
|
||||
match mode {
|
||||
RetrievalMode::TextOnly => {
|
||||
// Convert f32 BM25 scores to f64; already sorted descending by caller.
|
||||
bm25_results
|
||||
.iter()
|
||||
.map(|(id, score)| (*id, f64::from(*score)))
|
||||
.collect()
|
||||
}
|
||||
RetrievalMode::VectorOnly => {
|
||||
// Convert rank-position to a score using the same RRF formula for
|
||||
// consistency: score = 1.0 / (k + rank). This gives ANN-only results
|
||||
// the same score range as hybrid results.
|
||||
let k = f64::from(fusion.k);
|
||||
ann_results
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (id, _distance))| {
|
||||
let rank = (i + 1) as f64;
|
||||
(*id, 1.0 / (k + rank))
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
RetrievalMode::Hybrid => fusion.fuse(bm25_results, ann_results),
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### ann_to_ranked()
|
||||
|
||||
A helper to convert `Vec<VectorSearchResult>` (returned by `VectorIndex::search()`) to `Vec<(EntityId, f32)>` suitable as input to `fuse()` or `route_results()`:
|
||||
|
||||
```rust
|
||||
use crate::storage::vector::VectorSearchResult;
|
||||
|
||||
/// Convert ANN search results to a ranked list for fusion input.
|
||||
///
|
||||
/// `VectorSearchResult` is already sorted ascending by distance (best first).
|
||||
/// This function maps it to `(EntityId, f32)` where the f32 is the raw L2 distance.
|
||||
/// The caller passes this to `fuse()` or `route_results()` which uses position-as-rank.
|
||||
#[must_use]
|
||||
pub fn ann_to_ranked(ann_results: &[VectorSearchResult]) -> Vec<(EntityId, f32)> {
|
||||
ann_results
|
||||
.iter()
|
||||
.map(|r| (EntityId::new(r.id), r.distance))
|
||||
.collect()
|
||||
}
|
||||
```
|
||||
|
||||
### Module Integration
|
||||
|
||||
Add to `tidal/src/query/mod.rs`:
|
||||
```rust
|
||||
pub use fusion::{HybridFusion, RetrievalMode, ann_to_ranked, route_results};
|
||||
```
|
||||
|
||||
### Criterion Benchmark
|
||||
|
||||
```rust
|
||||
// tidal/benches/fusion.rs
|
||||
|
||||
fn bench_rrf_1k_per_list(c: &mut Criterion) {
|
||||
// 1000 BM25 results
|
||||
let bm25: Vec<(EntityId, f32)> = (0u64..1000)
|
||||
.map(|i| (EntityId::new(i), (1000 - i) as f32))
|
||||
.collect();
|
||||
// 1000 ANN results, 50% overlap with BM25
|
||||
let ann: Vec<(EntityId, f32)> = (500u64..1500)
|
||||
.enumerate()
|
||||
.map(|(i, id)| (EntityId::new(id), i as f32 * 0.001))
|
||||
.collect();
|
||||
|
||||
let fusion = HybridFusion::new();
|
||||
|
||||
c.bench_function("rrf_fuse_1k_per_list", |b| {
|
||||
b.iter(|| {
|
||||
let results = fusion.fuse(black_box(&bm25), black_box(&ann));
|
||||
black_box(results)
|
||||
});
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] `RetrievalMode` enum with `TextOnly`, `VectorOnly`, `Hybrid` variants in `fusion.rs`
|
||||
- [ ] `RetrievalMode::determine(has_text, has_vector) -> Option<RetrievalMode>` returns correct variant
|
||||
- [ ] `determine(false, false)` returns `None`
|
||||
- [ ] `route_results(mode, bm25, ann, fusion) -> Vec<(EntityId, f64)>` implemented
|
||||
- [ ] `TextOnly` path: BM25 scores converted to f64, list preserved
|
||||
- [ ] `VectorOnly` path: ANN results converted to rank-based scores via `1.0 / (k + rank)`
|
||||
- [ ] `Hybrid` path: calls `HybridFusion::fuse()` and returns result
|
||||
- [ ] `ann_to_ranked(ann_results: &[VectorSearchResult]) -> Vec<(EntityId, f32)>` helper
|
||||
- [ ] `RetrievalMode`, `route_results`, `ann_to_ranked` exported from `tidal/src/query/mod.rs`
|
||||
- [ ] `tidal/benches/fusion.rs` created with Criterion benchmark `rrf_fuse_1k_per_list`
|
||||
- [ ] Benchmark result confirms fusion < 1ms for 1000 candidates per list
|
||||
- [ ] `[[bench]] name = "fusion" harness = false` added to `tidal/Cargo.toml`
|
||||
- [ ] Unit tests: `determine_text_only`, `determine_vector_only`, `determine_hybrid`, `determine_none`, `route_text_only_passthrough`, `route_vector_only_rank_based`, `route_hybrid_calls_fuse`, `ann_to_ranked_converts_correctly`
|
||||
- [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass
|
||||
|
||||
## Test Strategy
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn determine_text_only() {
|
||||
assert_eq!(RetrievalMode::determine(true, false), Some(RetrievalMode::TextOnly));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn determine_hybrid() {
|
||||
assert_eq!(RetrievalMode::determine(true, true), Some(RetrievalMode::Hybrid));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn determine_none() {
|
||||
assert_eq!(RetrievalMode::determine(false, false), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn route_text_only_passthrough() {
|
||||
let bm25 = vec![(EntityId::new(1), 1.0f32), (EntityId::new(2), 0.5f32)];
|
||||
let fusion = HybridFusion::new();
|
||||
let results = route_results(RetrievalMode::TextOnly, &bm25, &[], &fusion);
|
||||
assert_eq!(results.len(), 2);
|
||||
assert!((results[0].1 - 1.0f64).abs() < 1e-6); // f32 → f64 exact
|
||||
assert!((results[1].1 - 0.5f64).abs() < 1e-6);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn route_vector_only_rank_based() {
|
||||
// VectorSearchResult order: rank 1 (index 0) gets score 1/(60+1)
|
||||
let ann = vec![
|
||||
(EntityId::new(1), 0.1f32), // rank 1
|
||||
(EntityId::new(2), 0.2f32), // rank 2
|
||||
];
|
||||
let fusion = HybridFusion::new();
|
||||
let results = route_results(RetrievalMode::VectorOnly, &[], &ann, &fusion);
|
||||
assert_eq!(results.len(), 2);
|
||||
let expected_rank1 = 1.0 / (60.0 + 1.0);
|
||||
let expected_rank2 = 1.0 / (60.0 + 2.0);
|
||||
assert!((results[0].1 - expected_rank1).abs() < 1e-9);
|
||||
assert!((results[1].1 - expected_rank2).abs() < 1e-9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ann_to_ranked_converts_correctly() {
|
||||
use crate::storage::vector::VectorSearchResult;
|
||||
let ann_results = vec![
|
||||
VectorSearchResult { id: 42, distance: 0.1 },
|
||||
VectorSearchResult { id: 99, distance: 0.3 },
|
||||
];
|
||||
let ranked = ann_to_ranked(&ann_results);
|
||||
assert_eq!(ranked.len(), 2);
|
||||
assert_eq!(ranked[0].0.as_u64(), 42);
|
||||
assert!((ranked[0].1 - 0.1f32).abs() < 1e-6);
|
||||
assert_eq!(ranked[1].0.as_u64(), 99);
|
||||
}
|
||||
```
|
||||
48
docs/planning/milestone-5/phase-4/OVERVIEW.md
Normal file
48
docs/planning/milestone-5/phase-4/OVERVIEW.md
Normal file
@ -0,0 +1,48 @@
|
||||
# Milestone 5, Phase 4: Creator and People Search
|
||||
|
||||
## Goal
|
||||
|
||||
Prove that the same SEARCH pipeline that indexes items can also index creator entities. After this phase, a developer can call `db.search(&Search { entity_kind: Creator, query: "jazz" })` and receive BM25-ranked creators with optional vector fusion, filters, and sort modes.
|
||||
|
||||
## Motivation
|
||||
|
||||
m5p1–p3 built a complete SEARCH pipeline for items. Creators are a first-class entity in tidalDB with their own storage engine, metadata, and embeddings. Extending the pipeline to creators validates the multi-entity-kind design and unlocks the people-search use case.
|
||||
|
||||
## Tasks
|
||||
|
||||
| Task | Title | Status |
|
||||
|------|-------|--------|
|
||||
| task-01 | Creator Text Indexing | pending |
|
||||
| task-02 | Creator Vector Index | pending |
|
||||
| task-03 | Creator Search Executor | pending |
|
||||
|
||||
## Execution Order
|
||||
|
||||
```
|
||||
task-01 (Creator Text Indexing)
|
||||
|
|
||||
v
|
||||
task-02 (Creator Vector Index)
|
||||
|
|
||||
v
|
||||
task-03 (Creator Search Executor)
|
||||
```
|
||||
|
||||
All tasks are sequential.
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
cargo check --manifest-path tidal/Cargo.toml
|
||||
cargo clippy --manifest-path tidal/Cargo.toml -- -D warnings
|
||||
cargo test --manifest-path tidal/Cargo.toml --lib
|
||||
cargo test --manifest-path tidal/Cargo.toml --test m5p4_creator_search
|
||||
cargo bench --manifest-path tidal/Cargo.toml --bench search -- bench_search_creator_text_200
|
||||
```
|
||||
|
||||
**Key assertions:**
|
||||
- `db.search(Search { entity_kind: Creator, query: "jazz" })` returns BM25-ranked creators
|
||||
- `filter(verified = true)` excludes non-verified creators
|
||||
- `similar_to` lookup triggers ANN on `(EntityKind::Creator, "content")` slot
|
||||
- `bench_search_creator_text_200` < 20ms
|
||||
- All existing m5p3 item search tests still pass
|
||||
@ -0,0 +1,51 @@
|
||||
# Task 01: Creator Text Indexing
|
||||
|
||||
## Goal
|
||||
|
||||
Add a separate Tantivy text index for creator entities, parallel to the existing item text index. Creator text fields are declared in the schema via `creator_text_field()`. The background syncer enqueues writes from `write_creator()`.
|
||||
|
||||
## Files to Modify
|
||||
|
||||
- `tidal/src/schema/validation.rs` — add `creator_text_fields` vec to `Schema` and `SchemaBuilder`
|
||||
- `tidal/src/schema/mod.rs` — re-export nothing new (types already exported)
|
||||
- `tidal/src/db/mod.rs` — add creator text index fields, spawn creator syncer, extend `write_creator()`, add `reload_creator_text_index()`
|
||||
|
||||
## Schema Changes
|
||||
|
||||
Add `creator_text_fields: Vec<TextFieldDef>` to both `Schema` and `SchemaBuilder`.
|
||||
|
||||
```rust
|
||||
impl SchemaBuilder {
|
||||
pub fn creator_text_field(&mut self, key: &str, field_type: TextFieldType) -> &mut Self {
|
||||
self.creator_text_fields.push(TextFieldDef { key: key.to_owned(), field_type });
|
||||
self
|
||||
}
|
||||
}
|
||||
|
||||
impl Schema {
|
||||
pub fn creator_text_fields(&self) -> &[TextFieldDef] {
|
||||
&self.creator_text_fields
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## TidalDb Changes
|
||||
|
||||
Add three new fields parallel to `text_index`, `text_tx`, `text_syncer_thread`:
|
||||
|
||||
```rust
|
||||
creator_text_index: Option<Arc<crate::text::TextIndex>>,
|
||||
creator_text_tx: std::sync::Mutex<Option<crossbeam::channel::Sender<crate::text::PendingWrite>>>,
|
||||
creator_text_syncer_thread: std::sync::Mutex<Option<std::thread::JoinHandle<crate::Result<()>>>>,
|
||||
```
|
||||
|
||||
Spawn in `from_parts()` using the same pattern as the item syncer. In `write_creator()`, enqueue to `creator_text_tx` when present.
|
||||
|
||||
Add `reload_creator_text_index()` helper for tests.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- `SchemaBuilder::creator_text_field()` compiles
|
||||
- Writing a creator with matching metadata enqueues to the creator text index
|
||||
- `reload_creator_text_index()` reloads the reader for test synchronization
|
||||
- Existing item text index is unaffected
|
||||
@ -0,0 +1,37 @@
|
||||
# Task 02: Creator Vector Index
|
||||
|
||||
## Goal
|
||||
|
||||
Add `write_creator_embedding()` and `read_creator_embedding()` to `TidalDb`. These register and populate the `(EntityKind::Creator, "content")` slot in the existing `EmbeddingSlotRegistry`.
|
||||
|
||||
## Files to Modify
|
||||
|
||||
- `tidal/src/db/mod.rs` — add `write_creator_embedding()` and `read_creator_embedding()`
|
||||
|
||||
## Implementation
|
||||
|
||||
```rust
|
||||
pub fn write_creator_embedding(&self, id: EntityId, embedding: &[f32]) -> crate::Result<()> {
|
||||
let mut registry = self.embedding_registry.write()...;
|
||||
if registry.get(EntityKind::Creator, "content").is_none() {
|
||||
// auto-register slot
|
||||
let state = EmbeddingSlotState::new(embedding.len(), QuantizationLevel::F32, EmbeddingSource::External);
|
||||
registry.register(EntityKind::Creator, "content".to_string(), state)?;
|
||||
}
|
||||
let slot = registry.get_mut(EntityKind::Creator, "content")...;
|
||||
slot.index.add(id.as_u64(), embedding)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn read_creator_embedding(&self, id: EntityId) -> crate::Result<Option<Vec<f32>>> {
|
||||
let registry = self.embedding_registry.read()...;
|
||||
let slot = match registry.get(EntityKind::Creator, "content") { None => return Ok(None), Some(s) => s };
|
||||
Ok(slot.index.get(id.as_u64()))
|
||||
}
|
||||
```
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- `write_creator_embedding(id, &vec)` succeeds and auto-registers the slot
|
||||
- `read_creator_embedding(id)` returns the stored vector
|
||||
- ANN search on `(EntityKind::Creator, "content")` returns results
|
||||
@ -0,0 +1,65 @@
|
||||
# Task 03: Creator Search Executor
|
||||
|
||||
## Goal
|
||||
|
||||
Extend `SearchExecutor` to route text index and ANN slot based on `query.entity_kind`. When `entity_kind = Creator`, use `creator_text_index` and the `(EntityKind::Creator, "content")` slot.
|
||||
|
||||
## Files to Modify
|
||||
|
||||
- `tidal/src/query/search.rs` — add `creator_text_index` field, routing in `execute()`
|
||||
- `tidal/src/db/mod.rs` — pass creator text index in `search()`
|
||||
- `tidal/tests/m5p4_creator_search.rs` — new integration tests
|
||||
|
||||
## SearchExecutor Changes
|
||||
|
||||
Add `creator_text_index: Option<&'a Arc<crate::text::TextIndex>>` field.
|
||||
|
||||
In `execute()` Stage 1a:
|
||||
```rust
|
||||
let effective_text_index = match query.entity_kind {
|
||||
EntityKind::Creator => self.creator_text_index,
|
||||
_ => self.text_index,
|
||||
};
|
||||
```
|
||||
|
||||
In `execute()` Stage 1b ANN, use `query.entity_kind` instead of hardcoded `EntityKind::Item`:
|
||||
```rust
|
||||
match registry.get(query.entity_kind, "content") { ... }
|
||||
```
|
||||
|
||||
Add builder method:
|
||||
```rust
|
||||
pub fn with_creator_text_index(mut self, idx: &'a Arc<crate::text::TextIndex>) -> Self {
|
||||
self.creator_text_index = Some(idx);
|
||||
self
|
||||
}
|
||||
```
|
||||
|
||||
## TidalDb::search() Routing
|
||||
|
||||
```rust
|
||||
if query.entity_kind == EntityKind::Creator {
|
||||
if let Some(idx) = self.creator_text_index.as_ref() {
|
||||
base_executor = base_executor.with_creator_text_index(idx);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Note: for Creator search, pass `None` as `text_index` (item text index) to `SearchExecutor::new()` — or pass both and let the executor route. Simplest: always pass item text index to `new()`, add creator index via builder method, executor picks based on entity_kind.
|
||||
|
||||
## Integration Tests
|
||||
|
||||
Create `tidal/tests/m5p4_creator_search.rs`:
|
||||
|
||||
- `step01_creator_text_search_returns_results()` — write 200 creators, search "jazz", assert ≥ 1 result with bm25_score.is_some()
|
||||
- `step02_creator_verified_filter()` — search with `filter(verified = "true")`, assert all results have verified metadata
|
||||
- `step03_creator_similar_to()` — write embeddings, search with vector, assert results have semantic_score.is_some()
|
||||
- `step04_creator_search_latency_under_20ms()` — measure 10 iterations, assert p50 < 20ms
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Creator search returns BM25-ranked results
|
||||
- Filter by `verified = "true"` works
|
||||
- Vector-only and hybrid search work for creators
|
||||
- All existing m5p3 item search tests still pass
|
||||
- Latency < 20ms at 200 creators
|
||||
@ -135,8 +135,6 @@ The scoring reads population-level signals. Item A has 500 views and 200 likes f
|
||||
The personalized scoring path builds a `UserContext` from the interaction ledger:
|
||||
|
||||
```rust
|
||||
// Adapted from tidal/src/query/executor.rs
|
||||
|
||||
fn build_user_context(&self, user_id: u64, now: Timestamp) -> UserContext {
|
||||
let top_creators = self.interaction_ledger
|
||||
.map(|il| il.top_creators(user_id, 50, now.as_nanos()))
|
||||
@ -156,8 +154,6 @@ For a new user, `top_creators` returns an empty vec. The `creator_interaction_bo
|
||||
**Stage 3.5: Exploration injection.** This is where the `exploration: 0.1` field matters. After scoring, the executor reserves 10% of result slots for candidates outside the top-scored set:
|
||||
|
||||
```rust
|
||||
// Adapted from tidal/src/query/executor.rs
|
||||
|
||||
fn inject_exploration(
|
||||
scored: &mut Vec<ScoredCandidate>,
|
||||
all_candidates: &[EntityId],
|
||||
@ -270,8 +266,6 @@ These profiles are not cold-start logic. They are discovery surfaces. `hidden_ge
|
||||
The preference vector initializes on the first positive engagement:
|
||||
|
||||
```rust
|
||||
// Adapted from tidal/src/entities/preference.rs
|
||||
|
||||
pub fn update(&self, user_id: u64, interaction_embedding: &[f32]) -> bool {
|
||||
let lr = self.learning_rate;
|
||||
match self.inner.entry(user_id) {
|
||||
@ -293,11 +287,11 @@ pub fn update(&self, user_id: u64, interaction_embedding: &[f32]) -> bool {
|
||||
}
|
||||
```
|
||||
|
||||
When the user has no preference vector and likes an item, the item's embedding becomes the initial preference vector. One interaction. One data point. The preference is crude -- it is a single point in a 128-dimensional space -- but it is not nothing. The next query that uses cosine similarity between the preference vector and candidate embeddings will rank items near this point higher.
|
||||
When the user has no preference vector and likes an item, the item's embedding becomes the initial preference vector. One interaction. One data point. The preference is crude -- it is a single point in a 128-dimensional space -- but it is not nothing. The second like blends the new item's embedding with the existing preference using exponential moving average at learning rate 0.1: `pref = 0.9 * pref + 0.1 * new_embedding`. The third like refines further. By the tenth positive interaction, the preference vector has converged to a meaningful region of the embedding space.
|
||||
|
||||
The second like blends the new item's embedding with the existing preference using exponential moving average at learning rate 0.1: `pref = 0.9 * pref + 0.1 * new_embedding`. The third like refines further. By the tenth positive interaction, the preference vector has converged to a meaningful region of the embedding space. The transition from cold to warm is continuous. There is no threshold, no flag, no branch.
|
||||
The vector is being built. Cosine similarity between the preference vector and candidate embeddings -- the scoring path that will rank items near this region higher -- is planned for M5, which requires an O(1) per-item embedding lookup table that does not yet exist. The architecture is in place; the wiring is the next step.
|
||||
|
||||
The interaction weight follows the same pattern. A new user's interaction ledger is empty. After one view of creator A's content, the ledger has one entry: `(user, creator_A) -> 1.0`. After three more views at weight 1.0 each, the entry decays and accumulates: the score is the sum of `weight * exp(-lambda * dt)` over all interactions. The decay half-life is 7 days. Recent interactions dominate. By the time the user has engaged with five creators, the `top_creators()` call returns a ranked list that meaningfully differentiates the user's preferences.
|
||||
The working cold-to-warm mechanism today is the interaction ledger. A new user's interaction ledger is empty. After one view of creator A's content, the ledger has one entry: `(user, creator_A) -> 1.0`. After three more views at weight 1.0 each, the entry decays and accumulates: the score is the sum of `weight * exp(-lambda * dt)` over all interactions. The decay half-life is 7 days. Recent interactions dominate. By the time the user has engaged with five creators, the `top_creators()` call returns a ranked list that meaningfully differentiates the user's preferences. Those weights flow into `creator_interaction_boosts`, which apply additive boosts to items from favored creators during scoring. Empty ledger, no boosts. One creator interaction, one boost. The transition from cold to warm is continuous. There is no threshold, no flag, no branch.
|
||||
|
||||
The exploration fraction does not change. It stays at 10% for `for_you` regardless of how many signals the user has generated. For a cold-start user, 10% exploration introduces variety into a population-ranked feed. For a warm user, 10% exploration prevents the feedback loop from closing too tightly. The same value serves both purposes because the purpose is the same: prevent the ranking from converging to a local optimum.
|
||||
|
||||
@ -316,7 +310,7 @@ let results = db.retrieve(&query).expect("retrieve");
|
||||
|
||||
No `if user.is_new()`. No `get_cold_start_items()`. No `select_popular_fallback()`. No feature flag for the cold-start experiment. No A/B test between the cold-start path and the warm path. No monitoring for "how many users are hitting the cold-start branch." No incident review when the cold-start Elasticsearch index falls behind.
|
||||
|
||||
The application chooses a profile name. The database handles the rest. The same query, the same profile, the same pipeline produces a population-ranked, diversity-enforced, exploration-injected feed for a new user and a personalized, interaction-weighted, preference-driven feed for a returning user. The difference is the data, not the code.
|
||||
The application chooses a profile name. The database handles the rest. The same query, the same profile, the same pipeline produces a population-ranked, diversity-enforced, exploration-injected feed for a new user and a personalized, interaction-weighted feed for a returning user. The difference is the data, not the code.
|
||||
|
||||
For new items, the application does even less. An item is written with metadata. It has no signals. If a query uses `hidden_gems`, the item competes on quality. If a query uses `new`, the item appears by recency. If a query uses `shuffle`, the item has a random chance proportional to the exploration budget. If a query uses `for_you`, the item can appear in the 10% exploration pool. The application did not write cold-start injection logic. It did not maintain a "new items" index. It did not implement a boost column with a 24-hour TTL.
|
||||
|
||||
|
||||
@ -52,7 +52,7 @@ When diversity lives in the database:
|
||||
- Every query surface gets it. The feed, the search results, the notifications, the "related content" sidebar. One implementation. One set of constraints. One set of tests.
|
||||
- The result count invariant is maintained. The selector fills the page from lower-ranked candidates when constraints force skips. The caller always gets `min(target_count, candidate_count)` results.
|
||||
- Constraint violations are reported explicitly. The result includes a `constraints_satisfied` boolean and a list of violations. The caller knows exactly what was relaxed and why.
|
||||
- It is tested. With property tests. Across 10,000 random candidate sets. Because it is a self-contained module with no external dependencies, not a loop buried in a microservice.
|
||||
- It is tested. With property tests. Across hundreds of random candidate sets. Because it is a self-contained module with no external dependencies, not a loop buried in a microservice.
|
||||
|
||||
## The algorithm
|
||||
|
||||
@ -189,7 +189,7 @@ Two properties hold regardless of input:
|
||||
|
||||
**INV-RANK-6: Score order is preserved.** The final emit pass walks the original sorted candidate list and retains only accepted IDs. This means the output is in the same order the scoring stage produced -- the global ranking is preserved across the entire result, not just within creator groups.
|
||||
|
||||
These are not aspirational. They are verified by property tests across 10,000 random candidate sets:
|
||||
These are not aspirational. They are verified by property tests across hundreds of random candidate sets:
|
||||
|
||||
```rust
|
||||
// From tidal/src/ranking/diversity.rs — property tests
|
||||
@ -284,7 +284,7 @@ Post-diversity:
|
||||
|
||||
Creator A still has the top 2 slots -- their best items are genuinely the best. But the user also sees B, C, and D. The feed is diverse. The result count is 6, as requested. Score order is preserved globally. The `constraints_satisfied` flag is `true`.
|
||||
|
||||
Now consider the degenerate case: all 10 items from creator A, `max_per_creator: 2`, target 6. Stage 0 runs `greedy_select` with `max_per_creator=2` and selects 2 items. Stage 1 gets a *fresh* `creator_counts` (not inherited from Stage 0), with `max_per_creator` doubled to 4 and a limit of 4 (the remaining shortfall). It processes the 8 unselected candidates -- all creator A -- and accepts 4 before hitting either the limit or the per-creator cap. That fills the target. Stages 2 and 3 never fire. Result: 6 items. `constraints_satisfied: false`. Violation reported: creator A appears 6 times, requested max was 2. The caller knows.
|
||||
Now consider the degenerate case: all 10 items from creator A, `max_per_creator: 1`, target 6. Stage 0 runs `greedy_select` with `max_per_creator=1` and selects 1 item. Stage 1 gets a *fresh* `creator_counts` (not inherited from Stage 0), with `max_per_creator` doubled to 2 and a limit of 5 (the remaining shortfall). It processes the 9 unselected candidates -- all creator A -- and accepts 2 before hitting the per-creator cap. Stage 2 drops format constraints but keeps the doubled cap; another fresh `creator_counts`, limit 3, accepts 2 more. Stage 3 accepts anything to fill the last slot. Result: 6 items. `constraints_satisfied: false`. Violation reported: creator A appears 6 times, requested max was 1. The caller knows.
|
||||
|
||||
## The cost
|
||||
|
||||
@ -350,7 +350,7 @@ The diversity selector is verified by property tests, not just example-based uni
|
||||
|
||||
A unit test says: "given these 10 specific candidates, the output looks like this." It proves the algorithm works for one input. A property test says: "for any combination of candidate count, creator distribution, format distribution, constraint values, and target count, these invariants hold." It proves the algorithm works for the space of possible inputs.
|
||||
|
||||
Five properties are tested across 10,000 random candidate sets each:
|
||||
Five properties are tested across hundreds of random candidate sets each:
|
||||
|
||||
1. **`max_per_creator` is never exceeded** when `constraints_satisfied` is true.
|
||||
2. **Format fraction is never exceeded** when `constraints_satisfied` is true.
|
||||
|
||||
@ -86,7 +86,7 @@ A database that understands ranking as a primitive would not need the stack. Her
|
||||
|
||||
**Signals are a schema-level type.** A "view" signal is not a counter you increment in Redis and hope stays consistent. It is a typed, timestamped event stream declared in the database schema, with a decay rate, a set of time windows, and velocity computation -- all maintained by the database. You write the event. The database handles aggregation, windowing, and decay. When you query for "trending," the database reads signal velocity directly. No external cache. No stale scores.
|
||||
|
||||
**User context is a database-managed state.** The user's preference vector is not a row in a feature store updated every 15 minutes. It is a living embedding that the database shifts every time the user engages with content. A like shifts it toward the item's embedding. A skip shifts it away. The next query reflects this. Not in 15 minutes. Now.
|
||||
**User context is a database-managed state.** The user's preference vector is not a row in a feature store updated every 15 minutes. It is a living embedding that the database shifts every time the user engages with content. A like shifts it toward the item's embedding. A skip adds the item to a hard-negative bitmap -- the user never sees it again. The next query reflects both. Not in 15 minutes. Now.
|
||||
|
||||
**The write path and the read path are one system.** When a user likes an item, the database atomically updates the item's signal ledger, the user's preference vector, and the user-to-creator relationship weight. No event bus between the engagement and the ranking update. No consumer lag. No eventual consistency. The write *is* the ranking update.
|
||||
|
||||
|
||||
@ -73,7 +73,6 @@ Bitmap insertion is O(1). The `FOR USER` clause in a `RETRIEVE` query intersects
|
||||
The `InteractionLedger` tracks per-(user, creator) interaction strength using the same lazy decay formula as signal scores:
|
||||
|
||||
```rust
|
||||
// Adapted from tidal/src/entities/interaction.rs
|
||||
pub fn record(&self, user_id: u64, creator_id: u64, weight: f64, timestamp_ns: u64) {
|
||||
let user_map = self.inner.entry(user_id).or_default();
|
||||
let mut entry = user_map.entry(creator_id).or_insert(InteractionEntry {
|
||||
@ -144,7 +143,7 @@ The normalization invariant is critical. The preference vector is always unit le
|
||||
|
||||
## The dispatch
|
||||
|
||||
The signal dispatch is a branch on signal type. Positive engagement signals (like, share, completion) trigger all four updates including the preference vector. View signals trigger the first three (signal ledger, seen tracking, interaction weight) but not the preference vector -- views are low-intent and do not shift the user's taste representation. Hard negatives (skip, hide, block) trigger exclusion. The branching happens in `signal_with_context`:
|
||||
The signal dispatch is a branch on signal type. Positive engagement signals (like, share, completion) trigger all four updates including the preference vector. View signals trigger the first three (signal ledger, seen tracking, interaction weight) but not the preference vector -- views are low-intent and do not shift the user's taste representation. Hard negatives (skip, hide, dislike, block) trigger exclusion. The branching happens in `signal_with_context`:
|
||||
|
||||
```rust
|
||||
// Record the base signal (item ledger, WAL, windowed counters).
|
||||
@ -181,50 +180,37 @@ When no user context is provided -- `for_user: None` -- the dispatch skips all u
|
||||
The acceptance test writes signals with user context and immediately queries:
|
||||
|
||||
```rust
|
||||
// Adapted from tidal/tests/
|
||||
|
||||
// User views items from creator 100.
|
||||
for i in 1..=3u64 {
|
||||
db.signal_with_context(
|
||||
"view", EntityId::new(i), 3.0, ts,
|
||||
Some(user_id), Some(100),
|
||||
).unwrap();
|
||||
)?;
|
||||
}
|
||||
|
||||
// User blocks creator 300.
|
||||
// User blocks creator 300 — all items from this creator are excluded from future queries.
|
||||
db.write_relationship(
|
||||
EntityId::new(user_id),
|
||||
RelationshipType::Blocks,
|
||||
EntityId::new(300),
|
||||
1.0, ts,
|
||||
).unwrap();
|
||||
)?;
|
||||
|
||||
// User skips item 5 (hard negative).
|
||||
// User skips item 5 — added to the hard-negative bitmap.
|
||||
db.signal_with_context(
|
||||
"skip", EntityId::new(5), 1.0, ts,
|
||||
Some(user_id), None,
|
||||
).unwrap();
|
||||
)?;
|
||||
|
||||
// Query: FOR USER with personalized profile.
|
||||
// Query immediately — no consumer lag, no cache to invalidate.
|
||||
let query = RetrieveBuilder::new(EntityKind::Item, ProfileRef::new("new"))
|
||||
.for_user(user_id)
|
||||
.limit(20)
|
||||
.build()
|
||||
.unwrap();
|
||||
let results = db.retrieve(&query).unwrap();
|
||||
.build()?;
|
||||
let results = db.retrieve(&query)?;
|
||||
|
||||
let ids: Vec<u64> = results.items.iter()
|
||||
.map(|r| r.entity_id.as_u64())
|
||||
.collect();
|
||||
|
||||
// Blocked creator's items excluded.
|
||||
assert!(!ids.contains(&7) && !ids.contains(&8) && !ids.contains(&9));
|
||||
// Hard negative excluded.
|
||||
assert!(!ids.contains(&5));
|
||||
// Seen items excluded.
|
||||
assert!(!ids.contains(&1) && !ids.contains(&2) && !ids.contains(&3));
|
||||
// Only unseen, unblocked, non-negative items remain.
|
||||
assert_eq!(ids.len(), 2);
|
||||
// Results contain only unseen, unblocked, non-negative items.
|
||||
// Items from creator 300 are absent. Item 5 is absent. Viewed items 1–3 are absent.
|
||||
```
|
||||
|
||||
There is no delay between the signal writes and the query. No background consumer to wait for. No cache to invalidate. No eventual consistency window. The signal updates in-memory state. The query reads in-memory state. They share a process and a memory space. The loop closes in the time it takes to execute the function call.
|
||||
@ -235,12 +221,12 @@ The interaction ledger test verifies the decay formula in isolation:
|
||||
// Record interaction with weight 10.0.
|
||||
il.record(1, 100, 10.0, base_ns);
|
||||
let score_now = il.score(1, 100, base_ns);
|
||||
assert!((score_now - 10.0).abs() < 1e-6);
|
||||
// score_now ≈ 10.0 — immediate, no decay elapsed.
|
||||
|
||||
// After one half-life (7 days): score should be ~5.0.
|
||||
// After one half-life (7 days): score halves.
|
||||
let one_week_later = base_ns + 7 * 24 * 3600 * 1_000_000_000;
|
||||
let score_later = il.score(1, 100, one_week_later);
|
||||
assert!((score_later - 5.0).abs() < 0.5);
|
||||
// score_later ≈ 5.0 — same O(1) lazy decay as the signal ledger.
|
||||
```
|
||||
|
||||
Same O(1) lazy decay as the signal ledger. Same formula. Same correctness guarantees. Different data.
|
||||
|
||||
@ -97,8 +97,6 @@ Same cache-line-aligned atomic CAS updates. Same windowed counters in the warm t
|
||||
When a signal arrives with user context, the database classifies it and dispatches side effects:
|
||||
|
||||
```rust
|
||||
// Adapted from tidal/src/db/mod.rs — signal_with_context()
|
||||
|
||||
// Record the base signal (item ledger, WAL, windowed counters).
|
||||
self.signal(signal_type, entity_id, weight, timestamp)?;
|
||||
|
||||
@ -169,8 +167,6 @@ During a `RETRIEVE` query, the hard-negative bitmap is intersected with the cand
|
||||
The query executor wires the hard-negative index into the filter pipeline alongside other user-state exclusions:
|
||||
|
||||
```rust
|
||||
// Adapted from tidal/src/query/executor.rs — Stage 2.5: User-context filtering
|
||||
|
||||
// Remove seen items.
|
||||
let seen = user_state.seen_bitmap(user_id);
|
||||
candidates.retain(|id| !seen.contains(id.as_u64() as u32));
|
||||
@ -321,38 +317,30 @@ The database does not interpret dwell time. It does not know what "3 seconds" me
|
||||
The strongest claim: hidden items and blocked creators never appear in query results. This is enforced at the bitmap level, before scoring, and verified by integration tests:
|
||||
|
||||
```rust
|
||||
// Adapted from integration tests
|
||||
|
||||
// User blocks creator 300.
|
||||
db.write_relationship(
|
||||
EntityId::new(user_id),
|
||||
RelationshipType::Blocks,
|
||||
EntityId::new(300),
|
||||
1.0, ts,
|
||||
).expect("write block");
|
||||
)?;
|
||||
|
||||
// User hides item 5.
|
||||
db.signal_with_context(
|
||||
"hide", EntityId::new(5), 1.0, ts,
|
||||
Some(user_id), None,
|
||||
).expect("signal hide");
|
||||
)?;
|
||||
|
||||
// Query with user context.
|
||||
let query = RetrieveBuilder::new(EntityKind::Item, ProfileRef::new("new"))
|
||||
.for_user(user_id)
|
||||
.limit(20)
|
||||
.build()
|
||||
.expect("build query");
|
||||
let results = db.retrieve(&query).expect("retrieve");
|
||||
.build()?;
|
||||
let results = db.retrieve(&query)?;
|
||||
|
||||
let ids: Vec<u64> = results.items.iter()
|
||||
.map(|r| r.entity_id.as_u64())
|
||||
.collect();
|
||||
|
||||
// Blocked creator's items: excluded.
|
||||
assert!(!ids.contains(&7) && !ids.contains(&8) && !ids.contains(&9));
|
||||
// Hidden item: excluded.
|
||||
assert!(!ids.contains(&5));
|
||||
// Items from blocked creators and hidden items are not present.
|
||||
// The block removes all items from creator 300. The hide removes item 5.
|
||||
// Both exclusions take effect before scoring — they are never ranked.
|
||||
```
|
||||
|
||||
There is no delay. The block write updates the user-state bitmap. The next query reads the bitmap. The blocked creator's items are removed from the candidate set by a retain pass against the merged blocked-creator bitmap. They are not scored. They are not ranked. They do not exist in the query's universe.
|
||||
|
||||
@ -97,7 +97,7 @@ Stage 5: Result Assembly
|
||||
|
||||
The ranking profile declares a `CandidateStrategy`. The executor reads it and routes accordingly.
|
||||
|
||||
All 11 built-in profiles use the same candidate strategy: `Scan` -- iterate the universe bitmap of all known entity IDs. The differentiation happens at Stage 3, where the profile's `Sort` mode determines the scoring formula. `Sort::MostViewed` reads windowed view counts. `Sort::Trending` reads velocity. `Sort::Hot` applies a gravity function. Same candidates in, different scores out. For profiles that will use semantic similarity (M3+), the strategy is `Ann` -- query the vector index. `SignalRanked` is available as a candidate strategy but no built-in profile uses it yet.
|
||||
Most built-in profiles use `Scan` as their candidate strategy -- iterate the universe bitmap of all known entity IDs. The differentiation happens at Stage 3, where the profile's `Sort` mode determines the scoring formula. `Sort::MostViewed` reads windowed view counts. `Sort::Trending` reads velocity. `Sort::Hot` applies a gravity function. Same candidates in, different scores out. Two profiles break this pattern: `following` and `notification` use `CandidateStrategy::Relationship`, sourcing candidates from the user's relationship graph rather than scanning the full universe. `SignalRanked` is available as a candidate strategy but no built-in profile uses it yet.
|
||||
|
||||
The scan reads a `RoaringBitmap` -- the universe of all item IDs written to the database. At 10K items, this produces candidates in under a millisecond. Overprovisioning is 4x the requested limit or 200, whichever is larger, so there are enough candidates to survive filtering and diversity.
|
||||
|
||||
@ -159,17 +159,15 @@ let query = Retrieve::builder()
|
||||
|
||||
let results = db.retrieve(&query)?;
|
||||
|
||||
for r in &results.items {
|
||||
let cat = item_category(&db, r.entity_id);
|
||||
assert_eq!(cat.as_deref(), Some("jazz"));
|
||||
}
|
||||
// Every result has category "jazz" — the filter is enforced before scoring.
|
||||
// Verified in the acceptance test: zero results with a non-matching category.
|
||||
```
|
||||
|
||||
### Stage 3: Signal scoring
|
||||
|
||||
This is where the ranking profile earns its name. The `ProfileExecutor` takes the surviving candidate IDs, reads their signal state from the ledger, and applies the profile's scoring formula.
|
||||
|
||||
tidalDB ships 11 built-in profiles. Each is a standard `RankingProfile` struct -- not special-cased in the executor. The sort mode determines the formula:
|
||||
tidalDB ships 15 built-in profiles. Each is a standard `RankingProfile` struct -- not special-cased in the executor. The sort mode determines the formula:
|
||||
|
||||
| Profile | Sort Mode | Formula |
|
||||
|---------|-----------|---------|
|
||||
@ -184,6 +182,10 @@ tidalDB ships 11 built-in profiles. Each is a standard `RankingProfile` struct -
|
||||
| `most_viewed` | `MostViewed { SevenDays }` | Raw view count within window |
|
||||
| `most_liked` | `MostLiked { SevenDays }` | Raw like count within window |
|
||||
| `shuffle` | `Shuffle` | Deterministic seeded RNG |
|
||||
| `for_you` | `ForYou { exploration: 0.1 }` | Interaction boosts + exploration injection |
|
||||
| `following` | `Following` | Candidates from followed creators, recency ordered |
|
||||
| `related` | `Related` | Item similarity (ANN, M5) |
|
||||
| `notification` | `Notification` | Relationship-sourced candidates, recency ordered |
|
||||
|
||||
Every profile reads from the same signal ledger. The `trending` profile reads velocity. The `hot` profile reads view counts and applies a gravity decay by age. The `hidden_gems` profile reads completion rates and penalizes reach. The data is the same. The lens is different.
|
||||
|
||||
@ -244,10 +246,10 @@ let query = Retrieve::builder()
|
||||
|
||||
let results = db.retrieve(&query)?;
|
||||
|
||||
let counts = creator_counts(&db, &results.items);
|
||||
let max_count = counts.values().copied().max().unwrap_or(0);
|
||||
// Constraint applied -- relaxation may occur but repetition is bounded.
|
||||
assert!(max_count <= 5);
|
||||
// The acceptance test verifies that no creator appears more than once
|
||||
// when the candidate set is large enough. When too few distinct creators
|
||||
// exist, the selector relaxes constraints in a defined order and sets
|
||||
// results.constraints_satisfied = false.
|
||||
```
|
||||
|
||||
### Stage 5: Result assembly
|
||||
@ -270,7 +272,7 @@ RetrieveResult {
|
||||
}
|
||||
```
|
||||
|
||||
The signal snapshot is the ranking equivalent of `EXPLAIN` in SQL. Each result carries the key signal values that contributed to its score. If a result seems wrong, the snapshot tells you why: the view velocity was 0.3, the share velocity was 0.0, the hot gravity penalized it by age. No guessing. No log diving. The data is on the result.
|
||||
The signal snapshot is designed to be the ranking equivalent of `EXPLAIN` in SQL -- each result will carry the key signal values that contributed to its score. The plumbing is in place: the `signal_snapshot` field exists on every `ScoredCandidate` and flows through to the result struct. Population of that field with actual signal breakdowns is coming in a future milestone. Today, the snapshot is empty. The score is accurate; the explanation of that score is not yet attached.
|
||||
|
||||
Pagination uses offset-based cursors encoded as base64. The cursor is opaque to the caller -- pass it back on the next request. The acceptance test verifies no overlap between pages and correct rank continuation.
|
||||
|
||||
@ -291,14 +293,12 @@ After the 6 queries, the test writes a burst of 100 share signals for a single e
|
||||
// Write the burst.
|
||||
for j in 0..100_u64 {
|
||||
let ts = Timestamp::from_nanos(burst_ts.as_nanos() + j * 1_000_000);
|
||||
db.signal("share", burst_entity, 1.0, ts)
|
||||
.expect("signal write failed");
|
||||
db.signal("share", burst_entity, 1.0, ts)?;
|
||||
}
|
||||
|
||||
// Verify the burst landed.
|
||||
let share_count_after = db.read_windowed_count(burst_entity, "share", Window::AllTime)
|
||||
.expect("windowed count read failed");
|
||||
assert!(share_count_after >= share_count_before + 100);
|
||||
// Read back the windowed count — all 100 signals are immediately visible.
|
||||
let share_count_after = db.read_windowed_count(burst_entity, "share", Window::AllTime)?;
|
||||
// share_count_after == share_count_before + 100 — no lag, no consumer to wait for.
|
||||
```
|
||||
|
||||
The signal burst test is the thesis in microcosm. Write 100 signals. Re-execute the same query. The ranking reflects the new data. No cache invalidation. No consumer lag. No batch pipeline. The signals and the query share a process, a memory space, and a ledger.
|
||||
@ -336,20 +336,22 @@ One function call. One process. One consistency model. The data is never stale b
|
||||
|
||||
## What is not here yet
|
||||
|
||||
The current RETRIEVE query operates on items without user context. There is no `FOR USER` clause yet. Personalization -- where the user's preference vector and relationship graph shape the ranking -- is coming. The `Retrieve` struct has a `for_user: Option<u64>` field, currently validated as `None`. Setting it returns a clear error: "FOR USER clause not yet supported."
|
||||
Personalization shipped in M3 and M4. The `FOR USER` clause is live: pass a user ID and the query applies seen/blocked exclusions, interaction boosts on creators the user has engaged with, and preference vector updates. The `following` and `notification` profiles source candidates from the user's relationship graph. The `for_you` profile blends interaction signals with exploration injection. Session context (M4) adds ephemeral preferences that shape ranking within a single session.
|
||||
|
||||
ANN candidate generation (vector similarity search over embeddings via USearch) falls back to a full scan with a warning. The infrastructure is integrated and tested, but wiring it as a first-class candidate strategy comes next. The scan-based approach is sufficient for the item counts this version targets.
|
||||
ANN candidate generation (vector similarity search over embeddings via USearch) falls back to a full scan with a warning. The infrastructure is integrated and tested, but wiring it as a first-class retrieval path comes next. The scan-based approach is sufficient for the item counts this version targets.
|
||||
|
||||
In-memory indexes (bitmap, range) are not persisted to disk. After a crash and restart, signal state survives via WAL checkpoint and replay, but items must be re-written to repopulate the indexes. The acceptance test verifies this path explicitly. Full index persistence is on the roadmap.
|
||||
|
||||
The text query parser (`RETRIEVE items USING PROFILE trending LIMIT 25` as a string) is not yet implemented. Queries today are constructed via the Rust builder API. The semantics are identical -- the parser will produce the same `Retrieve` struct the builder produces.
|
||||
|
||||
Signal snapshot population -- attaching the individual signal values that contributed to each result's score -- is plumbed but not yet producing data. The field exists on every result; the producer does not yet write to it.
|
||||
|
||||
## What is next
|
||||
|
||||
Next: personalized ranking. A user entity with a preference vector. A relationship graph (follows, blocks, interactions). The `FOR USER` clause on the RETRIEVE query. When a user likes an item, the database atomically updates the item's signal ledger, the user's preference vector, and the user-to-creator relationship weight. One write. The next query reflects it.
|
||||
Personalized ranking is operational. User entities carry preference vectors. The relationship graph tracks follows, blocks, and interactions. The `FOR USER` clause shapes candidate generation, scoring, and exclusions in a single query. When a user likes an item, the database updates the item's signal ledger and the user-to-creator relationship weight. One write. The next query reflects it.
|
||||
|
||||
The signal engine works. The ranking pipeline works. What remains is closing the loop between what a user does and what the system shows them next.
|
||||
Next: the text query parser, full ANN as a first-class candidate strategy, signal snapshot population for per-result explainability, and index persistence across restarts. The ranking pipeline works. What remains is the tooling around it.
|
||||
|
||||
---
|
||||
|
||||
*The acceptance test is at [tidal/tests/m2_uat.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/tests/m2_uat.rs). The query executor is at [tidal/src/query/executor.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/query/executor.rs). The 11 built-in profiles are at [tidal/src/ranking/builtins.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/ranking/builtins.rs). Follow the build on [GitHub](https://github.com/orchard9/tidalDB).*
|
||||
*The acceptance test is at [tidal/tests/m2_uat.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/tests/m2_uat.rs). The query executor is at [tidal/src/query/executor.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/query/executor.rs). The 15 built-in profiles are at [tidal/src/ranking/builtins.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/ranking/builtins.rs). Follow the build on [GitHub](https://github.com/orchard9/tidalDB).*
|
||||
|
||||
@ -287,7 +287,7 @@ fn score_by_sort(&self, entity_id: EntityId, sort: Option<&Sort>, now: Timestamp
|
||||
}
|
||||
```
|
||||
|
||||
`Hot` applies a gravity decay by age: `log10(max(views, 1)) / (age_hours + 2)^gravity`. `Trending` reads view and share velocity over 24 hours. `Controversial` computes `(positive * negative) / (positive + negative)^2` -- content that splits opinion scores highest. `HiddenGems` divides quality (completion rate) by reach (view count) -- high-quality content that few people have seen surfaces first. Each formula reads from the same signal ledger. Each produces a different ordering of the same candidates.
|
||||
`Hot` applies a gravity decay by age: `log10(max(views, 1)) / (age_hours + 2)^gravity`. `Trending` reads view and share velocity over 24 hours. `Controversial` computes `(positive * negative) / (positive + negative)^2` -- content that splits opinion scores highest. `HiddenGems` divides quality (completion rate) by the log of reach (`log₁₀(view_count + 10)`) -- the logarithmic denominator means popular content is only mildly penalized, not excluded, while high-quality content that few people have seen surfaces first. Each formula reads from the same signal ledger. Each produces a different ordering of the same candidates.
|
||||
|
||||
## Fifteen profiles ship by default
|
||||
|
||||
|
||||
@ -2,7 +2,7 @@
|
||||
title: "Running decay scores are O(1) -- here is the math"
|
||||
date: "2026-02-21"
|
||||
author: "tidalDB"
|
||||
description: "The forward-decay formula eliminates raw-event scanning at query time. Three exp() calls on write, one on read. 15 nanoseconds per entity. Here is how it works."
|
||||
description: "The forward-decay formula eliminates raw-event scanning at query time. One exp() call per decay rate on write, one on read. 15 nanoseconds per entity. Here is how it works."
|
||||
tags: ["signals", "architecture", "performance"]
|
||||
---
|
||||
|
||||
@ -291,7 +291,7 @@ Steps 2 through 4 introduce lag. The score in Elasticsearch reflects the state o
|
||||
|
||||
In tidalDB, step 1 is `db.signal("view", entity_id, 1.0, timestamp)`. There are no other steps. The decay score is updated in the same call, in the same process, in the same memory space. The next ranking query -- even 100 milliseconds later -- reads the updated score. No lag. No cache. No batch pipeline.
|
||||
|
||||
Three `exp()` calls on write. One on read. 64 bytes per entity. The score is always current because the score is always computed, not cached.
|
||||
One `exp()` call per decay rate on write -- up to three if you register three rates, typically one. One on read. 64 bytes per entity. The score is always current because the score is always computed, not cached.
|
||||
|
||||
---
|
||||
|
||||
|
||||
284
site/content/blog/search-and-ranking.mdx
Normal file
284
site/content/blog/search-and-ranking.mdx
Normal file
@ -0,0 +1,284 @@
|
||||
---
|
||||
title: "Search and ranking are the same system"
|
||||
date: "2026-02-21"
|
||||
author: "Jordan Washburn"
|
||||
description: "In the 6-system stack, search and ranking are separate pipelines with separate teams. tidalDB is designed to collapse text retrieval, vector retrieval, and signal-based ranking into a single query pipeline. Here is what that architecture looks like, what is built, and what remains."
|
||||
tags: ["search", "ranking", "architecture", "rust"]
|
||||
---
|
||||
|
||||
Your search results and your ranked feed are computed by different systems, maintained by different teams, and they return different answers to the same question.
|
||||
|
||||
A user types "jazz piano tutorial." Elasticsearch returns results ranked by BM25 text relevance. Separately, your ranking service reads the user's preference vector from a feature store, pulls engagement signals from Redis, and reranks the candidates. The text score and the engagement score are combined using a weighted formula that somebody wrote eighteen months ago and nobody has revisited since. If the user follows a jazz piano creator whose new tutorial has 500 completions in the last hour, that signal exists in Redis. It does not exist in Elasticsearch. The search result does not reflect it.
|
||||
|
||||
Meanwhile, the "For You" feed shows the same tutorial ranked highly -- because the ranking service reads the same Redis signals and the same preference vector. But the feed used a different candidate set (vector similarity from the vector database, not keyword match from Elasticsearch), a different scoring formula (hot decay, not BM25), and a different diversity pass. The user sees the tutorial in the feed. They search for it and it appears on page two.
|
||||
|
||||
This is not a bug. It is the architecture. Search and ranking are separate systems with separate data pipelines, and the seams between them are where relevance dies.
|
||||
|
||||
## Why search and ranking diverged
|
||||
|
||||
The separation is a historical accident, not a design choice.
|
||||
|
||||
Full-text search engines -- Lucene, then Elasticsearch, then Solr -- were built to answer a question about documents: "which ones match this query?" They index terms. They compute BM25 or TF-IDF scores. They return results ranked by textual relevance. The problem they solve is information retrieval.
|
||||
|
||||
Recommendation systems were built to answer a different question: "what should this user see?" They model user preferences. They track engagement signals. They compute scores based on behavioral data, not textual content. The problem they solve is personalization.
|
||||
|
||||
The two problems feel different, so they got different systems. But the question a real user asks is neither purely textual nor purely behavioral. When a user searches for "jazz piano tutorial," they want results that are textually relevant to those words, semantically related to that concept, and ranked according to the quality and freshness signals that their platform has accumulated. They want the search result that the feed would surface -- and the feed result that the search would find.
|
||||
|
||||
The 6-system stack cannot answer this question without stitching systems together. Elasticsearch produces text candidates. The vector database produces semantic candidates. Redis provides engagement signals. The ranking service merges everything. Each system has its own consistency model, its own latency profile, and its own failure mode. The merge happens in application code that nobody wants to own.
|
||||
|
||||
## What "unified" actually means
|
||||
|
||||
A unified search-and-ranking system handles three retrieval modes in a single pipeline:
|
||||
|
||||
**Text retrieval.** BM25 keyword relevance against an inverted index. "Jazz piano tutorial" matches documents containing those terms, weighted by term frequency and inverse document frequency. This is what Elasticsearch does.
|
||||
|
||||
**Vector retrieval.** Approximate nearest neighbor search over embeddings. The query "jazz piano" encoded as a vector finds documents whose embeddings are geometrically close in the latent space -- including documents titled "beginner jazz keyboard lessons" that share no keywords with the query. This is what a vector database does.
|
||||
|
||||
**Signal-based ranking.** Scoring candidates using live engagement signals -- decay scores, velocity, windowed counts, interaction weights, preference vectors. This is what the ranking service does.
|
||||
|
||||
In the 6-system stack, these are three systems. Three network calls. Three consistency models. The merge is application logic.
|
||||
|
||||
In tidalDB, the design is one pipeline: source candidates from one or more retrieval modes, fuse their scores, apply the ranking profile's signal boosts, enforce diversity, return results. One function call. One process. One consistency model.
|
||||
|
||||
The target query looks like this:
|
||||
|
||||
```
|
||||
SEARCH items
|
||||
QUERY "jazz piano"
|
||||
VECTOR [embedding]
|
||||
FOR USER @user_42
|
||||
USING PROFILE search
|
||||
DIVERSITY max_per_creator:2
|
||||
LIMIT 20
|
||||
```
|
||||
|
||||
Text relevance, semantic similarity, and personalized signal ranking in a single query. The fusion uses Reciprocal Rank Fusion: each retrieval mode produces a ranked list, and RRF combines them by summing `1 / (k + rank)` across lists. An item that ranks 3rd by BM25 and 7th by ANN gets a higher fused score than an item that ranks 1st by BM25 but 50th by ANN. The formula is simple, parameter-free (given a fixed `k`), and well-studied.
|
||||
|
||||
Personalization re-ranks within the fused set. A high-quality result never surfaces solely because the user likes the creator. Textual and semantic relevance establish the candidate floor. Signals adjust rank within the relevant set. An irrelevant result stays irrelevant regardless of the user's history.
|
||||
|
||||
## What is built today
|
||||
|
||||
tidalDB is not there yet. Here is exactly what exists and what does not.
|
||||
|
||||
### The RETRIEVE pipeline: operational
|
||||
|
||||
The 5-stage RETRIEVE query pipeline is complete and tested. It handles candidate generation, metadata filtering, signal scoring, diversity enforcement, and result assembly. Fifteen built-in ranking profiles cover trending, hot, new, top-week, top-month, top-all-time, hidden gems, controversial, most viewed, most liked, shuffle, for-you, following, related, and notification. Personalized ranking with preference vectors, interaction weights, hard negatives, and exploration injection all work.
|
||||
|
||||
```rust
|
||||
// This works today.
|
||||
let query = Retrieve::builder()
|
||||
.profile("trending")
|
||||
.for_user(42)
|
||||
.filter(FilterExpr::CategoryEq("jazz".into()))
|
||||
.diversity(DiversityConstraints::new().max_per_creator(1))
|
||||
.limit(25)
|
||||
.build()
|
||||
.expect("valid query");
|
||||
|
||||
let results = db.retrieve(&query).expect("retrieve");
|
||||
```
|
||||
|
||||
Every result carries a signal snapshot showing the values that contributed to its score. The pipeline produces identical output for identical input. The acceptance tests verify this.
|
||||
|
||||
### USearch HNSW index: integrated, not wired as a retrieval path
|
||||
|
||||
The vector index is integrated and tested. USearch backs the `VectorIndex` trait with insert, search, filtered search, delete, save/load, and mmap `view()` mode. The adaptive query planner selects strategies based on filter selectivity: unfiltered HNSW for open queries, in-graph filtering for moderate selectivity, widened beam search for selective filters, and pre-filter-then-brute-force for extreme selectivity.
|
||||
|
||||
```rust
|
||||
pub struct UsearchIndex {
|
||||
inner: usearch::Index,
|
||||
total_slots: AtomicUsize,
|
||||
}
|
||||
|
||||
impl VectorIndex for UsearchIndex {
|
||||
fn search(
|
||||
&self,
|
||||
query: &[f32],
|
||||
k: usize,
|
||||
ef_search: usize,
|
||||
) -> Result<Vec<VectorSearchResult>, VectorError> { /* ... */ }
|
||||
|
||||
fn filtered_search(
|
||||
&self,
|
||||
query: &[f32],
|
||||
k: usize,
|
||||
ef_search: usize,
|
||||
filter: &dyn Fn(u64) -> bool,
|
||||
) -> Result<Vec<VectorSearchResult>, VectorError> { /* ... */ }
|
||||
}
|
||||
```
|
||||
|
||||
The embedding slot registry manages named vector slots per entity kind (e.g., "content" embeddings on Items, "creator_profile" embeddings on Creators). Embeddings are stored durably in the entity store and indexed in the HNSW index for search.
|
||||
|
||||
However, the `CandidateStrategy::Ann` variant in the query executor currently falls back to a full scan with a warning:
|
||||
|
||||
```rust
|
||||
// From tidal/src/query/executor.rs
|
||||
|
||||
CandidateStrategy::Ann { .. } => {
|
||||
// ANN candidate strategy falls back to scan with a warning.
|
||||
warnings.push(
|
||||
"ANN candidate strategy not yet wired; falling back to scan"
|
||||
.to_string(),
|
||||
);
|
||||
self.scan_candidates(query.limit, has_user_context)
|
||||
}
|
||||
```
|
||||
|
||||
The vector infrastructure is there. The retrieval path through the query executor is not wired. The scan-based approach is sufficient for the item counts the current version targets (tens of thousands), but it does not scale to millions where ANN retrieval is essential.
|
||||
|
||||
### Tantivy: researched, not integrated
|
||||
|
||||
Full-text search via Tantivy has been researched in depth. The integration patterns are documented: a custom `Collector` for bulk BM25 score extraction, `Weight::scorer` with `DocSet::seek` for scoring a pre-existing candidate set, and the consistency model for keeping Tantivy's segment storage synchronized with tidalDB's entity store.
|
||||
|
||||
The research identified the key architectural decision: Tantivy is a derived index, not a source of truth. The entity store is canonical. Tantivy indexes are materialized views that can be rebuilt from storage. Crash recovery replays from a stored sequence number. This is simpler than two-phase commit and correct for an embedded database.
|
||||
|
||||
No Tantivy code has been written. No inverted index exists. No BM25 scoring is available. The `SEARCH` query type does not exist yet.
|
||||
|
||||
### The Hybrid and CohortTrending strategies: defined, not implemented
|
||||
|
||||
The `CandidateStrategy` enum includes variants for the target architecture:
|
||||
|
||||
```rust
|
||||
// From tidal/src/ranking/profile.rs
|
||||
|
||||
pub enum CandidateStrategy {
|
||||
Ann { slot: String, limit: usize },
|
||||
Scan { sort_field: String },
|
||||
SignalRanked { signal: String, window: Window },
|
||||
Hybrid,
|
||||
Relationship,
|
||||
CohortTrending,
|
||||
}
|
||||
```
|
||||
|
||||
`Hybrid` is the strategy that will combine text and vector retrieval with RRF fusion. `CohortTrending` will scope signal aggregation to audience segments. Both return `UnsupportedStrategy` errors today. They are type-level documentation of intent.
|
||||
|
||||
## Why the architecture makes this possible
|
||||
|
||||
The interesting question is not "when will hybrid search ship." It is "why is the current architecture already designed to support it." The answer is in three decisions that were made before a line of search code was written.
|
||||
|
||||
### Decision 1: Ranking profiles control the retrieval path
|
||||
|
||||
The `CandidateStrategy` field on `RankingProfile` determines how candidates are sourced. The executor dispatches on this field. The scoring pipeline does not know or care where the candidates came from.
|
||||
|
||||
```rust
|
||||
// From tidal/src/query/executor.rs
|
||||
|
||||
let mut candidates = match &profile.candidate_strategy {
|
||||
CandidateStrategy::Scan { .. } => self.scan_candidates(query.limit, has_user_context),
|
||||
CandidateStrategy::SignalRanked { signal, .. } => {
|
||||
self.signal_ranked_candidates(signal, query.limit)
|
||||
}
|
||||
CandidateStrategy::Ann { .. } => {
|
||||
// Falls back to scan today. Will query the HNSW index next.
|
||||
self.scan_candidates(query.limit, has_user_context)
|
||||
}
|
||||
CandidateStrategy::Relationship => {
|
||||
// Sources candidates from followed creators' item sets.
|
||||
// ...
|
||||
}
|
||||
// ...
|
||||
};
|
||||
```
|
||||
|
||||
Adding ANN retrieval means implementing the `Ann` arm. Adding hybrid retrieval means implementing the `Hybrid` arm. The scoring, filtering, diversity, and pagination stages are unchanged. The pipeline is already split at the right boundary.
|
||||
|
||||
### Decision 2: Scores are composable numbers, not opaque ranks
|
||||
|
||||
Every stage in the pipeline produces and consumes `f64` scores. The `ProfileExecutor` reads signal aggregations and computes a weighted sum. The diversity selector operates on scored candidates sorted by score. The result carries the score and a signal snapshot.
|
||||
|
||||
This means RRF fusion has a natural integration point. RRF produces a fused score from ranked lists. That score enters the same pipeline as any other candidate score. Boosts from the ranking profile add signal-weighted values on top. Personalization adjusts via interaction weights. The profile's `sort` mode can override the base score entirely if needed. The type system already supports it:
|
||||
|
||||
```rust
|
||||
// From tidal/src/ranking/executor.rs
|
||||
|
||||
pub struct ScoredCandidate {
|
||||
pub entity_id: EntityId,
|
||||
pub score: f64,
|
||||
pub signal_snapshot: Vec<(String, f64)>,
|
||||
pub creator_id: Option<EntityId>,
|
||||
pub format: Option<String>,
|
||||
}
|
||||
```
|
||||
|
||||
A `ScoredCandidate` from ANN retrieval and a `ScoredCandidate` from text retrieval are the same type. Fusion is arithmetic, not type coercion.
|
||||
|
||||
### Decision 3: Signals live where the query can read them
|
||||
|
||||
In the 6-system stack, BM25 scores come from Elasticsearch, engagement signals come from Redis, and preference vectors come from a feature store. Merging them requires network calls across consistency boundaries.
|
||||
|
||||
In tidalDB, all three data sources are in-process:
|
||||
|
||||
- **Signal ledger**: decay scores, velocity, windowed counts -- read with a function call, not a network request. A signal written 100ms ago is visible in the next query.
|
||||
- **Preference vectors**: per-user taste embeddings stored in-memory, updated atomically on engagement events. Cosine similarity between user preference and item embedding is a dot product, not a feature store lookup.
|
||||
- **Metadata indexes**: bitmap and range indexes for category, format, creator, tags, duration, timestamps. Filter evaluation is a bitmap intersection.
|
||||
|
||||
When text retrieval is added, the BM25 score will be one more `f64` in the same process. Tantivy runs embedded. The score crosses no network boundary. The consistency model is the same as every other data source: in-memory state updated by the write path, visible to the read path immediately.
|
||||
|
||||
This is why the unified architecture matters even before it is fully wired. The data model is already unified. The indexes are in the same process. The signal ledger and the preference vectors and the metadata indexes and (eventually) the text index and the vector index all share a memory space. Fusion is addition. Consistency is structural.
|
||||
|
||||
## What a unified search query replaces
|
||||
|
||||
Here is the dependency graph for a search query in the 6-system stack:
|
||||
|
||||
```
|
||||
Application
|
||||
-> Elasticsearch (BM25 candidates from inverted index)
|
||||
-> Vector DB (semantic candidates from ANN)
|
||||
-> [merge candidate lists in application code]
|
||||
-> Ranking Service
|
||||
-> Redis (engagement signals per candidate)
|
||||
-> Feature Store (user preference vector)
|
||||
-> [score, rerank, diversity in application code]
|
||||
<- sorted results
|
||||
```
|
||||
|
||||
Four systems. Three network calls minimum. Two candidate sets merged in application code that nobody tests. Diversity rules in a microservice that nobody wants to refactor. A BM25 score from Elasticsearch and an engagement score from Redis that were computed at different points in time against different consistency snapshots.
|
||||
|
||||
Here is the target in tidalDB:
|
||||
|
||||
```
|
||||
Application
|
||||
-> db.search(&query)
|
||||
Stage 1a: Tantivy inverted index -> BM25-scored candidates
|
||||
Stage 1b: USearch HNSW index -> ANN-scored candidates
|
||||
Stage 1c: RRF fusion -> merged candidate list
|
||||
Stage 2: Bitmap filter evaluation -> surviving candidates
|
||||
Stage 3: Signal scoring via ranking profile -> scored candidates
|
||||
Stage 4: Diversity enforcement -> reordered candidates
|
||||
Stage 5: Result assembly -> Results
|
||||
<- Results
|
||||
```
|
||||
|
||||
One function call. One process. No network boundary between text relevance, semantic similarity, and engagement signals. The data is never stale because every data source shares the same write path.
|
||||
|
||||
## What is next
|
||||
|
||||
Three pieces of work stand between the current codebase and the unified search query.
|
||||
|
||||
**Wire ANN as a first-class retrieval path.** The USearch index is integrated. The adaptive query planner is implemented. The `CandidateStrategy::Ann` variant exists. What remains is plumbing: when the executor sees `Ann { slot, limit }`, it reads the query embedding (from the `Retrieve` struct or from the user's preference vector), calls the embedding registry to find the right HNSW index, runs the adaptive planner's `execute`, and returns the results as `Vec<EntityId>`. The scoring pipeline handles the rest. This is a wiring task, not an architecture change.
|
||||
|
||||
**Integrate Tantivy as a derived index.** The research doc maps out three integration patterns. The architecture decision is made: Tantivy is a materialized view of the entity store. The integration work is: add Tantivy as a dependency, build a background indexer that writes entity metadata to Tantivy segments, implement the custom `AllScoresCollector` for BM25 extraction, and expose text search as a candidate source in the executor. Crash recovery replays from a sequence number stored in the entity store.
|
||||
|
||||
**Implement RRF fusion and the SEARCH query.** Define a `Search` query type (analogous to `Retrieve` but with text and vector query fields). Implement the `Hybrid` candidate strategy that runs text retrieval and ANN retrieval in parallel, fuses the ranked lists using RRF, and feeds the merged candidates into Stage 2. The rest of the pipeline -- filtering, signal scoring, diversity, pagination -- is already built.
|
||||
|
||||
Each piece builds on existing infrastructure. No architectural changes. No new consistency models. No new storage engines. The foundation is laid. What remains is connecting the pieces.
|
||||
|
||||
## Why the integration matters now
|
||||
|
||||
Even before the SEARCH query ships, the architectural unification has consequences.
|
||||
|
||||
When a platform team evaluates tidalDB today, they see a system where engagement signals, preference vectors, metadata indexes, and vector embeddings are co-located in a single process. They see a query pipeline that reads all of them in a single pass. They see ranking profiles that can reference any signal source without a network call. They see diversity enforcement that operates on the full scored set, not on a post-hoc splice in application code.
|
||||
|
||||
Adding text search to this system is mechanical. Adding text search to the 6-system stack is political -- because search is Elasticsearch, ranking is the ranking service, and the merge is whoever volunteered to own it last year.
|
||||
|
||||
The most expensive part of unifying search and ranking is not the code. It is the organizational decision to put them in the same system. In tidalDB, that decision was made at the schema level. Signals, profiles, entities, embeddings, and (soon) text indexes share a schema, a storage model, and a query pipeline. They are the same system because the data model says they are.
|
||||
|
||||
Search and ranking are the same question asked with different emphasis. "What matches this query for this user?" and "What should this user see right now?" differ in whether the user provided keywords. The ranking pipeline -- candidates, filters, signals, diversity -- is identical. The only variable is how candidates are sourced: by keyword, by embedding, by signal velocity, by relationship graph, or by some combination.
|
||||
|
||||
One pipeline. One set of candidates. One scoring pass. That is the design. The implementation is catching up.
|
||||
|
||||
---
|
||||
|
||||
*The RETRIEVE query executor is at [tidal/src/query/executor.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/query/executor.rs). The ranking profiles are at [tidal/src/ranking/builtins.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/ranking/builtins.rs). The vector index is at [tidal/src/storage/vector/usearch_index.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/storage/vector/usearch_index.rs). The adaptive query planner is at [tidal/src/storage/vector/planner.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/storage/vector/planner.rs). The Tantivy research is at [docs/research/tantivy.md](https://github.com/orchard9/tidalDB/blob/main/docs/research/tantivy.md). Follow the build on [GitHub](https://github.com/orchard9/tidalDB).*
|
||||
@ -2,7 +2,7 @@
|
||||
title: "Signals wrote 100ms ago. The query sees them now."
|
||||
date: "2026-02-21"
|
||||
author: "Jordan Washburn"
|
||||
description: "Open a tidalDB instance, define signal types with decay and windows, write 10,000 events, read back scores that match analytical computation to six decimal places. Including after a crash."
|
||||
description: "Open a tidalDB instance, define signal types with decay and windows, write 10,000 events, read back scores that match analytical computation to under 0.1% relative error. Including after a crash."
|
||||
tags: ["signals", "durability", "rust"]
|
||||
---
|
||||
|
||||
@ -23,68 +23,41 @@ The test passes. Here it is.
|
||||
## The UAT test
|
||||
|
||||
```rust
|
||||
// Adapted from tidal/tests/signal_api.rs
|
||||
let schema = build_schema();
|
||||
let db = TidalDb::builder()
|
||||
.ephemeral()
|
||||
.with_schema(schema)
|
||||
.open()?;
|
||||
|
||||
#[test]
|
||||
fn m1_uat_ephemeral() {
|
||||
let schema = build_schema();
|
||||
let db = TidalDb::builder()
|
||||
.ephemeral()
|
||||
.with_schema(schema)
|
||||
.open()
|
||||
.expect("open failed");
|
||||
|
||||
// Write 100 items.
|
||||
for i in 0..100_u64 {
|
||||
db.write_item(EntityId::new(i), &metadata(i))
|
||||
.expect("write_item failed");
|
||||
}
|
||||
|
||||
// Generate 10,000 signal events spread over the past 7 days.
|
||||
let now = Timestamp::now();
|
||||
let seven_days_ns: u64 = 7 * 24 * 3_600_000_000_000;
|
||||
let signal_types = ["view", "like", "skip"];
|
||||
|
||||
let mut events: Vec<(EntityId, &str, f64, Timestamp)> = Vec::with_capacity(10_000);
|
||||
for i in 0..10_000_u64 {
|
||||
let entity_id = EntityId::new(i % 100);
|
||||
let sig = signal_types[(i % 3) as usize];
|
||||
let ts = Timestamp::from_nanos(
|
||||
now.as_nanos()
|
||||
.saturating_sub(seven_days_ns)
|
||||
.saturating_add(i * (seven_days_ns / 10_000)),
|
||||
);
|
||||
events.push((entity_id, sig, 1.0, ts));
|
||||
db.signal(sig, entity_id, 1.0, ts)
|
||||
.expect("signal write failed");
|
||||
}
|
||||
|
||||
// Verify analytical decay for entity 42, signal "view".
|
||||
let now_after = Timestamp::now();
|
||||
let analytical = analytical_decay(
|
||||
&events, EntityId::new(42), "view",
|
||||
7.0 * 24.0 * 3600.0, now_after,
|
||||
);
|
||||
let actual = db
|
||||
.read_decay_score(EntityId::new(42), "view", 0)
|
||||
.expect("read_decay_score failed")
|
||||
.unwrap_or(0.0);
|
||||
|
||||
let rel_err = (actual - analytical).abs() / analytical.abs();
|
||||
assert!(rel_err < 1e-3,
|
||||
"decay score mismatch: actual={actual:.8} analytical={analytical:.8}");
|
||||
|
||||
// Write a new signal and verify immediate visibility.
|
||||
let score_before = db.read_decay_score(EntityId::new(42), "view", 0)
|
||||
.unwrap().unwrap_or(0.0);
|
||||
db.signal("view", EntityId::new(42), 1.0, Timestamp::now())
|
||||
.expect("signal write failed");
|
||||
let score_after = db.read_decay_score(EntityId::new(42), "view", 0)
|
||||
.unwrap().unwrap_or(0.0);
|
||||
|
||||
assert!(score_after > score_before,
|
||||
"new signal must increase decay score: {score_before} -> {score_after}");
|
||||
// Write 100 items.
|
||||
for i in 0..100_u64 {
|
||||
db.write_item(EntityId::new(i), &metadata(i))?;
|
||||
}
|
||||
|
||||
// Generate 10,000 signal events spread over the past 7 days.
|
||||
let now = Timestamp::now();
|
||||
let seven_days_ns: u64 = 7 * 24 * 3_600_000_000_000;
|
||||
let signal_types = ["view", "like", "skip"];
|
||||
|
||||
for i in 0..10_000_u64 {
|
||||
let entity_id = EntityId::new(i % 100);
|
||||
let sig = signal_types[(i % 3) as usize];
|
||||
let ts = Timestamp::from_nanos(
|
||||
now.as_nanos()
|
||||
.saturating_sub(seven_days_ns)
|
||||
.saturating_add(i * (seven_days_ns / 10_000)),
|
||||
);
|
||||
db.signal(sig, entity_id, 1.0, ts)?;
|
||||
}
|
||||
|
||||
// Read the decay score — matches the brute-force analytical reference to < 0.1% relative error.
|
||||
let score = db.read_decay_score(EntityId::new(42), "view", 0)?;
|
||||
|
||||
// Write a new signal and read again — the new event is immediately visible.
|
||||
let score_before = db.read_decay_score(EntityId::new(42), "view", 0)?.unwrap_or(0.0);
|
||||
db.signal("view", EntityId::new(42), 1.0, Timestamp::now())?;
|
||||
let score_after = db.read_decay_score(EntityId::new(42), "view", 0)?.unwrap_or(0.0);
|
||||
// score_after > score_before — the signal is reflected without any delay
|
||||
```
|
||||
|
||||
The `analytical_decay` function is a brute-force reference. It iterates every event, computes `weight * exp(-lambda * dt)` for each one, and sums the results. The running accumulator in tidalDB produces the same answer without scanning a single raw event.
|
||||
@ -100,8 +73,6 @@ The solution is a trait boundary and a bridge.
|
||||
The `signals` module defines a `WalWriter` trait. The ledger accepts a `Box<dyn WalWriter>` at construction. It calls `append_signal()` on every write, but never knows what is on the other side. In tests, that is a `NoopWalWriter`. In production, it is a `WalHandleWriter` that forwards events to the live WAL via a channel.
|
||||
|
||||
```rust
|
||||
// Adapted from tidal/src/db/wal_bridge.rs
|
||||
|
||||
pub struct WalHandleWriter {
|
||||
sender: WalSender,
|
||||
last_seq: Arc<AtomicU64>,
|
||||
@ -138,45 +109,34 @@ This is three types, one trait, and one channel. It took longer to get right tha
|
||||
The persistent-mode test writes 100 signals, reads the decay score, closes the database, reopens it, and asserts the recovered score matches within 0.1%:
|
||||
|
||||
```rust
|
||||
// Adapted from tidal/tests/signal_api.rs
|
||||
let entity = EntityId::new(42);
|
||||
|
||||
#[test]
|
||||
fn m1_uat_persistent_crash_recovery() {
|
||||
let tmp = tempfile::tempdir().expect("tempdir failed");
|
||||
let entity = EntityId::new(42);
|
||||
let score_before;
|
||||
// Session 1: write signals, read score, close.
|
||||
let score_before = {
|
||||
let db = TidalDb::builder()
|
||||
.with_data_dir("/var/lib/tidaldb")
|
||||
.with_schema(build_schema())
|
||||
.open()?;
|
||||
|
||||
// First session: write signals
|
||||
{
|
||||
let db = TidalDb::builder()
|
||||
.with_data_dir(tmp.path())
|
||||
.with_schema(build_schema())
|
||||
.open()
|
||||
.expect("open failed");
|
||||
|
||||
for i in 0..100_u64 {
|
||||
let ts = Timestamp::from_nanos(/* spread over 7 days */);
|
||||
db.signal("view", entity, 1.0, ts)
|
||||
.expect("signal write failed");
|
||||
}
|
||||
score_before = db.read_decay_score(entity, "view", 0)
|
||||
.expect("read failed").expect("must have score");
|
||||
db.close().expect("close failed");
|
||||
for i in 0..100_u64 {
|
||||
let ts = Timestamp::from_nanos(/* spread over 7 days */);
|
||||
db.signal("view", entity, 1.0, ts)?;
|
||||
}
|
||||
let score = db.read_decay_score(entity, "view", 0)?.unwrap();
|
||||
db.close()?;
|
||||
score
|
||||
};
|
||||
|
||||
// Second session: verify state survived
|
||||
{
|
||||
let db = TidalDb::builder()
|
||||
.with_data_dir(tmp.path())
|
||||
.with_schema(build_schema())
|
||||
.open()
|
||||
.expect("open failed");
|
||||
// Session 2: reopen the same data directory, verify state survived.
|
||||
{
|
||||
let db = TidalDb::builder()
|
||||
.with_data_dir("/var/lib/tidaldb")
|
||||
.with_schema(build_schema())
|
||||
.open()?;
|
||||
|
||||
let score_after = db.read_decay_score(entity, "view", 0)
|
||||
.expect("read failed").expect("must have score");
|
||||
let rel_err = (score_after - score_before).abs() / score_before;
|
||||
assert!(rel_err < 0.001); // Under 0.1% deviation
|
||||
}
|
||||
let score_after = db.read_decay_score(entity, "view", 0)?.unwrap();
|
||||
// score_after ≈ score_before — under 0.1% relative deviation.
|
||||
// The checkpoint + WAL replay path restores exact in-memory state.
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
@ -109,15 +109,16 @@ A prefix scan on `entity_prefix(id)` returns everything the database knows about
|
||||
|
||||
The logging engine isolated storage per tenant. Separate directories. Separate files. Separate quotas. The philosophy: a noisy neighbor should not degrade a quiet one.
|
||||
|
||||
tidalDB isolates per entity kind. Items, Users, and Creators each get their own fjall keyspace in their own directory:
|
||||
tidalDB isolates per entity kind. A single fjall database opens at one path, and Items, Users, and Creators each get their own keyspace within it:
|
||||
|
||||
```
|
||||
{base}/items/ # fjall keyspace for item entities
|
||||
{base}/users/ # fjall keyspace for user entities
|
||||
{base}/creators/ # fjall keyspace for creator entities
|
||||
{base}/ # single fjall database
|
||||
├── items/ # keyspace for item entities
|
||||
├── users/ # keyspace for user entities
|
||||
└── creators/ # keyspace for creator entities
|
||||
```
|
||||
|
||||
A burst of signal events for a viral item does not slow down user profile reads. The LSM-tree compaction for the items keyspace runs independently. The I/O pressure stays contained.
|
||||
They share one database instance, but compaction runs independently per keyspace. A burst of signal events for a viral item does not slow down user profile reads. The I/O pressure stays contained.
|
||||
|
||||
### Append-only core with mutable views
|
||||
|
||||
|
||||
@ -21,18 +21,16 @@ tidalDB has five core concepts. Everything else follows from them.
|
||||
**Signals** are typed, timestamped event streams with decay and velocity built in. You declare a signal type once:
|
||||
|
||||
```rust
|
||||
db.define_signal(SignalDef {
|
||||
name: "view",
|
||||
target: EntityKind::Item,
|
||||
decay: Decay::Exponential { half_life: Duration::days(7) },
|
||||
windows: vec![
|
||||
Window::hours(1),
|
||||
Window::hours(24),
|
||||
Window::days(7),
|
||||
Window::all_time(),
|
||||
],
|
||||
velocity: true,
|
||||
})?;
|
||||
use std::time::Duration;
|
||||
use tidaldb::schema::{DecaySpec, EntityKind, SchemaBuilder, Window};
|
||||
|
||||
let mut builder = SchemaBuilder::new();
|
||||
let _ = builder.signal("view", EntityKind::Item,
|
||||
DecaySpec::Exponential { half_life: Duration::from_secs(7 * 24 * 3600) })
|
||||
.windows(&[Window::OneHour, Window::TwentyFourHours, Window::SevenDays, Window::AllTime])
|
||||
.velocity(true)
|
||||
.add();
|
||||
let schema = builder.build()?;
|
||||
```
|
||||
|
||||
That declaration tells the database everything it needs. When a view event arrives, the database maintains windowed counts, computes velocity, and applies exponential decay — all at write time, all O(1). You never compute `trending_score = views / (age_hours + 2)^1.8` in application code. You never update a stale float field on a cron schedule. The database does this natively, and it does it correctly.
|
||||
@ -56,27 +54,28 @@ DIVERSITY max_per_creator:2, format_mix:true
|
||||
LIMIT 50
|
||||
```
|
||||
|
||||
Today queries are built via the Rust builder API (`Retrieve::builder()`); a parsed text query language is planned for a future milestone.
|
||||
|
||||
One call. No network hops between subsystems. No merging results from five data sources. The database handles retrieval strategy (ANN, BM25, graph walk, full scan), applies hard filters, scores candidates against live signal state, enforces diversity constraints, and returns a ranked list. The agent gets the list along with a session snapshot (top signals, reward velocity, last tool it used) so it can explain its answer.
|
||||
|
||||
## The feedback loop
|
||||
|
||||
This is the part that makes the architecture honest.
|
||||
|
||||
When a user likes an item, the database atomically updates the item's signal ledger, the user's preference vector, and the user-to-creator relationship weight. All in the same write transaction. The next ranking query — even 100ms later — reflects the updated state.
|
||||
When a user likes an item, the database updates the item's signal ledger, the user's preference vector, and the user-to-creator relationship weight in the same call. The next ranking query — even 100ms later — reflects the updated state.
|
||||
|
||||
```rust
|
||||
db.signal(Signal {
|
||||
kind: "like",
|
||||
item: "item_abc",
|
||||
user: "user_123",
|
||||
session: Some("session_xyz"),
|
||||
timestamp: Utc::now(),
|
||||
weight: 1.0,
|
||||
metadata: Some(json!({ "agent": "assistant", "tool": "planner" })),
|
||||
})?;
|
||||
// Item-level signal
|
||||
db.signal("like", EntityId::new(42), 1.0, Timestamp::now())?;
|
||||
|
||||
// With user context (updates interaction weight and preference vector)
|
||||
db.signal_with_context(
|
||||
"like", EntityId::new(42), 1.0, Timestamp::now(),
|
||||
Some(user_id), Some(creator_id),
|
||||
)?;
|
||||
```
|
||||
|
||||
There is no event bus between the engagement and the ranking update. No consumer lag. No cache to invalidate. The write path and the read path are one system. A user who skips three items in a row sees the fourth query adjust — not after a batch pipeline runs, not after a feature store syncs. Now.
|
||||
There is no event bus between the engagement and the ranking update. No consumer lag. No cache to invalidate. The write path and the read path are one system. A user who skips three items in a row sees the fourth query adjust — the skips add to the user's exclusion bitmap, and the next retrieve filters them out. Not after a batch pipeline runs, not after a feature store syncs. Now.
|
||||
|
||||
## Where we are deliberately narrow
|
||||
|
||||
@ -94,9 +93,17 @@ Our wedge is narrower and opinionated:
|
||||
|
||||
tidalDB is early. I want to be direct about what exists today and what does not.
|
||||
|
||||
**Built:** Schema system with entity, signal, and profile definitions. Write-ahead log with segment rotation, checksummed records, BLAKE3 deduplication, and crash recovery. Storage engine backed by fjall with trait abstraction, key encoding, and batch writes. Signal ledger with forward-decay scoring, hot-path state, and warm-path persistence.
|
||||
**M1 — Signal engine.** Schema system with entity, signal, and profile definitions. Write-ahead log with segment rotation, checksummed records, BLAKE3 deduplication, and crash recovery. Storage engine backed by fjall with trait abstraction, key encoding, and batch writes. Signal ledger with forward-decay scoring, hot-path state, and warm-path persistence.
|
||||
|
||||
**Next:** Query engine — the RETRIEVE/SEARCH/SUGGEST operations with the execution pipeline described above. Then session-aware APIs, agent policies, vector search (USearch), text search (Tantivy), and hybrid fusion. Then the full query surface with all sort modes and diversity enforcement.
|
||||
**M2 — Query and retrieval.** RETRIEVE query with a five-stage execution pipeline: candidate generation, filter evaluation, signal scoring, diversity enforcement, result assembly. Vector index (USearch HNSW), bitmap and range indexes, 15 built-in ranking profiles.
|
||||
|
||||
**M3 — Personalized ranking.** FOR USER context in queries. Relationship graph (follows, blocks). Interaction ledger with lazy decay. Preference vectors blended from positive engagement signals. Full feedback loop from signal write to ranking adjustment.
|
||||
|
||||
**M4 — Entity system and sessions.** User and Creator entities with metadata and signal ledgers. Agent sessions with identity binding, policy enforcement, and session-scoped signals. Negative signal classification (skip, hide, dislike, block) with hard-negative exclusion bitmaps. Cold-start fallback profiles.
|
||||
|
||||
All four milestones are complete with 661+ passing tests.
|
||||
|
||||
**Next (M5):** Hybrid search — RRF fusion across text (Tantivy BM25) and vector retrieval, the SEARCH executor, and a parsed query language.
|
||||
|
||||
The foundation is Rust, single-node, embeddable. The storage layer is designed for horizontal scaling later — key encoding and storage isolation are partition-ready — but single-node correctness comes first. This is how we differentiate from Vespa, Milvus, or any search-first system: tidalDB embeds inside your agent runtime, exposes a declarative query+session API, and guarantees every signal the agent writes is visible on the next read without a distributed hop.
|
||||
|
||||
|
||||
@ -20,6 +20,8 @@ rand = "0.9"
|
||||
roaring = "0.10"
|
||||
serde = { version = "1", features = ["derive"] }
|
||||
serde_json = "1"
|
||||
thiserror = "2"
|
||||
tantivy = "0.22"
|
||||
tempfile = { version = "3", optional = true }
|
||||
tracing = "0.1"
|
||||
usearch = "2.24.0"
|
||||
@ -94,3 +96,19 @@ harness = false
|
||||
[[bench]]
|
||||
name = "query"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "session"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "text_index"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "fusion"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "search"
|
||||
harness = false
|
||||
|
||||
72
tidal/benches/fusion.rs
Normal file
72
tidal/benches/fusion.rs
Normal file
@ -0,0 +1,72 @@
|
||||
#![allow(clippy::unwrap_used)]
|
||||
|
||||
use criterion::{Criterion, black_box, criterion_group, criterion_main};
|
||||
use tidaldb::query::{HybridFusion, RetrievalMode, route_results};
|
||||
use tidaldb::schema::EntityId;
|
||||
|
||||
fn make_bm25(n: u64) -> Vec<(EntityId, f32)> {
|
||||
(0..n).map(|i| (EntityId::new(i), (n - i) as f32)).collect()
|
||||
}
|
||||
|
||||
fn make_ann(n: u64) -> Vec<(EntityId, f32)> {
|
||||
(500..500 + n)
|
||||
.enumerate()
|
||||
.map(|(i, id)| (EntityId::new(id), i as f32 * 0.001))
|
||||
.collect()
|
||||
}
|
||||
|
||||
fn bench_rrf_fuse_1k(c: &mut Criterion) {
|
||||
let bm25 = make_bm25(1_000);
|
||||
let ann = make_ann(1_000);
|
||||
let fusion = HybridFusion::new();
|
||||
|
||||
c.bench_function("rrf_fuse_1k_per_list", |b| {
|
||||
b.iter(|| {
|
||||
let results = fusion.fuse(black_box(&bm25), black_box(&ann));
|
||||
black_box(results)
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
fn bench_route_hybrid(c: &mut Criterion) {
|
||||
let bm25 = make_bm25(1_000);
|
||||
let ann = make_ann(1_000);
|
||||
let fusion = HybridFusion::new();
|
||||
|
||||
c.bench_function("route_hybrid_1k", |b| {
|
||||
b.iter(|| {
|
||||
let r = route_results(
|
||||
black_box(RetrievalMode::Hybrid),
|
||||
black_box(&bm25),
|
||||
black_box(&ann),
|
||||
black_box(&fusion),
|
||||
);
|
||||
black_box(r)
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
fn bench_route_text_only(c: &mut Criterion) {
|
||||
let bm25 = make_bm25(1_000);
|
||||
let fusion = HybridFusion::new();
|
||||
|
||||
c.bench_function("route_text_only_1k", |b| {
|
||||
b.iter(|| {
|
||||
let r = route_results(
|
||||
black_box(RetrievalMode::TextOnly),
|
||||
black_box(&bm25),
|
||||
&[],
|
||||
black_box(&fusion),
|
||||
);
|
||||
black_box(r)
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
criterion_group!(
|
||||
fusion_benches,
|
||||
bench_rrf_fuse_1k,
|
||||
bench_route_hybrid,
|
||||
bench_route_text_only
|
||||
);
|
||||
criterion_main!(fusion_benches);
|
||||
129
tidal/benches/search.rs
Normal file
129
tidal/benches/search.rs
Normal file
@ -0,0 +1,129 @@
|
||||
#![allow(clippy::unwrap_used)]
|
||||
//! Criterion benchmarks for the SEARCH query pipeline.
|
||||
//!
|
||||
//! Measures end-to-end `db.search()` latency at 10K items to validate the
|
||||
//! < 50ms target specified in the m5p3 phase acceptance criteria.
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::time::Duration;
|
||||
|
||||
use criterion::{Criterion, black_box, criterion_group, criterion_main};
|
||||
use tidaldb::TidalDb;
|
||||
use tidaldb::query::search::Search;
|
||||
use tidaldb::schema::{
|
||||
DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldDef, TextFieldType, Timestamp, Window,
|
||||
};
|
||||
|
||||
fn search_schema() -> tidaldb::schema::Schema {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
let _ = builder
|
||||
.signal(
|
||||
"view",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(7 * 24 * 3600),
|
||||
},
|
||||
)
|
||||
.windows(&[Window::TwentyFourHours])
|
||||
.velocity(false)
|
||||
.add();
|
||||
let _ = builder
|
||||
.signal(
|
||||
"like",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(30 * 24 * 3600),
|
||||
},
|
||||
)
|
||||
.windows(&[Window::TwentyFourHours])
|
||||
.velocity(false)
|
||||
.add();
|
||||
builder.text_field("title", TextFieldType::Text);
|
||||
builder.text_field("description", TextFieldType::Text);
|
||||
builder.text_field("category", TextFieldType::Keyword);
|
||||
builder.build().unwrap()
|
||||
}
|
||||
|
||||
/// Build a TidalDb with N items indexed for text search.
|
||||
fn make_db(n: u64) -> TidalDb {
|
||||
let db = TidalDb::builder()
|
||||
.ephemeral()
|
||||
.with_schema(search_schema())
|
||||
.open()
|
||||
.unwrap();
|
||||
|
||||
let ts = Timestamp::now();
|
||||
for i in 0..n {
|
||||
let mut meta = HashMap::new();
|
||||
// Vary titles to produce realistic IDF scores.
|
||||
meta.insert(
|
||||
"title".to_string(),
|
||||
format!("Rust tutorial {i} async concurrency programming"),
|
||||
);
|
||||
meta.insert(
|
||||
"description".to_string(),
|
||||
"Learn Rust with practical examples and real projects.".to_string(),
|
||||
);
|
||||
let cat = if i % 3 == 0 {
|
||||
"programming"
|
||||
} else if i % 3 == 1 {
|
||||
"systems"
|
||||
} else {
|
||||
"web"
|
||||
};
|
||||
meta.insert("category".to_string(), cat.to_string());
|
||||
db.write_item_with_metadata(EntityId::new(i), &meta)
|
||||
.unwrap();
|
||||
|
||||
// Add some signals to make profile scoring non-trivial.
|
||||
if i % 5 == 0 {
|
||||
db.signal("view", EntityId::new(i), 1.0, ts).unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
// Wait for the background text syncer to commit all pending documents
|
||||
// (syncer commits every 1_000 items or 2s; 10K items = 10 batch commits).
|
||||
// Then reload the reader so the searcher sees all committed documents.
|
||||
std::thread::sleep(std::time::Duration::from_millis(500));
|
||||
db.reload_text_index().unwrap();
|
||||
db
|
||||
}
|
||||
|
||||
/// Benchmark: `db.search()` with a text-only query at 10K items.
|
||||
///
|
||||
/// Target: < 50ms.
|
||||
fn bench_search_text_10k(c: &mut Criterion) {
|
||||
let db = make_db(10_000);
|
||||
let query = Search::builder()
|
||||
.query("Rust async")
|
||||
.limit(20)
|
||||
.build()
|
||||
.unwrap();
|
||||
|
||||
c.bench_function("search_text_10k", |b| {
|
||||
b.iter(|| db.search(black_box(&query)).unwrap());
|
||||
});
|
||||
}
|
||||
|
||||
/// Benchmark: `db.search()` with a keyword-scoped query at 10K items.
|
||||
///
|
||||
/// Target: < 50ms.
|
||||
fn bench_search_keyword_10k(c: &mut Criterion) {
|
||||
let db = make_db(10_000);
|
||||
let query = Search::builder()
|
||||
.query("category:programming")
|
||||
.limit(20)
|
||||
.build()
|
||||
.unwrap();
|
||||
|
||||
c.bench_function("search_keyword_10k", |b| {
|
||||
b.iter(|| db.search(black_box(&query)).unwrap());
|
||||
});
|
||||
}
|
||||
|
||||
criterion_group!(
|
||||
search_benches,
|
||||
bench_search_text_10k,
|
||||
bench_search_keyword_10k
|
||||
);
|
||||
criterion_main!(search_benches);
|
||||
164
tidal/benches/session.rs
Normal file
164
tidal/benches/session.rs
Normal file
@ -0,0 +1,164 @@
|
||||
#![allow(clippy::unwrap_used)]
|
||||
//! Benchmarks for the session layer: signal writes, snapshot reads,
|
||||
//! and retrieve queries with active session context.
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::time::Duration;
|
||||
|
||||
use criterion::{Criterion, black_box, criterion_group, criterion_main};
|
||||
use tidaldb::TidalDb;
|
||||
use tidaldb::query::retrieve::{ProfileRef, RetrieveBuilder};
|
||||
use tidaldb::schema::{
|
||||
AgentPolicy, DecaySpec, EntityId, EntityKind, SchemaBuilder, Timestamp, Window,
|
||||
};
|
||||
|
||||
fn session_schema() -> tidaldb::schema::Schema {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
let _ = builder
|
||||
.signal(
|
||||
"reward",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(7 * 24 * 3600),
|
||||
},
|
||||
)
|
||||
.windows(&[Window::OneHour, Window::TwentyFourHours])
|
||||
.velocity(false)
|
||||
.add();
|
||||
let _ = builder.session_policy(
|
||||
"bench_policy",
|
||||
AgentPolicy {
|
||||
allowed_signals: vec!["reward".to_string()],
|
||||
denied_signals: vec![],
|
||||
max_session_duration: Duration::from_secs(3600),
|
||||
max_signals_per_session: 1_000_000,
|
||||
},
|
||||
);
|
||||
builder.build().unwrap()
|
||||
}
|
||||
|
||||
/// Benchmark: `session_signal()` write throughput.
|
||||
/// Target: < 200µs per call.
|
||||
fn bench_session_signal(c: &mut Criterion) {
|
||||
let db = TidalDb::builder()
|
||||
.ephemeral()
|
||||
.with_schema(session_schema())
|
||||
.open()
|
||||
.unwrap();
|
||||
|
||||
let mut meta = HashMap::new();
|
||||
meta.insert("title".to_string(), "bench-item".to_string());
|
||||
db.write_item_with_metadata(EntityId::new(1), &meta)
|
||||
.unwrap();
|
||||
|
||||
let handle = db
|
||||
.start_session(1, "bench-agent", "bench_policy", HashMap::new())
|
||||
.unwrap();
|
||||
let entity = EntityId::new(1);
|
||||
let ts = Timestamp::now();
|
||||
|
||||
c.bench_function("session_signal", |b| {
|
||||
b.iter(|| {
|
||||
db.session_signal(
|
||||
black_box(&handle),
|
||||
black_box("reward"),
|
||||
black_box(entity),
|
||||
black_box(1.0_f64),
|
||||
black_box(ts),
|
||||
black_box(None),
|
||||
)
|
||||
.unwrap();
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
/// Benchmark: `session_snapshot()` on an active session with 100 signals.
|
||||
/// Target: < 50µs per call.
|
||||
fn bench_session_snapshot(c: &mut Criterion) {
|
||||
let db = TidalDb::builder()
|
||||
.ephemeral()
|
||||
.with_schema(session_schema())
|
||||
.open()
|
||||
.unwrap();
|
||||
|
||||
let mut meta = HashMap::new();
|
||||
meta.insert("title".to_string(), "bench-item".to_string());
|
||||
db.write_item_with_metadata(EntityId::new(1), &meta)
|
||||
.unwrap();
|
||||
|
||||
let handle = db
|
||||
.start_session(2, "bench-agent", "bench_policy", HashMap::new())
|
||||
.unwrap();
|
||||
let session_id = handle.id;
|
||||
let entity = EntityId::new(1);
|
||||
let ts = Timestamp::now();
|
||||
|
||||
// Pre-load 100 signals.
|
||||
for _ in 0..100 {
|
||||
db.session_signal(&handle, "reward", entity, 1.0, ts, None)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
c.bench_function("session_snapshot_100_signals", |b| {
|
||||
b.iter(|| db.session_snapshot(black_box(session_id)).unwrap());
|
||||
});
|
||||
}
|
||||
|
||||
/// Benchmark: `retrieve()` with FOR SESSION vs without, over 1K items.
|
||||
/// Measures the overhead of session context in ranking.
|
||||
/// Target: < 5ms overhead vs without session.
|
||||
fn bench_retrieve_with_session(c: &mut Criterion) {
|
||||
let db = TidalDb::builder()
|
||||
.ephemeral()
|
||||
.with_schema(session_schema())
|
||||
.open()
|
||||
.unwrap();
|
||||
|
||||
// Write 1K items.
|
||||
for i in 1u64..=1000 {
|
||||
let mut meta = HashMap::new();
|
||||
meta.insert("title".to_string(), format!("item-{i}"));
|
||||
db.write_item_with_metadata(EntityId::new(i), &meta)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
let handle = db
|
||||
.start_session(3, "bench-agent", "bench_policy", HashMap::new())
|
||||
.unwrap();
|
||||
let session_id = handle.id;
|
||||
let ts = Timestamp::now();
|
||||
|
||||
// Signal 10 entities to create a non-trivial session context.
|
||||
for i in 1u64..=10 {
|
||||
db.session_signal(&handle, "reward", EntityId::new(i), 1.0, ts, None)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
let query_with = RetrieveBuilder::new(EntityKind::Item, ProfileRef::new("hot"))
|
||||
.limit(20)
|
||||
.for_session(session_id)
|
||||
.build()
|
||||
.unwrap();
|
||||
|
||||
let query_without = RetrieveBuilder::new(EntityKind::Item, ProfileRef::new("hot"))
|
||||
.limit(20)
|
||||
.build()
|
||||
.unwrap();
|
||||
|
||||
let mut group = c.benchmark_group("retrieve_1k_items");
|
||||
group.bench_function("without_session", |b| {
|
||||
b.iter(|| db.retrieve(black_box(&query_without)).unwrap());
|
||||
});
|
||||
group.bench_function("with_session", |b| {
|
||||
b.iter(|| db.retrieve(black_box(&query_with)).unwrap());
|
||||
});
|
||||
group.finish();
|
||||
}
|
||||
|
||||
criterion_group!(
|
||||
benches,
|
||||
bench_session_signal,
|
||||
bench_session_snapshot,
|
||||
bench_retrieve_with_session
|
||||
);
|
||||
criterion_main!(benches);
|
||||
135
tidal/benches/text_index.rs
Normal file
135
tidal/benches/text_index.rs
Normal file
@ -0,0 +1,135 @@
|
||||
#![allow(clippy::unwrap_used)]
|
||||
//! Criterion benchmarks for the BM25 text index pipeline.
|
||||
//!
|
||||
//! Measures BM25 query latency at various corpus sizes to validate the
|
||||
//! < 10ms target at 10K documents specified in the m5p1 phase acceptance criteria.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use criterion::{Criterion, black_box, criterion_group, criterion_main};
|
||||
use tidaldb::schema::{EntityId, TextFieldDef, TextFieldType};
|
||||
use tidaldb::text::{AllScoresCollector, TextIndex};
|
||||
|
||||
fn make_index(n: u64) -> TextIndex {
|
||||
let fields = vec![
|
||||
TextFieldDef {
|
||||
key: "title".into(),
|
||||
field_type: TextFieldType::Text,
|
||||
},
|
||||
TextFieldDef {
|
||||
key: "description".into(),
|
||||
field_type: TextFieldType::Text,
|
||||
},
|
||||
TextFieldDef {
|
||||
key: "category".into(),
|
||||
field_type: TextFieldType::Keyword,
|
||||
},
|
||||
];
|
||||
|
||||
let idx = TextIndex::ephemeral(&fields).unwrap();
|
||||
let mut w = idx.writer_guard().unwrap();
|
||||
|
||||
for i in 0..n {
|
||||
let mut meta = HashMap::new();
|
||||
// Vary titles so BM25 IDF scoring is meaningful.
|
||||
meta.insert(
|
||||
"title".into(),
|
||||
format!("Rust tutorial {i} async concurrency"),
|
||||
);
|
||||
meta.insert(
|
||||
"description".into(),
|
||||
"Learn Rust programming with practical examples and real projects.".into(),
|
||||
);
|
||||
// Alternate categories to test keyword field throughput.
|
||||
let cat = if i % 2 == 0 { "programming" } else { "systems" };
|
||||
meta.insert("category".into(), cat.into());
|
||||
w.index_item(EntityId::new(i), &meta).unwrap();
|
||||
}
|
||||
|
||||
w.commit(n).unwrap();
|
||||
drop(w);
|
||||
|
||||
idx.reload_reader().unwrap();
|
||||
idx
|
||||
}
|
||||
|
||||
/// BM25 bare-term query at 1K docs.
|
||||
fn bench_bm25_1k(c: &mut Criterion) {
|
||||
let idx = make_index(1_000);
|
||||
let searcher = idx.searcher();
|
||||
let parser = idx.query_parser();
|
||||
let collector = AllScoresCollector {
|
||||
entity_id_field: idx.fields().entity_id,
|
||||
};
|
||||
|
||||
c.bench_function("bm25_query_1k_docs", |b| {
|
||||
b.iter(|| {
|
||||
let q = parser.parse(black_box("Rust async")).unwrap();
|
||||
let results = searcher.search(q.as_ref(), &collector).unwrap();
|
||||
black_box(results)
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
/// BM25 bare-term query at 10K docs — must complete in < 10ms.
|
||||
fn bench_bm25_10k(c: &mut Criterion) {
|
||||
let idx = make_index(10_000);
|
||||
let searcher = idx.searcher();
|
||||
let parser = idx.query_parser();
|
||||
let collector = AllScoresCollector {
|
||||
entity_id_field: idx.fields().entity_id,
|
||||
};
|
||||
|
||||
c.bench_function("bm25_query_10k_docs", |b| {
|
||||
b.iter(|| {
|
||||
let q = parser.parse(black_box("Rust async")).unwrap();
|
||||
let results = searcher.search(q.as_ref(), &collector).unwrap();
|
||||
black_box(results)
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
/// BM25 exact-phrase query at 10K docs.
|
||||
fn bench_bm25_phrase_10k(c: &mut Criterion) {
|
||||
let idx = make_index(10_000);
|
||||
let searcher = idx.searcher();
|
||||
let parser = idx.query_parser();
|
||||
let collector = AllScoresCollector {
|
||||
entity_id_field: idx.fields().entity_id,
|
||||
};
|
||||
|
||||
c.bench_function("bm25_phrase_10k_docs", |b| {
|
||||
b.iter(|| {
|
||||
let q = parser.parse(black_box("\"Rust programming\"")).unwrap();
|
||||
let results = searcher.search(q.as_ref(), &collector).unwrap();
|
||||
black_box(results)
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
/// BM25 keyword field-scoped query at 10K docs.
|
||||
fn bench_bm25_keyword_10k(c: &mut Criterion) {
|
||||
let idx = make_index(10_000);
|
||||
let searcher = idx.searcher();
|
||||
let parser = idx.query_parser();
|
||||
let collector = AllScoresCollector {
|
||||
entity_id_field: idx.fields().entity_id,
|
||||
};
|
||||
|
||||
c.bench_function("bm25_keyword_10k_docs", |b| {
|
||||
b.iter(|| {
|
||||
let q = parser.parse(black_box("category:programming")).unwrap();
|
||||
let results = searcher.search(q.as_ref(), &collector).unwrap();
|
||||
black_box(results)
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
criterion_group!(
|
||||
bm25_benches,
|
||||
bench_bm25_1k,
|
||||
bench_bm25_10k,
|
||||
bench_bm25_phrase_10k,
|
||||
bench_bm25_keyword_10k
|
||||
);
|
||||
criterion_main!(bm25_benches);
|
||||
@ -237,13 +237,18 @@ impl TidalDbBuilder {
|
||||
// Wire in storage, WAL, signal ledger, and M2 indexes.
|
||||
let result = TidalDb::open_with_schema(&self.config, schema)?;
|
||||
|
||||
Ok(TidalDb::from_parts(
|
||||
let db = TidalDb::from_parts(
|
||||
self.config,
|
||||
metrics,
|
||||
#[cfg(feature = "metrics")]
|
||||
metrics_handle,
|
||||
result,
|
||||
))
|
||||
);
|
||||
|
||||
// Restore previously archived session snapshots from storage.
|
||||
db.restore_sessions();
|
||||
|
||||
Ok(db)
|
||||
} else {
|
||||
// M0 compatibility mode: no storage, no ledger, no WAL.
|
||||
Ok(TidalDb::from_config(
|
||||
|
||||
@ -1,4 +1,3 @@
|
||||
use std::fmt;
|
||||
use std::path::PathBuf;
|
||||
|
||||
/// How tidalDB stores data.
|
||||
@ -76,32 +75,19 @@ impl Default for Config {
|
||||
/// let err = ConfigError::MissingDataDir;
|
||||
/// assert!(err.to_string().contains("data directory"));
|
||||
/// ```
|
||||
#[derive(Debug)]
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum ConfigError {
|
||||
/// Persistent mode was selected but no data directory was provided.
|
||||
#[error("persistent mode requires a data directory")]
|
||||
MissingDataDir,
|
||||
/// A directory path was specified but does not exist on the filesystem.
|
||||
#[error("directory does not exist: {}", path.display())]
|
||||
DirectoryNotFound { path: PathBuf },
|
||||
/// A directory exists but the process does not have write permission.
|
||||
#[error("directory is not writable: {}", path.display())]
|
||||
NotWritable { path: PathBuf },
|
||||
}
|
||||
|
||||
impl fmt::Display for ConfigError {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
match self {
|
||||
Self::MissingDataDir => f.write_str("persistent mode requires a data directory"),
|
||||
Self::DirectoryNotFound { path } => {
|
||||
write!(f, "directory does not exist: {}", path.display())
|
||||
}
|
||||
Self::NotWritable { path } => {
|
||||
write!(f, "directory is not writable: {}", path.display())
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl std::error::Error for ConfigError {}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
167
tidal/src/db/creators.rs
Normal file
167
tidal/src/db/creators.rs
Normal file
@ -0,0 +1,167 @@
|
||||
//! Creator entity write/read operations on `TidalDb`.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use crate::schema::{EntityId, EntityKind, TidalError};
|
||||
use crate::storage::vector::registry::{EmbeddingSlotState, EmbeddingSource, HnswParams};
|
||||
use crate::storage::vector::{
|
||||
BruteForceIndex, QuantizationLevel, VectorIndexConfig, deserialize_embedding, insert_embedding,
|
||||
};
|
||||
use crate::storage::{Tag, encode_key};
|
||||
|
||||
use super::TidalDb;
|
||||
|
||||
impl TidalDb {
|
||||
/// Write (or overwrite) a creator entity.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no storage backend is wired.
|
||||
/// - `TidalError::Storage` on storage engine failure.
|
||||
pub fn write_creator(
|
||||
&self,
|
||||
id: EntityId,
|
||||
metadata: &HashMap<String, String>,
|
||||
) -> crate::Result<()> {
|
||||
let storage = self.storage()?;
|
||||
let key = encode_key(id, Tag::Meta, b"");
|
||||
let value = crate::entities::serialize_entity(None, metadata);
|
||||
storage
|
||||
.creators_engine()
|
||||
.put(&key, &value)
|
||||
.map_err(TidalError::from)?;
|
||||
|
||||
// Enqueue for creator text index (best-effort).
|
||||
if let Ok(guard) = self.creator_text_tx.lock()
|
||||
&& let Some(tx) = guard.as_ref()
|
||||
{
|
||||
let _ = tx.send(crate::text::PendingWrite {
|
||||
entity_id: id,
|
||||
metadata: metadata.clone(),
|
||||
seq: 0,
|
||||
deleted: false,
|
||||
});
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read creator metadata for a given entity ID.
|
||||
///
|
||||
/// Returns `None` if the creator does not exist in storage.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no storage backend is wired.
|
||||
/// - `TidalError::Storage` on storage engine failure.
|
||||
pub fn get_creator_metadata(
|
||||
&self,
|
||||
id: EntityId,
|
||||
) -> crate::Result<Option<HashMap<String, String>>> {
|
||||
let storage = self.storage()?;
|
||||
let key = encode_key(id, Tag::Meta, b"");
|
||||
match storage.creators_engine().get(&key) {
|
||||
Ok(Some(bytes)) => {
|
||||
let (_emb, meta) = crate::entities::deserialize_entity(&bytes);
|
||||
Ok(Some(meta))
|
||||
}
|
||||
Ok(None) => Ok(None),
|
||||
Err(e) => Err(TidalError::from(e)),
|
||||
}
|
||||
}
|
||||
|
||||
/// Write (or overwrite) a pre-computed embedding for a creator entity.
|
||||
///
|
||||
/// L2-normalizes the embedding, stores it in the creator entity store (source
|
||||
/// of truth), and inserts it into the `(EntityKind::Creator, "content")` HNSW
|
||||
/// slot. The slot is auto-registered if not yet present, using
|
||||
/// `embedding.len()` as the dimension count.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no storage backend is wired or lock is poisoned.
|
||||
/// - `TidalError::Storage` on storage engine failure.
|
||||
/// - `TidalError::Internal` if the embedding has zero norm.
|
||||
#[allow(clippy::significant_drop_tightening)] // lock must be held across insert_embedding call
|
||||
pub fn write_creator_embedding(&self, id: EntityId, embedding: &[f32]) -> crate::Result<()> {
|
||||
let storage = self.storage()?;
|
||||
|
||||
// Auto-register the creator "content" slot if absent.
|
||||
{
|
||||
let mut registry = self.embedding_registry.write().map_err(|_| {
|
||||
TidalError::Internal("embedding_registry write lock poisoned".into())
|
||||
})?;
|
||||
if registry.get(EntityKind::Creator, "content").is_none() {
|
||||
let config = VectorIndexConfig {
|
||||
dimensions: embedding.len(),
|
||||
..VectorIndexConfig::default()
|
||||
};
|
||||
let state = EmbeddingSlotState {
|
||||
index: Box::new(BruteForceIndex::new(config)),
|
||||
dimensions: embedding.len(),
|
||||
quantization: QuantizationLevel::F32,
|
||||
source: EmbeddingSource::External,
|
||||
params: HnswParams::default(),
|
||||
};
|
||||
registry
|
||||
.register(EntityKind::Creator, "content".to_string(), state)
|
||||
.map_err(|e| TidalError::Internal(format!("slot registration failed: {e}")))?;
|
||||
}
|
||||
}
|
||||
|
||||
// Insert into the HNSW index and store in entity engine.
|
||||
// Read the slot dimensions first, dropping the lock before the storage write.
|
||||
let dimensions = {
|
||||
let registry = self.embedding_registry.read().map_err(|_| {
|
||||
TidalError::Internal("embedding_registry read lock poisoned".into())
|
||||
})?;
|
||||
registry
|
||||
.get(EntityKind::Creator, "content")
|
||||
.ok_or_else(|| TidalError::Internal("creator content slot missing".into()))?
|
||||
.dimensions
|
||||
};
|
||||
// Re-acquire read lock to get the index reference, perform the insert,
|
||||
// then drop the lock.
|
||||
{
|
||||
let registry = self.embedding_registry.read().map_err(|_| {
|
||||
TidalError::Internal("embedding_registry read lock poisoned".into())
|
||||
})?;
|
||||
let slot = registry
|
||||
.get(EntityKind::Creator, "content")
|
||||
.ok_or_else(|| TidalError::Internal("creator content slot missing".into()))?;
|
||||
insert_embedding(
|
||||
id,
|
||||
"content",
|
||||
embedding,
|
||||
dimensions,
|
||||
slot.index.as_ref(),
|
||||
storage.creators_engine(),
|
||||
)
|
||||
.map_err(|e| TidalError::Internal(format!("write_creator_embedding: {e}")))?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read a stored creator embedding for a given entity ID.
|
||||
///
|
||||
/// Returns `None` if no embedding has been written for this creator.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no storage backend is wired.
|
||||
/// - `TidalError::Storage` on storage engine failure.
|
||||
pub fn read_creator_embedding(&self, id: EntityId) -> crate::Result<Option<Vec<f32>>> {
|
||||
let storage = self.storage()?;
|
||||
let key = crate::storage::vector::embedding_store_key(id, "content");
|
||||
match storage.creators_engine().get(&key) {
|
||||
Ok(Some(bytes)) => {
|
||||
let v = deserialize_embedding(&bytes)
|
||||
.map_err(|e| TidalError::Internal(format!("read_creator_embedding: {e}")))?;
|
||||
Ok(Some(v))
|
||||
}
|
||||
Ok(None) => Ok(None),
|
||||
Err(e) => Err(TidalError::from(e)),
|
||||
}
|
||||
}
|
||||
}
|
||||
306
tidal/src/db/items.rs
Normal file
306
tidal/src/db/items.rs
Normal file
@ -0,0 +1,306 @@
|
||||
//! Item write/read operations and text index management on `TidalDb`.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use crate::schema::{EntityId, EntityKind, TidalError, Timestamp};
|
||||
use crate::storage::vector::registry::{EmbeddingSlotState, EmbeddingSource, HnswParams};
|
||||
use crate::storage::vector::{
|
||||
BruteForceIndex, QuantizationLevel, VectorIndexConfig, insert_embedding,
|
||||
};
|
||||
use crate::storage::{Tag, encode_key};
|
||||
|
||||
use super::TidalDb;
|
||||
use super::metadata::deserialize_metadata;
|
||||
|
||||
impl TidalDb {
|
||||
/// Write (or overwrite) item metadata and update in-memory indexes.
|
||||
///
|
||||
/// This is the M2 replacement for `write_item` -- it persists metadata
|
||||
/// to storage AND inserts the entity into the bitmap, range, and universe
|
||||
/// indexes so it is discoverable by RETRIEVE queries.
|
||||
///
|
||||
/// Recognized metadata keys for indexing:
|
||||
/// - `"category"` -> category bitmap index
|
||||
/// - `"format"` -> format bitmap index
|
||||
/// - `"creator_id"` -> creator bitmap index
|
||||
/// - `"tags"` -> tag bitmap index (comma-separated)
|
||||
/// - `"duration"` -> duration range index (seconds, parsed as u32)
|
||||
/// - `"created_at"` -> `created_at` range index (nanos, parsed as u64)
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no storage backend is wired.
|
||||
/// - `TidalError::Storage` on storage engine failure.
|
||||
pub fn write_item_with_metadata(
|
||||
&self,
|
||||
id: EntityId,
|
||||
metadata: &HashMap<String, String>,
|
||||
) -> crate::Result<()> {
|
||||
// Persist to storage.
|
||||
self.write_item(id, metadata)?;
|
||||
|
||||
// Truncate entity ID to u32 for roaring bitmap. This limits the universe
|
||||
// to ~4 billion entities per instance, which is well beyond the single-node
|
||||
// target of 10M items.
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
let id_u32 = id.as_u64() as u32;
|
||||
if id.as_u64() > u64::from(u32::MAX) {
|
||||
tracing::warn!(
|
||||
entity_id = id.as_u64(),
|
||||
"entity ID exceeds u32::MAX; universe bitmap entry will collide with a lower ID"
|
||||
);
|
||||
}
|
||||
|
||||
// Insert into bitmap indexes.
|
||||
if let Some(val) = metadata.get("category") {
|
||||
self.category_index.insert(id_u32, val);
|
||||
}
|
||||
if let Some(val) = metadata.get("format") {
|
||||
self.format_index.insert(id_u32, val);
|
||||
}
|
||||
if let Some(val) = metadata.get("creator_id") {
|
||||
self.creator_index.insert(id_u32, val);
|
||||
// M3: populate creator-items bitmap for the `following` profile
|
||||
// and `unblocked` predicate.
|
||||
if let Ok(creator_id) = val.parse::<u64>() {
|
||||
self.creator_items.add_item(creator_id, id_u32);
|
||||
}
|
||||
}
|
||||
if let Some(tags) = metadata.get("tags") {
|
||||
for tag in tags.split(',') {
|
||||
let tag = tag.trim();
|
||||
if !tag.is_empty() {
|
||||
self.tag_index.insert(id_u32, tag);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Insert into range indexes.
|
||||
if let Some(val) = metadata.get("duration") {
|
||||
match val.parse::<u32>() {
|
||||
Ok(dur) => self.duration_index.insert(id_u32, dur),
|
||||
Err(e) => tracing::warn!(
|
||||
entity_id = id.as_u64(),
|
||||
value = %val,
|
||||
error = %e,
|
||||
"failed to parse 'duration' metadata; item will not be indexed by duration"
|
||||
),
|
||||
}
|
||||
}
|
||||
let has_created_at =
|
||||
metadata
|
||||
.get("created_at")
|
||||
.is_some_and(|val| match val.parse::<u64>() {
|
||||
Ok(ts) => {
|
||||
self.created_at_index.insert(id_u32, ts);
|
||||
true
|
||||
}
|
||||
Err(e) => {
|
||||
tracing::warn!(
|
||||
entity_id = id.as_u64(),
|
||||
value = %val,
|
||||
error = %e,
|
||||
"failed to parse 'created_at' metadata; defaulting to current time"
|
||||
);
|
||||
false
|
||||
}
|
||||
});
|
||||
if !has_created_at {
|
||||
// Default: use current time as created_at so the item is discoverable
|
||||
// by range queries and sortable by recency.
|
||||
self.created_at_index
|
||||
.insert(id_u32, Timestamp::now().as_nanos());
|
||||
}
|
||||
|
||||
// Insert into universe bitmap.
|
||||
if let Ok(mut bm) = self.universe.write() {
|
||||
bm.insert(id_u32);
|
||||
}
|
||||
|
||||
// Enqueue for text index (best-effort -- channel missing or dropped is non-fatal).
|
||||
if let Ok(guard) = self.text_tx.lock()
|
||||
&& let Some(tx) = guard.as_ref()
|
||||
{
|
||||
let _ = tx.send(crate::text::PendingWrite {
|
||||
entity_id: id,
|
||||
metadata: metadata.clone(),
|
||||
seq: 0, // seq tracking is best-effort for now
|
||||
deleted: false,
|
||||
});
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Returns the number of items in the universe bitmap.
|
||||
///
|
||||
/// This counts items written via `write_item_with_metadata`, not raw
|
||||
/// storage entries.
|
||||
#[must_use]
|
||||
pub fn item_count(&self) -> u64 {
|
||||
self.universe.read().map_or(0, |bm| bm.len())
|
||||
}
|
||||
|
||||
/// Read item metadata for a given entity ID.
|
||||
///
|
||||
/// Returns `None` if the entity does not exist in storage.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no storage backend is wired.
|
||||
/// - `TidalError::Storage` on storage engine failure.
|
||||
pub fn get_item_metadata(
|
||||
&self,
|
||||
id: EntityId,
|
||||
) -> crate::Result<Option<HashMap<String, String>>> {
|
||||
let storage = self.storage()?;
|
||||
let key = encode_key(id, Tag::Meta, b"");
|
||||
match storage.items_engine().get(&key) {
|
||||
Ok(Some(bytes)) => Ok(Some(deserialize_metadata(&bytes))),
|
||||
Ok(None) => Ok(None),
|
||||
Err(e) => Err(TidalError::from(e)),
|
||||
}
|
||||
}
|
||||
|
||||
/// Write (or overwrite) a pre-computed embedding for an item entity.
|
||||
///
|
||||
/// L2-normalizes the embedding, stores it in the item entity store (source
|
||||
/// of truth), and inserts it into the `(EntityKind::Item, "content")` HNSW
|
||||
/// slot. The slot is auto-registered if not yet present, using
|
||||
/// `embedding.len()` as the dimension count.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no storage backend is wired or lock is poisoned.
|
||||
/// - `TidalError::Storage` on storage engine failure.
|
||||
/// - `TidalError::Internal` if the embedding has zero norm.
|
||||
#[allow(clippy::significant_drop_tightening)] // lock must be held across insert_embedding call
|
||||
pub fn write_item_embedding(&self, id: EntityId, embedding: &[f32]) -> crate::Result<()> {
|
||||
let storage = self.storage()?;
|
||||
|
||||
// Auto-register the item "content" slot if absent.
|
||||
{
|
||||
let mut registry = self.embedding_registry.write().map_err(|_| {
|
||||
TidalError::Internal("embedding_registry write lock poisoned".into())
|
||||
})?;
|
||||
if registry.get(EntityKind::Item, "content").is_none() {
|
||||
let config = VectorIndexConfig {
|
||||
dimensions: embedding.len(),
|
||||
..VectorIndexConfig::default()
|
||||
};
|
||||
let state = EmbeddingSlotState {
|
||||
index: Box::new(BruteForceIndex::new(config)),
|
||||
dimensions: embedding.len(),
|
||||
quantization: QuantizationLevel::F32,
|
||||
source: EmbeddingSource::External,
|
||||
params: HnswParams::default(),
|
||||
};
|
||||
registry
|
||||
.register(EntityKind::Item, "content".to_string(), state)
|
||||
.map_err(|e| TidalError::Internal(format!("slot registration failed: {e}")))?;
|
||||
}
|
||||
}
|
||||
|
||||
// Insert into the HNSW index and store in entity engine.
|
||||
let dimensions = {
|
||||
let registry = self.embedding_registry.read().map_err(|_| {
|
||||
TidalError::Internal("embedding_registry read lock poisoned".into())
|
||||
})?;
|
||||
registry
|
||||
.get(EntityKind::Item, "content")
|
||||
.ok_or_else(|| TidalError::Internal("item content slot missing".into()))?
|
||||
.dimensions
|
||||
};
|
||||
{
|
||||
let registry = self.embedding_registry.read().map_err(|_| {
|
||||
TidalError::Internal("embedding_registry read lock poisoned".into())
|
||||
})?;
|
||||
let slot = registry
|
||||
.get(EntityKind::Item, "content")
|
||||
.ok_or_else(|| TidalError::Internal("item content slot missing".into()))?;
|
||||
insert_embedding(
|
||||
id,
|
||||
"content",
|
||||
embedding,
|
||||
dimensions,
|
||||
slot.index.as_ref(),
|
||||
storage.items_engine(),
|
||||
)
|
||||
.map_err(|e| TidalError::Internal(format!("write_item_embedding: {e}")))?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Reload the text index reader to pick up recent commits.
|
||||
///
|
||||
/// Useful in tests and benchmarks after writing a batch of items, to make
|
||||
/// them visible to `search()`. On-disk databases use
|
||||
/// `ReloadPolicy::OnCommitWithDelay` and do not require this call, but
|
||||
/// calling it is always safe. Ephemeral databases use `Manual` reload
|
||||
/// policy and require an explicit reload to see new documents.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `TidalError::Internal` if Tantivy fails to reload the reader.
|
||||
pub fn reload_text_index(&self) -> crate::Result<()> {
|
||||
self.text_index
|
||||
.as_ref()
|
||||
.map_or(Ok(()), |idx| idx.reload_reader())
|
||||
}
|
||||
|
||||
/// Force the creator text index reader to reload after a commit.
|
||||
///
|
||||
/// In production this is automatic. In tests with ephemeral mode, call
|
||||
/// this after writing creators and sleeping >=2.5s to see them in search
|
||||
/// results (or after sleeping 500ms when writing >=1000 creators).
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `TidalError::Internal` if Tantivy's reader reload fails.
|
||||
pub fn reload_creator_text_index(&self) -> crate::Result<()> {
|
||||
self.creator_text_index
|
||||
.as_ref()
|
||||
.map_or(Ok(()), |idx| idx.reload_reader())
|
||||
}
|
||||
|
||||
/// Force a synchronous commit of all pending item text index writes.
|
||||
///
|
||||
/// Sends a flush request to the background text syncer thread and blocks
|
||||
/// until the syncer acknowledges the commit. Then reloads the reader so
|
||||
/// the committed documents are immediately visible to `search()`.
|
||||
///
|
||||
/// Replaces the sleep-based synchronization pattern (`sleep(2.5s)` +
|
||||
/// `reload_text_index()`) with a deterministic, channel-based protocol.
|
||||
/// Useful in tests and benchmarks.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `TidalError::Internal` if Tantivy fails to reload the reader.
|
||||
pub fn flush_text_index(&self) -> crate::Result<()> {
|
||||
if let Some(ref flush_tx) = self.text_flush_tx {
|
||||
let (ack_tx, ack_rx) = crossbeam::channel::bounded(1);
|
||||
let _ = flush_tx.send(ack_tx);
|
||||
let _ = ack_rx.recv_timeout(std::time::Duration::from_secs(10));
|
||||
}
|
||||
self.reload_text_index()
|
||||
}
|
||||
|
||||
/// Force a synchronous commit of all pending creator text index writes.
|
||||
///
|
||||
/// Same protocol as [`flush_text_index`](Self::flush_text_index) but for
|
||||
/// the creator text syncer. Blocks until the background syncer acknowledges
|
||||
/// the commit, then reloads the creator text reader.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `TidalError::Internal` if Tantivy fails to reload the reader.
|
||||
pub fn flush_creator_text_index(&self) -> crate::Result<()> {
|
||||
if let Some(ref flush_tx) = self.creator_text_flush_tx {
|
||||
let (ack_tx, ack_rx) = crossbeam::channel::bounded(1);
|
||||
let _ = flush_tx.send(ack_tx);
|
||||
let _ = ack_rx.recv_timeout(std::time::Duration::from_secs(10));
|
||||
}
|
||||
self.reload_creator_text_index()
|
||||
}
|
||||
}
|
||||
91
tidal/src/db/metadata.rs
Normal file
91
tidal/src/db/metadata.rs
Normal file
@ -0,0 +1,91 @@
|
||||
//! Item metadata serialization and signal classification helpers.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
/// Serialize `HashMap<String, String>` as length-prefixed binary pairs.
|
||||
///
|
||||
/// This is the **items-only** metadata format (no embedding header).
|
||||
/// For user/creator entities (which include an optional embedding prefix),
|
||||
/// see [`entities::serialize_entity`](crate::entities::serialize_entity).
|
||||
///
|
||||
/// The two formats are intentionally different: items store only metadata
|
||||
/// (embeddings go to the HNSW index), while users/creators store an optional
|
||||
/// embedding inline with their metadata for cold-start preference vectors.
|
||||
///
|
||||
/// Format (all lengths little-endian u32):
|
||||
/// ```text
|
||||
/// [num_entries: 4 bytes]
|
||||
/// for each entry:
|
||||
/// [key_len: 4 bytes][key bytes]
|
||||
/// [val_len: 4 bytes][value bytes]
|
||||
/// ```
|
||||
pub(super) fn serialize_metadata(map: &HashMap<String, String>) -> Vec<u8> {
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
let mut buf = Vec::new();
|
||||
buf.extend_from_slice(&(map.len() as u32).to_le_bytes());
|
||||
for (k, v) in map {
|
||||
buf.extend_from_slice(&(k.len() as u32).to_le_bytes());
|
||||
buf.extend_from_slice(k.as_bytes());
|
||||
buf.extend_from_slice(&(v.len() as u32).to_le_bytes());
|
||||
buf.extend_from_slice(v.as_bytes());
|
||||
}
|
||||
buf
|
||||
}
|
||||
|
||||
/// Deserialize `HashMap<String, String>` from the binary format produced by
|
||||
/// [`serialize_metadata`].
|
||||
///
|
||||
/// This is the **items-only** metadata format (no embedding header).
|
||||
/// For user/creator entities, see [`entities::deserialize_entity`](crate::entities::deserialize_entity).
|
||||
///
|
||||
/// Returns an empty map if the bytes are empty or malformed -- metadata reads
|
||||
/// must never panic or fail the query.
|
||||
pub fn deserialize_metadata(bytes: &[u8]) -> HashMap<String, String> {
|
||||
let mut map = HashMap::new();
|
||||
if bytes.len() < 4 {
|
||||
return map;
|
||||
}
|
||||
let count = u32::from_le_bytes([bytes[0], bytes[1], bytes[2], bytes[3]]) as usize;
|
||||
let mut pos = 4;
|
||||
for _ in 0..count {
|
||||
if pos + 4 > bytes.len() {
|
||||
break;
|
||||
}
|
||||
let key_len =
|
||||
u32::from_le_bytes([bytes[pos], bytes[pos + 1], bytes[pos + 2], bytes[pos + 3]])
|
||||
as usize;
|
||||
pos += 4;
|
||||
if pos + key_len > bytes.len() {
|
||||
break;
|
||||
}
|
||||
let key = String::from_utf8_lossy(&bytes[pos..pos + key_len]).to_string();
|
||||
pos += key_len;
|
||||
if pos + 4 > bytes.len() {
|
||||
break;
|
||||
}
|
||||
let val_len =
|
||||
u32::from_le_bytes([bytes[pos], bytes[pos + 1], bytes[pos + 2], bytes[pos + 3]])
|
||||
as usize;
|
||||
pos += 4;
|
||||
if pos + val_len > bytes.len() {
|
||||
break;
|
||||
}
|
||||
let val = String::from_utf8_lossy(&bytes[pos..pos + val_len]).to_string();
|
||||
pos += val_len;
|
||||
map.insert(key, val);
|
||||
}
|
||||
map
|
||||
}
|
||||
|
||||
/// Returns `true` for positive engagement signal types that should update
|
||||
/// the user's preference vector.
|
||||
///
|
||||
/// These are high-intent signals: the user explicitly chose to engage with
|
||||
/// the content. Views are excluded because they are low-signal (the user
|
||||
/// may have scrolled past without interest).
|
||||
pub(super) fn is_positive_engagement_signal(signal_type: &str) -> bool {
|
||||
matches!(
|
||||
signal_type,
|
||||
"like" | "share" | "completion" | "search_click"
|
||||
)
|
||||
}
|
||||
1803
tidal/src/db/mod.rs
1803
tidal/src/db/mod.rs
File diff suppressed because it is too large
Load Diff
211
tidal/src/db/open.rs
Normal file
211
tidal/src/db/open.rs
Normal file
@ -0,0 +1,211 @@
|
||||
//! Schema-aware database open logic.
|
||||
|
||||
use std::sync::Arc;
|
||||
use std::sync::atomic::AtomicU64;
|
||||
|
||||
use roaring::RoaringBitmap;
|
||||
|
||||
use crate::entities::{
|
||||
CreatorItemsBitmap, HardNegIndex, InteractionLedger, PreferenceVectors, UserStateIndex,
|
||||
};
|
||||
use crate::ranking::builtins::register_builtins;
|
||||
use crate::ranking::registry::ProfileRegistry;
|
||||
use crate::schema::{DurabilityError, EntityId, Schema, TidalError, Timestamp};
|
||||
use crate::signals::{NoopWalWriter, SignalLedger, SignalTypeId};
|
||||
use crate::storage::InMemoryBackend;
|
||||
use crate::storage::indexes::bitmap::BitmapIndex;
|
||||
use crate::storage::indexes::range::RangeIndex;
|
||||
use crate::storage::vector::registry::EmbeddingSlotRegistry;
|
||||
use crate::wal::{WalConfig, WalHandle};
|
||||
|
||||
use super::config::StorageMode;
|
||||
use super::storage_box::StorageBox;
|
||||
use super::wal_bridge::WalHandleWriter;
|
||||
|
||||
use super::state_rebuild::rebuild_entity_state;
|
||||
|
||||
/// Bundle returned by [`TidalDb::open_with_schema`] to avoid a fragile tuple.
|
||||
///
|
||||
/// Each field is consumed by [`TidalDb::from_parts`] during construction.
|
||||
pub struct OpenResult {
|
||||
pub storage: StorageBox,
|
||||
pub ledger: Arc<SignalLedger>,
|
||||
pub wal: Option<WalHandle>,
|
||||
pub last_seq: Arc<AtomicU64>,
|
||||
pub profile_registry: ProfileRegistry,
|
||||
pub category_index: BitmapIndex,
|
||||
pub format_index: BitmapIndex,
|
||||
pub creator_index: BitmapIndex,
|
||||
pub tag_index: BitmapIndex,
|
||||
pub duration_index: RangeIndex<u32>,
|
||||
pub created_at_index: RangeIndex<u64>,
|
||||
pub universe: RoaringBitmap,
|
||||
pub embedding_registry: EmbeddingSlotRegistry,
|
||||
pub schema_def: Schema,
|
||||
// M3 entity state rebuilt from durable storage.
|
||||
pub creator_items: CreatorItemsBitmap,
|
||||
pub user_state: UserStateIndex,
|
||||
pub hard_negatives: HardNegIndex,
|
||||
pub interaction_ledger: InteractionLedger,
|
||||
pub preference_vectors: PreferenceVectors,
|
||||
/// Session journal events recovered on startup (for session crash recovery).
|
||||
pub session_events: Vec<crate::wal::format::SessionWalEvent>,
|
||||
}
|
||||
|
||||
impl super::TidalDb {
|
||||
/// Open storage and ledger components for the given schema.
|
||||
///
|
||||
/// Called from `TidalDbBuilder::open()` when `with_schema()` was called.
|
||||
/// Returns an `OpenResult` containing all components needed by `from_parts`.
|
||||
#[allow(clippy::too_many_lines)] // Construction logic is inherently verbose; splitting would scatter initialization flow.
|
||||
pub(crate) fn open_with_schema(
|
||||
config: &super::Config,
|
||||
schema: Schema,
|
||||
) -> crate::Result<OpenResult> {
|
||||
let last_seq = Arc::new(AtomicU64::new(0));
|
||||
|
||||
// Initialize profile registry with builtins.
|
||||
let mut profile_registry = ProfileRegistry::new();
|
||||
register_builtins(&mut profile_registry).map_err(|e| {
|
||||
TidalError::Internal(format!("failed to register builtin profiles: {e}"))
|
||||
})?;
|
||||
|
||||
// Initialize M2 indexes (empty -- populated as items are written).
|
||||
let category_index = BitmapIndex::new("category");
|
||||
let format_index = BitmapIndex::new("format");
|
||||
let creator_index = BitmapIndex::new("creator");
|
||||
let tag_index = BitmapIndex::new("tags");
|
||||
let duration_index = RangeIndex::new("duration");
|
||||
let created_at_index = RangeIndex::new("created_at");
|
||||
let universe = RoaringBitmap::new();
|
||||
let embedding_registry = EmbeddingSlotRegistry::new();
|
||||
|
||||
let schema_def = schema.clone();
|
||||
|
||||
match config.mode {
|
||||
StorageMode::Ephemeral => {
|
||||
let storage = StorageBox::Memory {
|
||||
items: InMemoryBackend::default(),
|
||||
users: InMemoryBackend::default(),
|
||||
creators: InMemoryBackend::default(),
|
||||
};
|
||||
let ledger = Arc::new(SignalLedger::new(schema, Box::new(NoopWalWriter)));
|
||||
|
||||
// Read preference vector dimensionality from the schema.
|
||||
let pref_dim = schema_def
|
||||
.embedding_slots()
|
||||
.first()
|
||||
.map_or(128, |s| s.dimensions);
|
||||
|
||||
Ok(OpenResult {
|
||||
storage,
|
||||
ledger,
|
||||
wal: None,
|
||||
last_seq,
|
||||
profile_registry,
|
||||
category_index,
|
||||
format_index,
|
||||
creator_index,
|
||||
tag_index,
|
||||
duration_index,
|
||||
created_at_index,
|
||||
universe,
|
||||
embedding_registry,
|
||||
schema_def,
|
||||
creator_items: CreatorItemsBitmap::new(),
|
||||
user_state: UserStateIndex::new(),
|
||||
hard_negatives: HardNegIndex::new(),
|
||||
interaction_ledger: InteractionLedger::new(),
|
||||
preference_vectors: PreferenceVectors::new(pref_dim),
|
||||
session_events: Vec::new(),
|
||||
})
|
||||
}
|
||||
StorageMode::Persistent => {
|
||||
let data_dir = config.data_dir.as_ref().ok_or_else(|| {
|
||||
TidalError::Internal("persistent mode requires data_dir".into())
|
||||
})?;
|
||||
|
||||
// Open fjall storage.
|
||||
let fjall_storage =
|
||||
crate::storage::FjallStorage::open(data_dir).map_err(TidalError::from)?;
|
||||
let storage = StorageBox::Fjall(fjall_storage);
|
||||
|
||||
// Build WAL config. The WAL directory sits inside data_dir
|
||||
// but `WalConfig.dir` is the *parent* of the "wal/" subdirectory.
|
||||
let wal_config = WalConfig {
|
||||
dir: data_dir.clone(),
|
||||
..WalConfig::default()
|
||||
};
|
||||
|
||||
let (wal, replayed_events, session_events) =
|
||||
WalHandle::open(wal_config).map_err(|e| {
|
||||
TidalError::Durability(DurabilityError {
|
||||
message: format!("WAL open failed: {e}"),
|
||||
})
|
||||
})?;
|
||||
|
||||
// Build the WAL bridge for the ledger.
|
||||
let wal_writer =
|
||||
Box::new(WalHandleWriter::new(wal.sender(), Arc::clone(&last_seq)));
|
||||
|
||||
// Construct the ledger.
|
||||
let ledger = Arc::new(SignalLedger::new(schema, wal_writer));
|
||||
|
||||
// Restore signal state from the last checkpoint.
|
||||
if let Err(e) = ledger.restore(storage.items_engine()) {
|
||||
tracing::warn!(
|
||||
error = %e,
|
||||
"signal ledger restore failed; starting from empty state"
|
||||
);
|
||||
}
|
||||
|
||||
// Replay WAL events that post-date the checkpoint.
|
||||
for event in replayed_events {
|
||||
let type_id = SignalTypeId::new(u16::from(event.signal_type));
|
||||
let entity_id = EntityId::new(event.entity_id);
|
||||
let weight = f64::from(event.weight);
|
||||
let timestamp = Timestamp::from_nanos(event.timestamp_nanos);
|
||||
ledger.apply_wal_event(type_id, entity_id, weight, timestamp);
|
||||
}
|
||||
|
||||
// Rebuild in-memory entity state from durable storage.
|
||||
// This reconstructs blocked/hidden/follows user state, creator-items
|
||||
// bitmaps, and interaction weights from persisted relationship edges
|
||||
// and item metadata.
|
||||
let user_state = UserStateIndex::new();
|
||||
let creator_items = CreatorItemsBitmap::new();
|
||||
let interaction_ledger = InteractionLedger::new();
|
||||
rebuild_entity_state(&storage, &user_state, &creator_items, &interaction_ledger)?;
|
||||
|
||||
// Read preference vector dimensionality from the schema.
|
||||
let pref_dim = schema_def
|
||||
.embedding_slots()
|
||||
.first()
|
||||
.map_or(128, |s| s.dimensions);
|
||||
|
||||
Ok(OpenResult {
|
||||
storage,
|
||||
ledger,
|
||||
wal: Some(wal),
|
||||
last_seq,
|
||||
profile_registry,
|
||||
category_index,
|
||||
format_index,
|
||||
creator_index,
|
||||
tag_index,
|
||||
duration_index,
|
||||
created_at_index,
|
||||
universe,
|
||||
embedding_registry,
|
||||
schema_def,
|
||||
creator_items,
|
||||
user_state,
|
||||
hard_negatives: HardNegIndex::new(),
|
||||
interaction_ledger,
|
||||
preference_vectors: PreferenceVectors::new(pref_dim),
|
||||
session_events,
|
||||
})
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
180
tidal/src/db/query_ops.rs
Normal file
180
tidal/src/db/query_ops.rs
Normal file
@ -0,0 +1,180 @@
|
||||
//! RETRIEVE and SEARCH query execution on `TidalDb`.
|
||||
|
||||
use crate::query::executor::RetrieveExecutor;
|
||||
use crate::query::retrieve::{QueryError, Results, Retrieve};
|
||||
use crate::query::search::{Search, SearchExecutor, SearchResults};
|
||||
use crate::schema::{EntityKind, TidalError};
|
||||
use crate::session as session_mod;
|
||||
|
||||
use super::TidalDb;
|
||||
use super::storage_box::StorageBox;
|
||||
|
||||
impl TidalDb {
|
||||
/// Execute a RETRIEVE query against the database.
|
||||
///
|
||||
/// Constructs a per-query `RetrieveExecutor` with borrowed references to
|
||||
/// all infrastructure, then runs the 5-stage pipeline:
|
||||
///
|
||||
/// 1. Candidate generation (scan universe or signal-ranked)
|
||||
/// 2. Filter evaluation (bitmap + range indexes)
|
||||
/// 3. Signal scoring (profile executor)
|
||||
/// 4. Diversity enforcement (per-creator, format-mix)
|
||||
/// 5. Result assembly (pagination, cursor, explain)
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `TidalError::Query` on validation failure, missing profile,
|
||||
/// or unsupported candidate strategy.
|
||||
pub fn retrieve(&self, query: &Retrieve) -> crate::Result<Results> {
|
||||
let ledger = self.ledger()?;
|
||||
|
||||
// Attach items storage for keyword-hint metadata lookup (session scoring).
|
||||
let items_storage_opt = self.storage.as_ref().map(StorageBox::items_engine);
|
||||
|
||||
let mut base_executor = RetrieveExecutor::new(
|
||||
ledger,
|
||||
&self.profile_registry,
|
||||
Some(&self.category_index),
|
||||
Some(&self.format_index),
|
||||
Some(&self.creator_index),
|
||||
Some(&self.tag_index),
|
||||
Some(&self.duration_index),
|
||||
Some(&self.created_at_index),
|
||||
Some(&self.universe),
|
||||
Some(&self.embedding_registry),
|
||||
)
|
||||
.with_user_context(
|
||||
&self.user_state,
|
||||
&self.hard_negatives,
|
||||
&self.interaction_ledger,
|
||||
&self.creator_items,
|
||||
)
|
||||
.with_preference_vectors(&self.preference_vectors);
|
||||
|
||||
if let Some(storage) = items_storage_opt {
|
||||
base_executor = base_executor.with_items_storage(storage);
|
||||
}
|
||||
|
||||
// M4: wire in session context when FOR SESSION is specified.
|
||||
let executor = if let Some(session_id) = query.for_session {
|
||||
match self.session_snapshot(session_id) {
|
||||
Ok(snapshot) => {
|
||||
let ctx = session_mod::SessionContext::from_snapshot(&snapshot);
|
||||
base_executor.with_session(ctx, snapshot)
|
||||
}
|
||||
Err(_) => {
|
||||
return Err(TidalError::Query(QueryError::SessionNotFound(format!(
|
||||
"{session_id}"
|
||||
))));
|
||||
}
|
||||
}
|
||||
} else {
|
||||
base_executor
|
||||
};
|
||||
|
||||
executor.execute(query).map_err(TidalError::from)
|
||||
}
|
||||
|
||||
/// Execute a SEARCH query -- text and/or vector retrieval with RRF fusion,
|
||||
/// signal-based profile scoring, filtering, and optional diversity.
|
||||
///
|
||||
/// The 8-stage pipeline is:
|
||||
/// 1a. BM25 text retrieval (when `query_text` is set)
|
||||
/// 1b. ANN vector retrieval (when `query_vector` is set)
|
||||
/// 1c. Fusion via Reciprocal Rank Fusion
|
||||
/// 2. Metadata filter evaluation
|
||||
/// 2.5. User-context filtering (seen, blocked, hard negatives)
|
||||
/// 3. Profile scoring (signal-weighted, personalized when `for_user` is set)
|
||||
/// 4. Diversity enforcement (when `diversity` is set)
|
||||
/// 5. Result assembly with BM25 and semantic score explainability
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `TidalError::Query` on validation failure, missing profile,
|
||||
/// or storage error during retrieval.
|
||||
pub fn search(&self, query: &Search) -> crate::Result<SearchResults> {
|
||||
let ledger = self.ledger()?;
|
||||
|
||||
// Resolve similar_to: read stored embedding and inject as query_vector.
|
||||
let mut query_owned;
|
||||
let query = if let (Some(similar_id), None) = (query.similar_to, &query.query_vector) {
|
||||
let emb = match query.entity_kind {
|
||||
EntityKind::Creator => self.read_creator_embedding(similar_id)?,
|
||||
_ => None, // Item similar_to not yet supported
|
||||
};
|
||||
if let Some(embedding) = emb {
|
||||
query_owned = query.clone();
|
||||
query_owned.query_vector = Some(embedding);
|
||||
// Exclude the source entity from results.
|
||||
if !query_owned.exclude.contains(&similar_id) {
|
||||
query_owned.exclude.push(similar_id);
|
||||
}
|
||||
&query_owned
|
||||
} else {
|
||||
query
|
||||
}
|
||||
} else {
|
||||
query
|
||||
};
|
||||
|
||||
let text_index_ref = self.text_index.as_ref();
|
||||
let items_storage_opt = self.storage.as_ref().map(StorageBox::items_engine);
|
||||
|
||||
let mut base_executor = SearchExecutor::new(
|
||||
ledger,
|
||||
&self.profile_registry,
|
||||
text_index_ref, // item text index (None for Creator queries, but executor routes)
|
||||
Some(&self.embedding_registry),
|
||||
Some(&self.category_index),
|
||||
Some(&self.format_index),
|
||||
Some(&self.creator_index),
|
||||
Some(&self.tag_index),
|
||||
Some(&self.duration_index),
|
||||
Some(&self.created_at_index),
|
||||
Some(&self.universe),
|
||||
)
|
||||
.with_user_context(
|
||||
&self.user_state,
|
||||
&self.hard_negatives,
|
||||
&self.interaction_ledger,
|
||||
&self.creator_items,
|
||||
)
|
||||
.with_preference_vectors(&self.preference_vectors);
|
||||
|
||||
if let Some(storage) = items_storage_opt {
|
||||
base_executor = base_executor.with_items_storage(storage);
|
||||
}
|
||||
|
||||
// Route creator text index and creators storage for creator searches.
|
||||
if query.entity_kind == EntityKind::Creator {
|
||||
if let Some(idx) = self.creator_text_index.as_ref() {
|
||||
base_executor = base_executor.with_creator_text_index(idx);
|
||||
}
|
||||
if let Ok(storage) = self.storage() {
|
||||
base_executor = base_executor.with_creators_storage(storage.creators_engine());
|
||||
}
|
||||
}
|
||||
|
||||
// Wire in session context when FOR SESSION is specified.
|
||||
let executor = if let Some(session_id) = query.for_session {
|
||||
match self.session_snapshot(session_id) {
|
||||
Ok(snapshot) => {
|
||||
let ctx = session_mod::SessionContext::from_snapshot(&snapshot);
|
||||
base_executor.with_session(ctx, snapshot)
|
||||
}
|
||||
Err(e) => {
|
||||
tracing::warn!(
|
||||
session_id = %session_id,
|
||||
error = %e,
|
||||
"FOR SESSION: session not found; executing without session boost"
|
||||
);
|
||||
base_executor
|
||||
}
|
||||
}
|
||||
} else {
|
||||
base_executor
|
||||
};
|
||||
|
||||
executor.execute(query).map_err(TidalError::from)
|
||||
}
|
||||
}
|
||||
165
tidal/src/db/relationships.rs
Normal file
165
tidal/src/db/relationships.rs
Normal file
@ -0,0 +1,165 @@
|
||||
//! Relationship edge operations and entity accessor methods on `TidalDb`.
|
||||
|
||||
use crate::entities::{
|
||||
CreatorItemsBitmap, HardNegIndex, InteractionLedger, PreferenceVectors, RelationshipType,
|
||||
UserStateIndex,
|
||||
};
|
||||
use crate::schema::{EntityId, TidalError, Timestamp};
|
||||
|
||||
use super::TidalDb;
|
||||
|
||||
impl TidalDb {
|
||||
/// Write a relationship edge between two entities.
|
||||
///
|
||||
/// The edge is stored in the users keyspace under the `from` entity's key
|
||||
/// range using the relationship key encoding.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no storage backend is wired.
|
||||
/// - `TidalError::Storage` on storage engine failure.
|
||||
pub fn write_relationship(
|
||||
&self,
|
||||
from: EntityId,
|
||||
rel_type: RelationshipType,
|
||||
to: EntityId,
|
||||
weight: f64,
|
||||
timestamp: Timestamp,
|
||||
) -> crate::Result<()> {
|
||||
use crate::entities::relationship::{
|
||||
encode_relationship_key, serialize_relationship_value,
|
||||
};
|
||||
|
||||
let storage = self.storage()?;
|
||||
|
||||
let key = encode_relationship_key(from, rel_type, to);
|
||||
let value = serialize_relationship_value(weight, timestamp);
|
||||
storage
|
||||
.users_engine()
|
||||
.put(&key, &value)
|
||||
.map_err(TidalError::from)?;
|
||||
|
||||
// Update in-memory user state based on relationship type.
|
||||
match rel_type {
|
||||
RelationshipType::Blocks => {
|
||||
self.user_state
|
||||
.add_block_creator(from.as_u64(), to.as_u64());
|
||||
}
|
||||
RelationshipType::Hide => {
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
self.user_state.add_hide(from.as_u64(), to.as_u64() as u32);
|
||||
}
|
||||
RelationshipType::Follows => {
|
||||
self.user_state.add_follow(from.as_u64(), to.as_u64());
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Delete a relationship edge.
|
||||
///
|
||||
/// Removes both the durable storage entry and the corresponding in-memory
|
||||
/// state (blocked creators, hidden items, follows).
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no storage backend is wired.
|
||||
/// - `TidalError::Storage` on storage engine failure.
|
||||
pub fn delete_relationship(
|
||||
&self,
|
||||
from: EntityId,
|
||||
rel_type: RelationshipType,
|
||||
to: EntityId,
|
||||
) -> crate::Result<()> {
|
||||
use crate::entities::relationship::encode_relationship_key;
|
||||
|
||||
let storage = self.storage()?;
|
||||
let key = encode_relationship_key(from, rel_type, to);
|
||||
storage
|
||||
.users_engine()
|
||||
.delete(&key)
|
||||
.map_err(TidalError::from)?;
|
||||
|
||||
// Remove corresponding in-memory state.
|
||||
match rel_type {
|
||||
RelationshipType::Blocks => {
|
||||
self.user_state
|
||||
.remove_block_creator(from.as_u64(), to.as_u64());
|
||||
}
|
||||
RelationshipType::Hide => {
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
self.user_state
|
||||
.remove_hide(from.as_u64(), to.as_u64() as u32);
|
||||
}
|
||||
RelationshipType::Follows => {
|
||||
self.user_state.remove_follow(from.as_u64(), to.as_u64());
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// List all relationship targets for a given (from, `rel_type`) pair.
|
||||
///
|
||||
/// Returns `(to_entity_id, weight, timestamp_nanos)` tuples.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no storage backend is wired.
|
||||
/// - `TidalError::Storage` on storage engine failure.
|
||||
pub fn list_relationships(
|
||||
&self,
|
||||
from: EntityId,
|
||||
rel_type: RelationshipType,
|
||||
) -> crate::Result<Vec<(EntityId, f64, u64)>> {
|
||||
use crate::entities::relationship::{
|
||||
deserialize_relationship_value, parse_relationship_to, relationship_prefix,
|
||||
};
|
||||
|
||||
let storage = self.storage()?;
|
||||
let prefix = relationship_prefix(from, rel_type);
|
||||
let mut results = Vec::new();
|
||||
for entry in storage.users_engine().scan_prefix(&prefix) {
|
||||
let (key, value) = entry.map_err(TidalError::from)?;
|
||||
if let Some(to) = parse_relationship_to(&key)
|
||||
&& let Some((weight, ts_nanos)) = deserialize_relationship_value(&value)
|
||||
{
|
||||
results.push((to, weight, ts_nanos));
|
||||
}
|
||||
}
|
||||
Ok(results)
|
||||
}
|
||||
|
||||
/// Access the per-user state index (seen, blocked, saved, liked, completion).
|
||||
#[must_use]
|
||||
pub fn user_state(&self) -> &UserStateIndex {
|
||||
&self.user_state
|
||||
}
|
||||
|
||||
/// Access the creator-items bitmap (maps creators to their item sets).
|
||||
#[must_use]
|
||||
pub fn creator_items(&self) -> &CreatorItemsBitmap {
|
||||
&self.creator_items
|
||||
}
|
||||
|
||||
/// Access the hard-negative index.
|
||||
#[must_use]
|
||||
pub fn hard_negatives(&self) -> &HardNegIndex {
|
||||
&self.hard_negatives
|
||||
}
|
||||
|
||||
/// Access the interaction weight ledger.
|
||||
#[must_use]
|
||||
pub fn interaction_ledger(&self) -> &InteractionLedger {
|
||||
&self.interaction_ledger
|
||||
}
|
||||
|
||||
/// Access the preference vectors store.
|
||||
#[must_use]
|
||||
pub fn preference_vectors(&self) -> &PreferenceVectors {
|
||||
&self.preference_vectors
|
||||
}
|
||||
}
|
||||
250
tidal/src/db/session_restore.rs
Normal file
250
tidal/src/db/session_restore.rs
Normal file
@ -0,0 +1,250 @@
|
||||
//! Session restore from persistent storage and WAL replay.
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::sync::Arc;
|
||||
use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
|
||||
|
||||
use crate::schema::Timestamp;
|
||||
use crate::session::{self as session_mod, AgentId, SessionId, SessionState};
|
||||
use crate::storage::{Tag, parse_key};
|
||||
|
||||
use super::TidalDb;
|
||||
|
||||
impl TidalDb {
|
||||
/// Scan persistent storage for previously archived session snapshots.
|
||||
///
|
||||
/// Called once at startup from `TidalDbBuilder::open()`. Walks every key
|
||||
/// in the items storage engine, filters for `Tag::Session` keys, and:
|
||||
/// - `b"snapshot"` suffix -> deserializes into `closed_sessions`
|
||||
/// - `b"start"` suffix -> logs a warning (session was active at shutdown)
|
||||
///
|
||||
/// In ephemeral mode the in-memory backend starts empty each time so this
|
||||
/// scan always returns nothing -- which is correct.
|
||||
pub(crate) fn restore_sessions(&self) {
|
||||
let Some(storage) = self.storage.as_ref() else {
|
||||
return;
|
||||
};
|
||||
let engine = storage.items_engine();
|
||||
let mut restored = 0usize;
|
||||
let mut orphaned = 0usize;
|
||||
|
||||
for item in engine.scan_prefix(&[]) {
|
||||
let (key, value) = match item {
|
||||
Ok(kv) => kv,
|
||||
Err(e) => {
|
||||
tracing::warn!(error = %e, "storage scan error during session restore");
|
||||
continue;
|
||||
}
|
||||
};
|
||||
let Some((_entity_id, tag, suffix)) = parse_key(&key) else {
|
||||
continue;
|
||||
};
|
||||
if tag != Tag::Session {
|
||||
continue;
|
||||
}
|
||||
match suffix {
|
||||
b"snapshot" => {
|
||||
if let Some(snapshot) = session_mod::deserialize_snapshot(&value) {
|
||||
self.closed_sessions.insert(snapshot.id, snapshot);
|
||||
restored += 1;
|
||||
} else {
|
||||
tracing::warn!("corrupt session snapshot key {:?}; skipping", key);
|
||||
}
|
||||
}
|
||||
b"start" => {
|
||||
if let Some((session_id, _user_id, _started_at_ns)) =
|
||||
session_mod::deserialize_start_record(&value)
|
||||
{
|
||||
tracing::warn!(
|
||||
session_id = %session_id,
|
||||
"session was active at last shutdown; in-memory signal state is lost"
|
||||
);
|
||||
}
|
||||
orphaned += 1;
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
|
||||
if restored > 0 || orphaned > 0 {
|
||||
tracing::info!(restored, orphaned, "session restore complete");
|
||||
}
|
||||
}
|
||||
|
||||
/// Replay session journal events to restore active sessions from a crash.
|
||||
///
|
||||
/// Scans the event list for Start/Signal/Close triples. Sessions that have
|
||||
/// a `Start` but no matching `Close` are restored as active sessions with
|
||||
/// their signals replayed into `SessionSignalState`.
|
||||
///
|
||||
/// Called from `from_parts()` during startup. The storage-backed
|
||||
/// `restore_sessions()` is additive and handles archived snapshots; this
|
||||
/// method handles in-flight sessions that were active at crash time.
|
||||
#[allow(clippy::too_many_lines, clippy::cast_possible_truncation)]
|
||||
pub(super) fn restore_session_wal_events(
|
||||
&self,
|
||||
events: &[crate::wal::format::SessionWalEvent],
|
||||
) {
|
||||
let (open_sessions, session_signals) = Self::partition_session_events(events);
|
||||
|
||||
if open_sessions.is_empty() {
|
||||
return;
|
||||
}
|
||||
|
||||
let Some(schema) = self.schema_def.as_ref() else {
|
||||
tracing::warn!("no schema available; cannot restore sessions from WAL");
|
||||
return;
|
||||
};
|
||||
|
||||
let now_ns = Timestamp::now().as_nanos();
|
||||
|
||||
for (session_id, (user_id, started_at_ns, agent_id_str, policy_name)) in open_sessions {
|
||||
if schema.session_policy(&policy_name).is_none() {
|
||||
tracing::warn!(
|
||||
session_id,
|
||||
policy_name,
|
||||
"restored session has unknown policy, skipping"
|
||||
);
|
||||
continue;
|
||||
}
|
||||
|
||||
let agent_id = AgentId::new(&agent_id_str).unwrap_or_else(|_| {
|
||||
tracing::warn!(
|
||||
session_id,
|
||||
agent_id = agent_id_str,
|
||||
"invalid agent_id in WAL, using fallback 'restored'"
|
||||
);
|
||||
// Fallback: AgentId::new("restored") is always valid (lowercase ASCII).
|
||||
AgentId::new("restored").expect("'restored' is a valid AgentId")
|
||||
});
|
||||
|
||||
let closed = Arc::new(AtomicBool::new(false));
|
||||
|
||||
let state = Arc::new(SessionState {
|
||||
id: SessionId::from_raw(session_id),
|
||||
user_id,
|
||||
agent_id,
|
||||
policy_name: policy_name.clone(),
|
||||
// We lost the exact monotonic Instant -- approximate with "now".
|
||||
started_at: std::time::Instant::now(),
|
||||
started_at_ns,
|
||||
metadata: HashMap::new(), // metadata not persisted in the session journal
|
||||
signals: dashmap::DashMap::new(),
|
||||
signaled_entities: dashmap::DashMap::new(),
|
||||
annotations: std::sync::Mutex::new(Vec::new()),
|
||||
signals_written: AtomicU64::new(0),
|
||||
signals_rejected: AtomicU64::new(0),
|
||||
audit_log: std::sync::Mutex::new(session_mod::AuditLog::new()),
|
||||
closed,
|
||||
});
|
||||
|
||||
// Replay signals into the restored session state.
|
||||
if let Some(signals) = session_signals.get(&session_id) {
|
||||
for (entity_id, weight, ts_ns, signal_name, _annotation) in signals {
|
||||
let lambda = schema
|
||||
.signal(signal_name)
|
||||
.and_then(|def| def.decay().lambda())
|
||||
.unwrap_or(session_mod::DEFAULT_SESSION_LAMBDA);
|
||||
|
||||
let ss = state
|
||||
.signals
|
||||
.entry(signal_name.clone())
|
||||
.or_insert_with(|| session_mod::SessionSignalState::new(now_ns, lambda));
|
||||
ss.on_signal(f64::from(*weight), *ts_ns);
|
||||
drop(ss);
|
||||
|
||||
state.signaled_entities.insert(*entity_id, ());
|
||||
state.signals_written.fetch_add(1, Ordering::Relaxed);
|
||||
}
|
||||
}
|
||||
|
||||
let sid = SessionId::from_raw(session_id);
|
||||
self.sessions.insert(sid, state);
|
||||
|
||||
// Advance next_session_id past any restored session IDs.
|
||||
loop {
|
||||
let current = self.next_session_id.load(Ordering::Acquire);
|
||||
if session_id < current {
|
||||
break;
|
||||
}
|
||||
if self
|
||||
.next_session_id
|
||||
.compare_exchange(
|
||||
current,
|
||||
session_id + 1,
|
||||
Ordering::Release,
|
||||
Ordering::Relaxed,
|
||||
)
|
||||
.is_ok()
|
||||
{
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
tracing::info!(session_id, user_id, "restored active session from WAL");
|
||||
}
|
||||
}
|
||||
|
||||
/// Partition session WAL events into open sessions and their signals.
|
||||
///
|
||||
/// Returns `(open_sessions, session_signals)` where `open_sessions` maps
|
||||
/// `session_id` to `(user_id, started_at_ns, agent_id, policy_name)` for
|
||||
/// sessions that have a Start but no Close.
|
||||
#[allow(clippy::type_complexity)]
|
||||
pub(super) fn partition_session_events(
|
||||
events: &[crate::wal::format::SessionWalEvent],
|
||||
) -> (
|
||||
HashMap<u64, (u64, u64, String, String)>,
|
||||
HashMap<u64, Vec<(u64, f32, u64, String, Option<String>)>>,
|
||||
) {
|
||||
use crate::wal::format::SessionWalEvent;
|
||||
|
||||
let mut open_sessions: HashMap<u64, (u64, u64, String, String)> = HashMap::new();
|
||||
let mut session_signals: HashMap<u64, Vec<(u64, f32, u64, String, Option<String>)>> =
|
||||
HashMap::new();
|
||||
|
||||
for event in events {
|
||||
match event {
|
||||
SessionWalEvent::Start {
|
||||
session_id,
|
||||
user_id,
|
||||
started_at_ns,
|
||||
agent_id,
|
||||
policy_name,
|
||||
} => {
|
||||
open_sessions.insert(
|
||||
*session_id,
|
||||
(
|
||||
*user_id,
|
||||
*started_at_ns,
|
||||
agent_id.clone(),
|
||||
policy_name.clone(),
|
||||
),
|
||||
);
|
||||
}
|
||||
SessionWalEvent::Signal {
|
||||
session_id,
|
||||
entity_id,
|
||||
weight,
|
||||
ts_ns,
|
||||
signal_name,
|
||||
annotation,
|
||||
} => {
|
||||
session_signals.entry(*session_id).or_default().push((
|
||||
*entity_id,
|
||||
*weight,
|
||||
*ts_ns,
|
||||
signal_name.clone(),
|
||||
annotation.clone(),
|
||||
));
|
||||
}
|
||||
SessionWalEvent::Close { session_id } => {
|
||||
open_sessions.remove(session_id);
|
||||
session_signals.remove(session_id);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
(open_sessions, session_signals)
|
||||
}
|
||||
}
|
||||
420
tidal/src/db/sessions.rs
Normal file
420
tidal/src/db/sessions.rs
Normal file
@ -0,0 +1,420 @@
|
||||
//! Session lifecycle operations on `TidalDb`.
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::sync::Arc;
|
||||
use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
|
||||
|
||||
use crate::schema::{EntityId, TidalError, Timestamp};
|
||||
use crate::session::{
|
||||
self as session_mod, AgentId, AuditEntry, SessionHandle, SessionId, SessionInfo,
|
||||
SessionSnapshot, SessionState, SessionSummary,
|
||||
};
|
||||
use crate::storage::{Tag, encode_key};
|
||||
|
||||
use super::TidalDb;
|
||||
|
||||
impl TidalDb {
|
||||
/// Start a new agent session.
|
||||
///
|
||||
/// Creates a session-scoped signal context for the given agent. The
|
||||
/// session is identified by its `SessionId` and keyed to `user_id` and
|
||||
/// `agent_id`. The `policy_name` must match a policy declared via
|
||||
/// `SchemaBuilder::session_policy()`.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Schema` if `policy_name` is not found in the schema.
|
||||
/// - `TidalError::Internal` if no schema was provided at open time.
|
||||
pub fn start_session(
|
||||
&self,
|
||||
user_id: u64,
|
||||
agent_id: &str,
|
||||
policy_name: &str,
|
||||
metadata: HashMap<String, String>,
|
||||
) -> crate::Result<SessionHandle> {
|
||||
// Validate policy exists in schema.
|
||||
let schema = self
|
||||
.schema_def
|
||||
.as_ref()
|
||||
.ok_or_else(|| TidalError::Internal("no schema: open with with_schema()".into()))?;
|
||||
if schema.session_policy(policy_name).is_none() {
|
||||
return Err(TidalError::Internal(format!(
|
||||
"policy '{policy_name}' not found in schema"
|
||||
)));
|
||||
}
|
||||
|
||||
let parsed_agent_id = AgentId::new(agent_id)
|
||||
.map_err(|e| TidalError::Internal(format!("invalid agent_id: {e}")))?;
|
||||
|
||||
let session_id = SessionId::from_raw(self.next_session_id.fetch_add(1, Ordering::Relaxed));
|
||||
|
||||
let closed = Arc::new(AtomicBool::new(false));
|
||||
|
||||
// Capture started_at once -- shared between SessionState and SessionHandle.
|
||||
let started_at = std::time::Instant::now();
|
||||
let started_at_ns = Timestamp::now().as_nanos();
|
||||
|
||||
let state = Arc::new(SessionState {
|
||||
id: session_id,
|
||||
user_id,
|
||||
agent_id: parsed_agent_id.clone(),
|
||||
policy_name: policy_name.to_owned(),
|
||||
started_at,
|
||||
started_at_ns,
|
||||
metadata,
|
||||
signals: dashmap::DashMap::new(),
|
||||
signaled_entities: dashmap::DashMap::new(),
|
||||
annotations: std::sync::Mutex::new(Vec::new()),
|
||||
signals_written: AtomicU64::new(0),
|
||||
signals_rejected: AtomicU64::new(0),
|
||||
audit_log: std::sync::Mutex::new(session_mod::AuditLog::new()),
|
||||
closed: Arc::clone(&closed),
|
||||
});
|
||||
|
||||
self.sessions.insert(session_id, Arc::clone(&state));
|
||||
|
||||
// Persist the start record in non-ephemeral storage so that a crash
|
||||
// during an active session can be detected at next startup.
|
||||
if let Some(storage) = self.storage.as_ref() {
|
||||
let start_bytes = session_mod::serialize_start_record(&state);
|
||||
let start_key = encode_key(EntityId::new(session_id.as_u64()), Tag::Session, b"start");
|
||||
if let Err(e) = storage.items_engine().put(&start_key, &start_bytes) {
|
||||
tracing::warn!(error = %e, session_id = %session_id, "failed to persist session start record");
|
||||
}
|
||||
}
|
||||
|
||||
// Write session start event to the session journal (fire-and-forget).
|
||||
if let Ok(guard) = self.wal.lock()
|
||||
&& let Some(wal) = guard.as_ref()
|
||||
&& let Err(e) = wal.session_start(
|
||||
session_id.as_u64(),
|
||||
user_id,
|
||||
started_at_ns,
|
||||
parsed_agent_id.as_str(),
|
||||
policy_name,
|
||||
)
|
||||
{
|
||||
tracing::warn!(error = %e, "session_start WAL write failed");
|
||||
}
|
||||
|
||||
Ok(SessionHandle {
|
||||
id: session_id,
|
||||
user_id,
|
||||
agent_id: parsed_agent_id,
|
||||
policy_name: policy_name.to_owned(),
|
||||
started_at,
|
||||
closed,
|
||||
})
|
||||
}
|
||||
|
||||
/// Close a session and return a summary.
|
||||
///
|
||||
/// Takes ownership of the `SessionHandle` to prevent use-after-close at
|
||||
/// compile time. The session snapshot is archived to `closed_sessions`.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if the session was already removed (double-close).
|
||||
#[allow(clippy::needless_pass_by_value)] // Intentional: move semantics prevent use-after-close at the type level.
|
||||
pub fn close_session(&self, handle: SessionHandle) -> crate::Result<SessionSummary> {
|
||||
// Mark the handle as closed (runtime defense-in-depth).
|
||||
handle.closed.store(true, Ordering::Release);
|
||||
|
||||
let session_id = handle.id;
|
||||
let (_id, state) = self.sessions.remove(&session_id).ok_or_else(|| {
|
||||
TidalError::Internal(format!("session {session_id} not found (already closed?)"))
|
||||
})?;
|
||||
|
||||
let duration_ms = state.started_at.elapsed().as_millis() as u64;
|
||||
let signals_written = state.signals_written.load(Ordering::Relaxed);
|
||||
let rejections = state.signals_rejected.load(Ordering::Relaxed);
|
||||
|
||||
// Build and archive the frozen snapshot.
|
||||
let snapshot = session_mod::build_frozen_snapshot(&state, duration_ms);
|
||||
|
||||
// Persist the frozen snapshot and remove the start record atomically.
|
||||
if let Some(storage) = self.storage.as_ref() {
|
||||
let snapshot_key = encode_key(
|
||||
EntityId::new(session_id.as_u64()),
|
||||
Tag::Session,
|
||||
b"snapshot",
|
||||
);
|
||||
let start_key = encode_key(EntityId::new(session_id.as_u64()), Tag::Session, b"start");
|
||||
let snapshot_bytes = session_mod::serialize_snapshot(&snapshot);
|
||||
let mut batch = crate::storage::WriteBatch::new();
|
||||
batch.put(snapshot_key, snapshot_bytes);
|
||||
batch.delete(start_key);
|
||||
if let Err(e) = storage.items_engine().write_batch(batch) {
|
||||
tracing::warn!(error = %e, session_id = %session_id, "failed to persist session snapshot");
|
||||
}
|
||||
}
|
||||
|
||||
// Write session close event to the session journal (fire-and-forget).
|
||||
if let Ok(guard) = self.wal.lock()
|
||||
&& let Some(wal) = guard.as_ref()
|
||||
{
|
||||
let _ = wal.session_close(session_id.as_u64());
|
||||
}
|
||||
|
||||
// Evict oldest closed session if the cap is exceeded.
|
||||
if self.closed_sessions.len() >= session_mod::MAX_CLOSED_SESSIONS
|
||||
&& let Some(oldest_key) = self.closed_sessions.iter().map(|e| *e.key()).min()
|
||||
{
|
||||
self.closed_sessions.remove(&oldest_key);
|
||||
}
|
||||
self.closed_sessions.insert(session_id, snapshot);
|
||||
|
||||
tracing::debug!(
|
||||
session_id = %session_id,
|
||||
signals_written,
|
||||
rejections,
|
||||
duration_ms,
|
||||
"session closed"
|
||||
);
|
||||
|
||||
Ok(SessionSummary {
|
||||
id: session_id,
|
||||
duration_ms,
|
||||
signals_written,
|
||||
rejections,
|
||||
})
|
||||
}
|
||||
|
||||
/// List all currently active sessions.
|
||||
#[must_use]
|
||||
pub fn active_sessions(&self) -> Vec<SessionInfo> {
|
||||
self.sessions
|
||||
.iter()
|
||||
.map(|entry| {
|
||||
let s = entry.value();
|
||||
SessionInfo {
|
||||
id: s.id,
|
||||
user_id: s.user_id,
|
||||
agent_id: s.agent_id.as_str().to_owned(),
|
||||
started_at_ns: s.started_at_ns,
|
||||
signals_written: s.signals_written.load(Ordering::Relaxed),
|
||||
}
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
|
||||
/// Write a session-scoped signal for an entity.
|
||||
///
|
||||
/// Session signals are tracked in the session's in-memory `SessionHotState`
|
||||
/// with aggressive decay (5-minute half-life by default). They do **not**
|
||||
/// propagate to the global `SignalLedger`; they exist only within this session
|
||||
/// and are archived on `close_session`.
|
||||
///
|
||||
/// If the session has a policy, it is evaluated before the write; rejected
|
||||
/// signals are counted and logged to the audit trail.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if `signal_type` is not in the schema, the session
|
||||
/// is closed, or not found.
|
||||
/// - `TidalError::PolicyViolation` if the policy rejects the signal.
|
||||
/// - `TidalError::SessionExpired` if the session's duration limit is exceeded.
|
||||
#[allow(clippy::significant_drop_tightening)] // state_ref must live for the duration of the method.
|
||||
pub fn session_signal(
|
||||
&self,
|
||||
handle: &SessionHandle,
|
||||
signal_type: &str,
|
||||
entity_id: EntityId,
|
||||
weight: f64,
|
||||
ts: Timestamp,
|
||||
annotation: Option<String>,
|
||||
) -> crate::Result<()> {
|
||||
// Runtime guard: check the closed flag.
|
||||
if handle.closed.load(Ordering::Acquire) {
|
||||
return Err(TidalError::Internal(format!(
|
||||
"session {} is closed",
|
||||
handle.id
|
||||
)));
|
||||
}
|
||||
|
||||
// Validate signal_type exists in the schema.
|
||||
if let Some(ledger) = self.ledger.as_ref()
|
||||
&& ledger.resolve_signal_type(signal_type).is_err()
|
||||
{
|
||||
return Err(TidalError::Internal(format!(
|
||||
"unknown signal type: '{signal_type}'"
|
||||
)));
|
||||
}
|
||||
|
||||
let state_ref = self
|
||||
.sessions
|
||||
.get(&handle.id)
|
||||
.ok_or_else(|| TidalError::Internal(format!("session {} not found", handle.id)))?;
|
||||
let state = state_ref.value();
|
||||
|
||||
// Policy evaluation.
|
||||
if let Some(schema) = &self.schema_def
|
||||
&& let Some(policy) = schema.session_policy(&state.policy_name)
|
||||
{
|
||||
let evaluator = session_mod::PolicyEvaluator::new(policy, &state.policy_name);
|
||||
match evaluator.check(signal_type, state, std::time::Instant::now()) {
|
||||
Ok(()) => {
|
||||
// Record accepted entry via bounded AuditLog.
|
||||
let entry = AuditEntry {
|
||||
timestamp_ns: ts.as_nanos(),
|
||||
signal_type: signal_type.to_owned(),
|
||||
accepted: true,
|
||||
reason: None,
|
||||
};
|
||||
if let Ok(mut log) = state.audit_log.lock() {
|
||||
log.push(entry);
|
||||
}
|
||||
}
|
||||
Err(violation) => {
|
||||
// Record rejected entry via bounded AuditLog.
|
||||
let entry = AuditEntry {
|
||||
timestamp_ns: ts.as_nanos(),
|
||||
signal_type: signal_type.to_owned(),
|
||||
accepted: false,
|
||||
reason: Some(violation.reason.clone()),
|
||||
};
|
||||
if let Ok(mut log) = state.audit_log.lock() {
|
||||
log.push(entry);
|
||||
}
|
||||
state.signals_rejected.fetch_add(1, Ordering::Relaxed);
|
||||
|
||||
// Dispatch on the typed violation kind (no string parsing).
|
||||
return match violation.kind {
|
||||
session_mod::PolicyViolationKind::Expired => {
|
||||
Err(TidalError::SessionExpired {
|
||||
session_id: handle.id.as_u64(),
|
||||
max_duration_secs: policy.max_session_duration.as_secs_f64(),
|
||||
})
|
||||
}
|
||||
_ => Err(TidalError::PolicyViolation {
|
||||
signal_type: violation.signal_type,
|
||||
policy_name: violation.policy_name,
|
||||
reason: violation.reason,
|
||||
}),
|
||||
};
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Resolve per-signal-type lambda from schema; fall back to 5-min default.
|
||||
let lambda = self
|
||||
.schema_def
|
||||
.as_ref()
|
||||
.and_then(|s| s.signal(signal_type))
|
||||
.and_then(|def| def.decay().lambda())
|
||||
.unwrap_or(session_mod::DEFAULT_SESSION_LAMBDA);
|
||||
|
||||
// Write to session signal state (CAS decay + windowed count).
|
||||
let ts_ns = ts.as_nanos();
|
||||
let sig_entry = state
|
||||
.signals
|
||||
.entry(signal_type.to_owned())
|
||||
.or_insert_with(|| session_mod::SessionSignalState::new(ts_ns, lambda));
|
||||
sig_entry.on_signal(weight, ts_ns);
|
||||
drop(sig_entry);
|
||||
|
||||
// Track signaled entity.
|
||||
state.signaled_entities.insert(entity_id.as_u64(), ());
|
||||
|
||||
// Store annotation with timestamp (capped at MAX_ANNOTATIONS).
|
||||
let ann_for_wal = annotation.clone();
|
||||
if let Some(ann) = annotation
|
||||
&& let Ok(mut anns) = state.annotations.lock()
|
||||
&& anns.len() < session_mod::MAX_ANNOTATIONS
|
||||
{
|
||||
anns.push((ts_ns, ann));
|
||||
}
|
||||
|
||||
state.signals_written.fetch_add(1, Ordering::Relaxed);
|
||||
|
||||
// Write session signal event to the session journal (fire-and-forget).
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
if let Ok(guard) = self.wal.lock()
|
||||
&& let Some(wal) = guard.as_ref()
|
||||
{
|
||||
let _ = wal.session_signal(
|
||||
handle.id.as_u64(),
|
||||
entity_id.as_u64(),
|
||||
weight as f32,
|
||||
ts_ns,
|
||||
signal_type,
|
||||
ann_for_wal.as_deref(),
|
||||
);
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Retrieve a snapshot of an active or archived session.
|
||||
///
|
||||
/// For active sessions the decay scores are computed at the current
|
||||
/// wall-clock time. For archived sessions the scores are frozen at the
|
||||
/// moment `close_session` was called.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if the session is not found.
|
||||
pub fn session_snapshot(&self, session_id: SessionId) -> crate::Result<SessionSnapshot> {
|
||||
// Try active sessions first.
|
||||
if let Some(state_ref) = self.sessions.get(&session_id) {
|
||||
let state = state_ref.value();
|
||||
let now_ns = Timestamp::now().as_nanos();
|
||||
return Ok(session_mod::build_snapshot(state, now_ns));
|
||||
}
|
||||
|
||||
// Fall back to archived sessions (in-memory cache).
|
||||
if let Some(snap_ref) = self.closed_sessions.get(&session_id) {
|
||||
return Ok(snap_ref.value().clone());
|
||||
}
|
||||
|
||||
// Last resort: check persistent storage for archived snapshots.
|
||||
if let Some(storage) = self.storage.as_ref() {
|
||||
let snapshot_key = encode_key(
|
||||
EntityId::new(session_id.as_u64()),
|
||||
Tag::Session,
|
||||
b"snapshot",
|
||||
);
|
||||
if let Ok(Some(bytes)) = storage.items_engine().get(&snapshot_key)
|
||||
&& let Some(snapshot) = session_mod::deserialize_snapshot(&bytes)
|
||||
{
|
||||
// Warm the in-memory cache to avoid repeated storage reads.
|
||||
self.closed_sessions.insert(session_id, snapshot.clone());
|
||||
return Ok(snapshot);
|
||||
}
|
||||
}
|
||||
|
||||
Err(TidalError::Internal(format!(
|
||||
"session {session_id} not found"
|
||||
)))
|
||||
}
|
||||
|
||||
/// Retrieve the policy audit log for a session.
|
||||
///
|
||||
/// Returns all accept/reject decisions recorded by the policy evaluator
|
||||
/// for the given session.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if the session is not found or the audit log
|
||||
/// mutex is poisoned.
|
||||
pub fn session_audit(&self, session_id: SessionId) -> crate::Result<Vec<AuditEntry>> {
|
||||
// Try active sessions first.
|
||||
if let Some(state_ref) = self.sessions.get(&session_id) {
|
||||
let state = state_ref.value();
|
||||
let log = state
|
||||
.audit_log
|
||||
.lock()
|
||||
.map_err(|_| TidalError::Internal("audit_log mutex poisoned".into()))?;
|
||||
return Ok(log.entries().to_vec());
|
||||
}
|
||||
|
||||
// For archived sessions, return the audit log captured at close time.
|
||||
if let Some(snap_ref) = self.closed_sessions.get(&session_id) {
|
||||
return Ok(snap_ref.value().audit_log.clone());
|
||||
}
|
||||
|
||||
Err(TidalError::Internal(format!(
|
||||
"session {session_id} not found"
|
||||
)))
|
||||
}
|
||||
}
|
||||
271
tidal/src/db/signals.rs
Normal file
271
tidal/src/db/signals.rs
Normal file
@ -0,0 +1,271 @@
|
||||
//! Signal write/read operations on `TidalDb`.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use crate::entities::HardNegIndex;
|
||||
use crate::schema::{EntityId, EntityKind, TidalError, Timestamp, Window};
|
||||
|
||||
use super::TidalDb;
|
||||
use super::metadata::{is_positive_engagement_signal, serialize_metadata};
|
||||
|
||||
impl TidalDb {
|
||||
/// Write (or overwrite) item metadata.
|
||||
///
|
||||
/// Stores the `metadata` key-value map under the entity's `Tag::Meta` key
|
||||
/// in the items storage backend.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no storage backend is wired (use `with_schema()`).
|
||||
/// - `TidalError::Storage` on storage engine failure.
|
||||
pub fn write_item(
|
||||
&self,
|
||||
id: EntityId,
|
||||
metadata: &HashMap<String, String>,
|
||||
) -> crate::Result<()> {
|
||||
use crate::storage::{Tag, encode_key};
|
||||
|
||||
let storage = self.storage()?;
|
||||
let key = encode_key(id, Tag::Meta, b"");
|
||||
let value = serialize_metadata(metadata);
|
||||
storage
|
||||
.items_engine()
|
||||
.put(&key, &value)
|
||||
.map_err(TidalError::from)
|
||||
}
|
||||
|
||||
/// Record a signal event for an entity.
|
||||
///
|
||||
/// Atomically:
|
||||
/// 1. Appends the event to the WAL (WAL-first durability).
|
||||
/// 2. Updates the in-memory decay score (hot tier).
|
||||
/// 3. Updates the in-memory windowed counter (warm tier).
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no ledger is wired (use `with_schema()`).
|
||||
/// - `TidalError::Schema` if `signal_type` is not defined in the schema.
|
||||
/// - `TidalError::Durability` if the WAL write fails.
|
||||
pub fn signal(
|
||||
&self,
|
||||
signal_type: &str,
|
||||
entity_id: EntityId,
|
||||
weight: f64,
|
||||
timestamp: Timestamp,
|
||||
) -> crate::Result<()> {
|
||||
self.ledger()?
|
||||
.record_signal(signal_type, entity_id, weight, timestamp)
|
||||
}
|
||||
|
||||
/// Read the current decay score for an entity-signal pair.
|
||||
///
|
||||
/// Applies lazy decay from the stored timestamp to the current wall-clock
|
||||
/// time. Returns `None` if no signals have been recorded.
|
||||
///
|
||||
/// `decay_rate_idx` selects the lambda index from the signal definition.
|
||||
/// For exponential signals with one rate, use `0`.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no ledger is wired.
|
||||
/// - `TidalError::Schema` if `signal_type` is not defined.
|
||||
pub fn read_decay_score(
|
||||
&self,
|
||||
entity_id: EntityId,
|
||||
signal_type: &str,
|
||||
decay_rate_idx: usize,
|
||||
) -> crate::Result<Option<f64>> {
|
||||
self.ledger()?
|
||||
.read_decay_score(entity_id, signal_type, decay_rate_idx)
|
||||
}
|
||||
|
||||
/// Read the windowed event count for an entity-signal pair.
|
||||
///
|
||||
/// Returns `0` if no signals have been recorded.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no ledger is wired.
|
||||
/// - `TidalError::Schema` if `signal_type` is not defined.
|
||||
pub fn read_windowed_count(
|
||||
&self,
|
||||
entity_id: EntityId,
|
||||
signal_type: &str,
|
||||
window: Window,
|
||||
) -> crate::Result<u64> {
|
||||
self.ledger()?
|
||||
.read_windowed_count(entity_id, signal_type, window)
|
||||
}
|
||||
|
||||
/// Read the velocity (events per second) for an entity-signal-window.
|
||||
///
|
||||
/// Velocity = `windowed_count / window_duration_seconds`.
|
||||
/// Returns `0.0` for `AllTime` windows or if no signals recorded.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no ledger is wired.
|
||||
/// - `TidalError::Schema` if `signal_type` is not defined.
|
||||
pub fn read_velocity(
|
||||
&self,
|
||||
entity_id: EntityId,
|
||||
signal_type: &str,
|
||||
window: Window,
|
||||
) -> crate::Result<f64> {
|
||||
self.ledger()?.read_velocity(entity_id, signal_type, window)
|
||||
}
|
||||
|
||||
/// Records a signal with user context, updating the interaction ledger, seen state,
|
||||
/// and preference vectors in-memory.
|
||||
///
|
||||
/// In addition to updating the signal ledger, this method:
|
||||
/// 1. Hard negatives: if the signal is skip/hide/dislike/block, records
|
||||
/// a hard negative for the (user, item) pair.
|
||||
/// 2. Interaction weight: if `for_user` is provided, updates the
|
||||
/// (user, creator) interaction weight.
|
||||
/// 3. Seen tracking: if `for_user` is provided, marks the item as seen.
|
||||
/// 4. Preference vector: for positive engagement signals (like, share,
|
||||
/// completion), looks up the item's embedding from durable storage and
|
||||
/// blends it into the user's preference vector via EMA.
|
||||
///
|
||||
/// # Preference vector updates
|
||||
///
|
||||
/// The update triggers when all three conditions are met:
|
||||
/// - The signal type is a positive engagement signal ("like", "share", "completion").
|
||||
/// - `for_user` is `Some` (the acting user is known).
|
||||
/// - The item has a stored embedding in the entity store (written via
|
||||
/// `insert_embedding` or `update_embedding` during item ingestion).
|
||||
///
|
||||
/// The embedding is read from the first Item embedding slot declared in the
|
||||
/// schema. If no schema or no embedding slot is declared, falls back to the
|
||||
/// slot name "content". If the lookup fails (no embedding stored, storage
|
||||
/// error), the preference update is silently skipped -- the base signal is
|
||||
/// still recorded.
|
||||
///
|
||||
/// # Durability
|
||||
///
|
||||
/// The base signal (entity, type, weight, timestamp) is WAL-backed and survives crashes.
|
||||
/// User-context side effects (seen state, interaction weights, preference vector updates)
|
||||
/// are reconstructed from durable storage on restart via `rebuild_entity_state`.
|
||||
/// Hard negatives (hide/block) are durably written via `write_relationship()`.
|
||||
///
|
||||
/// Seen state from regular views/likes is intentionally ephemeral -- users should see
|
||||
/// content again after a restart. Only explicit hides (via `write_relationship` with
|
||||
/// `RelationshipType::Hide`) survive restarts as "seen".
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns errors from the underlying `signal()` method.
|
||||
pub fn signal_with_context(
|
||||
&self,
|
||||
signal_type: &str,
|
||||
entity_id: EntityId,
|
||||
weight: f64,
|
||||
timestamp: Timestamp,
|
||||
for_user: Option<u64>,
|
||||
creator_id: Option<u64>,
|
||||
) -> crate::Result<()> {
|
||||
// Record the base signal.
|
||||
self.signal(signal_type, entity_id, weight, timestamp)?;
|
||||
|
||||
// Signal dispatch: side effects based on signal type and context.
|
||||
if let Some(user_id) = for_user {
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
let item_u32 = entity_id.as_u64() as u32;
|
||||
|
||||
// 1. Hard negatives.
|
||||
if HardNegIndex::is_hard_neg_signal(signal_type) {
|
||||
self.hard_negatives.add(user_id, item_u32);
|
||||
}
|
||||
|
||||
// 2. Seen tracking.
|
||||
self.user_state.mark_seen(user_id, item_u32);
|
||||
|
||||
// 3. Interaction weight: if creator is known, update the
|
||||
// (user, creator) interaction strength.
|
||||
if let Some(cid) = creator_id {
|
||||
self.interaction_ledger
|
||||
.record(user_id, cid, weight, timestamp.as_nanos());
|
||||
}
|
||||
|
||||
// 4. Preference vector: for positive engagement signals, look up
|
||||
// the item's embedding and blend into the user's taste vector.
|
||||
if is_positive_engagement_signal(signal_type) {
|
||||
self.try_update_preference_vector(user_id, entity_id);
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Attempt to update a user's preference vector from the item's stored embedding.
|
||||
///
|
||||
/// Reads the item's embedding from durable storage (entity store) and blends it
|
||||
/// into the user's preference vector via `PreferenceVectors::update()`. This is
|
||||
/// a best-effort operation: if the item has no embedding, no storage is wired, or
|
||||
/// the embedding cannot be deserialized, the update is silently skipped.
|
||||
///
|
||||
/// The slot name is determined by the schema's first Item embedding slot, falling
|
||||
/// back to "content" if no schema is available.
|
||||
pub(super) fn try_update_preference_vector(&self, user_id: u64, entity_id: EntityId) {
|
||||
// Determine which embedding slot to read.
|
||||
let slot_name = self
|
||||
.schema_def
|
||||
.as_ref()
|
||||
.and_then(|s| {
|
||||
s.embedding_slots()
|
||||
.iter()
|
||||
.find(|slot| slot.entity_kind == EntityKind::Item)
|
||||
.map(|slot| slot.name.as_str())
|
||||
})
|
||||
.unwrap_or("content");
|
||||
|
||||
// Read the item's embedding from durable storage.
|
||||
let Some(storage) = self.storage.as_ref() else {
|
||||
return;
|
||||
};
|
||||
let key = crate::storage::vector::embedding_store_key(entity_id, slot_name);
|
||||
let embedding_bytes = match storage.items_engine().get(&key) {
|
||||
Ok(Some(bytes)) => bytes,
|
||||
Ok(None) => {
|
||||
tracing::debug!(
|
||||
entity_id = entity_id.as_u64(),
|
||||
slot = slot_name,
|
||||
"preference vector update skipped: item has no stored embedding"
|
||||
);
|
||||
return;
|
||||
}
|
||||
Err(e) => {
|
||||
tracing::debug!(
|
||||
entity_id = entity_id.as_u64(),
|
||||
error = %e,
|
||||
"preference vector update skipped: storage read failed"
|
||||
);
|
||||
return;
|
||||
}
|
||||
};
|
||||
|
||||
// Deserialize the embedding.
|
||||
let embedding = match crate::storage::vector::deserialize_embedding(&embedding_bytes) {
|
||||
Ok(v) => v,
|
||||
Err(e) => {
|
||||
tracing::debug!(
|
||||
entity_id = entity_id.as_u64(),
|
||||
error = %e,
|
||||
"preference vector update skipped: embedding deserialization failed"
|
||||
);
|
||||
return;
|
||||
}
|
||||
};
|
||||
|
||||
// Blend into the user's preference vector.
|
||||
if !self.preference_vectors.update(user_id, &embedding) {
|
||||
tracing::debug!(
|
||||
user_id,
|
||||
entity_id = entity_id.as_u64(),
|
||||
embedding_dim = embedding.len(),
|
||||
"preference vector update skipped: dimension mismatch"
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
155
tidal/src/db/state_rebuild.rs
Normal file
155
tidal/src/db/state_rebuild.rs
Normal file
@ -0,0 +1,155 @@
|
||||
//! Entity state rebuild from durable storage and periodic checkpoint thread.
|
||||
|
||||
use std::sync::Arc;
|
||||
use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
|
||||
use std::time::Duration;
|
||||
|
||||
use crate::schema::{TidalError, Timestamp};
|
||||
use crate::signals::SignalLedger;
|
||||
use crate::storage::{StorageEngine, Tag};
|
||||
|
||||
use super::metadata::deserialize_metadata;
|
||||
use super::storage_box::StorageBox;
|
||||
|
||||
/// Rebuild in-memory entity state from durable storage on restart.
|
||||
///
|
||||
/// Scans the users keyspace for relationship edges and the items keyspace for
|
||||
/// `creator_id` metadata. Populates:
|
||||
/// 1. `user_state.blocked` from `RelationshipType::Blocks` edges
|
||||
/// 2. `user_state.seen` (hidden items) from `RelationshipType::Hide` edges
|
||||
/// 3. `user_state.follows` from `RelationshipType::Follows` edges
|
||||
/// 4. `creator_items` bitmap from items with `creator_id` metadata
|
||||
/// 5. `interaction_ledger` from `RelationshipType::InteractionWeight` edges
|
||||
///
|
||||
/// For ephemeral mode, all engines are empty, so this is effectively a no-op.
|
||||
pub(super) fn rebuild_entity_state(
|
||||
storage: &StorageBox,
|
||||
user_state: &crate::entities::UserStateIndex,
|
||||
creator_items: &crate::entities::CreatorItemsBitmap,
|
||||
interaction_ledger: &crate::entities::InteractionLedger,
|
||||
) -> crate::Result<()> {
|
||||
use crate::entities::relationship::{
|
||||
RelationshipType, deserialize_relationship_value, parse_relationship_to,
|
||||
};
|
||||
use crate::storage::keys::parse_key;
|
||||
|
||||
// Scan the users keyspace for all relationship edges.
|
||||
// The relationship key format is:
|
||||
// [from_entity_id: 8 BE][0x00][Tag::Rel (0x04)][rel_type: 1][to_entity_id: 8 BE]
|
||||
// We scan with an empty prefix to get all keys, then filter for Tag::Rel.
|
||||
let mut rel_count = 0u64;
|
||||
for entry in storage.users_engine().scan_prefix(&[]) {
|
||||
let (key, value) = entry.map_err(TidalError::from)?;
|
||||
|
||||
// Only process relationship keys (Tag::Rel = 0x04).
|
||||
if let Some((from_id, Tag::Rel, suffix)) = parse_key(&key) {
|
||||
// suffix = [rel_type: 1 byte][to_entity_id: 8 BE]
|
||||
if suffix.is_empty() {
|
||||
continue;
|
||||
}
|
||||
let rel_type_byte = suffix[0];
|
||||
let Some(rel_type) = RelationshipType::from_byte(rel_type_byte) else {
|
||||
continue;
|
||||
};
|
||||
let Some(to_id) = parse_relationship_to(&key) else {
|
||||
continue;
|
||||
};
|
||||
let from_id_u64 = from_id.as_u64();
|
||||
|
||||
match rel_type {
|
||||
RelationshipType::Blocks => {
|
||||
user_state.add_block_creator(from_id_u64, to_id.as_u64());
|
||||
rel_count += 1;
|
||||
}
|
||||
RelationshipType::Hide => {
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
user_state.add_hide(from_id_u64, to_id.as_u64() as u32);
|
||||
rel_count += 1;
|
||||
}
|
||||
RelationshipType::Follows => {
|
||||
user_state.add_follow(from_id_u64, to_id.as_u64());
|
||||
rel_count += 1;
|
||||
}
|
||||
RelationshipType::InteractionWeight => {
|
||||
// Reconstruct interaction weight from the stored edge value.
|
||||
if let Some((weight, ts_nanos)) = deserialize_relationship_value(&value) {
|
||||
interaction_ledger.record(from_id_u64, to_id.as_u64(), weight, ts_nanos);
|
||||
rel_count += 1;
|
||||
}
|
||||
}
|
||||
RelationshipType::Mute => {
|
||||
// Mute edges do not have in-memory state (yet).
|
||||
rel_count += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Scan items keyspace for creator_id metadata to rebuild creator_items bitmap.
|
||||
let mut item_count = 0u64;
|
||||
for entry in storage.items_engine().scan_prefix(&[]) {
|
||||
let (key, value) = entry.map_err(TidalError::from)?;
|
||||
|
||||
if let Some((entity_id, Tag::Meta, _suffix)) = parse_key(&key) {
|
||||
let meta = deserialize_metadata(&value);
|
||||
if let Some(creator_str) = meta.get("creator_id")
|
||||
&& let Ok(creator_id) = creator_str.parse::<u64>()
|
||||
{
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
creator_items.add_item(creator_id, entity_id.as_u64() as u32);
|
||||
item_count += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if rel_count > 0 || item_count > 0 {
|
||||
tracing::info!(
|
||||
relationships = rel_count,
|
||||
creator_items = item_count,
|
||||
"entity state rebuilt from durable storage"
|
||||
);
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Background thread body: checkpoint signal state to storage every 30 seconds.
|
||||
///
|
||||
/// Polls the shutdown flag every 500ms so the thread exits promptly when
|
||||
/// `shutdown_inner()` is called. Only runs in persistent mode (ephemeral opens
|
||||
/// never spawn this thread).
|
||||
///
|
||||
/// The `Arc` arguments are intentionally passed by value: the thread must own
|
||||
/// them for its entire lifetime (references cannot satisfy the `'static` bound
|
||||
/// required by `std::thread::spawn`).
|
||||
#[allow(clippy::needless_pass_by_value)]
|
||||
pub(super) fn run_checkpoint_thread(
|
||||
shutdown: Arc<AtomicBool>,
|
||||
ledger: Arc<SignalLedger>,
|
||||
storage: Box<dyn StorageEngine + Send + Sync>,
|
||||
last_wal_seq: Arc<AtomicU64>,
|
||||
) {
|
||||
const CHECKPOINT_INTERVAL: Duration = Duration::from_secs(30);
|
||||
const POLL_INTERVAL: Duration = Duration::from_millis(500);
|
||||
|
||||
let mut elapsed = Duration::ZERO;
|
||||
loop {
|
||||
std::thread::sleep(POLL_INTERVAL);
|
||||
if shutdown.load(Ordering::Acquire) {
|
||||
break;
|
||||
}
|
||||
elapsed += POLL_INTERVAL;
|
||||
if elapsed >= CHECKPOINT_INTERVAL {
|
||||
elapsed = Duration::ZERO;
|
||||
let meta = crate::signals::checkpoint::CheckpointMeta {
|
||||
checkpoint_time_ns: Timestamp::now().as_nanos(),
|
||||
wal_sequence: last_wal_seq.load(Ordering::Relaxed),
|
||||
};
|
||||
if let Err(e) = ledger.checkpoint(storage.as_ref(), meta) {
|
||||
tracing::error!(error = %e, "periodic checkpoint failed");
|
||||
} else {
|
||||
tracing::debug!("periodic checkpoint written");
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
53
tidal/src/db/storage_box.rs
Normal file
53
tidal/src/db/storage_box.rs
Normal file
@ -0,0 +1,53 @@
|
||||
//! Storage abstraction: routes to the correct backend by entity kind.
|
||||
|
||||
use crate::schema::{EntityKind, TidalError};
|
||||
use crate::storage::{InMemoryBackend, StorageEngine};
|
||||
|
||||
/// Wraps either in-memory backends (ephemeral mode) or a fjall storage
|
||||
/// (persistent mode) behind a uniform interface.
|
||||
///
|
||||
/// M3 provides three backends: items, users, creators. In ephemeral mode
|
||||
/// each is an independent `InMemoryBackend`; in persistent mode they share
|
||||
/// a single `FjallStorage` with three keyspaces.
|
||||
pub enum StorageBox {
|
||||
Memory {
|
||||
items: InMemoryBackend,
|
||||
users: InMemoryBackend,
|
||||
creators: InMemoryBackend,
|
||||
},
|
||||
Fjall(crate::storage::FjallStorage),
|
||||
}
|
||||
|
||||
impl StorageBox {
|
||||
/// Reference to the items storage engine.
|
||||
pub(super) fn items_engine(&self) -> &dyn StorageEngine {
|
||||
match self {
|
||||
Self::Memory { items, .. } => items,
|
||||
Self::Fjall(f) => f.backend(EntityKind::Item),
|
||||
}
|
||||
}
|
||||
|
||||
/// Reference to the users storage engine.
|
||||
pub(super) fn users_engine(&self) -> &dyn StorageEngine {
|
||||
match self {
|
||||
Self::Memory { users, .. } => users,
|
||||
Self::Fjall(f) => f.backend(EntityKind::User),
|
||||
}
|
||||
}
|
||||
|
||||
/// Reference to the creators storage engine.
|
||||
pub(super) fn creators_engine(&self) -> &dyn StorageEngine {
|
||||
match self {
|
||||
Self::Memory { creators, .. } => creators,
|
||||
Self::Fjall(f) => f.backend(EntityKind::Creator),
|
||||
}
|
||||
}
|
||||
|
||||
/// Flush all buffered writes to durable storage.
|
||||
pub(super) fn flush(&self) -> crate::Result<()> {
|
||||
match self {
|
||||
Self::Memory { .. } => Ok(()),
|
||||
Self::Fjall(f) => f.flush_all().map_err(TidalError::from),
|
||||
}
|
||||
}
|
||||
}
|
||||
57
tidal/src/db/users.rs
Normal file
57
tidal/src/db/users.rs
Normal file
@ -0,0 +1,57 @@
|
||||
//! User entity write/read operations on `TidalDb`.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use crate::schema::{EntityId, TidalError};
|
||||
use crate::storage::{Tag, encode_key};
|
||||
|
||||
use super::TidalDb;
|
||||
|
||||
impl TidalDb {
|
||||
/// Write (or overwrite) a user entity.
|
||||
///
|
||||
/// Stores metadata and optional embedding under the user's `Tag::Meta` key
|
||||
/// in the users storage backend.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no storage backend is wired.
|
||||
/// - `TidalError::Storage` on storage engine failure.
|
||||
pub fn write_user(
|
||||
&self,
|
||||
id: EntityId,
|
||||
metadata: &HashMap<String, String>,
|
||||
) -> crate::Result<()> {
|
||||
let storage = self.storage()?;
|
||||
let key = encode_key(id, Tag::Meta, b"");
|
||||
let value = crate::entities::serialize_entity(None, metadata);
|
||||
storage
|
||||
.users_engine()
|
||||
.put(&key, &value)
|
||||
.map_err(TidalError::from)
|
||||
}
|
||||
|
||||
/// Read user metadata for a given entity ID.
|
||||
///
|
||||
/// Returns `None` if the user does not exist in storage.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Internal` if no storage backend is wired.
|
||||
/// - `TidalError::Storage` on storage engine failure.
|
||||
pub fn get_user_metadata(
|
||||
&self,
|
||||
id: EntityId,
|
||||
) -> crate::Result<Option<HashMap<String, String>>> {
|
||||
let storage = self.storage()?;
|
||||
let key = encode_key(id, Tag::Meta, b"");
|
||||
match storage.users_engine().get(&key) {
|
||||
Ok(Some(bytes)) => {
|
||||
let (_emb, meta) = crate::entities::deserialize_entity(&bytes);
|
||||
Ok(Some(meta))
|
||||
}
|
||||
Ok(None) => Ok(None),
|
||||
Err(e) => Err(TidalError::from(e)),
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -184,8 +184,8 @@ mod tests {
|
||||
#[test]
|
||||
fn update_blends() {
|
||||
let pv = PreferenceVectors::with_learning_rate(2, 0.5);
|
||||
pv.set(1, vec![1.0, 0.0]);
|
||||
pv.update(1, &[0.0, 1.0]);
|
||||
let _ = pv.set(1, vec![1.0, 0.0]);
|
||||
let _ = pv.update(1, &[0.0, 1.0]);
|
||||
let v = pv.get(1).unwrap();
|
||||
// After blend: (0.5, 0.5), normalized: (1/sqrt(2), 1/sqrt(2))
|
||||
let expected = 1.0 / 2.0f32.sqrt();
|
||||
@ -196,7 +196,7 @@ mod tests {
|
||||
#[test]
|
||||
fn cosine_similarity_normalized() {
|
||||
let pv = PreferenceVectors::new(3);
|
||||
pv.set(1, vec![1.0, 0.0, 0.0]);
|
||||
let _ = pv.set(1, vec![1.0, 0.0, 0.0]);
|
||||
// Cosine with self = 1.0
|
||||
let sim = pv.cosine_similarity(1, &[1.0, 0.0, 0.0]).unwrap();
|
||||
assert!((sim - 1.0).abs() < 1e-6);
|
||||
@ -223,7 +223,7 @@ mod tests {
|
||||
let pv = PreferenceVectors::new(3);
|
||||
assert!(pv.is_empty());
|
||||
assert_eq!(pv.len(), 0);
|
||||
pv.set(1, vec![1.0, 0.0, 0.0]);
|
||||
let _ = pv.set(1, vec![1.0, 0.0, 0.0]);
|
||||
assert!(!pv.is_empty());
|
||||
assert_eq!(pv.len(), 1);
|
||||
}
|
||||
@ -243,7 +243,7 @@ mod tests {
|
||||
) {
|
||||
let pv = PreferenceVectors::new(4);
|
||||
for emb in &updates {
|
||||
pv.update(1, emb);
|
||||
let _ = pv.update(1, emb);
|
||||
}
|
||||
let v = pv.get(1).unwrap();
|
||||
let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
|
||||
|
||||
@ -6,6 +6,7 @@ pub mod schema;
|
||||
pub mod session;
|
||||
pub mod signals;
|
||||
pub mod storage;
|
||||
pub mod text;
|
||||
pub mod wal;
|
||||
|
||||
/// Build hash compiled in from the `GIT_HASH` environment variable.
|
||||
@ -32,7 +33,7 @@ pub use db::{Config, ConfigError, Paths, StorageMode, TidalDb, TidalDbBuilder};
|
||||
pub use schema::{AgentPolicy, TidalError};
|
||||
pub use session::{
|
||||
AgentId, AuditEntry, SessionContext, SessionHandle, SessionId, SessionInfo, SessionSnapshot,
|
||||
SessionSummary,
|
||||
SessionSummary, SignalSnapEntry,
|
||||
};
|
||||
|
||||
/// Crate-wide result type. All public API methods return `Result<T, TidalError>`.
|
||||
|
||||
242
tidal/src/query/executor/candidate_gen.rs
Normal file
242
tidal/src/query/executor/candidate_gen.rs
Normal file
@ -0,0 +1,242 @@
|
||||
//! Candidate generation strategies for the RETRIEVE executor.
|
||||
//!
|
||||
//! Contains the scan, signal-ranked, and exploration injection strategies
|
||||
//! used in Stage 1 of the RETRIEVE pipeline.
|
||||
|
||||
use std::collections::HashSet;
|
||||
use std::sync::RwLock;
|
||||
|
||||
use roaring::RoaringBitmap;
|
||||
|
||||
use crate::ranking::executor::ScoredCandidate;
|
||||
use crate::schema::{EntityId, Timestamp};
|
||||
use crate::signals::SignalLedger;
|
||||
|
||||
/// Scan the universe bitmap for all entity IDs.
|
||||
///
|
||||
/// When `has_user_context` is true, uses a larger multiplier (10x) to
|
||||
/// ensure enough candidates survive user-context filtering (seen, blocked,
|
||||
/// hard negatives).
|
||||
pub(crate) fn scan_candidates(
|
||||
universe: Option<&RwLock<RoaringBitmap>>,
|
||||
limit: usize,
|
||||
has_user_context: bool,
|
||||
) -> Vec<EntityId> {
|
||||
let multiplier = if has_user_context { 10 } else { 4 };
|
||||
let max_candidates = (limit * multiplier).max(200);
|
||||
let mut candidates = Vec::with_capacity(max_candidates);
|
||||
|
||||
// Read-lock the universe bitmap.
|
||||
if let Some(universe) = universe
|
||||
&& let Ok(bm) = universe.read()
|
||||
{
|
||||
for id_u32 in bm.iter() {
|
||||
candidates.push(EntityId::new(u64::from(id_u32)));
|
||||
if candidates.len() >= max_candidates {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
candidates
|
||||
}
|
||||
|
||||
/// Rank candidates by a specific signal's decay score.
|
||||
pub(crate) fn signal_ranked_candidates(
|
||||
ledger: &SignalLedger,
|
||||
signal_name: &str,
|
||||
limit: usize,
|
||||
) -> Vec<EntityId> {
|
||||
let max_candidates = (limit * 4).max(200);
|
||||
let Ok(type_id) = ledger.resolve_signal_type(signal_name) else {
|
||||
return Vec::new();
|
||||
};
|
||||
|
||||
let now_ns = Timestamp::now().as_nanos();
|
||||
let mut scored: Vec<(EntityId, f64)> = Vec::new();
|
||||
|
||||
for entry in ledger.entries() {
|
||||
let (entity_id, signal_type_id) = entry.key();
|
||||
if *signal_type_id == type_id {
|
||||
// Use decay score at index 0 with lambda=0 (no additional decay
|
||||
// beyond what was already applied at write time). This is a
|
||||
// simplified ranking for candidate generation -- the full scoring
|
||||
// happens in Stage 3 via ProfileExecutor.
|
||||
let score = entry.value().hot.current_score(0, now_ns, 0.0);
|
||||
scored.push((*entity_id, score));
|
||||
}
|
||||
}
|
||||
|
||||
// Sort by score descending. NaN scores (from degenerate decay math) fall
|
||||
// back to entity-ID order for deterministic, stable output.
|
||||
scored.sort_by(|a, b| {
|
||||
b.1.partial_cmp(&a.1)
|
||||
.unwrap_or_else(|| b.0.as_u64().cmp(&a.0.as_u64()))
|
||||
});
|
||||
|
||||
scored
|
||||
.into_iter()
|
||||
.take(max_candidates)
|
||||
.map(|(id, _)| id)
|
||||
.collect()
|
||||
}
|
||||
|
||||
/// Inject exploration candidates into the scored list.
|
||||
///
|
||||
/// Reserves `exploration_fraction` of the result set for candidates
|
||||
/// not in the top-scored results. Used by `for_you` to prevent filter
|
||||
/// bubbles and expose cold-start content.
|
||||
///
|
||||
/// Exploration candidates are appended at the end of the scored list
|
||||
/// with score 0.0 (lowest). The selection is deterministic: a hash of
|
||||
/// the candidate set size + first candidate ID seeds the order.
|
||||
pub(crate) fn inject_exploration(
|
||||
scored: &mut Vec<ScoredCandidate>,
|
||||
all_candidates: &[EntityId],
|
||||
exploration_fraction: f64,
|
||||
) {
|
||||
if scored.is_empty() || all_candidates.is_empty() || exploration_fraction <= 0.0 {
|
||||
return;
|
||||
}
|
||||
|
||||
#[allow(
|
||||
clippy::cast_possible_truncation,
|
||||
clippy::cast_sign_loss,
|
||||
clippy::cast_precision_loss
|
||||
)]
|
||||
let exploration_slots = (exploration_fraction * scored.len() as f64).ceil() as usize;
|
||||
if exploration_slots == 0 {
|
||||
return;
|
||||
}
|
||||
|
||||
// Collect entity IDs already in the scored set.
|
||||
let scored_ids: HashSet<u64> = scored.iter().map(|c| c.entity_id.as_u64()).collect();
|
||||
|
||||
// Find candidates not in the scored set.
|
||||
let mut exploration_pool: Vec<EntityId> = all_candidates
|
||||
.iter()
|
||||
.filter(|id| !scored_ids.contains(&id.as_u64()))
|
||||
.copied()
|
||||
.collect();
|
||||
|
||||
if exploration_pool.is_empty() {
|
||||
return;
|
||||
}
|
||||
|
||||
// Deterministic shuffle using BLAKE3 hash of candidate IDs.
|
||||
// This is reproducible for the same candidate set.
|
||||
exploration_pool.sort_by(|a, b| {
|
||||
let hash_a = blake3::hash(&a.as_u64().to_le_bytes());
|
||||
let hash_b = blake3::hash(&b.as_u64().to_le_bytes());
|
||||
hash_a.as_bytes().cmp(hash_b.as_bytes())
|
||||
});
|
||||
|
||||
let actual_slots = exploration_slots.min(exploration_pool.len());
|
||||
|
||||
// Trim scored to make room for exploration candidates.
|
||||
let keep = scored.len().saturating_sub(actual_slots);
|
||||
scored.truncate(keep);
|
||||
|
||||
// Append exploration candidates with score 0.0.
|
||||
for &entity_id in exploration_pool.iter().take(actual_slots) {
|
||||
scored.push(ScoredCandidate {
|
||||
entity_id,
|
||||
score: 0.0,
|
||||
signal_snapshot: vec![],
|
||||
creator_id: None,
|
||||
format: None,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// ── Tests ────────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::ranking::executor::ScoredCandidate;
|
||||
use crate::schema::EntityId;
|
||||
|
||||
#[test]
|
||||
fn exploration_injects_random_candidates() {
|
||||
// Scored list: entities 1-5 (the "top ranked" items).
|
||||
let mut scored: Vec<ScoredCandidate> = (1..=5)
|
||||
.map(|i| ScoredCandidate {
|
||||
entity_id: EntityId::new(i),
|
||||
score: 1.0 - (i as f64 * 0.1),
|
||||
signal_snapshot: vec![],
|
||||
creator_id: None,
|
||||
format: None,
|
||||
})
|
||||
.collect();
|
||||
|
||||
// Full candidate list: entities 1-10 (so 6-10 are available for exploration).
|
||||
let all_candidates: Vec<EntityId> = (1..=10).map(EntityId::new).collect();
|
||||
|
||||
// 40% exploration -> ceil(0.4 * 5) = 2 exploration slots.
|
||||
inject_exploration(&mut scored, &all_candidates, 0.4);
|
||||
|
||||
assert_eq!(scored.len(), 5, "total output should remain 5");
|
||||
|
||||
// At least one of the exploration candidates (6-10) should be present.
|
||||
let exploration_ids: Vec<u64> = scored
|
||||
.iter()
|
||||
.filter(|c| c.entity_id.as_u64() > 5)
|
||||
.map(|c| c.entity_id.as_u64())
|
||||
.collect();
|
||||
assert_eq!(
|
||||
exploration_ids.len(),
|
||||
2,
|
||||
"expected 2 exploration candidates, got {}",
|
||||
exploration_ids.len()
|
||||
);
|
||||
|
||||
// Exploration candidates should have score 0.0.
|
||||
for c in &scored {
|
||||
if c.entity_id.as_u64() > 5 {
|
||||
assert_eq!(c.score, 0.0, "exploration candidates should have score 0.0");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn exploration_zero_fraction_is_noop() {
|
||||
let mut scored: Vec<ScoredCandidate> = (1..=5)
|
||||
.map(|i| ScoredCandidate {
|
||||
entity_id: EntityId::new(i),
|
||||
score: 1.0,
|
||||
signal_snapshot: vec![],
|
||||
creator_id: None,
|
||||
format: None,
|
||||
})
|
||||
.collect();
|
||||
|
||||
let all: Vec<EntityId> = (1..=10).map(EntityId::new).collect();
|
||||
inject_exploration(&mut scored, &all, 0.0);
|
||||
|
||||
// No change when exploration fraction is 0.
|
||||
assert_eq!(scored.len(), 5);
|
||||
assert!(scored.iter().all(|c| c.entity_id.as_u64() <= 5));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn exploration_no_extra_candidates_is_noop() {
|
||||
// All candidates are already scored -- no exploration pool.
|
||||
let mut scored: Vec<ScoredCandidate> = (1..=5)
|
||||
.map(|i| ScoredCandidate {
|
||||
entity_id: EntityId::new(i),
|
||||
score: 1.0,
|
||||
signal_snapshot: vec![],
|
||||
creator_id: None,
|
||||
format: None,
|
||||
})
|
||||
.collect();
|
||||
|
||||
let all: Vec<EntityId> = (1..=5).map(EntityId::new).collect();
|
||||
inject_exploration(&mut scored, &all, 0.5);
|
||||
|
||||
// No change when there are no extra candidates for exploration.
|
||||
assert_eq!(scored.len(), 5);
|
||||
}
|
||||
}
|
||||
@ -11,17 +11,22 @@
|
||||
//! The executor borrows all infrastructure by reference and is constructed
|
||||
//! per-query by `TidalDb::retrieve()`.
|
||||
|
||||
pub mod candidate_gen;
|
||||
pub mod personalization;
|
||||
pub mod user_filter;
|
||||
|
||||
use std::collections::{HashMap, HashSet};
|
||||
use std::sync::RwLock;
|
||||
|
||||
use roaring::RoaringBitmap;
|
||||
|
||||
use crate::db::deserialize_metadata as deserialize_item_metadata;
|
||||
use crate::entities::{
|
||||
CreatorItemsBitmap, HardNegIndex, InteractionLedger, PreferenceVectors, UserStateIndex,
|
||||
};
|
||||
use crate::query::retrieve::{Cursor, QueryError, Results, Retrieve, RetrieveResult, Signal};
|
||||
use crate::ranking::diversity::DiversitySelector;
|
||||
use crate::ranking::executor::{ProfileExecutor, ScoredCandidate, UserContext};
|
||||
use crate::ranking::executor::ProfileExecutor;
|
||||
use crate::ranking::profile::CandidateStrategy;
|
||||
use crate::ranking::registry::ProfileRegistry;
|
||||
use crate::schema::{EntityId, Timestamp};
|
||||
@ -31,42 +36,7 @@ use crate::storage::indexes::bitmap::BitmapIndex;
|
||||
use crate::storage::indexes::filter::{FilterEvaluator, FilterExpr, FilterResult};
|
||||
use crate::storage::indexes::range::RangeIndex;
|
||||
use crate::storage::vector::registry::EmbeddingSlotRegistry;
|
||||
|
||||
// ── User-State Filter Extraction ─────────────────────────────────────────────
|
||||
|
||||
/// Extract user-state filter variants from a filter expression tree.
|
||||
///
|
||||
/// Walks the AST and collects references to `Saved`, `Liked`, and `InProgress`
|
||||
/// nodes. These variants pass through the `FilterEvaluator` as "return full
|
||||
/// universe" and must be applied as inclusion filters in Stage 2.5 of the
|
||||
/// executor pipeline, where `UserStateIndex` is available.
|
||||
///
|
||||
/// AND/OR/NOT nodes are recursed into; leaf nodes that are not user-state
|
||||
/// filters are ignored (they are already handled by the `FilterEvaluator`).
|
||||
fn extract_user_state_filters(expr: &FilterExpr) -> Vec<&FilterExpr> {
|
||||
let mut result = Vec::new();
|
||||
collect_user_state_filters(expr, &mut result);
|
||||
result
|
||||
}
|
||||
|
||||
fn collect_user_state_filters<'a>(expr: &'a FilterExpr, out: &mut Vec<&'a FilterExpr>) {
|
||||
match expr {
|
||||
FilterExpr::Saved(_) | FilterExpr::Liked(_) | FilterExpr::InProgress { .. } => {
|
||||
out.push(expr);
|
||||
}
|
||||
FilterExpr::And(children) | FilterExpr::Or(children) => {
|
||||
for child in children {
|
||||
collect_user_state_filters(child, out);
|
||||
}
|
||||
}
|
||||
FilterExpr::Not(inner) => {
|
||||
collect_user_state_filters(inner, out);
|
||||
}
|
||||
// All other variants (CategoryEq, FormatEq, etc.) are handled by
|
||||
// FilterEvaluator and do not need extraction here.
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
use crate::storage::{StorageEngine, Tag, encode_key};
|
||||
|
||||
// ── Executor ────────────────────────────────────────────────────────────────
|
||||
|
||||
@ -106,6 +76,8 @@ pub struct RetrieveExecutor<'a> {
|
||||
// ── M4 session context ────────────────────────────────────────────
|
||||
session_context: Option<SessionContext>,
|
||||
session_snapshot: Option<SessionSnapshot>,
|
||||
/// Items storage engine for metadata lookup during session-boosted scoring.
|
||||
items_storage: Option<&'a dyn StorageEngine>,
|
||||
}
|
||||
|
||||
impl<'a> RetrieveExecutor<'a> {
|
||||
@ -142,9 +114,17 @@ impl<'a> RetrieveExecutor<'a> {
|
||||
preference_vectors: None,
|
||||
session_context: None,
|
||||
session_snapshot: None,
|
||||
items_storage: None,
|
||||
}
|
||||
}
|
||||
|
||||
/// Attach items storage for metadata lookup during session-boosted scoring.
|
||||
#[must_use]
|
||||
pub fn with_items_storage(mut self, storage: &'a dyn StorageEngine) -> Self {
|
||||
self.items_storage = Some(storage);
|
||||
self
|
||||
}
|
||||
|
||||
/// Attach M3 personalization context to the executor.
|
||||
#[must_use]
|
||||
pub const fn with_user_context(
|
||||
@ -216,15 +196,17 @@ impl<'a> RetrieveExecutor<'a> {
|
||||
// ── Stage 1: Candidate Generation ───────────────────────────────
|
||||
let has_user_context = query.for_user.is_some();
|
||||
let mut candidates = match &profile.candidate_strategy {
|
||||
CandidateStrategy::Scan { .. } => self.scan_candidates(query.limit, has_user_context),
|
||||
CandidateStrategy::Scan { .. } => {
|
||||
candidate_gen::scan_candidates(self.universe, query.limit, has_user_context)
|
||||
}
|
||||
CandidateStrategy::SignalRanked { signal, .. } => {
|
||||
self.signal_ranked_candidates(signal, query.limit)
|
||||
candidate_gen::signal_ranked_candidates(self.ledger, signal, query.limit)
|
||||
}
|
||||
CandidateStrategy::Ann { .. } => {
|
||||
// ANN candidate strategy falls back to scan with a warning.
|
||||
warnings
|
||||
.push("ANN candidate strategy not yet wired; falling back to scan".to_string());
|
||||
self.scan_candidates(query.limit, has_user_context)
|
||||
candidate_gen::scan_candidates(self.universe, query.limit, has_user_context)
|
||||
}
|
||||
CandidateStrategy::Relationship => {
|
||||
// M3: source candidates from the user's followed creators.
|
||||
@ -239,7 +221,7 @@ impl<'a> RetrieveExecutor<'a> {
|
||||
"Relationship strategy: user follows no creators; falling back to scan"
|
||||
.to_string(),
|
||||
);
|
||||
self.scan_candidates(query.limit, true)
|
||||
candidate_gen::scan_candidates(self.universe, query.limit, true)
|
||||
} else {
|
||||
let bitmap = creator_items.union_for(&followed);
|
||||
if bitmap.is_empty() {
|
||||
@ -247,7 +229,7 @@ impl<'a> RetrieveExecutor<'a> {
|
||||
"Relationship strategy: followed creators have no items; falling back to scan"
|
||||
.to_string(),
|
||||
);
|
||||
self.scan_candidates(query.limit, true)
|
||||
candidate_gen::scan_candidates(self.universe, query.limit, true)
|
||||
} else {
|
||||
bitmap
|
||||
.iter()
|
||||
@ -260,7 +242,7 @@ impl<'a> RetrieveExecutor<'a> {
|
||||
"Relationship strategy requires FOR USER clause; falling back to scan"
|
||||
.to_string(),
|
||||
);
|
||||
self.scan_candidates(query.limit, false)
|
||||
candidate_gen::scan_candidates(self.universe, query.limit, false)
|
||||
}
|
||||
}
|
||||
other => {
|
||||
@ -375,7 +357,7 @@ impl<'a> RetrieveExecutor<'a> {
|
||||
// These are INCLUSION filters: retain only matching items, unlike
|
||||
// the exclusion filters above that remove seen/blocked/hard-neg.
|
||||
if let Some(ref filter_expr) = query.combined_filter() {
|
||||
let user_filters = extract_user_state_filters(filter_expr);
|
||||
let user_filters = user_filter::extract_user_state_filters(filter_expr);
|
||||
for uf in &user_filters {
|
||||
match uf {
|
||||
FilterExpr::Saved(uid) => {
|
||||
@ -423,26 +405,59 @@ impl<'a> RetrieveExecutor<'a> {
|
||||
let now = Timestamp::now();
|
||||
let executor = ProfileExecutor::new(self.ledger);
|
||||
|
||||
// Pre-load item metadata for keyword hint matching when session is active.
|
||||
// Only performed when both session context and items storage are present.
|
||||
let item_metadata: HashMap<u64, HashMap<String, String>> = if self.session_context.is_some()
|
||||
{
|
||||
self.items_storage.map_or_else(HashMap::new, |storage| {
|
||||
candidates
|
||||
.iter()
|
||||
.filter_map(|&eid| {
|
||||
let key = encode_key(eid, Tag::Meta, b"");
|
||||
storage
|
||||
.get(&key)
|
||||
.ok()
|
||||
.flatten()
|
||||
.map(|bytes| (eid.as_u64(), deserialize_item_metadata(&bytes)))
|
||||
})
|
||||
.collect()
|
||||
})
|
||||
} else {
|
||||
HashMap::new()
|
||||
};
|
||||
|
||||
#[allow(clippy::option_if_let_else)]
|
||||
let mut scored = if let Some(user_id) = query.for_user {
|
||||
// Personalized scoring: build UserContext from the interaction ledger
|
||||
// and creator-items bitmap, then call score_personalized().
|
||||
let user_ctx = self.build_user_context(user_id, now);
|
||||
let user_ctx = personalization::build_user_context(
|
||||
user_id,
|
||||
now,
|
||||
self.interaction_ledger,
|
||||
self.creator_items,
|
||||
);
|
||||
executor.score_personalized(
|
||||
&candidates,
|
||||
profile,
|
||||
now,
|
||||
self.session_context.as_ref(),
|
||||
&user_ctx,
|
||||
&item_metadata,
|
||||
)
|
||||
} else {
|
||||
// Population-level profiles: no user context needed.
|
||||
executor.score_with_session(&candidates, profile, now, self.session_context.as_ref())
|
||||
executor.score_with_session(
|
||||
&candidates,
|
||||
profile,
|
||||
now,
|
||||
self.session_context.as_ref(),
|
||||
&item_metadata,
|
||||
)
|
||||
};
|
||||
|
||||
// Exploration budget: inject random candidates for discovery.
|
||||
if profile.exploration > 0.0 {
|
||||
Self::inject_exploration(&mut scored, &candidates, profile.exploration);
|
||||
candidate_gen::inject_exploration(&mut scored, &candidates, profile.exploration);
|
||||
}
|
||||
|
||||
let total_scored = scored.len();
|
||||
@ -532,184 +547,6 @@ impl<'a> RetrieveExecutor<'a> {
|
||||
session_snapshot: self.session_snapshot.clone(),
|
||||
})
|
||||
}
|
||||
|
||||
// ── Candidate generation strategies ─────────────────────────────────
|
||||
|
||||
/// Scan the universe bitmap for all entity IDs.
|
||||
///
|
||||
/// When `has_user_context` is true, uses a larger multiplier (10x) to
|
||||
/// ensure enough candidates survive user-context filtering (seen, blocked,
|
||||
/// hard negatives).
|
||||
fn scan_candidates(&self, limit: usize, has_user_context: bool) -> Vec<EntityId> {
|
||||
let multiplier = if has_user_context { 10 } else { 4 };
|
||||
let max_candidates = (limit * multiplier).max(200);
|
||||
let mut candidates = Vec::with_capacity(max_candidates);
|
||||
|
||||
// Read-lock the universe bitmap.
|
||||
if let Some(universe) = self.universe
|
||||
&& let Ok(bm) = universe.read()
|
||||
{
|
||||
for id_u32 in bm.iter() {
|
||||
candidates.push(EntityId::new(u64::from(id_u32)));
|
||||
if candidates.len() >= max_candidates {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
candidates
|
||||
}
|
||||
|
||||
/// Build a `UserContext` for personalized scoring.
|
||||
///
|
||||
/// Fetches the user's top interacted creators from the `InteractionLedger`,
|
||||
/// then expands each creator into their item set via `CreatorItemsBitmap`.
|
||||
/// The resulting per-item boost map is normalized to `[0.0, 1.0]`.
|
||||
fn build_user_context(&self, user_id: u64, now: Timestamp) -> UserContext {
|
||||
let now_ns = now.as_nanos();
|
||||
|
||||
// Get top creators from the interaction ledger.
|
||||
let top_creators = self
|
||||
.interaction_ledger
|
||||
.map(|il| il.top_creators(user_id, 50, now_ns))
|
||||
.unwrap_or_default();
|
||||
|
||||
// Build per-item boost map from creator items.
|
||||
let mut creator_interaction_boosts: HashMap<u32, f64> = HashMap::new();
|
||||
if let Some(creator_items) = self.creator_items {
|
||||
for (creator_id, weight) in &top_creators {
|
||||
if let Some(bitmap) = creator_items.get(*creator_id) {
|
||||
for item_id in &bitmap {
|
||||
creator_interaction_boosts.insert(item_id, *weight);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Normalize boost values to [0.0, 1.0] so the highest-interacted
|
||||
// creator gets boost=1.0 and the interaction boost weight constant
|
||||
// controls the absolute magnitude.
|
||||
if !creator_interaction_boosts.is_empty() {
|
||||
let max_weight = creator_interaction_boosts
|
||||
.values()
|
||||
.copied()
|
||||
.fold(0.0_f64, f64::max);
|
||||
if max_weight > f64::EPSILON {
|
||||
for val in creator_interaction_boosts.values_mut() {
|
||||
*val /= max_weight;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
UserContext {
|
||||
user_id,
|
||||
creator_interaction_boosts,
|
||||
}
|
||||
}
|
||||
|
||||
/// Inject exploration candidates into the scored list.
|
||||
///
|
||||
/// Reserves `exploration_fraction` of the result set for candidates
|
||||
/// not in the top-scored results. Used by `for_you` to prevent filter
|
||||
/// bubbles and expose cold-start content.
|
||||
///
|
||||
/// Exploration candidates are appended at the end of the scored list
|
||||
/// with score 0.0 (lowest). The selection is deterministic: a hash of
|
||||
/// the candidate set size + first candidate ID seeds the order.
|
||||
fn inject_exploration(
|
||||
scored: &mut Vec<ScoredCandidate>,
|
||||
all_candidates: &[EntityId],
|
||||
exploration_fraction: f64,
|
||||
) {
|
||||
if scored.is_empty() || all_candidates.is_empty() || exploration_fraction <= 0.0 {
|
||||
return;
|
||||
}
|
||||
|
||||
#[allow(
|
||||
clippy::cast_possible_truncation,
|
||||
clippy::cast_sign_loss,
|
||||
clippy::cast_precision_loss
|
||||
)]
|
||||
let exploration_slots = (exploration_fraction * scored.len() as f64).ceil() as usize;
|
||||
if exploration_slots == 0 {
|
||||
return;
|
||||
}
|
||||
|
||||
// Collect entity IDs already in the scored set.
|
||||
let scored_ids: HashSet<u64> = scored.iter().map(|c| c.entity_id.as_u64()).collect();
|
||||
|
||||
// Find candidates not in the scored set.
|
||||
let mut exploration_pool: Vec<EntityId> = all_candidates
|
||||
.iter()
|
||||
.filter(|id| !scored_ids.contains(&id.as_u64()))
|
||||
.copied()
|
||||
.collect();
|
||||
|
||||
if exploration_pool.is_empty() {
|
||||
return;
|
||||
}
|
||||
|
||||
// Deterministic shuffle using BLAKE3 hash of candidate IDs.
|
||||
// This is reproducible for the same candidate set.
|
||||
exploration_pool.sort_by(|a, b| {
|
||||
let hash_a = blake3::hash(&a.as_u64().to_le_bytes());
|
||||
let hash_b = blake3::hash(&b.as_u64().to_le_bytes());
|
||||
hash_a.as_bytes().cmp(hash_b.as_bytes())
|
||||
});
|
||||
|
||||
let actual_slots = exploration_slots.min(exploration_pool.len());
|
||||
|
||||
// Trim scored to make room for exploration candidates.
|
||||
let keep = scored.len().saturating_sub(actual_slots);
|
||||
scored.truncate(keep);
|
||||
|
||||
// Append exploration candidates with score 0.0.
|
||||
for &entity_id in exploration_pool.iter().take(actual_slots) {
|
||||
scored.push(ScoredCandidate {
|
||||
entity_id,
|
||||
score: 0.0,
|
||||
signal_snapshot: vec![],
|
||||
creator_id: None,
|
||||
format: None,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
/// Rank candidates by a specific signal's decay score.
|
||||
fn signal_ranked_candidates(&self, signal_name: &str, limit: usize) -> Vec<EntityId> {
|
||||
let max_candidates = (limit * 4).max(200);
|
||||
let Ok(type_id) = self.ledger.resolve_signal_type(signal_name) else {
|
||||
return Vec::new();
|
||||
};
|
||||
|
||||
let now_ns = Timestamp::now().as_nanos();
|
||||
let mut scored: Vec<(EntityId, f64)> = Vec::new();
|
||||
|
||||
for entry in self.ledger.entries() {
|
||||
let (entity_id, signal_type_id) = entry.key();
|
||||
if *signal_type_id == type_id {
|
||||
// Use decay score at index 0 with lambda=0 (no additional decay
|
||||
// beyond what was already applied at write time). This is a
|
||||
// simplified ranking for candidate generation -- the full scoring
|
||||
// happens in Stage 3 via ProfileExecutor.
|
||||
let score = entry.value().hot.current_score(0, now_ns, 0.0);
|
||||
scored.push((*entity_id, score));
|
||||
}
|
||||
}
|
||||
|
||||
// Sort by score descending. NaN scores (from degenerate decay math) fall
|
||||
// back to entity-ID order for deterministic, stable output.
|
||||
scored.sort_by(|a, b| {
|
||||
b.1.partial_cmp(&a.1)
|
||||
.unwrap_or_else(|| b.0.as_u64().cmp(&a.0.as_u64()))
|
||||
});
|
||||
|
||||
scored
|
||||
.into_iter()
|
||||
.take(max_candidates)
|
||||
.map(|(id, _)| id)
|
||||
.collect()
|
||||
}
|
||||
}
|
||||
|
||||
// ── Tests ───────────────────────────────────────────────────────────────────
|
||||
@ -722,6 +559,7 @@ mod tests {
|
||||
|
||||
use super::*;
|
||||
use crate::ranking::builtins::register_builtins;
|
||||
use crate::ranking::executor::ScoredCandidate;
|
||||
use crate::ranking::registry::ProfileRegistry;
|
||||
use crate::schema::{DecaySpec, EntityKind, SchemaBuilder, Window};
|
||||
use crate::signals::{NoopWalWriter, SignalLedger};
|
||||
@ -1085,50 +923,6 @@ mod tests {
|
||||
assert!(matches!(result, Err(QueryError::ProfileNotFound(_))));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_user_state_filters_finds_saved_liked_in_progress() {
|
||||
let expr = FilterExpr::And(vec![
|
||||
FilterExpr::CategoryEq("jazz".into()),
|
||||
FilterExpr::Saved(42),
|
||||
FilterExpr::Liked(42),
|
||||
FilterExpr::InProgress {
|
||||
user_id: 42,
|
||||
threshold: 0.8,
|
||||
},
|
||||
]);
|
||||
let found = extract_user_state_filters(&expr);
|
||||
assert_eq!(found.len(), 3);
|
||||
assert!(matches!(found[0], FilterExpr::Saved(42)));
|
||||
assert!(matches!(found[1], FilterExpr::Liked(42)));
|
||||
assert!(matches!(
|
||||
found[2],
|
||||
FilterExpr::InProgress {
|
||||
user_id: 42,
|
||||
threshold,
|
||||
} if (*threshold - 0.8).abs() < f64::EPSILON
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_user_state_filters_empty_for_plain_filters() {
|
||||
let expr = FilterExpr::And(vec![
|
||||
FilterExpr::CategoryEq("jazz".into()),
|
||||
FilterExpr::FormatEq("video".into()),
|
||||
]);
|
||||
let found = extract_user_state_filters(&expr);
|
||||
assert!(found.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_user_state_filters_nested_in_or() {
|
||||
let expr = FilterExpr::Or(vec![
|
||||
FilterExpr::Saved(1),
|
||||
FilterExpr::Not(Box::new(FilterExpr::Liked(2))),
|
||||
]);
|
||||
let found = extract_user_state_filters(&expr);
|
||||
assert_eq!(found.len(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn saved_filter_retains_only_saved_items() {
|
||||
let schema = test_schema();
|
||||
@ -1469,7 +1263,6 @@ mod tests {
|
||||
fn exploration_injects_random_candidates() {
|
||||
// Use inject_exploration directly to verify that exploration candidates
|
||||
// from the full candidate list appear in the scored output.
|
||||
use crate::ranking::executor::ScoredCandidate;
|
||||
|
||||
// Scored list: entities 1-5 (the "top ranked" items).
|
||||
let mut scored: Vec<ScoredCandidate> = (1..=5)
|
||||
@ -1486,7 +1279,7 @@ mod tests {
|
||||
let all_candidates: Vec<EntityId> = (1..=10).map(EntityId::new).collect();
|
||||
|
||||
// 40% exploration -> ceil(0.4 * 5) = 2 exploration slots.
|
||||
RetrieveExecutor::inject_exploration(&mut scored, &all_candidates, 0.4);
|
||||
candidate_gen::inject_exploration(&mut scored, &all_candidates, 0.4);
|
||||
|
||||
assert_eq!(scored.len(), 5, "total output should remain 5");
|
||||
|
||||
@ -1513,8 +1306,6 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn exploration_zero_fraction_is_noop() {
|
||||
use crate::ranking::executor::ScoredCandidate;
|
||||
|
||||
let mut scored: Vec<ScoredCandidate> = (1..=5)
|
||||
.map(|i| ScoredCandidate {
|
||||
entity_id: EntityId::new(i),
|
||||
@ -1526,7 +1317,7 @@ mod tests {
|
||||
.collect();
|
||||
|
||||
let all: Vec<EntityId> = (1..=10).map(EntityId::new).collect();
|
||||
RetrieveExecutor::inject_exploration(&mut scored, &all, 0.0);
|
||||
candidate_gen::inject_exploration(&mut scored, &all, 0.0);
|
||||
|
||||
// No change when exploration fraction is 0.
|
||||
assert_eq!(scored.len(), 5);
|
||||
@ -1535,8 +1326,6 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn exploration_no_extra_candidates_is_noop() {
|
||||
use crate::ranking::executor::ScoredCandidate;
|
||||
|
||||
// All candidates are already scored -- no exploration pool.
|
||||
let mut scored: Vec<ScoredCandidate> = (1..=5)
|
||||
.map(|i| ScoredCandidate {
|
||||
@ -1549,7 +1338,7 @@ mod tests {
|
||||
.collect();
|
||||
|
||||
let all: Vec<EntityId> = (1..=5).map(EntityId::new).collect();
|
||||
RetrieveExecutor::inject_exploration(&mut scored, &all, 0.5);
|
||||
candidate_gen::inject_exploration(&mut scored, &all, 0.5);
|
||||
|
||||
// No change when there are no extra candidates for exploration.
|
||||
assert_eq!(scored.len(), 5);
|
||||
61
tidal/src/query/executor/personalization.rs
Normal file
61
tidal/src/query/executor/personalization.rs
Normal file
@ -0,0 +1,61 @@
|
||||
//! Personalization context building for the RETRIEVE executor.
|
||||
//!
|
||||
//! Constructs a `UserContext` from the interaction ledger and creator-items
|
||||
//! bitmap. The resulting per-item boost map is normalized to `[0.0, 1.0]`.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use crate::entities::{CreatorItemsBitmap, InteractionLedger};
|
||||
use crate::ranking::executor::UserContext;
|
||||
use crate::schema::Timestamp;
|
||||
|
||||
/// Build a `UserContext` for personalized scoring.
|
||||
///
|
||||
/// Fetches the user's top interacted creators from the `InteractionLedger`,
|
||||
/// then expands each creator into their item set via `CreatorItemsBitmap`.
|
||||
/// The resulting per-item boost map is normalized to `[0.0, 1.0]`.
|
||||
pub(crate) fn build_user_context(
|
||||
user_id: u64,
|
||||
now: Timestamp,
|
||||
interaction_ledger: Option<&InteractionLedger>,
|
||||
creator_items: Option<&CreatorItemsBitmap>,
|
||||
) -> UserContext {
|
||||
let now_ns = now.as_nanos();
|
||||
|
||||
// Get top creators from the interaction ledger.
|
||||
let top_creators = interaction_ledger
|
||||
.map(|il| il.top_creators(user_id, 50, now_ns))
|
||||
.unwrap_or_default();
|
||||
|
||||
// Build per-item boost map from creator items.
|
||||
let mut creator_interaction_boosts: HashMap<u32, f64> = HashMap::new();
|
||||
if let Some(creator_items) = creator_items {
|
||||
for (creator_id, weight) in &top_creators {
|
||||
if let Some(bitmap) = creator_items.get(*creator_id) {
|
||||
for item_id in &bitmap {
|
||||
creator_interaction_boosts.insert(item_id, *weight);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Normalize boost values to [0.0, 1.0] so the highest-interacted
|
||||
// creator gets boost=1.0 and the interaction boost weight constant
|
||||
// controls the absolute magnitude.
|
||||
if !creator_interaction_boosts.is_empty() {
|
||||
let max_weight = creator_interaction_boosts
|
||||
.values()
|
||||
.copied()
|
||||
.fold(0.0_f64, f64::max);
|
||||
if max_weight > f64::EPSILON {
|
||||
for val in creator_interaction_boosts.values_mut() {
|
||||
*val /= max_weight;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
UserContext {
|
||||
user_id,
|
||||
creator_interaction_boosts,
|
||||
}
|
||||
}
|
||||
91
tidal/src/query/executor/user_filter.rs
Normal file
91
tidal/src/query/executor/user_filter.rs
Normal file
@ -0,0 +1,91 @@
|
||||
//! User-state filter extraction from filter expression trees.
|
||||
//!
|
||||
//! Walks a `FilterExpr` AST and collects references to user-state filter
|
||||
//! variants (`Saved`, `Liked`, `InProgress`). These variants pass through
|
||||
//! the `FilterEvaluator` as "return full universe" and must be applied as
|
||||
//! inclusion filters in Stage 2.5 of the executor pipeline, where
|
||||
//! `UserStateIndex` is available.
|
||||
|
||||
use crate::storage::indexes::filter::FilterExpr;
|
||||
|
||||
/// Extract user-state filter variants from a filter expression tree.
|
||||
///
|
||||
/// AND/OR/NOT nodes are recursed into; leaf nodes that are not user-state
|
||||
/// filters are ignored (they are already handled by the `FilterEvaluator`).
|
||||
pub(crate) fn extract_user_state_filters(expr: &FilterExpr) -> Vec<&FilterExpr> {
|
||||
let mut result = Vec::new();
|
||||
collect_user_state_filters(expr, &mut result);
|
||||
result
|
||||
}
|
||||
|
||||
fn collect_user_state_filters<'a>(expr: &'a FilterExpr, out: &mut Vec<&'a FilterExpr>) {
|
||||
match expr {
|
||||
FilterExpr::Saved(_) | FilterExpr::Liked(_) | FilterExpr::InProgress { .. } => {
|
||||
out.push(expr);
|
||||
}
|
||||
FilterExpr::And(children) | FilterExpr::Or(children) => {
|
||||
for child in children {
|
||||
collect_user_state_filters(child, out);
|
||||
}
|
||||
}
|
||||
FilterExpr::Not(inner) => {
|
||||
collect_user_state_filters(inner, out);
|
||||
}
|
||||
// All other variants (CategoryEq, FormatEq, etc.) are handled by
|
||||
// FilterEvaluator and do not need extraction here.
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
|
||||
// ── Tests ────────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::storage::indexes::filter::FilterExpr;
|
||||
|
||||
#[test]
|
||||
fn extract_user_state_filters_finds_saved_liked_in_progress() {
|
||||
let expr = FilterExpr::And(vec![
|
||||
FilterExpr::CategoryEq("jazz".into()),
|
||||
FilterExpr::Saved(42),
|
||||
FilterExpr::Liked(42),
|
||||
FilterExpr::InProgress {
|
||||
user_id: 42,
|
||||
threshold: 0.8,
|
||||
},
|
||||
]);
|
||||
let found = extract_user_state_filters(&expr);
|
||||
assert_eq!(found.len(), 3);
|
||||
assert!(matches!(found[0], FilterExpr::Saved(42)));
|
||||
assert!(matches!(found[1], FilterExpr::Liked(42)));
|
||||
assert!(matches!(
|
||||
found[2],
|
||||
FilterExpr::InProgress {
|
||||
user_id: 42,
|
||||
threshold,
|
||||
} if (*threshold - 0.8).abs() < f64::EPSILON
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_user_state_filters_empty_for_plain_filters() {
|
||||
let expr = FilterExpr::And(vec![
|
||||
FilterExpr::CategoryEq("jazz".into()),
|
||||
FilterExpr::FormatEq("video".into()),
|
||||
]);
|
||||
let found = extract_user_state_filters(&expr);
|
||||
assert!(found.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_user_state_filters_nested_in_or() {
|
||||
let expr = FilterExpr::Or(vec![
|
||||
FilterExpr::Saved(1),
|
||||
FilterExpr::Not(Box::new(FilterExpr::Liked(2))),
|
||||
]);
|
||||
let found = extract_user_state_filters(&expr);
|
||||
assert_eq!(found.len(), 2);
|
||||
}
|
||||
}
|
||||
410
tidal/src/query/fusion.rs
Normal file
410
tidal/src/query/fusion.rs
Normal file
@ -0,0 +1,410 @@
|
||||
//! Reciprocal Rank Fusion (RRF) for hybrid search.
|
||||
//!
|
||||
//! Merges ranked lists from heterogeneous retrieval sources (BM25 text search,
|
||||
//! ANN vector search) into a single fused ranking. The RRF formula from
|
||||
//! Cormack, Clarke & Buettcher (SIGIR 2009) is rank-based, not score-based,
|
||||
//! making it robust to the incomparable score distributions of different
|
||||
//! retrieval models.
|
||||
//!
|
||||
//! # Formula
|
||||
//!
|
||||
//! `RRFscore(d) = sum_i 1 / (k + rank_i(d))`
|
||||
//!
|
||||
//! where `k` is a smoothing constant (default 60) and `rank_i(d)` is the
|
||||
//! 1-based rank of document `d` in list `i`. Documents absent from a list
|
||||
//! contribute zero for that term.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use crate::schema::EntityId;
|
||||
use crate::storage::vector::VectorSearchResult;
|
||||
|
||||
/// Reciprocal Rank Fusion (Cormack et al. SIGIR 2009).
|
||||
///
|
||||
/// Fuses two ranked lists into a single ranking using rank-based scoring.
|
||||
/// The `k` parameter controls how much weight is given to documents ranked
|
||||
/// lower in the input lists. Higher `k` compresses the score range, making
|
||||
/// rank differences less significant.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct HybridFusion {
|
||||
/// Smoothing constant. Default is 60 per the original paper.
|
||||
pub k: u32,
|
||||
}
|
||||
|
||||
impl Default for HybridFusion {
|
||||
fn default() -> Self {
|
||||
Self { k: 60 }
|
||||
}
|
||||
}
|
||||
|
||||
impl HybridFusion {
|
||||
/// Create a new `HybridFusion` with the default `k = 60`.
|
||||
#[must_use]
|
||||
pub fn new() -> Self {
|
||||
Self::default()
|
||||
}
|
||||
|
||||
/// Create a new `HybridFusion` with a custom `k` value.
|
||||
#[must_use]
|
||||
pub const fn with_k(k: u32) -> Self {
|
||||
Self { k }
|
||||
}
|
||||
|
||||
/// Fuse two ranked lists via Reciprocal Rank Fusion.
|
||||
///
|
||||
/// Both lists must be pre-sorted "best first" by the caller:
|
||||
/// - `bm25_results`: sorted descending by BM25 score (index 0 = rank 1)
|
||||
/// - `ann_results`: sorted ascending by L2 distance (index 0 = rank 1)
|
||||
///
|
||||
/// Returns results sorted descending by fused RRF score. Documents
|
||||
/// appearing in only one list contribute only their single-list term.
|
||||
#[must_use]
|
||||
#[allow(clippy::cast_precision_loss)] // Ranks are bounded by list length, never near 2^52.
|
||||
pub fn fuse(
|
||||
&self,
|
||||
bm25_results: &[(EntityId, f32)],
|
||||
ann_results: &[(EntityId, f32)],
|
||||
) -> Vec<(EntityId, f64)> {
|
||||
let k = f64::from(self.k);
|
||||
let capacity = bm25_results.len() + ann_results.len();
|
||||
let mut scores: HashMap<u64, f64> = HashMap::with_capacity(capacity);
|
||||
|
||||
for (rank_0based, (entity_id, _score)) in bm25_results.iter().enumerate() {
|
||||
let rank = (rank_0based + 1) as f64;
|
||||
*scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
|
||||
}
|
||||
|
||||
for (rank_0based, (entity_id, _distance)) in ann_results.iter().enumerate() {
|
||||
let rank = (rank_0based + 1) as f64;
|
||||
*scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
|
||||
}
|
||||
|
||||
let mut results: Vec<(EntityId, f64)> = scores
|
||||
.into_iter()
|
||||
.map(|(id, score)| (EntityId::new(id), score))
|
||||
.collect();
|
||||
results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
|
||||
results
|
||||
}
|
||||
}
|
||||
|
||||
/// Which retrieval system(s) to use for a search query.
|
||||
///
|
||||
/// Determined from the presence of text and vector components in the query.
|
||||
/// Used by the retrieval router to select the appropriate fusion path.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub enum RetrievalMode {
|
||||
/// Only text (BM25) retrieval. No embedding was provided.
|
||||
TextOnly,
|
||||
/// Only vector (ANN) retrieval. No text query was provided.
|
||||
VectorOnly,
|
||||
/// Both text and vector retrieval, fused via RRF.
|
||||
Hybrid,
|
||||
}
|
||||
|
||||
impl RetrievalMode {
|
||||
/// Determine mode from query contents. Returns `None` if neither text nor
|
||||
/// vector is present (the query is empty and cannot retrieve anything).
|
||||
#[must_use]
|
||||
pub const fn determine(has_text: bool, has_vector: bool) -> Option<Self> {
|
||||
match (has_text, has_vector) {
|
||||
(true, false) => Some(Self::TextOnly),
|
||||
(false, true) => Some(Self::VectorOnly),
|
||||
(true, true) => Some(Self::Hybrid),
|
||||
(false, false) => None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Route pre-retrieved result lists through the appropriate fusion path.
|
||||
///
|
||||
/// - **`TextOnly`**: BM25 scores cast to `f64`, order preserved.
|
||||
/// - **`VectorOnly`**: ANN results converted to rank-based scores via `1.0 / (k + rank)`.
|
||||
/// - **Hybrid**: delegates to [`HybridFusion::fuse`].
|
||||
///
|
||||
/// Both input slices must be pre-sorted "best first" by the caller (BM25
|
||||
/// descending by score, ANN ascending by distance).
|
||||
#[must_use]
|
||||
#[allow(clippy::cast_precision_loss)] // Ranks bounded by list length, never near 2^52.
|
||||
pub fn route_results(
|
||||
mode: RetrievalMode,
|
||||
bm25_results: &[(EntityId, f32)],
|
||||
ann_results: &[(EntityId, f32)],
|
||||
fusion: &HybridFusion,
|
||||
) -> Vec<(EntityId, f64)> {
|
||||
match mode {
|
||||
RetrievalMode::TextOnly => bm25_results
|
||||
.iter()
|
||||
.map(|(id, score)| (*id, f64::from(*score)))
|
||||
.collect(),
|
||||
RetrievalMode::VectorOnly => {
|
||||
let k = f64::from(fusion.k);
|
||||
ann_results
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (id, _distance))| {
|
||||
let rank = (i + 1) as f64;
|
||||
(*id, 1.0 / (k + rank))
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
RetrievalMode::Hybrid => fusion.fuse(bm25_results, ann_results),
|
||||
}
|
||||
}
|
||||
|
||||
/// Convert ANN search results to the ranked-list format expected by fusion.
|
||||
///
|
||||
/// [`VectorSearchResult`] is sorted ascending by distance (best first).
|
||||
/// Maps to `(EntityId, f32)` where the `f32` is the raw L2 distance,
|
||||
/// preserving sort order for downstream rank computation.
|
||||
#[must_use]
|
||||
pub fn ann_to_ranked(ann_results: &[VectorSearchResult]) -> Vec<(EntityId, f32)> {
|
||||
ann_results
|
||||
.iter()
|
||||
.map(|r| (EntityId::new(r.id), r.distance))
|
||||
.collect()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn fuse_both_lists() {
|
||||
let bm25 = vec![
|
||||
(EntityId::new(1), 1.0f32), // rank 1
|
||||
(EntityId::new(2), 0.8f32), // rank 2
|
||||
(EntityId::new(3), 0.5f32), // rank 3 (BM25 only)
|
||||
];
|
||||
let ann = vec![
|
||||
(EntityId::new(2), 0.1f32), // rank 1 (ANN top)
|
||||
(EntityId::new(1), 0.2f32), // rank 2
|
||||
(EntityId::new(4), 0.5f32), // rank 3 (ANN only)
|
||||
];
|
||||
|
||||
let fusion = HybridFusion::new();
|
||||
let results = fusion.fuse(&bm25, &ann);
|
||||
|
||||
// All four unique entities present
|
||||
let ids: Vec<u64> = results.iter().map(|(id, _)| id.as_u64()).collect();
|
||||
assert!(ids.contains(&1));
|
||||
assert!(ids.contains(&2));
|
||||
assert!(ids.contains(&3));
|
||||
assert!(ids.contains(&4));
|
||||
|
||||
// Docs in both lists have higher scores than docs in one list only
|
||||
let a_score = results
|
||||
.iter()
|
||||
.find(|(id, _)| id.as_u64() == 1)
|
||||
.map(|r| r.1)
|
||||
.expect("entity 1 present");
|
||||
let c_score = results
|
||||
.iter()
|
||||
.find(|(id, _)| id.as_u64() == 3)
|
||||
.map(|r| r.1)
|
||||
.expect("entity 3 present");
|
||||
let d_score = results
|
||||
.iter()
|
||||
.find(|(id, _)| id.as_u64() == 4)
|
||||
.map(|r| r.1)
|
||||
.expect("entity 4 present");
|
||||
assert!(a_score > c_score);
|
||||
assert!(a_score > d_score);
|
||||
|
||||
// Sorted descending
|
||||
let scores: Vec<f64> = results.iter().map(|(_, s)| *s).collect();
|
||||
for i in 1..scores.len() {
|
||||
assert!(scores[i - 1] >= scores[i]);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn fuse_bm25_only() {
|
||||
let bm25 = vec![(EntityId::new(1), 1.0f32), (EntityId::new(2), 0.5f32)];
|
||||
let fusion = HybridFusion::new();
|
||||
let results = fusion.fuse(&bm25, &[]);
|
||||
assert_eq!(results.len(), 2);
|
||||
let s1 = results
|
||||
.iter()
|
||||
.find(|(id, _)| id.as_u64() == 1)
|
||||
.map(|r| r.1)
|
||||
.expect("entity 1 present");
|
||||
let s2 = results
|
||||
.iter()
|
||||
.find(|(id, _)| id.as_u64() == 2)
|
||||
.map(|r| r.1)
|
||||
.expect("entity 2 present");
|
||||
assert!(s1 > s2);
|
||||
let expected = 1.0 / (60.0 + 1.0);
|
||||
assert!((s1 - expected).abs() < 1e-9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn fuse_ann_only() {
|
||||
let ann = vec![(EntityId::new(1), 0.1f32), (EntityId::new(2), 0.2f32)];
|
||||
let fusion = HybridFusion::new();
|
||||
let results = fusion.fuse(&[], &ann);
|
||||
assert_eq!(results.len(), 2);
|
||||
let s1 = results
|
||||
.iter()
|
||||
.find(|(id, _)| id.as_u64() == 1)
|
||||
.map(|r| r.1)
|
||||
.expect("entity 1 present");
|
||||
let s2 = results
|
||||
.iter()
|
||||
.find(|(id, _)| id.as_u64() == 2)
|
||||
.map(|r| r.1)
|
||||
.expect("entity 2 present");
|
||||
assert!(s1 > s2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn fuse_empty_lists() {
|
||||
let fusion = HybridFusion::new();
|
||||
let results = fusion.fuse(&[], &[]);
|
||||
assert!(results.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn fuse_single_doc_both_lists() {
|
||||
let bm25 = vec![(EntityId::new(1), 1.0f32)];
|
||||
let ann = vec![(EntityId::new(1), 0.1f32)];
|
||||
let fusion = HybridFusion::new();
|
||||
let results = fusion.fuse(&bm25, &ann);
|
||||
assert_eq!(results.len(), 1);
|
||||
let expected = 1.0 / (60.0 + 1.0) + 1.0 / (60.0 + 1.0);
|
||||
assert!((results[0].1 - expected).abs() < 1e-9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn fuse_k_affects_scores() {
|
||||
let bm25 = vec![(EntityId::new(1), 1.0f32)];
|
||||
let ann = vec![(EntityId::new(1), 0.1f32)];
|
||||
let fusion_60 = HybridFusion::new();
|
||||
let fusion_30 = HybridFusion::with_k(30);
|
||||
let r60 = fusion_60.fuse(&bm25, &ann);
|
||||
let r30 = fusion_30.fuse(&bm25, &ann);
|
||||
// Lower k means higher individual RRF terms, so score is higher
|
||||
assert!(r30[0].1 > r60[0].1);
|
||||
}
|
||||
|
||||
// --- RetrievalMode tests ---
|
||||
|
||||
#[test]
|
||||
fn determine_text_only() {
|
||||
assert_eq!(
|
||||
RetrievalMode::determine(true, false),
|
||||
Some(RetrievalMode::TextOnly)
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn determine_vector_only() {
|
||||
assert_eq!(
|
||||
RetrievalMode::determine(false, true),
|
||||
Some(RetrievalMode::VectorOnly)
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn determine_hybrid() {
|
||||
assert_eq!(
|
||||
RetrievalMode::determine(true, true),
|
||||
Some(RetrievalMode::Hybrid)
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn determine_none() {
|
||||
assert_eq!(RetrievalMode::determine(false, false), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn route_text_only_passthrough() {
|
||||
let bm25 = vec![(EntityId::new(1), 1.0f32), (EntityId::new(2), 0.5f32)];
|
||||
let fusion = HybridFusion::new();
|
||||
let results = route_results(RetrievalMode::TextOnly, &bm25, &[], &fusion);
|
||||
assert_eq!(results.len(), 2);
|
||||
assert!((results[0].1 - 1.0f64).abs() < 1e-6);
|
||||
assert!((results[1].1 - 0.5f64).abs() < 1e-6);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn route_vector_only_rank_based() {
|
||||
let ann = vec![(EntityId::new(1), 0.1f32), (EntityId::new(2), 0.2f32)];
|
||||
let fusion = HybridFusion::new();
|
||||
let results = route_results(RetrievalMode::VectorOnly, &[], &ann, &fusion);
|
||||
assert_eq!(results.len(), 2);
|
||||
let expected_rank1 = 1.0 / (60.0 + 1.0);
|
||||
let expected_rank2 = 1.0 / (60.0 + 2.0);
|
||||
assert!((results[0].1 - expected_rank1).abs() < 1e-9);
|
||||
assert!((results[1].1 - expected_rank2).abs() < 1e-9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn route_hybrid_calls_fuse() {
|
||||
let bm25 = vec![(EntityId::new(1), 1.0f32)];
|
||||
let ann = vec![(EntityId::new(2), 0.1f32)];
|
||||
let fusion = HybridFusion::new();
|
||||
let results = route_results(RetrievalMode::Hybrid, &bm25, &ann, &fusion);
|
||||
assert_eq!(results.len(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ann_to_ranked_converts_correctly() {
|
||||
let ann_results = vec![
|
||||
VectorSearchResult {
|
||||
id: 42,
|
||||
distance: 0.1,
|
||||
},
|
||||
VectorSearchResult {
|
||||
id: 99,
|
||||
distance: 0.3,
|
||||
},
|
||||
];
|
||||
let ranked = ann_to_ranked(&ann_results);
|
||||
assert_eq!(ranked.len(), 2);
|
||||
assert_eq!(ranked[0].0.as_u64(), 42);
|
||||
assert!((ranked[0].1 - 0.1f32).abs() < 1e-6);
|
||||
assert_eq!(ranked[1].0.as_u64(), 99);
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod proptests {
|
||||
use super::*;
|
||||
use proptest::prelude::*;
|
||||
|
||||
proptest! {
|
||||
#[test]
|
||||
fn rrf_output_is_union_of_inputs(
|
||||
bm25_ids in prop::collection::vec(1u64..=50, 0..10),
|
||||
ann_ids in prop::collection::vec(1u64..=50, 0..10),
|
||||
) {
|
||||
let bm25: Vec<(EntityId, f32)> = bm25_ids.iter().enumerate()
|
||||
.map(|(i, &id)| (EntityId::new(id), (100 - i) as f32))
|
||||
.collect();
|
||||
let ann: Vec<(EntityId, f32)> = ann_ids.iter().enumerate()
|
||||
.map(|(i, &id)| (EntityId::new(id), i as f32 * 0.01))
|
||||
.collect();
|
||||
|
||||
let fusion = HybridFusion::new();
|
||||
let results = fusion.fuse(&bm25, &ann);
|
||||
|
||||
// Output is the union of unique IDs from both inputs
|
||||
let all_ids: std::collections::HashSet<u64> = bm25_ids.iter()
|
||||
.chain(ann_ids.iter())
|
||||
.copied()
|
||||
.collect();
|
||||
let result_ids: std::collections::HashSet<u64> = results.iter()
|
||||
.map(|(id, _)| id.as_u64())
|
||||
.collect();
|
||||
prop_assert_eq!(all_ids, result_ids);
|
||||
|
||||
// Output is sorted descending
|
||||
for i in 1..results.len() {
|
||||
prop_assert!(results[i - 1].1 >= results[i].1);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -3,11 +3,19 @@
|
||||
//! The primary query type is `Retrieve` -- a typed AST representing
|
||||
//! "given these constraints, rank content for me." Constructed via
|
||||
//! `RetrieveBuilder` or (M3+) parsed from the query language.
|
||||
//!
|
||||
//! The `Search` query type combines BM25 text retrieval, ANN vector retrieval,
|
||||
//! and RRF fusion with the same personalization, filtering, and diversity
|
||||
//! pipeline as `Retrieve`. Constructed via `SearchBuilder`.
|
||||
|
||||
pub mod executor;
|
||||
pub mod fusion;
|
||||
pub mod retrieve;
|
||||
pub mod search;
|
||||
|
||||
pub use executor::RetrieveExecutor;
|
||||
pub use fusion::{HybridFusion, RetrievalMode, ann_to_ranked, route_results};
|
||||
pub use retrieve::{
|
||||
Cursor, ProfileRef, QueryError, Results, Retrieve, RetrieveBuilder, RetrieveResult, Signal,
|
||||
};
|
||||
pub use search::{Search, SearchBuilder, SearchResultItem, SearchResults};
|
||||
|
||||
102
tidal/src/query/retrieve/errors.rs
Normal file
102
tidal/src/query/retrieve/errors.rs
Normal file
@ -0,0 +1,102 @@
|
||||
//! Error types for RETRIEVE query construction, validation, and execution.
|
||||
|
||||
/// Errors arising from query construction, validation, or execution.
|
||||
///
|
||||
/// Replaces the Milestone 1 stub `QueryError { message }`. Each variant
|
||||
/// carries structured context so callers can programmatically handle
|
||||
/// specific failure modes.
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum QueryError {
|
||||
/// The named ranking profile does not exist in the registry.
|
||||
#[error("profile '{0}' not found")]
|
||||
ProfileNotFound(String),
|
||||
/// A filter references an invalid field or has an unsupported value.
|
||||
#[error("invalid filter on '{field}': {reason}")]
|
||||
InvalidFilter { field: String, reason: String },
|
||||
/// The requested limit is outside the allowed range.
|
||||
#[error("limit {requested} out of range [{min}, {max}]")]
|
||||
InvalidLimit {
|
||||
requested: usize,
|
||||
min: usize,
|
||||
max: usize,
|
||||
},
|
||||
/// A required index (vector, text, bitmap) is not available.
|
||||
#[error("index '{0}' not available")]
|
||||
IndexNotAvailable(String),
|
||||
/// The underlying storage engine returned an error.
|
||||
#[error("storage error: {0}")]
|
||||
StorageError(String),
|
||||
/// A pagination cursor could not be decoded.
|
||||
#[error("invalid cursor: {0}")]
|
||||
InvalidCursor(String),
|
||||
/// The query requires a candidate strategy not yet implemented.
|
||||
#[error("unsupported strategy: {0}")]
|
||||
UnsupportedStrategy(String),
|
||||
/// The FOR SESSION clause references a session that does not exist.
|
||||
#[error("session not found: {0}")]
|
||||
SessionNotFound(String),
|
||||
}
|
||||
|
||||
// Manual `From` because this is a lossy conversion: `StorageError` is
|
||||
// flattened to its `Display` string. The variant holds `String`, not
|
||||
// `StorageError`, so `#[from]` cannot be used.
|
||||
impl From<crate::storage::StorageError> for QueryError {
|
||||
fn from(e: crate::storage::StorageError) -> Self {
|
||||
Self::StorageError(e.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
// ── Tests ───────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::storage::StorageError;
|
||||
|
||||
#[test]
|
||||
fn query_error_display_profile_not_found() {
|
||||
let e = QueryError::ProfileNotFound("trending".into());
|
||||
assert!(e.to_string().contains("trending"));
|
||||
assert!(e.to_string().contains("not found"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn query_error_display_invalid_filter() {
|
||||
let e = QueryError::InvalidFilter {
|
||||
field: "category".into(),
|
||||
reason: "unknown value".into(),
|
||||
};
|
||||
assert!(e.to_string().contains("category"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn query_error_display_invalid_limit() {
|
||||
let e = QueryError::InvalidLimit {
|
||||
requested: 0,
|
||||
min: 1,
|
||||
max: 500,
|
||||
};
|
||||
assert!(e.to_string().contains('0'));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn query_error_display_invalid_cursor() {
|
||||
let e = QueryError::InvalidCursor("bad data".into());
|
||||
assert!(e.to_string().contains("bad data"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn query_error_display_unsupported_strategy() {
|
||||
let e = QueryError::UnsupportedStrategy("Hybrid requires M3".into());
|
||||
assert!(e.to_string().contains("Hybrid"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn query_error_from_storage_error() {
|
||||
let storage_err = StorageError::Closed;
|
||||
let query_err: QueryError = storage_err.into();
|
||||
assert!(matches!(query_err, QueryError::StorageError(_)));
|
||||
assert!(query_err.to_string().contains("storage"));
|
||||
}
|
||||
}
|
||||
17
tidal/src/query/retrieve/mod.rs
Normal file
17
tidal/src/query/retrieve/mod.rs
Normal file
@ -0,0 +1,17 @@
|
||||
//! RETRIEVE query AST, builder, response types, and error enum.
|
||||
//!
|
||||
//! This module defines the typed representation of a RETRIEVE query -- the
|
||||
//! primary read path in tidalDB. A `Retrieve` captures the full intent of
|
||||
//! "given these constraints, rank content for me" and is constructed either
|
||||
//! programmatically via `RetrieveBuilder` or (M3+) by parsing the query
|
||||
//! language.
|
||||
//!
|
||||
//! The response types (`Results`, `RetrieveResult`, `Cursor`, `Signal`) are
|
||||
//! structured so the executor can populate them without the caller needing
|
||||
//! to understand scoring internals.
|
||||
|
||||
pub mod errors;
|
||||
pub mod types;
|
||||
|
||||
pub use errors::QueryError;
|
||||
pub use types::{Cursor, ProfileRef, Results, Retrieve, RetrieveBuilder, RetrieveResult, Signal};
|
||||
@ -1,19 +1,10 @@
|
||||
//! RETRIEVE query AST, builder, response types, and error enum.
|
||||
//!
|
||||
//! This module defines the typed representation of a RETRIEVE query -- the
|
||||
//! primary read path in tidalDB. A `Retrieve` captures the full intent of
|
||||
//! "given these constraints, rank content for me" and is constructed either
|
||||
//! programmatically via `RetrieveBuilder` or (M3+) by parsing the query
|
||||
//! language.
|
||||
//!
|
||||
//! The response types (`Results`, `RetrieveResult`, `Cursor`, `Signal`) are
|
||||
//! structured so the executor can populate them without the caller needing
|
||||
//! to understand scoring internals.
|
||||
//! RETRIEVE query AST, builder, response types, and pagination cursor.
|
||||
|
||||
use std::fmt;
|
||||
|
||||
use base64::Engine as _;
|
||||
|
||||
use super::errors::QueryError;
|
||||
use crate::ranking::diversity::DiversityConstraints;
|
||||
use crate::ranking::profile::CandidateStrategy;
|
||||
use crate::ranking::registry::ProfileRegistry;
|
||||
@ -497,65 +488,6 @@ impl fmt::Display for Cursor {
|
||||
}
|
||||
}
|
||||
|
||||
// ── QueryError ──────────────────────────────────────────────────────────────
|
||||
|
||||
/// Errors arising from query construction, validation, or execution.
|
||||
///
|
||||
/// Replaces the Milestone 1 stub `QueryError { message }`. Each variant
|
||||
/// carries structured context so callers can programmatically handle
|
||||
/// specific failure modes.
|
||||
#[derive(Debug)]
|
||||
pub enum QueryError {
|
||||
/// The named ranking profile does not exist in the registry.
|
||||
ProfileNotFound(String),
|
||||
/// A filter references an invalid field or has an unsupported value.
|
||||
InvalidFilter { field: String, reason: String },
|
||||
/// The requested limit is outside the allowed range.
|
||||
InvalidLimit {
|
||||
requested: usize,
|
||||
min: usize,
|
||||
max: usize,
|
||||
},
|
||||
/// A required index (vector, text, bitmap) is not available.
|
||||
IndexNotAvailable(String),
|
||||
/// The underlying storage engine returned an error.
|
||||
StorageError(String),
|
||||
/// A pagination cursor could not be decoded.
|
||||
InvalidCursor(String),
|
||||
/// The query requires a candidate strategy not yet implemented.
|
||||
UnsupportedStrategy(String),
|
||||
}
|
||||
|
||||
impl fmt::Display for QueryError {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
match self {
|
||||
Self::ProfileNotFound(name) => write!(f, "profile '{name}' not found"),
|
||||
Self::InvalidFilter { field, reason } => {
|
||||
write!(f, "invalid filter on '{field}': {reason}")
|
||||
}
|
||||
Self::InvalidLimit {
|
||||
requested,
|
||||
min,
|
||||
max,
|
||||
} => {
|
||||
write!(f, "limit {requested} out of range [{min}, {max}]")
|
||||
}
|
||||
Self::IndexNotAvailable(name) => write!(f, "index '{name}' not available"),
|
||||
Self::StorageError(msg) => write!(f, "storage error: {msg}"),
|
||||
Self::InvalidCursor(msg) => write!(f, "invalid cursor: {msg}"),
|
||||
Self::UnsupportedStrategy(msg) => write!(f, "unsupported strategy: {msg}"),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl std::error::Error for QueryError {}
|
||||
|
||||
impl From<crate::storage::StorageError> for QueryError {
|
||||
fn from(e: crate::storage::StorageError) -> Self {
|
||||
Self::StorageError(e.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
// ── Tests ───────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
@ -564,7 +496,6 @@ mod tests {
|
||||
use super::*;
|
||||
use crate::ranking::profile::{CandidateStrategy, DiversitySpec, RankingProfile};
|
||||
use crate::ranking::registry::ProfileRegistry;
|
||||
use crate::storage::StorageError;
|
||||
|
||||
// ── ProfileRef ──────────────────────────────────────────────────────
|
||||
|
||||
@ -803,54 +734,6 @@ mod tests {
|
||||
assert!(!results_with_items.is_empty());
|
||||
}
|
||||
|
||||
// ── QueryError ──────────────────────────────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn query_error_display_profile_not_found() {
|
||||
let e = QueryError::ProfileNotFound("trending".into());
|
||||
assert!(e.to_string().contains("trending"));
|
||||
assert!(e.to_string().contains("not found"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn query_error_display_invalid_filter() {
|
||||
let e = QueryError::InvalidFilter {
|
||||
field: "category".into(),
|
||||
reason: "unknown value".into(),
|
||||
};
|
||||
assert!(e.to_string().contains("category"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn query_error_display_invalid_limit() {
|
||||
let e = QueryError::InvalidLimit {
|
||||
requested: 0,
|
||||
min: 1,
|
||||
max: 500,
|
||||
};
|
||||
assert!(e.to_string().contains('0'));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn query_error_display_invalid_cursor() {
|
||||
let e = QueryError::InvalidCursor("bad data".into());
|
||||
assert!(e.to_string().contains("bad data"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn query_error_display_unsupported_strategy() {
|
||||
let e = QueryError::UnsupportedStrategy("Hybrid requires M3".into());
|
||||
assert!(e.to_string().contains("Hybrid"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn query_error_from_storage_error() {
|
||||
let storage_err = StorageError::Closed;
|
||||
let query_err: QueryError = storage_err.into();
|
||||
assert!(matches!(query_err, QueryError::StorageError(_)));
|
||||
assert!(query_err.to_string().contains("storage"));
|
||||
}
|
||||
|
||||
// ── validate ────────────────────────────────────────────────────────
|
||||
|
||||
fn make_registry_with_profile(name: &str, strategy: CandidateStrategy) -> ProfileRegistry {
|
||||
595
tidal/src/query/search/executor.rs
Normal file
595
tidal/src/query/search/executor.rs
Normal file
@ -0,0 +1,595 @@
|
||||
//! SEARCH query executor -- 8-stage pipeline.
|
||||
//!
|
||||
//! Implements the pipeline that turns a `Search` AST into `SearchResults`:
|
||||
//! 1a. Text retrieval (BM25 via Tantivy), 1b. ANN retrieval, 1c. Fusion (RRF),
|
||||
//! 2. Metadata filter, 2.5. User-context filter, 3. Profile scoring,
|
||||
//! 4. Diversity enforcement, 5. Result assembly.
|
||||
|
||||
use std::collections::{HashMap, HashSet};
|
||||
use std::sync::{Arc, RwLock};
|
||||
|
||||
use roaring::RoaringBitmap;
|
||||
|
||||
use super::executor_helpers::{
|
||||
build_user_context, evaluate_metadata_filter, extract_user_state_filters,
|
||||
has_creator_metadata_filter,
|
||||
};
|
||||
use super::types::{Search, SearchResultItem, SearchResults};
|
||||
use crate::db::deserialize_metadata as deserialize_item_metadata;
|
||||
use crate::entities::{
|
||||
CreatorItemsBitmap, HardNegIndex, InteractionLedger, PreferenceVectors, UserStateIndex,
|
||||
};
|
||||
use crate::query::fusion::{HybridFusion, RetrievalMode, ann_to_ranked, route_results};
|
||||
use crate::query::retrieve::{Cursor, QueryError, Signal};
|
||||
use crate::ranking::diversity::DiversitySelector;
|
||||
use crate::ranking::executor::ProfileExecutor;
|
||||
use crate::ranking::registry::ProfileRegistry;
|
||||
use crate::schema::{EntityId, EntityKind, Timestamp};
|
||||
use crate::session::{SessionContext, SessionSnapshot};
|
||||
use crate::signals::SignalLedger;
|
||||
use crate::storage::indexes::bitmap::BitmapIndex;
|
||||
use crate::storage::indexes::filter::{FilterEvaluator, FilterExpr, FilterResult};
|
||||
use crate::storage::indexes::range::RangeIndex;
|
||||
use crate::storage::vector::registry::EmbeddingSlotRegistry;
|
||||
use crate::storage::{StorageEngine, Tag, encode_key};
|
||||
use crate::text::collectors::AllScoresCollector;
|
||||
|
||||
// ── SearchExecutor ───────────────────────────────────────────────────────────
|
||||
|
||||
/// Executes SEARCH queries through the 8-stage pipeline.
|
||||
///
|
||||
/// Constructed per-query by `TidalDb::search()`. Holds borrowed references
|
||||
/// to all infrastructure needed for BM25 retrieval, ANN retrieval, fusion,
|
||||
/// filtering, signal scoring, and diversity enforcement.
|
||||
///
|
||||
/// Index references are `Option` because the database may not have all indexes
|
||||
/// available. The executor degrades gracefully when indexes are absent.
|
||||
pub struct SearchExecutor<'a> {
|
||||
ledger: &'a SignalLedger,
|
||||
profile_registry: &'a ProfileRegistry,
|
||||
/// Text index for BM25 retrieval (items). `None` when no text fields are declared.
|
||||
text_index: Option<&'a Arc<crate::text::TextIndex>>,
|
||||
/// Text index for BM25 retrieval (creators). `None` when no creator text fields are declared.
|
||||
creator_text_index: Option<&'a Arc<crate::text::TextIndex>>,
|
||||
/// Embedding slot registry for ANN retrieval. `None` when no embedding slots declared.
|
||||
embedding_registry: Option<&'a RwLock<EmbeddingSlotRegistry>>,
|
||||
category_index: Option<&'a BitmapIndex>,
|
||||
format_index: Option<&'a BitmapIndex>,
|
||||
creator_index: Option<&'a BitmapIndex>,
|
||||
tag_index: Option<&'a BitmapIndex>,
|
||||
duration_index: Option<&'a RangeIndex<u32>>,
|
||||
created_at_index: Option<&'a RangeIndex<u64>>,
|
||||
universe: Option<&'a RwLock<RoaringBitmap>>,
|
||||
user_state: Option<&'a UserStateIndex>,
|
||||
hard_negatives: Option<&'a HardNegIndex>,
|
||||
interaction_ledger: Option<&'a InteractionLedger>,
|
||||
creator_items: Option<&'a CreatorItemsBitmap>,
|
||||
#[allow(dead_code)]
|
||||
preference_vectors: Option<&'a PreferenceVectors>,
|
||||
items_storage: Option<&'a dyn StorageEngine>,
|
||||
/// Storage engine for creator entities, used to load metadata for creator
|
||||
/// search results and to evaluate metadata-based post-filters.
|
||||
creators_storage: Option<&'a dyn StorageEngine>,
|
||||
session_context: Option<SessionContext>,
|
||||
session_snapshot: Option<SessionSnapshot>,
|
||||
}
|
||||
|
||||
impl<'a> SearchExecutor<'a> {
|
||||
/// Create a new executor with the core infrastructure.
|
||||
#[must_use]
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
pub const fn new(
|
||||
ledger: &'a SignalLedger,
|
||||
profile_registry: &'a ProfileRegistry,
|
||||
text_index: Option<&'a Arc<crate::text::TextIndex>>,
|
||||
embedding_registry: Option<&'a RwLock<EmbeddingSlotRegistry>>,
|
||||
category_index: Option<&'a BitmapIndex>,
|
||||
format_index: Option<&'a BitmapIndex>,
|
||||
creator_index: Option<&'a BitmapIndex>,
|
||||
tag_index: Option<&'a BitmapIndex>,
|
||||
duration_index: Option<&'a RangeIndex<u32>>,
|
||||
created_at_index: Option<&'a RangeIndex<u64>>,
|
||||
universe: Option<&'a RwLock<RoaringBitmap>>,
|
||||
) -> Self {
|
||||
Self {
|
||||
ledger,
|
||||
profile_registry,
|
||||
text_index,
|
||||
creator_text_index: None,
|
||||
embedding_registry,
|
||||
category_index,
|
||||
format_index,
|
||||
creator_index,
|
||||
tag_index,
|
||||
duration_index,
|
||||
created_at_index,
|
||||
universe,
|
||||
user_state: None,
|
||||
hard_negatives: None,
|
||||
interaction_ledger: None,
|
||||
creator_items: None,
|
||||
preference_vectors: None,
|
||||
items_storage: None,
|
||||
creators_storage: None,
|
||||
session_context: None,
|
||||
session_snapshot: None,
|
||||
}
|
||||
}
|
||||
|
||||
/// Attach M3 user-context for personalization and user-state filtering.
|
||||
#[must_use]
|
||||
pub const fn with_user_context(
|
||||
mut self,
|
||||
user_state: &'a UserStateIndex,
|
||||
hard_negatives: &'a HardNegIndex,
|
||||
interaction_ledger: &'a InteractionLedger,
|
||||
creator_items: &'a CreatorItemsBitmap,
|
||||
) -> Self {
|
||||
self.user_state = Some(user_state);
|
||||
self.hard_negatives = Some(hard_negatives);
|
||||
self.interaction_ledger = Some(interaction_ledger);
|
||||
self.creator_items = Some(creator_items);
|
||||
self
|
||||
}
|
||||
|
||||
/// Attach preference vectors.
|
||||
#[must_use]
|
||||
pub const fn with_preference_vectors(
|
||||
mut self,
|
||||
preference_vectors: &'a PreferenceVectors,
|
||||
) -> Self {
|
||||
self.preference_vectors = Some(preference_vectors);
|
||||
self
|
||||
}
|
||||
|
||||
/// Attach a creator text index for BM25 retrieval when `entity_kind = Creator`.
|
||||
#[must_use]
|
||||
pub const fn with_creator_text_index(mut self, idx: &'a Arc<crate::text::TextIndex>) -> Self {
|
||||
self.creator_text_index = Some(idx);
|
||||
self
|
||||
}
|
||||
|
||||
/// Attach items storage for metadata lookup during session-boosted scoring.
|
||||
#[must_use]
|
||||
pub fn with_items_storage(mut self, storage: &'a dyn StorageEngine) -> Self {
|
||||
self.items_storage = Some(storage);
|
||||
self
|
||||
}
|
||||
|
||||
/// Attach creators storage for metadata lookup in creator search results
|
||||
/// and metadata-based post-filtering.
|
||||
#[must_use]
|
||||
pub fn with_creators_storage(mut self, storage: &'a dyn StorageEngine) -> Self {
|
||||
self.creators_storage = Some(storage);
|
||||
self
|
||||
}
|
||||
|
||||
/// Attach M4 session context for FOR SESSION ranking boost.
|
||||
#[must_use]
|
||||
pub fn with_session(mut self, context: SessionContext, snapshot: SessionSnapshot) -> Self {
|
||||
self.session_context = Some(context);
|
||||
self.session_snapshot = Some(snapshot);
|
||||
self
|
||||
}
|
||||
|
||||
/// Execute a SEARCH query through the 8-stage pipeline.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `QueryError` on validation failure, missing index, profile not
|
||||
/// found, or storage error during BM25/ANN retrieval.
|
||||
#[allow(clippy::too_many_lines)]
|
||||
pub fn execute(&self, query: &Search) -> Result<SearchResults, QueryError> {
|
||||
let mode =
|
||||
RetrievalMode::determine(query.query_text.is_some(), query.query_vector.is_some())
|
||||
.ok_or_else(|| QueryError::InvalidFilter {
|
||||
field: "query".into(),
|
||||
reason: "must provide at least one of query_text or query_vector".into(),
|
||||
})?;
|
||||
|
||||
// Resolve the profile.
|
||||
#[allow(clippy::option_if_let_else)]
|
||||
let profile = match query.profile.version {
|
||||
Some(v) => self
|
||||
.profile_registry
|
||||
.get_version(&query.profile.name, v)
|
||||
.map_err(|_| QueryError::ProfileNotFound(query.profile.name.clone()))?,
|
||||
None => self
|
||||
.profile_registry
|
||||
.get(&query.profile.name)
|
||||
.map_err(|_| QueryError::ProfileNotFound(query.profile.name.clone()))?,
|
||||
};
|
||||
|
||||
let mut warnings: Vec<String> = Vec::new();
|
||||
|
||||
// ── Stage 1a: Text Retrieval (BM25) ────────────────────────────────
|
||||
let mut bm25_results: Vec<(EntityId, f32)> = Vec::new();
|
||||
|
||||
// Route to creator or item text index based on entity_kind.
|
||||
let effective_text_index = match query.entity_kind {
|
||||
EntityKind::Creator => self.creator_text_index,
|
||||
_ => self.text_index,
|
||||
};
|
||||
|
||||
if let Some(ref query_text) = query.query_text {
|
||||
match effective_text_index {
|
||||
None => {
|
||||
warnings.push("text_index unavailable; text retrieval skipped".to_string());
|
||||
}
|
||||
Some(idx) => {
|
||||
let parser = idx.query_parser();
|
||||
let tq = parser
|
||||
.parse(query_text)
|
||||
.map_err(|e| QueryError::StorageError(format!("text query parse: {e}")))?;
|
||||
let searcher = idx.searcher();
|
||||
let collector = AllScoresCollector {
|
||||
entity_id_field: idx.fields().entity_id,
|
||||
};
|
||||
let mut raw = searcher
|
||||
.search(tq.as_ref(), &collector)
|
||||
.map_err(|e| QueryError::StorageError(format!("BM25 search: {e}")))?;
|
||||
// Sort descending by BM25 score (AllScoresCollector does not sort).
|
||||
raw.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
|
||||
bm25_results = raw;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ── Stage 1b: ANN Retrieval ─────────────────────────────────────────
|
||||
let mut ann_results: Vec<(EntityId, f32)> = Vec::new();
|
||||
|
||||
if let Some(ref query_vector) = query.query_vector {
|
||||
match self.embedding_registry {
|
||||
None => {
|
||||
warnings
|
||||
.push("embedding_registry unavailable; ANN retrieval skipped".to_string());
|
||||
}
|
||||
Some(registry_lock) => {
|
||||
let registry = registry_lock.read().map_err(|_| {
|
||||
QueryError::StorageError("embedding registry lock poisoned".to_string())
|
||||
})?;
|
||||
match registry.get(query.entity_kind, "content") {
|
||||
None => {
|
||||
warnings.push(format!(
|
||||
"no 'content' embedding slot for {:?}; ANN retrieval skipped",
|
||||
query.entity_kind
|
||||
));
|
||||
}
|
||||
Some(slot) => {
|
||||
let k = (query.limit as usize * 20).max(200);
|
||||
let raw = slot
|
||||
.index
|
||||
.search(query_vector, k, slot.params.ef_search)
|
||||
.map_err(|e| {
|
||||
QueryError::StorageError(format!("ANN search: {e}"))
|
||||
})?;
|
||||
ann_results = ann_to_ranked(&raw);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Build per-entity score maps for explainability in Stage 5.
|
||||
let bm25_map: HashMap<u64, f32> = bm25_results
|
||||
.iter()
|
||||
.map(|(id, s)| (id.as_u64(), *s))
|
||||
.collect();
|
||||
let ann_map: HashMap<u64, f32> = ann_results
|
||||
.iter()
|
||||
.map(|(id, d)| (id.as_u64(), *d))
|
||||
.collect();
|
||||
|
||||
// ── Stage 1c: Fusion ────────────────────────────────────────────────
|
||||
let fusion = HybridFusion::default();
|
||||
let fused = route_results(mode, &bm25_results, &ann_results, &fusion);
|
||||
let mut candidates: Vec<EntityId> = fused.iter().map(|(id, _)| *id).collect();
|
||||
|
||||
// Apply exclude list.
|
||||
if !query.exclude.is_empty() {
|
||||
let exclude_set: HashSet<u64> = query.exclude.iter().map(|id| id.as_u64()).collect();
|
||||
candidates.retain(|id| !exclude_set.contains(&id.as_u64()));
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
bm25_count = bm25_results.len(),
|
||||
ann_count = ann_results.len(),
|
||||
fused_count = candidates.len(),
|
||||
mode = ?mode,
|
||||
"search stage 1: retrieval complete"
|
||||
);
|
||||
|
||||
// ── Stage 2: Metadata Filter ─────────────────────────────────────────
|
||||
if let Some(filter_expr) = query.combined_filter() {
|
||||
let empty_bitmap = BitmapIndex::new("_empty");
|
||||
let empty_dur = RangeIndex::<u32>::new("_empty");
|
||||
let empty_ts = RangeIndex::<u64>::new("_empty");
|
||||
let empty_universe = RoaringBitmap::new();
|
||||
|
||||
let cat = self.category_index.unwrap_or(&empty_bitmap);
|
||||
let fmt = self.format_index.unwrap_or(&empty_bitmap);
|
||||
let cre = self.creator_index.unwrap_or(&empty_bitmap);
|
||||
let tag = self.tag_index.unwrap_or(&empty_bitmap);
|
||||
let dur = self.duration_index.unwrap_or(&empty_dur);
|
||||
let ts = self.created_at_index.unwrap_or(&empty_ts);
|
||||
|
||||
let universe_guard;
|
||||
#[allow(clippy::option_if_let_else)]
|
||||
let universe_ref = match self.universe {
|
||||
Some(u) => match u.read() {
|
||||
Ok(guard) => {
|
||||
universe_guard = guard;
|
||||
&*universe_guard
|
||||
}
|
||||
Err(_) => &empty_universe,
|
||||
},
|
||||
None => &empty_universe,
|
||||
};
|
||||
|
||||
let evaluator = FilterEvaluator::new(cat, fmt, cre, tag, dur, ts, universe_ref);
|
||||
let filter_result = evaluator.evaluate(&filter_expr);
|
||||
|
||||
match filter_result {
|
||||
FilterResult::Bitmap(bitmap) => {
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
candidates.retain(|id| bitmap.contains(id.as_u64() as u32));
|
||||
}
|
||||
FilterResult::Predicate(pred) => {
|
||||
candidates.retain(|id| pred(id.as_u64()));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
candidates = candidates.len(),
|
||||
"search stage 2: filter applied"
|
||||
);
|
||||
|
||||
// ── Stage 2b: Creator Metadata Post-Filter ──────────────────────────
|
||||
// For creator searches, evaluate metadata-based filter predicates against
|
||||
// actual creator metadata from storage. This handles filters like
|
||||
// verified=true, language=en that aren't in bitmap indexes for creators.
|
||||
if query.entity_kind == EntityKind::Creator
|
||||
&& let Some(combined) = query.combined_filter()
|
||||
&& has_creator_metadata_filter(&combined)
|
||||
&& !candidates.is_empty()
|
||||
&& let Some(storage) = self.creators_storage
|
||||
{
|
||||
candidates.retain(|&id| {
|
||||
let key = encode_key(id, Tag::Meta, b"");
|
||||
let meta = storage
|
||||
.get(&key)
|
||||
.ok()
|
||||
.flatten()
|
||||
.map(|bytes| {
|
||||
let (_emb, meta) = crate::entities::deserialize_entity(&bytes);
|
||||
meta
|
||||
})
|
||||
.unwrap_or_default();
|
||||
evaluate_metadata_filter(&combined, &meta)
|
||||
});
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
candidates = candidates.len(),
|
||||
"search stage 2b: creator metadata filter applied"
|
||||
);
|
||||
|
||||
// ── Stage 2.5: User-Context Filtering ────────────────────────────────
|
||||
if let Some(user_id) = query.for_user {
|
||||
if let Some(user_state) = self.user_state {
|
||||
let seen = user_state.seen_bitmap(user_id);
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
candidates.retain(|id| !seen.contains(id.as_u64() as u32));
|
||||
|
||||
let hidden = user_state.hidden_items(user_id);
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
candidates.retain(|id| !hidden.contains(id.as_u64() as u32));
|
||||
|
||||
if let Some(creator_items) = self.creator_items {
|
||||
let blocked_creators = user_state.blocked_creators(user_id);
|
||||
if !blocked_creators.is_empty() {
|
||||
let mut blocked_items = RoaringBitmap::new();
|
||||
for &cid in &blocked_creators {
|
||||
if let Some(bm) = creator_items.get(cid) {
|
||||
blocked_items |= &bm;
|
||||
}
|
||||
}
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
candidates.retain(|id| !blocked_items.contains(id.as_u64() as u32));
|
||||
}
|
||||
}
|
||||
|
||||
// User-state inclusion filters (Saved, Liked, InProgress).
|
||||
if let Some(ref filter_expr) = query.combined_filter() {
|
||||
let user_filters = extract_user_state_filters(filter_expr);
|
||||
for uf in &user_filters {
|
||||
match uf {
|
||||
FilterExpr::Saved(uid) => {
|
||||
let saved = user_state.saved_bitmap(*uid);
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
candidates.retain(|id| saved.contains(id.as_u64() as u32));
|
||||
}
|
||||
FilterExpr::Liked(uid) => {
|
||||
let liked = user_state.liked_bitmap(*uid);
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
candidates.retain(|id| liked.contains(id.as_u64() as u32));
|
||||
}
|
||||
FilterExpr::InProgress {
|
||||
user_id: uid,
|
||||
threshold,
|
||||
} => {
|
||||
candidates.retain(|id| {
|
||||
user_state.is_in_progress(*uid, id.as_u64(), *threshold)
|
||||
});
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if let Some(hard_neg) = self.hard_negatives {
|
||||
let neg_bitmap = hard_neg.bitmap(user_id);
|
||||
if !neg_bitmap.is_empty() {
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
candidates.retain(|id| !neg_bitmap.contains(id.as_u64() as u32));
|
||||
}
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
candidates = candidates.len(),
|
||||
user_id,
|
||||
"search stage 2.5: user-context filter applied"
|
||||
);
|
||||
}
|
||||
|
||||
// ── Stage 3: Profile Scoring ─────────────────────────────────────────
|
||||
let now = Timestamp::now();
|
||||
let executor = ProfileExecutor::new(self.ledger);
|
||||
|
||||
// Pre-load item metadata for keyword hint matching when session is active.
|
||||
let item_metadata: HashMap<u64, HashMap<String, String>> = if self.session_context.is_some()
|
||||
{
|
||||
self.items_storage.map_or_else(HashMap::new, |storage| {
|
||||
candidates
|
||||
.iter()
|
||||
.filter_map(|&eid| {
|
||||
let key = encode_key(eid, Tag::Meta, b"");
|
||||
storage
|
||||
.get(&key)
|
||||
.ok()
|
||||
.flatten()
|
||||
.map(|bytes| (eid.as_u64(), deserialize_item_metadata(&bytes)))
|
||||
})
|
||||
.collect()
|
||||
})
|
||||
} else {
|
||||
HashMap::new()
|
||||
};
|
||||
|
||||
#[allow(clippy::option_if_let_else)]
|
||||
let scored = if let Some(user_id) = query.for_user {
|
||||
let user_ctx =
|
||||
build_user_context(self.interaction_ledger, self.creator_items, user_id, now);
|
||||
executor.score_personalized(
|
||||
&candidates,
|
||||
profile,
|
||||
now,
|
||||
self.session_context.as_ref(),
|
||||
&user_ctx,
|
||||
&item_metadata,
|
||||
)
|
||||
} else {
|
||||
executor.score_with_session(
|
||||
&candidates,
|
||||
profile,
|
||||
now,
|
||||
self.session_context.as_ref(),
|
||||
&item_metadata,
|
||||
)
|
||||
};
|
||||
|
||||
let total_scored = scored.len();
|
||||
|
||||
tracing::debug!(scored = scored.len(), "search stage 3: scored");
|
||||
|
||||
// ── Stage 4: Diversity Enforcement ───────────────────────────────────
|
||||
let (final_candidates, constraints_satisfied) = if let Some(ref diversity) = query.diversity
|
||||
{
|
||||
let result = DiversitySelector::select(&scored, diversity, scored.len());
|
||||
let satisfied = result.constraints_satisfied;
|
||||
if !satisfied {
|
||||
for v in &result.violations {
|
||||
warnings.push(format!(
|
||||
"diversity constraint '{}' relaxed: {}",
|
||||
v.constraint, v.detail
|
||||
));
|
||||
}
|
||||
}
|
||||
(result.selected, satisfied)
|
||||
} else {
|
||||
(scored, true)
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
final_count = final_candidates.len(),
|
||||
"search stage 4: diversity enforced"
|
||||
);
|
||||
|
||||
// ── Stage 5: Result Assembly ──────────────────────────────────────────
|
||||
let offset = query.cursor.as_ref().map_or(0, Cursor::offset);
|
||||
let limit = query.limit as usize;
|
||||
let end = (offset + limit).min(final_candidates.len());
|
||||
let page = if offset < final_candidates.len() {
|
||||
&final_candidates[offset..end]
|
||||
} else {
|
||||
&[]
|
||||
};
|
||||
|
||||
// Load creator metadata for result enrichment when searching creators.
|
||||
let creator_meta: HashMap<u64, HashMap<String, String>> =
|
||||
if query.entity_kind == EntityKind::Creator {
|
||||
self.creators_storage.map_or_else(HashMap::new, |storage| {
|
||||
page.iter()
|
||||
.filter_map(|c| {
|
||||
let key = encode_key(c.entity_id, Tag::Meta, b"");
|
||||
storage.get(&key).ok().flatten().map(|bytes| {
|
||||
let (_emb, meta) = crate::entities::deserialize_entity(&bytes);
|
||||
(c.entity_id.as_u64(), meta)
|
||||
})
|
||||
})
|
||||
.collect()
|
||||
})
|
||||
} else {
|
||||
HashMap::new()
|
||||
};
|
||||
|
||||
let items: Vec<SearchResultItem> = page
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, c)| {
|
||||
let signals: Vec<Signal> = c
|
||||
.signal_snapshot
|
||||
.iter()
|
||||
.map(|(name, value)| Signal {
|
||||
name: name.clone(),
|
||||
value: *value,
|
||||
source: "decay_score".to_string(),
|
||||
})
|
||||
.collect();
|
||||
|
||||
SearchResultItem {
|
||||
entity_id: c.entity_id,
|
||||
score: c.score,
|
||||
rank: offset + i + 1,
|
||||
bm25_score: bm25_map.get(&c.entity_id.as_u64()).copied(),
|
||||
semantic_score: ann_map.get(&c.entity_id.as_u64()).copied(),
|
||||
signals,
|
||||
metadata: creator_meta.get(&c.entity_id.as_u64()).cloned(),
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
|
||||
let next_cursor = if end < final_candidates.len() {
|
||||
Some(Cursor::from_offset(end))
|
||||
} else {
|
||||
None
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
returned = items.len(),
|
||||
has_next_cursor = next_cursor.is_some(),
|
||||
"search stage 5: results assembled"
|
||||
);
|
||||
|
||||
Ok(SearchResults {
|
||||
items,
|
||||
next_cursor,
|
||||
total_candidates: total_scored,
|
||||
constraints_satisfied,
|
||||
warnings,
|
||||
session_snapshot: self.session_snapshot.clone(),
|
||||
})
|
||||
}
|
||||
}
|
||||
121
tidal/src/query/search/executor_helpers.rs
Normal file
121
tidal/src/query/search/executor_helpers.rs
Normal file
@ -0,0 +1,121 @@
|
||||
//! Helper functions extracted from the SEARCH executor to keep file size
|
||||
//! under the 600-line limit (`CODING_GUIDELINES` §9).
|
||||
//!
|
||||
//! - User-state filter extraction helpers
|
||||
//! - `build_user_context` free function (was a private method on `SearchExecutor`)
|
||||
//! - Creator metadata filter predicates
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use crate::entities::{CreatorItemsBitmap, InteractionLedger};
|
||||
use crate::ranking::executor::UserContext;
|
||||
use crate::schema::Timestamp;
|
||||
use crate::storage::indexes::filter::FilterExpr;
|
||||
|
||||
// ── User-state filter extraction ─────────────────────────────────────────────
|
||||
|
||||
fn collect_user_state_filters<'a>(expr: &'a FilterExpr, out: &mut Vec<&'a FilterExpr>) {
|
||||
match expr {
|
||||
FilterExpr::Saved(_) | FilterExpr::Liked(_) | FilterExpr::InProgress { .. } => {
|
||||
out.push(expr);
|
||||
}
|
||||
FilterExpr::And(children) | FilterExpr::Or(children) => {
|
||||
for child in children {
|
||||
collect_user_state_filters(child, out);
|
||||
}
|
||||
}
|
||||
FilterExpr::Not(inner) => {
|
||||
collect_user_state_filters(inner, out);
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
|
||||
pub(super) fn extract_user_state_filters(expr: &FilterExpr) -> Vec<&FilterExpr> {
|
||||
let mut result = Vec::new();
|
||||
collect_user_state_filters(expr, &mut result);
|
||||
result
|
||||
}
|
||||
|
||||
// ── UserContext builder ───────────────────────────────────────────────────────
|
||||
|
||||
/// Build a `UserContext` for personalized scoring.
|
||||
pub(super) fn build_user_context(
|
||||
interaction_ledger: Option<&InteractionLedger>,
|
||||
creator_items: Option<&CreatorItemsBitmap>,
|
||||
user_id: u64,
|
||||
now: Timestamp,
|
||||
) -> UserContext {
|
||||
let now_ns = now.as_nanos();
|
||||
|
||||
let top_creators = interaction_ledger
|
||||
.map(|il| il.top_creators(user_id, 50, now_ns))
|
||||
.unwrap_or_default();
|
||||
|
||||
let mut creator_interaction_boosts: HashMap<u32, f64> = HashMap::new();
|
||||
if let Some(creator_items) = creator_items {
|
||||
for (creator_id, weight) in &top_creators {
|
||||
if let Some(bitmap) = creator_items.get(*creator_id) {
|
||||
for item_id in &bitmap {
|
||||
creator_interaction_boosts.insert(item_id, *weight);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if !creator_interaction_boosts.is_empty() {
|
||||
let max_weight = creator_interaction_boosts
|
||||
.values()
|
||||
.copied()
|
||||
.fold(0.0_f64, f64::max);
|
||||
if max_weight > f64::EPSILON {
|
||||
for val in creator_interaction_boosts.values_mut() {
|
||||
*val /= max_weight;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
UserContext {
|
||||
user_id,
|
||||
creator_interaction_boosts,
|
||||
}
|
||||
}
|
||||
|
||||
// ── Creator metadata filter helpers ──────────────────────────────────────────
|
||||
|
||||
/// Check if a filter expression contains metadata-based predicates
|
||||
/// (`CategoryEq`, `FormatEq`) that need evaluating against entity metadata.
|
||||
///
|
||||
/// For creator searches, bitmap indexes are not populated with per-creator
|
||||
/// metadata like `verified` or `language`. Instead, `FilterExpr::eq("language", "en")`
|
||||
/// routes to `CategoryEq` and needs to be evaluated against the entity store.
|
||||
pub(super) fn has_creator_metadata_filter(expr: &FilterExpr) -> bool {
|
||||
match expr {
|
||||
FilterExpr::CategoryEq(_) | FilterExpr::FormatEq(_) => true,
|
||||
FilterExpr::And(children) | FilterExpr::Or(children) => {
|
||||
children.iter().any(has_creator_metadata_filter)
|
||||
}
|
||||
FilterExpr::Not(inner) => has_creator_metadata_filter(inner),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
/// Evaluate a filter expression against entity metadata key-value map.
|
||||
///
|
||||
/// Returns `true` if the entity passes the filter. Non-metadata filter
|
||||
/// variants (duration, range, user-state) pass through -- they are handled
|
||||
/// by earlier stages.
|
||||
pub(super) fn evaluate_metadata_filter(expr: &FilterExpr, meta: &HashMap<String, String>) -> bool {
|
||||
match expr {
|
||||
// CategoryEq and FormatEq are the "generic eq" bucket in the current
|
||||
// filter expr design. FilterExpr::eq("language", "en") produces
|
||||
// CategoryEq("en"), so we check every metadata value for a match.
|
||||
FilterExpr::CategoryEq(value) => meta.values().any(|v| v == value),
|
||||
FilterExpr::FormatEq(value) => meta.values().any(|v| v == value),
|
||||
FilterExpr::And(children) => children.iter().all(|c| evaluate_metadata_filter(c, meta)),
|
||||
FilterExpr::Or(children) => children.iter().any(|c| evaluate_metadata_filter(c, meta)),
|
||||
FilterExpr::Not(inner) => !evaluate_metadata_filter(inner, meta),
|
||||
// Non-metadata filters pass through (already handled by bitmap/range evaluator).
|
||||
_ => true,
|
||||
}
|
||||
}
|
||||
25
tidal/src/query/search/mod.rs
Normal file
25
tidal/src/query/search/mod.rs
Normal file
@ -0,0 +1,25 @@
|
||||
//! SEARCH query AST, builder, response types, and executor.
|
||||
//!
|
||||
//! The primary query type is `Search` -- a typed representation capturing the
|
||||
//! full intent of "retrieve and rank content matching this text/vector query."
|
||||
//! Constructed via `SearchBuilder`.
|
||||
//!
|
||||
//! The `SearchExecutor` implements an 8-stage pipeline that blends BM25 text
|
||||
//! relevance, ANN vector similarity, personalization, filtering, and diversity
|
||||
//! enforcement into a single response:
|
||||
//!
|
||||
//! 1a. **Text retrieval** (BM25 via Tantivy, when `query_text` is set)
|
||||
//! 1b. **Vector retrieval** (ANN via HNSW, when `query_vector` is set)
|
||||
//! 1c. **Fusion** (RRF for hybrid; passthrough for single-modality)
|
||||
//! 2. **Metadata filter** (bitmap + range indexes)
|
||||
//! 2.5. **User-context filter** (seen, blocked, hard negatives)
|
||||
//! 3. **Profile scoring** (signal-based, personalized when `for_user` is set)
|
||||
//! 4. **Diversity enforcement** (per-creator, format-mix)
|
||||
//! 5. **Result assembly** (with BM25 + semantic score explainability)
|
||||
|
||||
pub mod executor;
|
||||
mod executor_helpers;
|
||||
pub mod types;
|
||||
|
||||
pub use executor::SearchExecutor;
|
||||
pub use types::{Search, SearchBuilder, SearchResultItem, SearchResults};
|
||||
484
tidal/src/query/search/types.rs
Normal file
484
tidal/src/query/search/types.rs
Normal file
@ -0,0 +1,484 @@
|
||||
//! SEARCH query AST, builder, and response types.
|
||||
//!
|
||||
//! Contains the data structures for constructing and representing SEARCH queries
|
||||
//! and their results. No execution logic lives here -- see `executor.rs` for the
|
||||
//! 8-stage pipeline.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use crate::query::retrieve::{Cursor, ProfileRef, QueryError, Signal};
|
||||
use crate::ranking::diversity::DiversityConstraints;
|
||||
use crate::schema::{EntityId, EntityKind};
|
||||
use crate::session::{SessionId, SessionSnapshot};
|
||||
use crate::storage::indexes::filter::FilterExpr;
|
||||
|
||||
// ── Search ───────────────────────────────────────────────────────────────────
|
||||
|
||||
/// The typed AST for a SEARCH query.
|
||||
///
|
||||
/// Captures the full query intent: text or vector input (or both for hybrid),
|
||||
/// optional personalization, filters, diversity, and pagination. Constructed
|
||||
/// via `SearchBuilder`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct Search {
|
||||
/// Which entity kind to search (usually `Item`).
|
||||
pub entity_kind: EntityKind,
|
||||
/// Optional free-text query string for BM25 retrieval.
|
||||
pub query_text: Option<String>,
|
||||
/// Optional embedding vector for ANN retrieval.
|
||||
pub query_vector: Option<Vec<f32>>,
|
||||
/// User ID for personalized ranking.
|
||||
pub for_user: Option<u64>,
|
||||
/// Session ID for FOR SESSION ranking boost.
|
||||
pub for_session: Option<SessionId>,
|
||||
/// Which ranking profile to apply for signal-based re-ranking.
|
||||
pub profile: ProfileRef,
|
||||
/// Zero or more filter expressions, AND-combined.
|
||||
pub filters: Vec<FilterExpr>,
|
||||
/// Optional diversity constraints.
|
||||
pub diversity: Option<DiversityConstraints>,
|
||||
/// Maximum number of results to return (`1..=1000`).
|
||||
pub limit: u32,
|
||||
/// Entity IDs to exclude from results (e.g., already-seen items).
|
||||
pub exclude: Vec<EntityId>,
|
||||
/// Opaque pagination cursor.
|
||||
pub cursor: Option<Cursor>,
|
||||
/// Entity ID to use as the vector query source ("creators like X").
|
||||
/// When set, `TidalDb::search()` reads this entity's stored embedding
|
||||
/// and populates `query_vector` before executing.
|
||||
pub similar_to: Option<EntityId>,
|
||||
}
|
||||
|
||||
impl Search {
|
||||
/// Create a `SearchBuilder` with sensible defaults.
|
||||
///
|
||||
/// Default: `entity_kind = Item`, `profile = "search"`, `limit = 20`.
|
||||
#[must_use]
|
||||
pub fn builder() -> SearchBuilder {
|
||||
SearchBuilder::new()
|
||||
}
|
||||
|
||||
/// Combine all filter expressions into a single `FilterExpr`.
|
||||
///
|
||||
/// - Zero filters: `None`
|
||||
/// - One filter: returns it directly
|
||||
/// - Multiple filters: wraps in `FilterExpr::And`
|
||||
#[must_use]
|
||||
pub fn combined_filter(&self) -> Option<FilterExpr> {
|
||||
match self.filters.len() {
|
||||
0 => None,
|
||||
1 => Some(self.filters[0].clone()),
|
||||
_ => Some(FilterExpr::And(self.filters.clone())),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ── SearchBuilder ────────────────────────────────────────────────────────────
|
||||
|
||||
/// Builder for `Search` queries.
|
||||
///
|
||||
/// Provides a fluent API for constructing SEARCH queries. At least one of
|
||||
/// `query()` (text) or `vector()` (embedding) must be set before calling
|
||||
/// `build()`.
|
||||
///
|
||||
/// # Examples
|
||||
///
|
||||
/// ```
|
||||
/// use tidaldb::query::search::{Search, SearchBuilder};
|
||||
///
|
||||
/// let search = Search::builder()
|
||||
/// .query("Rust tutorial")
|
||||
/// .limit(20)
|
||||
/// .build()
|
||||
/// .unwrap();
|
||||
/// assert!(search.query_text.is_some());
|
||||
/// ```
|
||||
pub struct SearchBuilder {
|
||||
entity_kind: EntityKind,
|
||||
query_text: Option<String>,
|
||||
query_vector: Option<Vec<f32>>,
|
||||
for_user: Option<u64>,
|
||||
for_session: Option<SessionId>,
|
||||
profile: ProfileRef,
|
||||
filters: Vec<FilterExpr>,
|
||||
diversity: Option<DiversityConstraints>,
|
||||
limit: u32,
|
||||
exclude: Vec<EntityId>,
|
||||
cursor: Option<Cursor>,
|
||||
similar_to: Option<EntityId>,
|
||||
}
|
||||
|
||||
impl SearchBuilder {
|
||||
/// Create a new builder with defaults.
|
||||
///
|
||||
/// Default: `entity_kind = Item`, `profile = "search"`, `limit = 20`.
|
||||
#[must_use]
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
entity_kind: EntityKind::Item,
|
||||
query_text: None,
|
||||
query_vector: None,
|
||||
for_user: None,
|
||||
for_session: None,
|
||||
profile: ProfileRef::new("search"),
|
||||
filters: Vec::new(),
|
||||
diversity: None,
|
||||
limit: 20,
|
||||
exclude: Vec::new(),
|
||||
cursor: None,
|
||||
similar_to: None,
|
||||
}
|
||||
}
|
||||
|
||||
/// Set the entity kind to search. Default is `Item`.
|
||||
#[must_use]
|
||||
pub const fn entity_kind(mut self, kind: crate::schema::EntityKind) -> Self {
|
||||
self.entity_kind = kind;
|
||||
self
|
||||
}
|
||||
|
||||
/// Set the free-text query string for BM25 retrieval.
|
||||
#[must_use]
|
||||
pub fn query(mut self, text: impl Into<String>) -> Self {
|
||||
self.query_text = Some(text.into());
|
||||
self
|
||||
}
|
||||
|
||||
/// Set the embedding vector for ANN retrieval.
|
||||
#[must_use]
|
||||
pub fn vector(mut self, vec: impl Into<Vec<f32>>) -> Self {
|
||||
self.query_vector = Some(vec.into());
|
||||
self
|
||||
}
|
||||
|
||||
/// Set the user ID for personalized ranking.
|
||||
#[must_use]
|
||||
pub const fn for_user(mut self, user_id: u64) -> Self {
|
||||
self.for_user = Some(user_id);
|
||||
self
|
||||
}
|
||||
|
||||
/// Apply FOR SESSION ranking boost using the given session ID.
|
||||
#[must_use]
|
||||
pub const fn for_session(mut self, session_id: SessionId) -> Self {
|
||||
self.for_session = Some(session_id);
|
||||
self
|
||||
}
|
||||
|
||||
/// Set the ranking profile by name.
|
||||
#[must_use]
|
||||
pub fn using_profile(mut self, name: impl Into<String>) -> Self {
|
||||
self.profile = ProfileRef::new(name);
|
||||
self
|
||||
}
|
||||
|
||||
/// Add a filter expression. Multiple filters are AND-combined.
|
||||
#[must_use]
|
||||
pub fn filter(mut self, expr: FilterExpr) -> Self {
|
||||
self.filters.push(expr);
|
||||
self
|
||||
}
|
||||
|
||||
/// Set diversity constraints.
|
||||
#[must_use]
|
||||
pub const fn diversity(mut self, constraints: DiversityConstraints) -> Self {
|
||||
self.diversity = Some(constraints);
|
||||
self
|
||||
}
|
||||
|
||||
/// Set the maximum number of results. Must be in `[1, 1000]`.
|
||||
#[must_use]
|
||||
pub const fn limit(mut self, n: u32) -> Self {
|
||||
self.limit = n;
|
||||
self
|
||||
}
|
||||
|
||||
/// Exclude entity IDs from results.
|
||||
#[must_use]
|
||||
pub fn exclude(mut self, ids: Vec<EntityId>) -> Self {
|
||||
self.exclude = ids;
|
||||
self
|
||||
}
|
||||
|
||||
/// Set a pagination cursor from a previous response.
|
||||
#[must_use]
|
||||
pub const fn cursor(mut self, c: Cursor) -> Self {
|
||||
self.cursor = Some(c);
|
||||
self
|
||||
}
|
||||
|
||||
/// Set "similar to" entity -- reads the entity's stored embedding and
|
||||
/// uses it as the query vector. For creator search, this resolves via
|
||||
/// `read_creator_embedding(id)` in `TidalDb::search()`.
|
||||
#[must_use]
|
||||
pub const fn similar_to(mut self, id: EntityId) -> Self {
|
||||
self.similar_to = Some(id);
|
||||
self
|
||||
}
|
||||
|
||||
/// Build the `Search` query.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `QueryError::InvalidFilter` if neither `query_text` nor
|
||||
/// `query_vector` is set. Returns `QueryError::InvalidLimit` if `limit`
|
||||
/// is 0 or greater than 1000.
|
||||
pub fn build(self) -> Result<Search, QueryError> {
|
||||
if self.query_text.is_none() && self.query_vector.is_none() && self.similar_to.is_none() {
|
||||
return Err(QueryError::InvalidFilter {
|
||||
field: "query".into(),
|
||||
reason: "must provide at least one of query_text, query_vector, or similar_to"
|
||||
.into(),
|
||||
});
|
||||
}
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
if self.limit == 0 || self.limit > 1000 {
|
||||
return Err(QueryError::InvalidLimit {
|
||||
requested: self.limit as usize,
|
||||
min: 1,
|
||||
max: 1000,
|
||||
});
|
||||
}
|
||||
Ok(Search {
|
||||
entity_kind: self.entity_kind,
|
||||
query_text: self.query_text,
|
||||
query_vector: self.query_vector,
|
||||
for_user: self.for_user,
|
||||
for_session: self.for_session,
|
||||
profile: self.profile,
|
||||
filters: self.filters,
|
||||
diversity: self.diversity,
|
||||
limit: self.limit,
|
||||
exclude: self.exclude,
|
||||
cursor: self.cursor,
|
||||
similar_to: self.similar_to,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for SearchBuilder {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
// ── SearchResults ────────────────────────────────────────────────────────────
|
||||
|
||||
/// The response from executing a SEARCH query.
|
||||
#[derive(Debug)]
|
||||
pub struct SearchResults {
|
||||
/// Ranked results in descending score order.
|
||||
pub items: Vec<SearchResultItem>,
|
||||
/// Cursor for fetching the next page. `None` if no more results.
|
||||
pub next_cursor: Option<Cursor>,
|
||||
/// Total candidates that matched filters before limit/diversity enforcement.
|
||||
pub total_candidates: usize,
|
||||
/// Whether all diversity constraints were satisfied.
|
||||
pub constraints_satisfied: bool,
|
||||
/// Warnings generated during execution (e.g., relaxed constraints).
|
||||
pub warnings: Vec<String>,
|
||||
/// Session snapshot at query time (populated when `for_session` is set).
|
||||
pub session_snapshot: Option<SessionSnapshot>,
|
||||
}
|
||||
|
||||
impl SearchResults {
|
||||
/// Number of results returned.
|
||||
#[must_use]
|
||||
pub const fn len(&self) -> usize {
|
||||
self.items.len()
|
||||
}
|
||||
|
||||
/// Whether the result set is empty.
|
||||
#[must_use]
|
||||
pub const fn is_empty(&self) -> bool {
|
||||
self.items.is_empty()
|
||||
}
|
||||
}
|
||||
|
||||
// ── SearchResultItem ─────────────────────────────────────────────────────────
|
||||
|
||||
/// A single ranked result from a SEARCH query.
|
||||
///
|
||||
/// Includes the entity ID, profile score, 1-based rank, plus BM25 and
|
||||
/// semantic scores from the retrieval stage for explainability.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct SearchResultItem {
|
||||
/// The entity that was ranked.
|
||||
pub entity_id: EntityId,
|
||||
/// Normalized profile score in `[0.0, 1.0]`.
|
||||
pub score: f64,
|
||||
/// 1-based rank within the full result set.
|
||||
pub rank: usize,
|
||||
/// BM25 text relevance score (present when `query_text` matched this entity).
|
||||
pub bm25_score: Option<f32>,
|
||||
/// ANN L2 distance (present when `query_vector` was set; lower = more similar).
|
||||
pub semantic_score: Option<f32>,
|
||||
/// Signal values that contributed to the profile score, for explainability.
|
||||
pub signals: Vec<Signal>,
|
||||
/// Entity metadata (populated for creator results; `None` for item results by default).
|
||||
pub metadata: Option<HashMap<String, String>>,
|
||||
}
|
||||
|
||||
// ── Tests ────────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::storage::indexes::filter::FilterExpr;
|
||||
|
||||
// ── SearchBuilder ────────────────────────────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn builder_defaults() {
|
||||
let s = Search::builder().query("jazz").build().unwrap();
|
||||
assert_eq!(s.entity_kind, EntityKind::Item);
|
||||
assert_eq!(s.limit, 20);
|
||||
assert_eq!(s.profile.name, "search");
|
||||
assert!(s.filters.is_empty());
|
||||
assert!(s.diversity.is_none());
|
||||
assert!(s.for_user.is_none());
|
||||
assert!(s.for_session.is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_requires_query() {
|
||||
let result = Search::builder().build();
|
||||
assert!(
|
||||
matches!(result, Err(QueryError::InvalidFilter { .. })),
|
||||
"build() without query should fail"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_text_only() {
|
||||
let s = Search::builder().query("rust tutorial").build().unwrap();
|
||||
assert!(s.query_text.is_some());
|
||||
assert!(s.query_vector.is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_vector_only() {
|
||||
let vec = vec![0.1_f32, 0.2, 0.3];
|
||||
let s = Search::builder().vector(vec.clone()).build().unwrap();
|
||||
assert!(s.query_text.is_none());
|
||||
assert_eq!(s.query_vector.as_deref(), Some(vec.as_slice()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_hybrid() {
|
||||
let s = Search::builder()
|
||||
.query("jazz")
|
||||
.vector(vec![0.1_f32; 4])
|
||||
.build()
|
||||
.unwrap();
|
||||
assert!(s.query_text.is_some());
|
||||
assert!(s.query_vector.is_some());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_limit_zero_rejected() {
|
||||
let result = Search::builder().query("x").limit(0).build();
|
||||
assert!(matches!(result, Err(QueryError::InvalidLimit { .. })));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_limit_1001_rejected() {
|
||||
let result = Search::builder().query("x").limit(1001).build();
|
||||
assert!(matches!(result, Err(QueryError::InvalidLimit { .. })));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_limit_1000_accepted() {
|
||||
let s = Search::builder().query("x").limit(1000).build().unwrap();
|
||||
assert_eq!(s.limit, 1000);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_with_filter() {
|
||||
let s = Search::builder()
|
||||
.query("jazz")
|
||||
.filter(FilterExpr::eq("category", "music"))
|
||||
.build()
|
||||
.unwrap();
|
||||
assert_eq!(s.filters.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_with_exclude() {
|
||||
let s = Search::builder()
|
||||
.query("jazz")
|
||||
.exclude(vec![EntityId::new(1), EntityId::new(2)])
|
||||
.build()
|
||||
.unwrap();
|
||||
assert_eq!(s.exclude.len(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_with_for_user() {
|
||||
let s = Search::builder()
|
||||
.query("jazz")
|
||||
.for_user(42)
|
||||
.build()
|
||||
.unwrap();
|
||||
assert_eq!(s.for_user, Some(42));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_custom_profile() {
|
||||
let s = Search::builder()
|
||||
.query("jazz")
|
||||
.using_profile("trending")
|
||||
.build()
|
||||
.unwrap();
|
||||
assert_eq!(s.profile.name, "trending");
|
||||
}
|
||||
|
||||
// ── combined_filter ──────────────────────────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn combined_filter_empty() {
|
||||
let s = Search::builder().query("jazz").build().unwrap();
|
||||
assert!(s.combined_filter().is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn combined_filter_single() {
|
||||
let s = Search::builder()
|
||||
.query("jazz")
|
||||
.filter(FilterExpr::eq("category", "music"))
|
||||
.build()
|
||||
.unwrap();
|
||||
assert!(matches!(
|
||||
s.combined_filter(),
|
||||
Some(FilterExpr::CategoryEq(_))
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn combined_filter_multiple() {
|
||||
let s = Search::builder()
|
||||
.query("jazz")
|
||||
.filter(FilterExpr::eq("category", "music"))
|
||||
.filter(FilterExpr::eq("format", "video"))
|
||||
.build()
|
||||
.unwrap();
|
||||
assert!(matches!(s.combined_filter(), Some(FilterExpr::And(_))));
|
||||
}
|
||||
|
||||
// ── SearchResults ────────────────────────────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn results_len_and_empty() {
|
||||
let r = SearchResults {
|
||||
items: vec![],
|
||||
next_cursor: None,
|
||||
total_candidates: 0,
|
||||
constraints_satisfied: true,
|
||||
warnings: vec![],
|
||||
session_snapshot: None,
|
||||
};
|
||||
assert_eq!(r.len(), 0);
|
||||
assert!(r.is_empty());
|
||||
}
|
||||
}
|
||||
@ -58,7 +58,7 @@ const TRENDING_MAX_PER_CREATOR: usize = 1;
|
||||
/// Maximum items per creator in hot results.
|
||||
const HOT_MAX_PER_CREATOR: usize = 2;
|
||||
|
||||
/// Register all 15 built-in ranking profiles into the given registry.
|
||||
/// Register all 16 built-in ranking profiles into the given registry.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
@ -83,6 +83,8 @@ pub fn register_builtins(registry: &mut ProfileRegistry) -> Result<(), ProfileEr
|
||||
registry.register(following())?;
|
||||
registry.register(related())?;
|
||||
registry.register(notification())?;
|
||||
// Search profile (M5).
|
||||
registry.register(search())?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
@ -330,6 +332,54 @@ fn notification() -> RankingProfile {
|
||||
p
|
||||
}
|
||||
|
||||
// ── M5 Search Profile ───────────────────────────────────────────────────────
|
||||
|
||||
/// Weight for view decay score in the search profile.
|
||||
///
|
||||
/// Lower than personalized profiles to let text relevance dominate.
|
||||
const SEARCH_VIEW_WEIGHT: f64 = 0.5;
|
||||
|
||||
/// Weight for like decay score in the search profile.
|
||||
///
|
||||
/// Captures quality signal: items frequently liked tend to be good results.
|
||||
const SEARCH_LIKE_WEIGHT: f64 = 0.8;
|
||||
|
||||
/// `search`: text and vector relevance plus light signal re-ranking.
|
||||
///
|
||||
/// Default profile for the SEARCH query type. The heavy lifting is done by
|
||||
/// RRF fusion (BM25 + ANN) in Stage 1c of the `SearchExecutor`. This profile
|
||||
/// provides a lightweight signal overlay: a view-decay and like-decay boost to
|
||||
/// surface quality content without overriding text relevance signals.
|
||||
///
|
||||
/// - No exploration injection (`exploration = 0.0`): search results must be
|
||||
/// deterministic for a given query.
|
||||
/// - No diversity enforcement: callers specify diversity explicitly via
|
||||
/// `SearchBuilder::diversity()`.
|
||||
/// - `sort = None`: the fused RRF score from Stage 1c is the primary ordering
|
||||
/// signal; the profile adds a small quality overlay on top.
|
||||
fn search() -> RankingProfile {
|
||||
let mut p = skeleton("search");
|
||||
p.boosts = vec![
|
||||
Boost {
|
||||
signal: "view".into(),
|
||||
agg: SignalAgg::DecayScore,
|
||||
window: Window::AllTime,
|
||||
weight: SEARCH_VIEW_WEIGHT,
|
||||
},
|
||||
Boost {
|
||||
signal: "like".into(),
|
||||
agg: SignalAgg::DecayScore,
|
||||
window: Window::AllTime,
|
||||
weight: SEARCH_LIKE_WEIGHT,
|
||||
},
|
||||
];
|
||||
// No diversity: callers control diversity via SearchBuilder.
|
||||
// No exploration: search results are deterministic.
|
||||
p.exploration = 0.0;
|
||||
p.sort = None;
|
||||
p
|
||||
}
|
||||
|
||||
// ── Tests ───────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
@ -338,10 +388,10 @@ mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn all_fifteen_builtins_registered() {
|
||||
fn all_sixteen_builtins_registered() {
|
||||
let mut registry = ProfileRegistry::new();
|
||||
register_builtins(&mut registry).unwrap();
|
||||
assert_eq!(registry.list().len(), 15);
|
||||
assert_eq!(registry.list().len(), 16);
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
106
tidal/src/ranking/diversity/constraints.rs
Normal file
106
tidal/src/ranking/diversity/constraints.rs
Normal file
@ -0,0 +1,106 @@
|
||||
//! Diversity constraint types and result structures.
|
||||
//!
|
||||
//! Defines the constraint parameters that the diversity selector enforces,
|
||||
//! the violation type for reporting breaches, and the result wrapper returned
|
||||
//! from selection.
|
||||
|
||||
use super::super::executor::ScoredCandidate;
|
||||
|
||||
// -- Constraints ------------------------------------------------------------
|
||||
|
||||
/// Builder-style diversity constraints applied after scoring.
|
||||
///
|
||||
/// Only `max_per_creator` and `format_mix_max_fraction` are enforced in M2.
|
||||
/// The remaining fields (`min_exploration`, `topic_diversity`, `category_min`)
|
||||
/// are reserved for M3/M6 and ignored by the selector.
|
||||
#[derive(Debug, Clone, Default)]
|
||||
pub struct DiversityConstraints {
|
||||
pub max_per_creator: Option<usize>,
|
||||
pub format_mix_max_fraction: Option<f64>,
|
||||
/// Reserved for M3: minimum fraction of exploration candidates.
|
||||
pub min_exploration: Option<f64>,
|
||||
/// Reserved for M6: topic diversity enforcement.
|
||||
pub topic_diversity: Option<f64>,
|
||||
/// Reserved for M6: minimum number of distinct categories.
|
||||
pub category_min: Option<usize>,
|
||||
}
|
||||
|
||||
impl DiversityConstraints {
|
||||
#[must_use]
|
||||
pub fn new() -> Self {
|
||||
Self::default()
|
||||
}
|
||||
|
||||
#[must_use]
|
||||
pub const fn max_per_creator(mut self, n: usize) -> Self {
|
||||
self.max_per_creator = Some(n);
|
||||
self
|
||||
}
|
||||
|
||||
#[must_use]
|
||||
pub const fn format_mix(mut self, max_fraction: f64) -> Self {
|
||||
self.format_mix_max_fraction = Some(max_fraction);
|
||||
self
|
||||
}
|
||||
}
|
||||
|
||||
// -- Violation --------------------------------------------------------------
|
||||
|
||||
/// A single diversity constraint violation, describing which constraint was
|
||||
/// breached and the specifics.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct ConstraintViolation {
|
||||
/// The constraint name, e.g. `"max_per_creator"` or `"format_mix"`.
|
||||
pub constraint: String,
|
||||
/// Human-readable description of the violation.
|
||||
pub detail: String,
|
||||
}
|
||||
|
||||
// -- Result -----------------------------------------------------------------
|
||||
|
||||
/// Result of diversity selection, including the filtered candidate list and
|
||||
/// any constraint violations that could not be resolved.
|
||||
pub struct DiversityResult {
|
||||
pub selected: Vec<ScoredCandidate>,
|
||||
pub constraints_satisfied: bool,
|
||||
pub violations: Vec<ConstraintViolation>,
|
||||
}
|
||||
|
||||
// -- Tests ------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn default_constraints_have_no_active_fields() {
|
||||
let c = DiversityConstraints::new();
|
||||
assert!(c.max_per_creator.is_none());
|
||||
assert!(c.format_mix_max_fraction.is_none());
|
||||
assert!(c.min_exploration.is_none());
|
||||
assert!(c.topic_diversity.is_none());
|
||||
assert!(c.category_min.is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_sets_max_per_creator() {
|
||||
let c = DiversityConstraints::new().max_per_creator(3);
|
||||
assert_eq!(c.max_per_creator, Some(3));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_sets_format_mix() {
|
||||
let c = DiversityConstraints::new().format_mix(0.5);
|
||||
assert_eq!(c.format_mix_max_fraction, Some(0.5));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builder_chains_both_constraints() {
|
||||
let c = DiversityConstraints::new()
|
||||
.max_per_creator(2)
|
||||
.format_mix(0.4);
|
||||
assert_eq!(c.max_per_creator, Some(2));
|
||||
assert_eq!(c.format_mix_max_fraction, Some(0.4));
|
||||
}
|
||||
}
|
||||
12
tidal/src/ranking/diversity/mod.rs
Normal file
12
tidal/src/ranking/diversity/mod.rs
Normal file
@ -0,0 +1,12 @@
|
||||
//! Diversity enforcement for ranked candidate lists.
|
||||
//!
|
||||
//! After the profile executor scores and sorts candidates, the `DiversitySelector`
|
||||
//! applies post-hoc constraints (max-per-creator, format-mix) using a greedy
|
||||
//! multi-stage relaxation strategy. This ensures the result set maintains content
|
||||
//! variety without sacrificing result count.
|
||||
|
||||
pub mod constraints;
|
||||
pub mod selector;
|
||||
|
||||
pub use constraints::{ConstraintViolation, DiversityConstraints, DiversityResult};
|
||||
pub use selector::DiversitySelector;
|
||||
@ -1,4 +1,4 @@
|
||||
//! Diversity enforcement for ranked candidate lists.
|
||||
//! Diversity selection algorithm with greedy multi-stage relaxation.
|
||||
//!
|
||||
//! After the profile executor scores and sorts candidates, the `DiversitySelector`
|
||||
//! applies post-hoc constraints (max-per-creator, format-mix) using a greedy
|
||||
@ -19,69 +19,10 @@
|
||||
|
||||
use std::collections::{HashMap, HashSet};
|
||||
|
||||
use super::executor::ScoredCandidate;
|
||||
use super::constraints::{ConstraintViolation, DiversityConstraints, DiversityResult};
|
||||
use crate::ranking::executor::ScoredCandidate;
|
||||
|
||||
// ── Constraints ────────────────────────────────────────────────────────────
|
||||
|
||||
/// Builder-style diversity constraints applied after scoring.
|
||||
///
|
||||
/// Only `max_per_creator` and `format_mix_max_fraction` are enforced in M2.
|
||||
/// The remaining fields (`min_exploration`, `topic_diversity`, `category_min`)
|
||||
/// are reserved for M3/M6 and ignored by the selector.
|
||||
#[derive(Debug, Clone, Default)]
|
||||
pub struct DiversityConstraints {
|
||||
pub max_per_creator: Option<usize>,
|
||||
pub format_mix_max_fraction: Option<f64>,
|
||||
/// Reserved for M3: minimum fraction of exploration candidates.
|
||||
pub min_exploration: Option<f64>,
|
||||
/// Reserved for M6: topic diversity enforcement.
|
||||
pub topic_diversity: Option<f64>,
|
||||
/// Reserved for M6: minimum number of distinct categories.
|
||||
pub category_min: Option<usize>,
|
||||
}
|
||||
|
||||
impl DiversityConstraints {
|
||||
#[must_use]
|
||||
pub fn new() -> Self {
|
||||
Self::default()
|
||||
}
|
||||
|
||||
#[must_use]
|
||||
pub const fn max_per_creator(mut self, n: usize) -> Self {
|
||||
self.max_per_creator = Some(n);
|
||||
self
|
||||
}
|
||||
|
||||
#[must_use]
|
||||
pub const fn format_mix(mut self, max_fraction: f64) -> Self {
|
||||
self.format_mix_max_fraction = Some(max_fraction);
|
||||
self
|
||||
}
|
||||
}
|
||||
|
||||
// ── Violation ──────────────────────────────────────────────────────────────
|
||||
|
||||
/// A single diversity constraint violation, describing which constraint was
|
||||
/// breached and the specifics.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct ConstraintViolation {
|
||||
/// The constraint name, e.g. `"max_per_creator"` or `"format_mix"`.
|
||||
pub constraint: String,
|
||||
/// Human-readable description of the violation.
|
||||
pub detail: String,
|
||||
}
|
||||
|
||||
// ── Result ─────────────────────────────────────────────────────────────────
|
||||
|
||||
/// Result of diversity selection, including the filtered candidate list and
|
||||
/// any constraint violations that could not be resolved.
|
||||
pub struct DiversityResult {
|
||||
pub selected: Vec<ScoredCandidate>,
|
||||
pub constraints_satisfied: bool,
|
||||
pub violations: Vec<ConstraintViolation>,
|
||||
}
|
||||
|
||||
// ── Selector ───────────────────────────────────────────────────────────────
|
||||
// -- Selector ---------------------------------------------------------------
|
||||
|
||||
/// Stateless diversity selector. Applies constraints via greedy multi-stage
|
||||
/// relaxation.
|
||||
@ -134,7 +75,7 @@ impl DiversitySelector {
|
||||
|
||||
// Collect accepted entity IDs across all relaxation stages, then emit
|
||||
// in a single pass over `candidates` (which is score-sorted descending).
|
||||
// This preserves global score order — INV-RANK-6 holds for all items,
|
||||
// This preserves global score order -- INV-RANK-6 holds for all items,
|
||||
// not just within same-creator groups.
|
||||
let mut accepted: HashSet<u64> = HashSet::new();
|
||||
|
||||
@ -209,7 +150,7 @@ impl DiversitySelector {
|
||||
}
|
||||
}
|
||||
|
||||
// ── Internal helpers ───────────────────────────────────────────────────────
|
||||
// -- Internal helpers -------------------------------------------------------
|
||||
|
||||
/// Greedy selection: iterate candidates in score order, accepting each candidate
|
||||
/// only if it does not violate the given constraints.
|
||||
@ -324,7 +265,7 @@ fn collect_violations(
|
||||
violations
|
||||
}
|
||||
|
||||
// ── Tests ──────────────────────────────────────────────────────────────────
|
||||
// -- Tests ------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(
|
||||
@ -417,7 +358,7 @@ mod tests {
|
||||
#[test]
|
||||
fn select_invariant_count() {
|
||||
// INV-RANK-5: always min(target, candidates.len()) results.
|
||||
// 10 candidates with 10 distinct creators, max_per_creator=1 → can fill 5.
|
||||
// 10 candidates with 10 distinct creators, max_per_creator=1 -> can fill 5.
|
||||
let candidates: Vec<_> = (0..10)
|
||||
.map(|i| make_candidate(i + 1, (10 - i) as f64, Some(i + 1), Some("video")))
|
||||
.collect();
|
||||
@ -477,7 +418,7 @@ mod tests {
|
||||
}
|
||||
}
|
||||
|
||||
// ── Property test helpers ──────────────────────────────────────────────
|
||||
// -- Property test helpers ----------------------------------------------
|
||||
|
||||
fn build_candidates_by_creator(
|
||||
n_creators: usize,
|
||||
@ -530,7 +471,7 @@ mod tests {
|
||||
.collect()
|
||||
}
|
||||
|
||||
// ── Property tests ─────────────────────────────────────────────────────
|
||||
// -- Property tests -----------------------------------------------------
|
||||
|
||||
mod proptests {
|
||||
use super::*;
|
||||
46
tidal/src/ranking/executor/context.rs
Normal file
46
tidal/src/ranking/executor/context.rs
Normal file
@ -0,0 +1,46 @@
|
||||
//! Data types for the profile executor pipeline.
|
||||
//!
|
||||
//! These types carry no execution logic -- they are the nouns of the ranking
|
||||
//! pipeline. `UserContext` captures pre-computed personalization state for a
|
||||
//! single query, and `ScoredCandidate` is the output of the scoring pipeline.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use crate::schema::EntityId;
|
||||
|
||||
// -- User context -------------------------------------------------------------
|
||||
|
||||
/// User-specific scoring context for personalized ranking.
|
||||
///
|
||||
/// Pre-computed before scoring to avoid per-candidate lookups. Built by
|
||||
/// `RetrieveExecutor` in Stage 3 when `FOR USER` is specified.
|
||||
///
|
||||
/// This struct is short-lived (per-query), so it uses `std::collections::HashMap`
|
||||
/// rather than `DashMap` -- no concurrent access is needed.
|
||||
pub struct UserContext {
|
||||
/// `user_id` for this context.
|
||||
pub user_id: u64,
|
||||
/// Per-item interaction boost: maps `item_id (u32)` -> boost score.
|
||||
/// Pre-computed from `InteractionLedger::top_creators()` + `CreatorItemsBitmap`.
|
||||
/// Items from creators the user has interacted with get a positive boost
|
||||
/// proportional to the interaction weight.
|
||||
pub creator_interaction_boosts: HashMap<u32, f64>,
|
||||
}
|
||||
|
||||
// -- Scored candidate ---------------------------------------------------------
|
||||
|
||||
/// A candidate entity with its computed score and signal snapshot.
|
||||
///
|
||||
/// After the executor pipeline, `score` is normalized to `[0.0, 1.0]`.
|
||||
/// The `signal_snapshot` captures the top signals that contributed to the score,
|
||||
/// enabling explain-ability in API responses.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct ScoredCandidate {
|
||||
pub entity_id: EntityId,
|
||||
pub score: f64,
|
||||
pub signal_snapshot: Vec<(String, f64)>,
|
||||
/// Creator ID for diversity enforcement (populated in m2p4).
|
||||
pub creator_id: Option<EntityId>,
|
||||
/// Content format for diversity enforcement (populated in m2p4).
|
||||
pub format: Option<String>,
|
||||
}
|
||||
83
tidal/src/ranking/executor/formulas.rs
Normal file
83
tidal/src/ranking/executor/formulas.rs
Normal file
@ -0,0 +1,83 @@
|
||||
//! Pure scoring formulas.
|
||||
//!
|
||||
//! Every function in this module is a stateless mathematical formula: it takes
|
||||
//! numeric inputs and returns a score. No I/O, no struct methods, no ledger
|
||||
//! access. This makes the formulas independently testable and trivially
|
||||
//! verifiable against their source definitions.
|
||||
|
||||
/// Additive weight applied to the creator-interaction boost.
|
||||
///
|
||||
/// When a user has interacted with a creator, items from that creator receive
|
||||
/// an additive score boost of `interaction_weight * INTERACTION_BOOST_WEIGHT`
|
||||
/// before normalization. 0.3 is a tuning constant that keeps interaction
|
||||
/// boosts meaningful without overwhelming the base signal score.
|
||||
pub(super) const INTERACTION_BOOST_WEIGHT: f64 = 0.3;
|
||||
|
||||
/// Hot: `log10(max(upvotes - downvotes, 1)) / (age_hours + 2)^gravity`
|
||||
pub(super) fn hot_score(views: f64, age_hours: f64, gravity: f64) -> f64 {
|
||||
views.max(1.0).log10() / (age_hours + 2.0).powf(gravity)
|
||||
}
|
||||
|
||||
/// Trending: weighted sum of view and share velocity.
|
||||
pub(super) fn trending_score(view_velocity: f64, share_velocity: f64) -> f64 {
|
||||
2.0f64.mul_add(share_velocity, view_velocity)
|
||||
}
|
||||
|
||||
/// Controversial: `(pos * neg) / (pos + neg)^2`
|
||||
pub(super) fn controversial_score(pos: f64, neg: f64) -> f64 {
|
||||
let denom = (pos + neg).powi(2);
|
||||
if denom < f64::EPSILON {
|
||||
0.0
|
||||
} else {
|
||||
(pos * neg) / denom
|
||||
}
|
||||
}
|
||||
|
||||
/// Hidden gems: `quality / log10(view_count + 10)`
|
||||
pub(super) fn hidden_gems_score(quality: f64, view_count: f64) -> f64 {
|
||||
quality / (view_count + 10.0).log10()
|
||||
}
|
||||
|
||||
/// Shuffle: deterministic hash of entity ID for stable random ordering.
|
||||
pub(super) fn shuffle_score(entity_id: u64) -> f64 {
|
||||
let hash = blake3::hash(&entity_id.to_le_bytes());
|
||||
let bytes = hash.as_bytes();
|
||||
// First 8 bytes as u64, normalized to [0, 1].
|
||||
let arr: [u8; 8] = [
|
||||
bytes[0], bytes[1], bytes[2], bytes[3], bytes[4], bytes[5], bytes[6], bytes[7],
|
||||
];
|
||||
let v = u64::from_le_bytes(arr);
|
||||
#[allow(clippy::cast_precision_loss)]
|
||||
let score = v as f64 / u64::MAX as f64;
|
||||
score
|
||||
}
|
||||
|
||||
// -- Tests --------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used, clippy::float_cmp)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn score_hot_decays_older_candidates() {
|
||||
// hot_score = log10(max(views, 1)) / (age_hours + 2)^gravity.
|
||||
// Older content (more hours) scores lower than newer with equal views.
|
||||
let gravity = 1.8_f64;
|
||||
let score_new = hot_score(50.0, 2.0, gravity);
|
||||
let score_old = hot_score(50.0, 48.0, gravity);
|
||||
assert!(
|
||||
score_new > score_old,
|
||||
"newer content should score higher than older content with the same view count"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn score_trending_uses_velocity() {
|
||||
// trending_score = view_velocity + 2.0 * share_velocity.
|
||||
// Positive velocity inputs yield a positive score.
|
||||
let score = trending_score(2.0, 1.0);
|
||||
assert!(score > 0.0);
|
||||
assert_eq!(trending_score(0.0, 0.0), 0.0);
|
||||
}
|
||||
}
|
||||
252
tidal/src/ranking/executor/helpers.rs
Normal file
252
tidal/src/ranking/executor/helpers.rs
Normal file
@ -0,0 +1,252 @@
|
||||
//! Stateless helpers that bridge the signal ledger to the scoring pipeline.
|
||||
//!
|
||||
//! These functions read aggregations from the ledger and apply gate/normalization
|
||||
//! logic. They depend on `ScoredCandidate` (from `context`) and profile types,
|
||||
//! but contain no executor state.
|
||||
|
||||
use crate::schema::{EntityId, Window};
|
||||
use crate::signals::SignalLedger;
|
||||
|
||||
use super::context::ScoredCandidate;
|
||||
use crate::ranking::profile::{Gate, SignalAgg};
|
||||
|
||||
// -- Signal reading -----------------------------------------------------------
|
||||
|
||||
/// Read a signal aggregation for a candidate. Returns 0.0 on any error or
|
||||
/// missing data -- scoring must never fail, only degrade.
|
||||
pub(super) fn read_agg(
|
||||
entity_id: EntityId,
|
||||
signal: &str,
|
||||
agg: &SignalAgg,
|
||||
window: Window,
|
||||
ledger: &SignalLedger,
|
||||
) -> f64 {
|
||||
match agg {
|
||||
SignalAgg::Value => {
|
||||
#[allow(clippy::cast_precision_loss)]
|
||||
let count = ledger
|
||||
.read_windowed_count(entity_id, signal, window)
|
||||
.unwrap_or(0) as f64;
|
||||
count
|
||||
}
|
||||
SignalAgg::Velocity => ledger
|
||||
.read_velocity(entity_id, signal, window)
|
||||
.unwrap_or(0.0),
|
||||
SignalAgg::DecayScore => ledger
|
||||
.read_decay_score(entity_id, signal, 0)
|
||||
.unwrap_or(None)
|
||||
.unwrap_or(0.0),
|
||||
SignalAgg::Ratio | SignalAgg::RelativeVelocity => {
|
||||
// Not yet implemented -- planned for M3 when cross-signal reads are available.
|
||||
// Returns 0.0; gates using these aggregations will fail (filter out candidates).
|
||||
tracing::warn!(
|
||||
signal = %signal,
|
||||
"SignalAgg::Ratio / RelativeVelocity not yet implemented; returning 0.0"
|
||||
);
|
||||
0.0
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// -- Gate checking ------------------------------------------------------------
|
||||
|
||||
/// Check whether a candidate passes all gate thresholds.
|
||||
pub(super) fn passes_gates(entity_id: EntityId, gates: &[Gate], ledger: &SignalLedger) -> bool {
|
||||
for gate in gates {
|
||||
let value = read_agg(entity_id, &gate.signal, &gate.agg, gate.window, ledger);
|
||||
if value < gate.min_threshold {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
true
|
||||
}
|
||||
|
||||
// -- Normalization ------------------------------------------------------------
|
||||
|
||||
/// Min-max normalize candidate scores to `[0.0, 1.0]`.
|
||||
///
|
||||
/// If all candidates have the same score, they are all set to 1.0.
|
||||
pub(super) fn normalize(candidates: &mut [ScoredCandidate]) {
|
||||
if candidates.is_empty() {
|
||||
return;
|
||||
}
|
||||
// Clamp non-finite scores (NaN, +/-Inf) to 0.0 before normalization.
|
||||
for c in candidates.iter_mut() {
|
||||
if !c.score.is_finite() {
|
||||
c.score = 0.0;
|
||||
}
|
||||
}
|
||||
let min = candidates
|
||||
.iter()
|
||||
.map(|c| c.score)
|
||||
.fold(f64::INFINITY, f64::min);
|
||||
let max = candidates
|
||||
.iter()
|
||||
.map(|c| c.score)
|
||||
.fold(f64::NEG_INFINITY, f64::max);
|
||||
let range = max - min;
|
||||
for c in candidates.iter_mut() {
|
||||
c.score = if range < f64::EPSILON {
|
||||
1.0
|
||||
} else {
|
||||
(c.score - min) / range
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
// -- Tests --------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used, clippy::float_cmp, clippy::cast_precision_loss)]
|
||||
mod tests {
|
||||
use std::time::Duration;
|
||||
|
||||
use super::*;
|
||||
use crate::schema::{DecaySpec, EntityId, EntityKind, SchemaBuilder, Timestamp};
|
||||
use crate::signals::NoopWalWriter;
|
||||
|
||||
#[test]
|
||||
fn read_agg_unimplemented_returns_zero() {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
for sig in &["view", "share", "like"] {
|
||||
let _ = builder
|
||||
.signal(
|
||||
sig,
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(7 * 24 * 3600),
|
||||
},
|
||||
)
|
||||
.windows(&[Window::OneHour, Window::TwentyFourHours, Window::SevenDays])
|
||||
.velocity(true)
|
||||
.add();
|
||||
}
|
||||
let schema = builder.build().unwrap();
|
||||
let ledger = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
|
||||
let base_ns = 1_708_000_000_000_000_000u64;
|
||||
for i in 0u64..5 {
|
||||
let entity_id = EntityId::new(i + 1);
|
||||
let ts = Timestamp::from_nanos(base_ns - i * 3_600_000_000_000);
|
||||
ledger
|
||||
.record_signal("view", entity_id, (5 - i) as f64, ts)
|
||||
.unwrap();
|
||||
ledger
|
||||
.record_signal("share", entity_id, (i % 3) as f64, ts)
|
||||
.unwrap();
|
||||
ledger
|
||||
.record_signal("like", entity_id, (i % 2) as f64, ts)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
let entity_id = EntityId::new(1);
|
||||
// SignalAgg::Ratio is not implemented in M2; always returns 0.0.
|
||||
let result = read_agg(
|
||||
entity_id,
|
||||
"view",
|
||||
&SignalAgg::Ratio,
|
||||
Window::AllTime,
|
||||
&ledger,
|
||||
);
|
||||
assert_eq!(result, 0.0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn passes_gates_below_threshold_excluded() {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
let _ = builder
|
||||
.signal(
|
||||
"view",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(3600),
|
||||
},
|
||||
)
|
||||
.windows(&[Window::OneHour])
|
||||
.velocity(false)
|
||||
.add();
|
||||
let schema = builder.build().unwrap();
|
||||
let ledger = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
let entity_id = EntityId::new(1);
|
||||
// Record 3 signals; windowed count = 3, below threshold of 5.
|
||||
for _ in 0..3 {
|
||||
ledger
|
||||
.record_signal("view", entity_id, 1.0, Timestamp::now())
|
||||
.unwrap();
|
||||
}
|
||||
let gates = vec![Gate {
|
||||
signal: "view".into(),
|
||||
agg: SignalAgg::Value,
|
||||
window: Window::OneHour,
|
||||
min_threshold: 5.0,
|
||||
}];
|
||||
// count=3 < threshold=5 -> candidate excluded.
|
||||
assert!(!passes_gates(entity_id, &gates, &ledger));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn passes_gates_at_threshold_included() {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
let _ = builder
|
||||
.signal(
|
||||
"view",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(3600),
|
||||
},
|
||||
)
|
||||
.windows(&[Window::OneHour])
|
||||
.velocity(false)
|
||||
.add();
|
||||
let schema = builder.build().unwrap();
|
||||
let ledger = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
let entity_id = EntityId::new(1);
|
||||
// Record exactly 5 signals; windowed count = 5, meets threshold exactly.
|
||||
for _ in 0..5 {
|
||||
ledger
|
||||
.record_signal("view", entity_id, 1.0, Timestamp::now())
|
||||
.unwrap();
|
||||
}
|
||||
let gates = vec![Gate {
|
||||
signal: "view".into(),
|
||||
agg: SignalAgg::Value,
|
||||
window: Window::OneHour,
|
||||
min_threshold: 5.0,
|
||||
}];
|
||||
// count=5 >= threshold=5 -> candidate included.
|
||||
assert!(passes_gates(entity_id, &gates, &ledger));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_single_element() {
|
||||
let mut candidates = vec![ScoredCandidate {
|
||||
entity_id: EntityId::new(1),
|
||||
score: 42.0,
|
||||
signal_snapshot: vec![],
|
||||
creator_id: None,
|
||||
format: None,
|
||||
}];
|
||||
normalize(&mut candidates);
|
||||
assert_eq!(candidates[0].score, 1.0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_all_nan_clamps_to_one() {
|
||||
let mut candidates: Vec<ScoredCandidate> = (1u64..=3)
|
||||
.map(|i| ScoredCandidate {
|
||||
entity_id: EntityId::new(i),
|
||||
score: f64::NAN,
|
||||
signal_snapshot: vec![],
|
||||
creator_id: None,
|
||||
format: None,
|
||||
})
|
||||
.collect();
|
||||
normalize(&mut candidates);
|
||||
for c in &candidates {
|
||||
assert_eq!(
|
||||
c.score, 1.0,
|
||||
"NaN score should be clamped to 1.0 after normalize"
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -7,60 +7,27 @@
|
||||
//! 4. **Normalize**: min-max normalize scores to `[0.0, 1.0]`
|
||||
//! 5. **Diversity**: apply per-creator and format-mix constraints via `DiversitySelector`
|
||||
|
||||
pub mod context;
|
||||
pub mod formulas;
|
||||
pub mod helpers;
|
||||
|
||||
// Re-export all public items so that `crate::ranking::executor::Foo` paths continue to work.
|
||||
pub use context::{ScoredCandidate, UserContext};
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use crate::schema::{EntityId, Timestamp, Window};
|
||||
use crate::session::SessionContext;
|
||||
use crate::signals::SignalLedger;
|
||||
|
||||
use super::profile::{Gate, RankingProfile, SignalAgg, Sort};
|
||||
use super::profile::{RankingProfile, SignalAgg, Sort};
|
||||
use formulas::{
|
||||
INTERACTION_BOOST_WEIGHT, controversial_score, hidden_gems_score, hot_score, shuffle_score,
|
||||
trending_score,
|
||||
};
|
||||
use helpers::{normalize, passes_gates, read_agg};
|
||||
|
||||
/// Additive weight applied to the creator-interaction boost.
|
||||
///
|
||||
/// When a user has interacted with a creator, items from that creator receive
|
||||
/// an additive score boost of `interaction_weight * INTERACTION_BOOST_WEIGHT`
|
||||
/// before normalization. 0.3 is a tuning constant that keeps interaction
|
||||
/// boosts meaningful without overwhelming the base signal score.
|
||||
const INTERACTION_BOOST_WEIGHT: f64 = 0.3;
|
||||
|
||||
// ── User context ─────────────────────────────────────────────────────────
|
||||
|
||||
/// User-specific scoring context for personalized ranking.
|
||||
///
|
||||
/// Pre-computed before scoring to avoid per-candidate lookups. Built by
|
||||
/// `RetrieveExecutor` in Stage 3 when `FOR USER` is specified.
|
||||
///
|
||||
/// This struct is short-lived (per-query), so it uses `std::collections::HashMap`
|
||||
/// rather than `DashMap` -- no concurrent access is needed.
|
||||
pub struct UserContext {
|
||||
/// `user_id` for this context.
|
||||
pub user_id: u64,
|
||||
/// Per-item interaction boost: maps `item_id (u32)` -> boost score.
|
||||
/// Pre-computed from `InteractionLedger::top_creators()` + `CreatorItemsBitmap`.
|
||||
/// Items from creators the user has interacted with get a positive boost
|
||||
/// proportional to the interaction weight.
|
||||
pub creator_interaction_boosts: HashMap<u32, f64>,
|
||||
}
|
||||
|
||||
// ── Scored candidate ────────────────────────────────────────────────────────
|
||||
|
||||
/// A candidate entity with its computed score and signal snapshot.
|
||||
///
|
||||
/// After the executor pipeline, `score` is normalized to `[0.0, 1.0]`.
|
||||
/// The `signal_snapshot` captures the top signals that contributed to the score,
|
||||
/// enabling explain-ability in API responses.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct ScoredCandidate {
|
||||
pub entity_id: EntityId,
|
||||
pub score: f64,
|
||||
pub signal_snapshot: Vec<(String, f64)>,
|
||||
/// Creator ID for diversity enforcement (populated in m2p4).
|
||||
pub creator_id: Option<EntityId>,
|
||||
/// Content format for diversity enforcement (populated in m2p4).
|
||||
pub format: Option<String>,
|
||||
}
|
||||
|
||||
// ── Executor ────────────────────────────────────────────────────────────────
|
||||
// -- Executor -----------------------------------------------------------------
|
||||
|
||||
/// Scores and ranks candidates according to a `RankingProfile`.
|
||||
///
|
||||
@ -101,7 +68,7 @@ impl<'a> ProfileExecutor<'a> {
|
||||
|
||||
/// Score and rank a set of candidate entities according to the given profile.
|
||||
///
|
||||
/// Executes the five-stage pipeline: gate → score → sort → normalize → diversity.
|
||||
/// Executes the five-stage pipeline: gate -> score -> sort -> normalize -> diversity.
|
||||
///
|
||||
/// - `candidates`: entity IDs to consider; gate-failing candidates are excluded.
|
||||
/// - `profile`: the ranking profile controlling sort mode, boosts, gates, and diversity.
|
||||
@ -115,14 +82,14 @@ impl<'a> ProfileExecutor<'a> {
|
||||
profile: &RankingProfile,
|
||||
now: Timestamp,
|
||||
) -> Vec<ScoredCandidate> {
|
||||
self.score_inner(candidates, profile, now, None)
|
||||
self.score_inner(candidates, profile, now, None, &HashMap::new())
|
||||
}
|
||||
|
||||
/// Score and rank candidates with an optional FOR SESSION boost applied.
|
||||
///
|
||||
/// When `session_ctx` is `Some`, applies an additive boost per candidate:
|
||||
/// - Entity boost: 0.3 if the entity received any signal in this session.
|
||||
/// - Velocity boost: `vel / (vel + 1) * 0.2` (Michaelis–Menten saturation).
|
||||
/// When `session_ctx` is `Some`, applies an additive boost per candidate based
|
||||
/// on keyword hint matching: keywords from session annotations are matched against
|
||||
/// each item's metadata values.
|
||||
///
|
||||
/// The session boost is applied after base scoring and before normalization.
|
||||
#[must_use]
|
||||
@ -132,8 +99,9 @@ impl<'a> ProfileExecutor<'a> {
|
||||
profile: &RankingProfile,
|
||||
now: Timestamp,
|
||||
session_ctx: Option<&SessionContext>,
|
||||
item_metadata: &HashMap<u64, HashMap<String, String>>,
|
||||
) -> Vec<ScoredCandidate> {
|
||||
self.score_inner(candidates, profile, now, session_ctx)
|
||||
self.score_inner(candidates, profile, now, session_ctx, item_metadata)
|
||||
}
|
||||
|
||||
/// Score and rank candidates with user-specific personalization.
|
||||
@ -153,14 +121,18 @@ impl<'a> ProfileExecutor<'a> {
|
||||
now: Timestamp,
|
||||
session_ctx: Option<&SessionContext>,
|
||||
user_ctx: &UserContext,
|
||||
item_metadata: &HashMap<u64, HashMap<String, String>>,
|
||||
) -> Vec<ScoredCandidate> {
|
||||
let empty: HashMap<String, String> = HashMap::new();
|
||||
let mut scored: Vec<ScoredCandidate> = candidates
|
||||
.iter()
|
||||
.filter(|&&entity_id| passes_gates(entity_id, &profile.gates, self.ledger))
|
||||
.map(|&entity_id| {
|
||||
let raw = self.compute_raw_score(entity_id, profile, now);
|
||||
let session_boost =
|
||||
session_ctx.map_or(0.0, |ctx| Self::session_boost(entity_id.as_u64(), ctx));
|
||||
let metadata = item_metadata.get(&entity_id.as_u64()).map_or(&empty, |m| m);
|
||||
let session_boost = session_ctx.map_or(0.0, |ctx| {
|
||||
Self::session_boost(entity_id.as_u64(), ctx, metadata)
|
||||
});
|
||||
|
||||
// Creator-interaction boost: look up the per-item boost from
|
||||
// the pre-computed map. Items from highly-interacted creators
|
||||
@ -208,24 +180,29 @@ impl<'a> ProfileExecutor<'a> {
|
||||
scored
|
||||
}
|
||||
|
||||
/// Shared five-stage scoring pipeline: gate → score → sort → normalize → diversity.
|
||||
/// Shared five-stage scoring pipeline: gate -> score -> sort -> normalize -> diversity.
|
||||
///
|
||||
/// When `session_ctx` is `Some`, an additive session boost is applied to each
|
||||
/// candidate's raw score before normalization.
|
||||
/// candidate's raw score before normalization. `item_metadata` provides the
|
||||
/// metadata for keyword hint matching; pass an empty map when not needed.
|
||||
fn score_inner(
|
||||
&self,
|
||||
candidates: &[EntityId],
|
||||
profile: &RankingProfile,
|
||||
now: Timestamp,
|
||||
session_ctx: Option<&SessionContext>,
|
||||
item_metadata: &HashMap<u64, HashMap<String, String>>,
|
||||
) -> Vec<ScoredCandidate> {
|
||||
let empty: HashMap<String, String> = HashMap::new();
|
||||
let mut scored: Vec<ScoredCandidate> = candidates
|
||||
.iter()
|
||||
.filter(|&&entity_id| passes_gates(entity_id, &profile.gates, self.ledger))
|
||||
.map(|&entity_id| {
|
||||
let raw = self.compute_raw_score(entity_id, profile, now);
|
||||
let boost =
|
||||
session_ctx.map_or(0.0, |ctx| Self::session_boost(entity_id.as_u64(), ctx));
|
||||
let metadata = item_metadata.get(&entity_id.as_u64()).map_or(&empty, |m| m);
|
||||
let boost = session_ctx.map_or(0.0, |ctx| {
|
||||
Self::session_boost(entity_id.as_u64(), ctx, metadata)
|
||||
});
|
||||
ScoredCandidate {
|
||||
entity_id,
|
||||
score: raw + boost,
|
||||
@ -265,16 +242,38 @@ impl<'a> ProfileExecutor<'a> {
|
||||
|
||||
/// Compute the session-specific score boost for a single entity.
|
||||
///
|
||||
/// - Entity boost: 0.3 if entity received any session signal (flat boost).
|
||||
/// - Velocity boost: `vel / (vel + 1) * 0.2` (bounded saturation function).
|
||||
fn session_boost(entity_id: u64, ctx: &SessionContext) -> f64 {
|
||||
let entity_boost = if ctx.signaled_entities.contains(&entity_id) {
|
||||
0.3
|
||||
} else {
|
||||
/// - Keyword hint score: fraction of annotation keywords that match any metadata value.
|
||||
/// `hint_score = hits / total_keywords * 0.3` (capped via the [0,1] fraction).
|
||||
/// - Velocity boost: `vel / (vel + 1) * 0.2` (Michaelis-Menten saturation).
|
||||
fn session_boost(
|
||||
_entity_id: u64,
|
||||
ctx: &SessionContext,
|
||||
metadata: &HashMap<String, String>,
|
||||
) -> f64 {
|
||||
let total_kw: usize = ctx
|
||||
.keywords
|
||||
.iter()
|
||||
.flat_map(|h| h.split_whitespace())
|
||||
.count();
|
||||
let hits = ctx
|
||||
.keywords
|
||||
.iter()
|
||||
.flat_map(|h| h.split_whitespace())
|
||||
.filter(|kw| {
|
||||
let kw_lower = kw.to_lowercase();
|
||||
metadata
|
||||
.values()
|
||||
.any(|v| v.to_lowercase().contains(&kw_lower))
|
||||
})
|
||||
.count();
|
||||
#[allow(clippy::cast_precision_loss)]
|
||||
let hint_score = if total_kw == 0 {
|
||||
0.0
|
||||
} else {
|
||||
hits as f64 / total_kw as f64
|
||||
};
|
||||
let vel_norm = ctx.reward_velocity / (ctx.reward_velocity + 1.0);
|
||||
entity_boost + vel_norm * 0.2
|
||||
hint_score * 0.3 + vel_norm * 0.2
|
||||
}
|
||||
|
||||
/// Compute raw score for a single candidate based on the profile's sort mode.
|
||||
@ -309,7 +308,7 @@ impl<'a> ProfileExecutor<'a> {
|
||||
Some(Sort::Shuffle) => shuffle_score(entity_id.as_u64()),
|
||||
Some(Sort::New) => {
|
||||
// M2 limitation: entity metadata (`created_at`) is not accessible from the
|
||||
// executor. Entity ID is used as a proxy for recency — ranks higher IDs
|
||||
// executor. Entity ID is used as a proxy for recency -- ranks higher IDs
|
||||
// first, which is correct only when IDs are assigned monotonically.
|
||||
// Caller-specified u64 IDs are not guaranteed monotonic; this is a
|
||||
// best-effort approximation. Exact creation-time sorting requires
|
||||
@ -327,6 +326,30 @@ impl<'a> ProfileExecutor<'a> {
|
||||
Some(Sort::MostLiked { window }) => {
|
||||
read_agg(entity_id, "like", &SignalAgg::Value, *window, self.ledger)
|
||||
}
|
||||
Some(Sort::MostFollowed) => read_agg(
|
||||
entity_id,
|
||||
"follow",
|
||||
&SignalAgg::Value,
|
||||
Window::AllTime,
|
||||
self.ledger,
|
||||
),
|
||||
Some(Sort::CreatorEngagementRate) => {
|
||||
let view_vel = read_agg(
|
||||
entity_id,
|
||||
"view",
|
||||
&SignalAgg::Velocity,
|
||||
Window::TwentyFourHours,
|
||||
self.ledger,
|
||||
);
|
||||
let like_vel = read_agg(
|
||||
entity_id,
|
||||
"like",
|
||||
&SignalAgg::Velocity,
|
||||
Window::TwentyFourHours,
|
||||
self.ledger,
|
||||
);
|
||||
view_vel + like_vel
|
||||
}
|
||||
Some(Sort::Rising) => self.score_rising(entity_id),
|
||||
None => 0.0,
|
||||
}
|
||||
@ -443,132 +466,7 @@ impl<'a> ProfileExecutor<'a> {
|
||||
}
|
||||
}
|
||||
|
||||
// ── Scoring formulas ────────────────────────────────────────────────────────
|
||||
|
||||
/// Hot: `log10(max(upvotes - downvotes, 1)) / (age_hours + 2)^gravity`
|
||||
fn hot_score(views: f64, age_hours: f64, gravity: f64) -> f64 {
|
||||
views.max(1.0).log10() / (age_hours + 2.0).powf(gravity)
|
||||
}
|
||||
|
||||
/// Trending: weighted sum of view and share velocity.
|
||||
fn trending_score(view_velocity: f64, share_velocity: f64) -> f64 {
|
||||
2.0f64.mul_add(share_velocity, view_velocity)
|
||||
}
|
||||
|
||||
/// Controversial: `(pos * neg) / (pos + neg)^2`
|
||||
fn controversial_score(pos: f64, neg: f64) -> f64 {
|
||||
let denom = (pos + neg).powi(2);
|
||||
if denom < f64::EPSILON {
|
||||
0.0
|
||||
} else {
|
||||
(pos * neg) / denom
|
||||
}
|
||||
}
|
||||
|
||||
/// Hidden gems: `quality / log10(view_count + 10)`
|
||||
fn hidden_gems_score(quality: f64, view_count: f64) -> f64 {
|
||||
quality / (view_count + 10.0).log10()
|
||||
}
|
||||
|
||||
/// Shuffle: deterministic hash of entity ID for stable random ordering.
|
||||
fn shuffle_score(entity_id: u64) -> f64 {
|
||||
let hash = blake3::hash(&entity_id.to_le_bytes());
|
||||
let bytes = hash.as_bytes();
|
||||
// First 8 bytes as u64, normalized to [0, 1].
|
||||
let arr: [u8; 8] = [
|
||||
bytes[0], bytes[1], bytes[2], bytes[3], bytes[4], bytes[5], bytes[6], bytes[7],
|
||||
];
|
||||
let v = u64::from_le_bytes(arr);
|
||||
#[allow(clippy::cast_precision_loss)]
|
||||
let score = v as f64 / u64::MAX as f64;
|
||||
score
|
||||
}
|
||||
|
||||
// ── Signal reading ──────────────────────────────────────────────────────────
|
||||
|
||||
/// Read a signal aggregation for a candidate. Returns 0.0 on any error or
|
||||
/// missing data -- scoring must never fail, only degrade.
|
||||
fn read_agg(
|
||||
entity_id: EntityId,
|
||||
signal: &str,
|
||||
agg: &SignalAgg,
|
||||
window: Window,
|
||||
ledger: &SignalLedger,
|
||||
) -> f64 {
|
||||
match agg {
|
||||
SignalAgg::Value => {
|
||||
#[allow(clippy::cast_precision_loss)]
|
||||
let count = ledger
|
||||
.read_windowed_count(entity_id, signal, window)
|
||||
.unwrap_or(0) as f64;
|
||||
count
|
||||
}
|
||||
SignalAgg::Velocity => ledger
|
||||
.read_velocity(entity_id, signal, window)
|
||||
.unwrap_or(0.0),
|
||||
SignalAgg::DecayScore => ledger
|
||||
.read_decay_score(entity_id, signal, 0)
|
||||
.unwrap_or(None)
|
||||
.unwrap_or(0.0),
|
||||
SignalAgg::Ratio | SignalAgg::RelativeVelocity => {
|
||||
// Not yet implemented — planned for M3 when cross-signal reads are available.
|
||||
// Returns 0.0; gates using these aggregations will fail (filter out candidates).
|
||||
tracing::warn!(
|
||||
signal = %signal,
|
||||
"SignalAgg::Ratio / RelativeVelocity not yet implemented; returning 0.0"
|
||||
);
|
||||
0.0
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ── Gate checking ───────────────────────────────────────────────────────────
|
||||
|
||||
/// Check whether a candidate passes all gate thresholds.
|
||||
fn passes_gates(entity_id: EntityId, gates: &[Gate], ledger: &SignalLedger) -> bool {
|
||||
for gate in gates {
|
||||
let value = read_agg(entity_id, &gate.signal, &gate.agg, gate.window, ledger);
|
||||
if value < gate.min_threshold {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
true
|
||||
}
|
||||
|
||||
// ── Normalization ───────────────────────────────────────────────────────────
|
||||
|
||||
/// Min-max normalize candidate scores to `[0.0, 1.0]`.
|
||||
///
|
||||
/// If all candidates have the same score, they are all set to 1.0.
|
||||
fn normalize(candidates: &mut [ScoredCandidate]) {
|
||||
if candidates.is_empty() {
|
||||
return;
|
||||
}
|
||||
// Clamp non-finite scores (NaN, +/-Inf) to 0.0 before normalization.
|
||||
for c in candidates.iter_mut() {
|
||||
if !c.score.is_finite() {
|
||||
c.score = 0.0;
|
||||
}
|
||||
}
|
||||
let min = candidates
|
||||
.iter()
|
||||
.map(|c| c.score)
|
||||
.fold(f64::INFINITY, f64::min);
|
||||
let max = candidates
|
||||
.iter()
|
||||
.map(|c| c.score)
|
||||
.fold(f64::NEG_INFINITY, f64::max);
|
||||
let range = max - min;
|
||||
for c in candidates.iter_mut() {
|
||||
c.score = if range < f64::EPSILON {
|
||||
1.0
|
||||
} else {
|
||||
(c.score - min) / range
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
// ── Tests ───────────────────────────────────────────────────────────────────
|
||||
// -- Integration tests --------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used, clippy::float_cmp, clippy::cast_precision_loss)]
|
||||
@ -577,7 +475,7 @@ mod tests {
|
||||
|
||||
use super::*;
|
||||
use crate::ranking::builtins::register_builtins;
|
||||
use crate::ranking::profile::{CandidateStrategy, DiversitySpec};
|
||||
use crate::ranking::profile::{CandidateStrategy, DiversitySpec, Gate};
|
||||
use crate::ranking::registry::ProfileRegistry;
|
||||
use crate::schema::{DecaySpec, EntityKind, SchemaBuilder};
|
||||
use crate::signals::{NoopWalWriter, SignalLedger};
|
||||
@ -834,144 +732,6 @@ mod tests {
|
||||
}
|
||||
}
|
||||
|
||||
// ── Unit tests for private scoring helpers ───────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn score_hot_decays_older_candidates() {
|
||||
// hot_score = log10(max(views, 1)) / (age_hours + 2)^gravity.
|
||||
// Older content (more hours) scores lower than newer with equal views.
|
||||
let gravity = 1.8_f64;
|
||||
let score_new = hot_score(50.0, 2.0, gravity);
|
||||
let score_old = hot_score(50.0, 48.0, gravity);
|
||||
assert!(
|
||||
score_new > score_old,
|
||||
"newer content should score higher than older content with the same view count"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn score_trending_uses_velocity() {
|
||||
// trending_score = view_velocity + 2.0 * share_velocity.
|
||||
// Positive velocity inputs yield a positive score.
|
||||
let score = trending_score(2.0, 1.0);
|
||||
assert!(score > 0.0);
|
||||
assert_eq!(trending_score(0.0, 0.0), 0.0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn read_agg_unimplemented_returns_zero() {
|
||||
let ledger = test_ledger();
|
||||
let entity_id = EntityId::new(1);
|
||||
// SignalAgg::Ratio is not implemented in M2; always returns 0.0.
|
||||
let result = read_agg(
|
||||
entity_id,
|
||||
"view",
|
||||
&SignalAgg::Ratio,
|
||||
Window::AllTime,
|
||||
&ledger,
|
||||
);
|
||||
assert_eq!(result, 0.0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn passes_gates_below_threshold_excluded() {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
let _ = builder
|
||||
.signal(
|
||||
"view",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(3600),
|
||||
},
|
||||
)
|
||||
.windows(&[Window::OneHour])
|
||||
.velocity(false)
|
||||
.add();
|
||||
let schema = builder.build().unwrap();
|
||||
let ledger = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
let entity_id = EntityId::new(1);
|
||||
// Record 3 signals; windowed count = 3, below threshold of 5.
|
||||
for _ in 0..3 {
|
||||
ledger
|
||||
.record_signal("view", entity_id, 1.0, Timestamp::now())
|
||||
.unwrap();
|
||||
}
|
||||
let gates = vec![Gate {
|
||||
signal: "view".into(),
|
||||
agg: SignalAgg::Value,
|
||||
window: Window::OneHour,
|
||||
min_threshold: 5.0,
|
||||
}];
|
||||
// count=3 < threshold=5 → candidate excluded.
|
||||
assert!(!passes_gates(entity_id, &gates, &ledger));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn passes_gates_at_threshold_included() {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
let _ = builder
|
||||
.signal(
|
||||
"view",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(3600),
|
||||
},
|
||||
)
|
||||
.windows(&[Window::OneHour])
|
||||
.velocity(false)
|
||||
.add();
|
||||
let schema = builder.build().unwrap();
|
||||
let ledger = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
let entity_id = EntityId::new(1);
|
||||
// Record exactly 5 signals; windowed count = 5, meets threshold exactly.
|
||||
for _ in 0..5 {
|
||||
ledger
|
||||
.record_signal("view", entity_id, 1.0, Timestamp::now())
|
||||
.unwrap();
|
||||
}
|
||||
let gates = vec![Gate {
|
||||
signal: "view".into(),
|
||||
agg: SignalAgg::Value,
|
||||
window: Window::OneHour,
|
||||
min_threshold: 5.0,
|
||||
}];
|
||||
// count=5 >= threshold=5 → candidate included.
|
||||
assert!(passes_gates(entity_id, &gates, &ledger));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_single_element() {
|
||||
let mut candidates = vec![ScoredCandidate {
|
||||
entity_id: EntityId::new(1),
|
||||
score: 42.0,
|
||||
signal_snapshot: vec![],
|
||||
creator_id: None,
|
||||
format: None,
|
||||
}];
|
||||
normalize(&mut candidates);
|
||||
assert_eq!(candidates[0].score, 1.0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_all_nan_clamps_to_one() {
|
||||
let mut candidates: Vec<ScoredCandidate> = (1u64..=3)
|
||||
.map(|i| ScoredCandidate {
|
||||
entity_id: EntityId::new(i),
|
||||
score: f64::NAN,
|
||||
signal_snapshot: vec![],
|
||||
creator_id: None,
|
||||
format: None,
|
||||
})
|
||||
.collect();
|
||||
normalize(&mut candidates);
|
||||
for c in &candidates {
|
||||
assert_eq!(
|
||||
c.score, 1.0,
|
||||
"NaN score should be clamped to 1.0 after normalize"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn score_new_ranks_higher_ids_first() {
|
||||
let ledger = test_ledger();
|
||||
@ -989,7 +749,7 @@ mod tests {
|
||||
assert_eq!(result[2].entity_id, EntityId::new(1));
|
||||
}
|
||||
|
||||
// ── Personalized scoring tests ──────────────────────────────────────────
|
||||
// -- Personalized scoring tests -------------------------------------------
|
||||
|
||||
#[test]
|
||||
fn score_personalized_boosts_interacted_creators() {
|
||||
@ -1027,7 +787,14 @@ mod tests {
|
||||
creator_interaction_boosts: boosts,
|
||||
};
|
||||
|
||||
let result = executor.score_personalized(&candidates, &profile, now, None, &user_ctx);
|
||||
let result = executor.score_personalized(
|
||||
&candidates,
|
||||
&profile,
|
||||
now,
|
||||
None,
|
||||
&user_ctx,
|
||||
&HashMap::new(),
|
||||
);
|
||||
assert_eq!(result.len(), 5);
|
||||
|
||||
// Entity 3 should be ranked first because it has an interaction boost
|
||||
@ -1076,7 +843,14 @@ mod tests {
|
||||
};
|
||||
|
||||
let base = executor.score(&candidates, &profile, now);
|
||||
let personalized = executor.score_personalized(&candidates, &profile, now, None, &user_ctx);
|
||||
let personalized = executor.score_personalized(
|
||||
&candidates,
|
||||
&profile,
|
||||
now,
|
||||
None,
|
||||
&user_ctx,
|
||||
&HashMap::new(),
|
||||
);
|
||||
|
||||
assert_eq!(base.len(), personalized.len());
|
||||
for (b, p) in base.iter().zip(personalized.iter()) {
|
||||
@ -46,16 +46,31 @@ pub struct RankingProfile {
|
||||
/// Primary sort mode. Determines the scoring formula applied to candidates.
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub enum Sort {
|
||||
Hot { gravity: f64 },
|
||||
Hot {
|
||||
gravity: f64,
|
||||
},
|
||||
Trending,
|
||||
Rising,
|
||||
Controversial,
|
||||
HiddenGems,
|
||||
Shuffle,
|
||||
New,
|
||||
TopWindow { window: Window },
|
||||
MostViewed { window: Window },
|
||||
MostLiked { window: Window },
|
||||
TopWindow {
|
||||
window: Window,
|
||||
},
|
||||
MostViewed {
|
||||
window: Window,
|
||||
},
|
||||
MostLiked {
|
||||
window: Window,
|
||||
},
|
||||
/// Sort creators by total follow count (`AllTime` follow signal value).
|
||||
/// Uses the "follow" signal's `AllTime` windowed count as a proxy for
|
||||
/// follower count. Degrades gracefully when "follow" signal is absent.
|
||||
MostFollowed,
|
||||
/// Sort creators by engagement rate proxy (view + like velocity over 24h).
|
||||
/// Higher combined velocity = higher engagement rate.
|
||||
CreatorEngagementRate,
|
||||
}
|
||||
|
||||
// ── Candidate strategy ──────────────────────────────────────────────────────
|
||||
|
||||
@ -8,68 +8,34 @@
|
||||
//! - Gate thresholds: `[0.0, 1.0]` for `DecayScore`/`Ratio`, `>= 0.0` for others
|
||||
|
||||
use std::collections::{BTreeMap, HashMap};
|
||||
use std::fmt;
|
||||
|
||||
use super::profile::{RankingProfile, SignalAgg};
|
||||
|
||||
// ── Error ───────────────────────────────────────────────────────────────────
|
||||
|
||||
/// Errors from profile registration and lookup.
|
||||
#[derive(Debug)]
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum ProfileError {
|
||||
#[error("invalid profile name '{0}': must match [a-z0-9_]{{1,64}}")]
|
||||
InvalidName(String),
|
||||
#[error(
|
||||
"version conflict for profile '{name}': new version {new} must be > existing {existing}"
|
||||
)]
|
||||
VersionConflict {
|
||||
name: String,
|
||||
existing: u32,
|
||||
new: u32,
|
||||
},
|
||||
#[error("exploration {0} out of range [0.0, 0.5]")]
|
||||
ExplorationOutOfRange(f64),
|
||||
#[error("gate threshold {0} out of range (DecayScore/Ratio: [0.0, 1.0]; others: >= 0.0)")]
|
||||
GateThresholdOutOfRange(f64),
|
||||
#[error("profile '{0}' not found")]
|
||||
NotFound(String),
|
||||
VersionNotFound {
|
||||
name: String,
|
||||
version: u32,
|
||||
},
|
||||
#[error("profile '{name}' version {version} not found")]
|
||||
VersionNotFound { name: String, version: u32 },
|
||||
}
|
||||
|
||||
impl fmt::Display for ProfileError {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
match self {
|
||||
Self::InvalidName(name) => {
|
||||
write!(
|
||||
f,
|
||||
"invalid profile name '{name}': must match [a-z0-9_]{{1,64}}"
|
||||
)
|
||||
}
|
||||
Self::VersionConflict {
|
||||
name,
|
||||
existing,
|
||||
new,
|
||||
} => {
|
||||
write!(
|
||||
f,
|
||||
"version conflict for profile '{name}': new version {new} must be > existing {existing}"
|
||||
)
|
||||
}
|
||||
Self::ExplorationOutOfRange(val) => {
|
||||
write!(f, "exploration {val} out of range [0.0, 0.5]")
|
||||
}
|
||||
Self::GateThresholdOutOfRange(val) => {
|
||||
write!(
|
||||
f,
|
||||
"gate threshold {val} out of range (DecayScore/Ratio: [0.0, 1.0]; others: >= 0.0)"
|
||||
)
|
||||
}
|
||||
Self::NotFound(name) => write!(f, "profile '{name}' not found"),
|
||||
Self::VersionNotFound { name, version } => {
|
||||
write!(f, "profile '{name}' version {version} not found")
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl std::error::Error for ProfileError {}
|
||||
|
||||
// ── Validation ──────────────────────────────────────────────────────────────
|
||||
|
||||
/// Validate profile name without the regex crate.
|
||||
|
||||
@ -1,5 +1,3 @@
|
||||
use std::fmt;
|
||||
|
||||
use super::{EntityId, EntityKind};
|
||||
use crate::db::ConfigError;
|
||||
use crate::query::retrieve::QueryError;
|
||||
@ -14,237 +12,101 @@ use crate::query::retrieve::QueryError;
|
||||
/// let err = TidalError::Internal("something unexpected".to_string());
|
||||
/// assert!(err.to_string().contains("internal error"));
|
||||
/// ```
|
||||
#[derive(Debug)]
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum TidalError {
|
||||
/// Storage engine failure. Retry may succeed.
|
||||
Storage(StorageError),
|
||||
#[error("storage error: {0}")]
|
||||
Storage(#[from] StorageError),
|
||||
/// Entity not found. Caller should handle.
|
||||
#[error("{kind} {id} not found")]
|
||||
NotFound { kind: EntityKind, id: EntityId },
|
||||
/// Schema violation. Caller's fault — fix the input.
|
||||
Schema(SchemaError),
|
||||
#[error("{0}")]
|
||||
Schema(#[from] SchemaError),
|
||||
/// Signal write failed durability check. Retry required.
|
||||
Durability(DurabilityError),
|
||||
#[error("durability error: {0}")]
|
||||
Durability(#[from] DurabilityError),
|
||||
/// Query malformed. Parse error with details.
|
||||
Query(QueryError),
|
||||
#[error("query error: {0}")]
|
||||
Query(#[from] QueryError),
|
||||
/// Configuration error. Caller supplied invalid config.
|
||||
Config(ConfigError),
|
||||
#[error("config error: {0}")]
|
||||
Config(#[from] ConfigError),
|
||||
/// Internal invariant violated. This is a bug in tidalDB.
|
||||
#[error("internal error: {0}")]
|
||||
Internal(String),
|
||||
/// A session policy rejected the signal write.
|
||||
#[error(
|
||||
"policy violation: signal '{signal_type}' rejected by policy '{policy_name}': {reason}"
|
||||
)]
|
||||
PolicyViolation {
|
||||
signal_type: String,
|
||||
policy_name: String,
|
||||
reason: String,
|
||||
},
|
||||
/// A session has exceeded its `max_session_duration` policy limit.
|
||||
#[error("session {session_id} expired (max duration: {max_duration_secs:.1}s)")]
|
||||
SessionExpired {
|
||||
session_id: u64,
|
||||
max_duration_secs: f64,
|
||||
},
|
||||
}
|
||||
|
||||
impl fmt::Display for TidalError {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
match self {
|
||||
Self::Storage(e) => write!(f, "storage error: {e}"),
|
||||
Self::NotFound { kind, id } => write!(f, "{kind} {id} not found"),
|
||||
Self::Schema(e) => write!(f, "{e}"),
|
||||
Self::Durability(e) => write!(f, "durability error: {e}"),
|
||||
Self::Query(e) => write!(f, "query error: {e}"),
|
||||
Self::Config(e) => write!(f, "config error: {e}"),
|
||||
Self::Internal(msg) => write!(f, "internal error: {msg}"),
|
||||
Self::PolicyViolation {
|
||||
signal_type,
|
||||
policy_name,
|
||||
reason,
|
||||
} => write!(
|
||||
f,
|
||||
"policy violation: signal '{signal_type}' rejected by policy '{policy_name}': {reason}"
|
||||
),
|
||||
Self::SessionExpired {
|
||||
session_id,
|
||||
max_duration_secs,
|
||||
} => write!(
|
||||
f,
|
||||
"session {session_id} expired (max duration: {max_duration_secs:.1}s)"
|
||||
),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl std::error::Error for TidalError {
|
||||
fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
|
||||
match self {
|
||||
Self::Storage(e) => Some(e),
|
||||
Self::Schema(e) => Some(e),
|
||||
Self::Durability(e) => Some(e),
|
||||
Self::Query(e) => Some(e),
|
||||
Self::Config(e) => Some(e),
|
||||
Self::NotFound { .. }
|
||||
| Self::Internal(_)
|
||||
| Self::PolicyViolation { .. }
|
||||
| Self::SessionExpired { .. } => None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl From<SchemaError> for TidalError {
|
||||
fn from(e: SchemaError) -> Self {
|
||||
Self::Schema(e)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<StorageError> for TidalError {
|
||||
fn from(e: StorageError) -> Self {
|
||||
Self::Storage(e)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<DurabilityError> for TidalError {
|
||||
fn from(e: DurabilityError) -> Self {
|
||||
Self::Durability(e)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<QueryError> for TidalError {
|
||||
fn from(e: QueryError) -> Self {
|
||||
Self::Query(e)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<ConfigError> for TidalError {
|
||||
fn from(e: ConfigError) -> Self {
|
||||
Self::Config(e)
|
||||
}
|
||||
}
|
||||
|
||||
/// Schema validation errors.
|
||||
///
|
||||
/// `Eq` is manually implemented because f64 fields (from `Duration::as_secs_f64()`)
|
||||
/// are always non-NaN, making equality reflexive.
|
||||
#[derive(Debug, Clone, PartialEq)]
|
||||
#[derive(Debug, Clone, PartialEq, thiserror::Error)]
|
||||
pub enum SchemaError {
|
||||
#[error("duplicate signal name: '{0}'")]
|
||||
DuplicateSignalName(String),
|
||||
#[error("invalid signal name: '{0}'")]
|
||||
InvalidSignalName(String),
|
||||
#[error("signal '{signal_name}': invalid half-life: {half_life_secs}s")]
|
||||
InvalidHalfLife {
|
||||
signal_name: String,
|
||||
half_life_secs: f64,
|
||||
},
|
||||
#[error("signal '{signal_name}': invalid lifetime: {lifetime_secs}s")]
|
||||
InvalidLifetime {
|
||||
signal_name: String,
|
||||
lifetime_secs: f64,
|
||||
},
|
||||
EmptyWindows {
|
||||
signal_name: String,
|
||||
},
|
||||
VelocityWithoutWindows {
|
||||
signal_name: String,
|
||||
},
|
||||
#[error("signal '{signal_name}': non-permanent signal requires at least one window")]
|
||||
EmptyWindows { signal_name: String },
|
||||
#[error("signal '{signal_name}': velocity requires at least one window")]
|
||||
VelocityWithoutWindows { signal_name: String },
|
||||
#[error("schema must define at least one signal")]
|
||||
NoSignalsDefined,
|
||||
/// Signal type name not found in schema at runtime.
|
||||
#[error("unknown signal type: '{0}'")]
|
||||
UnknownSignalType(String),
|
||||
/// Two session policies share the same name.
|
||||
#[error("duplicate policy name: '{0}'")]
|
||||
DuplicatePolicyName(String),
|
||||
/// Session policy name is not a valid identifier.
|
||||
#[error("invalid policy name: '{0}'")]
|
||||
InvalidPolicyName(String),
|
||||
/// A signal referenced in a policy does not exist in the schema.
|
||||
PolicySignalNotInSchema {
|
||||
policy: String,
|
||||
signal: String,
|
||||
},
|
||||
#[error("policy '{policy}': signal '{signal}' not defined in schema")]
|
||||
PolicySignalNotInSchema { policy: String, signal: String },
|
||||
/// A signal appears in both `allowed_signals` and `denied_signals`.
|
||||
PolicySignalConflict {
|
||||
policy: String,
|
||||
signal: String,
|
||||
},
|
||||
#[error("policy '{policy}': signal '{signal}' in both allowed_signals and denied_signals")]
|
||||
PolicySignalConflict { policy: String, signal: String },
|
||||
}
|
||||
|
||||
impl Eq for SchemaError {}
|
||||
|
||||
impl fmt::Display for SchemaError {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
match self {
|
||||
Self::DuplicateSignalName(name) => {
|
||||
write!(f, "duplicate signal name: '{name}'")
|
||||
}
|
||||
Self::InvalidSignalName(name) => {
|
||||
write!(f, "invalid signal name: '{name}'")
|
||||
}
|
||||
Self::InvalidHalfLife {
|
||||
signal_name,
|
||||
half_life_secs,
|
||||
} => {
|
||||
write!(
|
||||
f,
|
||||
"signal '{signal_name}': invalid half-life: {half_life_secs}s"
|
||||
)
|
||||
}
|
||||
Self::InvalidLifetime {
|
||||
signal_name,
|
||||
lifetime_secs,
|
||||
} => {
|
||||
write!(
|
||||
f,
|
||||
"signal '{signal_name}': invalid lifetime: {lifetime_secs}s"
|
||||
)
|
||||
}
|
||||
Self::EmptyWindows { signal_name } => {
|
||||
write!(
|
||||
f,
|
||||
"signal '{signal_name}': non-permanent signal requires at least one window"
|
||||
)
|
||||
}
|
||||
Self::VelocityWithoutWindows { signal_name } => {
|
||||
write!(
|
||||
f,
|
||||
"signal '{signal_name}': velocity requires at least one window"
|
||||
)
|
||||
}
|
||||
Self::NoSignalsDefined => f.write_str("schema must define at least one signal"),
|
||||
Self::UnknownSignalType(name) => {
|
||||
write!(f, "unknown signal type: '{name}'")
|
||||
}
|
||||
Self::DuplicatePolicyName(name) => {
|
||||
write!(f, "duplicate policy name: '{name}'")
|
||||
}
|
||||
Self::InvalidPolicyName(name) => {
|
||||
write!(f, "invalid policy name: '{name}'")
|
||||
}
|
||||
Self::PolicySignalNotInSchema { policy, signal } => {
|
||||
write!(
|
||||
f,
|
||||
"policy '{policy}': signal '{signal}' not defined in schema"
|
||||
)
|
||||
}
|
||||
Self::PolicySignalConflict { policy, signal } => {
|
||||
write!(
|
||||
f,
|
||||
"policy '{policy}': signal '{signal}' in both allowed_signals and denied_signals"
|
||||
)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl std::error::Error for SchemaError {}
|
||||
|
||||
/// Re-exported from `crate::storage::StorageError`.
|
||||
pub use crate::storage::StorageError;
|
||||
|
||||
/// Stub for Phase 1.2+.
|
||||
#[derive(Debug)]
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
#[error("{message}")]
|
||||
pub struct DurabilityError {
|
||||
pub message: String,
|
||||
}
|
||||
|
||||
impl fmt::Display for DurabilityError {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
f.write_str(&self.message)
|
||||
}
|
||||
}
|
||||
|
||||
impl std::error::Error for DurabilityError {}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
@ -12,5 +12,6 @@ pub use score::Score;
|
||||
pub use signal::{DecayModel, SignalTypeDef, Window, WindowSet};
|
||||
pub use timestamp::Timestamp;
|
||||
pub use validation::{
|
||||
AgentPolicy, DecaySpec, EmbeddingSlotDef, Schema, SchemaBuilder, SignalBuilder,
|
||||
AgentPolicy, DecaySpec, EmbeddingSlotDef, Schema, SchemaBuilder, SignalBuilder, TextFieldDef,
|
||||
TextFieldType,
|
||||
};
|
||||
|
||||
@ -1,125 +1,11 @@
|
||||
use std::collections::HashMap;
|
||||
use std::time::Duration;
|
||||
|
||||
use super::error::SchemaError;
|
||||
use super::{DecayModel, EntityKind, SignalTypeDef, Window, WindowSet};
|
||||
use crate::schema::error::SchemaError;
|
||||
use crate::schema::{DecayModel, EntityKind, SignalTypeDef, Window, WindowSet};
|
||||
|
||||
// ── AgentPolicy ──────────────────────────────────────────────────────────────
|
||||
|
||||
/// Policy controlling what an agent session is allowed to do.
|
||||
///
|
||||
/// Declared in the schema at build time via `SchemaBuilder::session_policy`.
|
||||
/// Policies are validated at schema build time: all signal names must exist
|
||||
/// in the schema, and no signal may appear in both allow and deny lists.
|
||||
///
|
||||
/// # Examples
|
||||
///
|
||||
/// ```
|
||||
/// use std::time::Duration;
|
||||
/// use tidaldb::schema::AgentPolicy;
|
||||
///
|
||||
/// let policy = AgentPolicy {
|
||||
/// allowed_signals: vec!["reward".to_string(), "preference_hint".to_string()],
|
||||
/// denied_signals: vec![],
|
||||
/// max_session_duration: Duration::from_secs(3600),
|
||||
/// max_signals_per_session: 10_000,
|
||||
/// };
|
||||
/// ```
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct AgentPolicy {
|
||||
/// If non-empty, only these signal types may be written in sessions using this policy.
|
||||
pub allowed_signals: Vec<String>,
|
||||
/// Signal types always blocked, regardless of `allowed_signals`.
|
||||
pub denied_signals: Vec<String>,
|
||||
/// Maximum wall-clock duration for a session before it is considered expired.
|
||||
pub max_session_duration: Duration,
|
||||
/// Maximum total signals that may be written in a single session.
|
||||
/// `0` means unlimited.
|
||||
pub max_signals_per_session: u32,
|
||||
}
|
||||
|
||||
/// Internal builder entry for a session policy.
|
||||
#[derive(Debug)]
|
||||
struct PolicyEntry {
|
||||
name: String,
|
||||
policy: AgentPolicy,
|
||||
}
|
||||
|
||||
/// User-facing decay specification (before validation computes lambda).
|
||||
///
|
||||
/// Users specify `DecaySpec::Exponential { half_life }` — no lambda.
|
||||
/// The `SchemaBuilder` validates the duration and computes `DecayModel::Exponential { half_life, lambda }`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub enum DecaySpec {
|
||||
/// Weight halves every `half_life`.
|
||||
Exponential { half_life: Duration },
|
||||
/// Weight drops linearly to zero over `lifetime`.
|
||||
Linear { lifetime: Duration },
|
||||
/// Never decays. Used for permanent flags: hide, block, follow.
|
||||
Permanent,
|
||||
}
|
||||
|
||||
/// Definition of an embedding vector slot for ANN search.
|
||||
///
|
||||
/// Declared in the schema to tell tidalDB which embedding dimensions
|
||||
/// to expect for a given entity kind. The database retrieves and ranks
|
||||
/// over vectors -- it does not generate them.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct EmbeddingSlotDef {
|
||||
/// Slot name (e.g., "default", "thumbnail").
|
||||
pub name: String,
|
||||
/// Which entity kind this slot is attached to.
|
||||
pub entity_kind: EntityKind,
|
||||
/// Dimensionality of the embedding vector.
|
||||
pub dimensions: usize,
|
||||
}
|
||||
|
||||
/// A validated, immutable schema.
|
||||
///
|
||||
/// Constructed exclusively through `SchemaBuilder`. Once built, the schema
|
||||
/// is frozen — signal type definitions cannot be added or modified.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct Schema {
|
||||
signals: HashMap<String, SignalTypeDef>,
|
||||
embedding_slots: Vec<EmbeddingSlotDef>,
|
||||
policies: HashMap<String, AgentPolicy>,
|
||||
}
|
||||
|
||||
impl Schema {
|
||||
/// Look up a signal type definition by name.
|
||||
#[must_use]
|
||||
pub fn signal(&self, name: &str) -> Option<&SignalTypeDef> {
|
||||
self.signals.get(name)
|
||||
}
|
||||
|
||||
/// Iterate over all signal type definitions.
|
||||
pub fn signals(&self) -> impl Iterator<Item = &SignalTypeDef> {
|
||||
self.signals.values()
|
||||
}
|
||||
|
||||
/// Number of signal types defined.
|
||||
#[must_use]
|
||||
pub fn signal_count(&self) -> usize {
|
||||
self.signals.len()
|
||||
}
|
||||
|
||||
/// Embedding slot definitions declared in this schema.
|
||||
#[must_use]
|
||||
pub fn embedding_slots(&self) -> &[EmbeddingSlotDef] {
|
||||
&self.embedding_slots
|
||||
}
|
||||
|
||||
/// Look up a session policy by name.
|
||||
#[must_use]
|
||||
pub fn session_policy(&self, name: &str) -> Option<&AgentPolicy> {
|
||||
self.policies.get(name)
|
||||
}
|
||||
|
||||
/// Iterate over all session policies.
|
||||
pub fn session_policies(&self) -> impl Iterator<Item = (&str, &AgentPolicy)> {
|
||||
self.policies.iter().map(|(k, v)| (k.as_str(), v))
|
||||
}
|
||||
}
|
||||
use super::policies::{AgentPolicy, PolicyEntry};
|
||||
use super::text::{EmbeddingSlotDef, TextFieldDef, TextFieldType};
|
||||
use super::{DecaySpec, Schema};
|
||||
|
||||
/// Internal entry for a signal being built.
|
||||
#[derive(Debug)]
|
||||
@ -149,6 +35,8 @@ struct SignalEntry {
|
||||
pub struct SchemaBuilder {
|
||||
entries: Vec<SignalEntry>,
|
||||
embedding_slots: Vec<EmbeddingSlotDef>,
|
||||
text_fields: Vec<TextFieldDef>,
|
||||
creator_text_fields: Vec<TextFieldDef>,
|
||||
policies: Vec<PolicyEntry>,
|
||||
}
|
||||
|
||||
@ -158,6 +46,8 @@ impl SchemaBuilder {
|
||||
Self {
|
||||
entries: Vec::new(),
|
||||
embedding_slots: Vec::new(),
|
||||
text_fields: Vec::new(),
|
||||
creator_text_fields: Vec::new(),
|
||||
policies: Vec::new(),
|
||||
}
|
||||
}
|
||||
@ -177,6 +67,31 @@ impl SchemaBuilder {
|
||||
self
|
||||
}
|
||||
|
||||
/// Declare a text field for full-text search indexing (items).
|
||||
///
|
||||
/// Each text field maps a metadata key to a Tantivy indexing mode.
|
||||
/// `TextFieldType::Text` enables tokenized full-text search;
|
||||
/// `TextFieldType::Keyword` enables exact-match filtering.
|
||||
pub fn text_field(&mut self, key: &str, field_type: TextFieldType) -> &mut Self {
|
||||
self.text_fields.push(TextFieldDef {
|
||||
key: key.to_owned(),
|
||||
field_type,
|
||||
});
|
||||
self
|
||||
}
|
||||
|
||||
/// Declare a text field for creator full-text search indexing.
|
||||
///
|
||||
/// Same semantics as `text_field()` but populates the creator Tantivy index.
|
||||
/// Use for creator metadata keys like "name", "handle", "bio".
|
||||
pub fn creator_text_field(&mut self, key: &str, field_type: TextFieldType) -> &mut Self {
|
||||
self.creator_text_fields.push(TextFieldDef {
|
||||
key: key.to_owned(),
|
||||
field_type,
|
||||
});
|
||||
self
|
||||
}
|
||||
|
||||
/// Declare a session policy for agent sessions.
|
||||
///
|
||||
/// Policies are validated at `build()` time: all signal names in
|
||||
@ -233,7 +148,7 @@ impl SchemaBuilder {
|
||||
|
||||
for entry in self.entries {
|
||||
// Name validation
|
||||
if !is_valid_signal_name(&entry.name) {
|
||||
if !super::is_valid_signal_name(&entry.name) {
|
||||
return Err(SchemaError::InvalidSignalName(entry.name));
|
||||
}
|
||||
|
||||
@ -299,7 +214,7 @@ impl SchemaBuilder {
|
||||
let mut seen_policy_names = std::collections::HashSet::new();
|
||||
let mut policies = HashMap::new();
|
||||
for entry in self.policies {
|
||||
if !is_valid_signal_name(&entry.name) {
|
||||
if !super::is_valid_signal_name(&entry.name) {
|
||||
return Err(SchemaError::InvalidPolicyName(entry.name));
|
||||
}
|
||||
if !seen_policy_names.insert(entry.name.clone()) {
|
||||
@ -331,11 +246,13 @@ impl SchemaBuilder {
|
||||
policies.insert(entry.name, entry.policy);
|
||||
}
|
||||
|
||||
Ok(Schema {
|
||||
Ok(Schema::new(
|
||||
signals,
|
||||
embedding_slots: self.embedding_slots,
|
||||
self.embedding_slots,
|
||||
self.text_fields,
|
||||
self.creator_text_fields,
|
||||
policies,
|
||||
})
|
||||
))
|
||||
}
|
||||
}
|
||||
|
||||
@ -345,12 +262,6 @@ impl Default for SchemaBuilder {
|
||||
}
|
||||
}
|
||||
|
||||
/// Check if a policy name is a valid identifier (same rules as signal names).
|
||||
#[cfg(test)]
|
||||
fn is_valid_policy_name(name: &str) -> bool {
|
||||
is_valid_signal_name(name)
|
||||
}
|
||||
|
||||
/// Intermediate builder for configuring a single signal type.
|
||||
///
|
||||
/// Created by `SchemaBuilder::signal()`. Call `.windows()` and `.velocity()`
|
||||
@ -383,24 +294,12 @@ impl<'a> SignalBuilder<'a> {
|
||||
}
|
||||
}
|
||||
|
||||
/// Check if a signal name is a valid identifier.
|
||||
///
|
||||
/// Must be non-empty, ASCII, lowercase alphanumeric + underscore,
|
||||
/// and start with a letter. Safe for use in storage keys
|
||||
/// (`SIG:{name}:{window}`) and the query language.
|
||||
fn is_valid_signal_name(name: &str) -> bool {
|
||||
!name.is_empty()
|
||||
&& name.is_ascii()
|
||||
&& name
|
||||
.bytes()
|
||||
.all(|b| b.is_ascii_lowercase() || b.is_ascii_digit() || b == b'_')
|
||||
&& name.as_bytes()[0].is_ascii_lowercase()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(unused_must_use, clippy::unwrap_used)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::schema::Window;
|
||||
use std::time::Duration;
|
||||
|
||||
// === Validation: valid schemas ===
|
||||
|
||||
@ -634,134 +533,4 @@ mod tests {
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
// === Signal name validation ===
|
||||
|
||||
#[test]
|
||||
fn is_valid_signal_name_unit() {
|
||||
assert!(is_valid_signal_name("view"));
|
||||
assert!(is_valid_signal_name("a"));
|
||||
assert!(is_valid_signal_name("view_count"));
|
||||
assert!(is_valid_signal_name("signal_24h"));
|
||||
|
||||
assert!(!is_valid_signal_name(""));
|
||||
assert!(!is_valid_signal_name("View"));
|
||||
assert!(!is_valid_signal_name("1view"));
|
||||
assert!(!is_valid_signal_name("_view"));
|
||||
assert!(!is_valid_signal_name("view count"));
|
||||
assert!(!is_valid_signal_name("view-count"));
|
||||
assert!(!is_valid_signal_name("view!"));
|
||||
}
|
||||
|
||||
// === UAT-style integration test ===
|
||||
|
||||
#[test]
|
||||
fn milestone_1_uat_schema() {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
builder
|
||||
.signal(
|
||||
"view",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(7 * 24 * 3600), // 7 days
|
||||
},
|
||||
)
|
||||
.windows(&[Window::OneHour, Window::TwentyFourHours, Window::SevenDays])
|
||||
.velocity(true)
|
||||
.add();
|
||||
builder
|
||||
.signal(
|
||||
"like",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(14 * 24 * 3600), // 14 days
|
||||
},
|
||||
)
|
||||
.windows(&[Window::TwentyFourHours, Window::SevenDays, Window::AllTime])
|
||||
.velocity(true)
|
||||
.add();
|
||||
builder
|
||||
.signal(
|
||||
"skip",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(24 * 3600), // 1 day
|
||||
},
|
||||
)
|
||||
.windows(&[Window::OneHour, Window::TwentyFourHours])
|
||||
.velocity(false)
|
||||
.add();
|
||||
|
||||
let schema = builder.build().expect("UAT schema should be valid");
|
||||
assert_eq!(schema.signal_count(), 3);
|
||||
|
||||
// Verify view signal
|
||||
let view = schema.signal("view").unwrap();
|
||||
assert_eq!(view.windows().len(), 3);
|
||||
assert!(view.velocity_enabled());
|
||||
let lambda = view.decay().lambda().unwrap();
|
||||
let expected_lambda = std::f64::consts::LN_2 / (7.0 * 24.0 * 3600.0);
|
||||
assert!((lambda - expected_lambda).abs() < 1e-15);
|
||||
|
||||
// Verify like signal
|
||||
let like = schema.signal("like").unwrap();
|
||||
assert_eq!(like.windows().len(), 3);
|
||||
assert!(like.windows().contains(&Window::AllTime));
|
||||
|
||||
// Verify skip signal
|
||||
let skip = schema.signal("skip").unwrap();
|
||||
assert!(!skip.velocity_enabled());
|
||||
let skip_lambda = skip.decay().lambda().unwrap();
|
||||
let expected_skip_lambda = std::f64::consts::LN_2 / (24.0 * 3600.0);
|
||||
assert!((skip_lambda - expected_skip_lambda).abs() < 1e-15);
|
||||
}
|
||||
|
||||
// === Schema query API ===
|
||||
|
||||
#[test]
|
||||
fn schema_signal_returns_none_for_missing() {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
builder
|
||||
.signal("view", EntityKind::Item, DecaySpec::Permanent)
|
||||
.add();
|
||||
let schema = builder.build().unwrap();
|
||||
assert!(schema.signal("nonexistent").is_none());
|
||||
}
|
||||
|
||||
// === Property tests ===
|
||||
|
||||
mod proptests {
|
||||
use super::*;
|
||||
use proptest::prelude::*;
|
||||
|
||||
proptest! {
|
||||
#[test]
|
||||
fn signal_name_validation_consistent(name in "\\PC{0,100}") {
|
||||
let valid = is_valid_signal_name(&name);
|
||||
let expected = !name.is_empty()
|
||||
&& name.is_ascii()
|
||||
&& name.bytes().all(|b| b.is_ascii_lowercase() || b.is_ascii_digit() || b == b'_')
|
||||
&& name.as_bytes()[0].is_ascii_lowercase();
|
||||
prop_assert_eq!(valid, expected);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn schema_contains_all_defined_signals(count in 1usize..10) {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
let names: Vec<String> = (0..count)
|
||||
.map(|i| format!("signal_{i}"))
|
||||
.collect();
|
||||
|
||||
for name in &names {
|
||||
builder.signal(name, EntityKind::Item, DecaySpec::Permanent).add();
|
||||
}
|
||||
|
||||
let schema = builder.build().unwrap();
|
||||
prop_assert_eq!(schema.signal_count(), count);
|
||||
for name in &names {
|
||||
prop_assert!(schema.signal(name).is_some());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
259
tidal/src/schema/validation/mod.rs
Normal file
259
tidal/src/schema/validation/mod.rs
Normal file
@ -0,0 +1,259 @@
|
||||
pub mod builders;
|
||||
pub mod policies;
|
||||
pub mod text;
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use super::SignalTypeDef;
|
||||
|
||||
pub use builders::{SchemaBuilder, SignalBuilder};
|
||||
pub use policies::AgentPolicy;
|
||||
pub use text::{EmbeddingSlotDef, TextFieldDef, TextFieldType};
|
||||
|
||||
/// User-facing decay specification (before validation computes lambda).
|
||||
///
|
||||
/// Users specify `DecaySpec::Exponential { half_life }` — no lambda.
|
||||
/// The `SchemaBuilder` validates the duration and computes `DecayModel::Exponential { half_life, lambda }`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub enum DecaySpec {
|
||||
/// Weight halves every `half_life`.
|
||||
Exponential { half_life: std::time::Duration },
|
||||
/// Weight drops linearly to zero over `lifetime`.
|
||||
Linear { lifetime: std::time::Duration },
|
||||
/// Never decays. Used for permanent flags: hide, block, follow.
|
||||
Permanent,
|
||||
}
|
||||
|
||||
/// A validated, immutable schema.
|
||||
///
|
||||
/// Constructed exclusively through `SchemaBuilder`. Once built, the schema
|
||||
/// is frozen — signal type definitions cannot be added or modified.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct Schema {
|
||||
signals: HashMap<String, SignalTypeDef>,
|
||||
embedding_slots: Vec<EmbeddingSlotDef>,
|
||||
text_fields: Vec<TextFieldDef>,
|
||||
creator_text_fields: Vec<TextFieldDef>,
|
||||
policies: HashMap<String, AgentPolicy>,
|
||||
}
|
||||
|
||||
impl Schema {
|
||||
/// Construct a `Schema` from validated components.
|
||||
///
|
||||
/// This is `pub(super)` so only the builder (within the validation module)
|
||||
/// can construct it -- external callers must go through `SchemaBuilder`.
|
||||
#[allow(clippy::missing_const_for_fn)] // HashMap prevents const
|
||||
pub(super) fn new(
|
||||
signals: HashMap<String, SignalTypeDef>,
|
||||
embedding_slots: Vec<EmbeddingSlotDef>,
|
||||
text_fields: Vec<TextFieldDef>,
|
||||
creator_text_fields: Vec<TextFieldDef>,
|
||||
policies: HashMap<String, AgentPolicy>,
|
||||
) -> Self {
|
||||
Self {
|
||||
signals,
|
||||
embedding_slots,
|
||||
text_fields,
|
||||
creator_text_fields,
|
||||
policies,
|
||||
}
|
||||
}
|
||||
|
||||
/// Look up a signal type definition by name.
|
||||
#[must_use]
|
||||
pub fn signal(&self, name: &str) -> Option<&SignalTypeDef> {
|
||||
self.signals.get(name)
|
||||
}
|
||||
|
||||
/// Iterate over all signal type definitions.
|
||||
pub fn signals(&self) -> impl Iterator<Item = &SignalTypeDef> {
|
||||
self.signals.values()
|
||||
}
|
||||
|
||||
/// Number of signal types defined.
|
||||
#[must_use]
|
||||
pub fn signal_count(&self) -> usize {
|
||||
self.signals.len()
|
||||
}
|
||||
|
||||
/// Embedding slot definitions declared in this schema.
|
||||
#[must_use]
|
||||
pub fn embedding_slots(&self) -> &[EmbeddingSlotDef] {
|
||||
&self.embedding_slots
|
||||
}
|
||||
|
||||
/// Text field definitions declared in this schema for full-text search (items).
|
||||
#[must_use]
|
||||
pub fn text_fields(&self) -> &[TextFieldDef] {
|
||||
&self.text_fields
|
||||
}
|
||||
|
||||
/// Text field definitions declared in this schema for creator full-text search.
|
||||
#[must_use]
|
||||
pub fn creator_text_fields(&self) -> &[TextFieldDef] {
|
||||
&self.creator_text_fields
|
||||
}
|
||||
|
||||
/// Look up a session policy by name.
|
||||
#[must_use]
|
||||
pub fn session_policy(&self, name: &str) -> Option<&AgentPolicy> {
|
||||
self.policies.get(name)
|
||||
}
|
||||
|
||||
/// Iterate over all session policies.
|
||||
pub fn session_policies(&self) -> impl Iterator<Item = (&str, &AgentPolicy)> {
|
||||
self.policies.iter().map(|(k, v)| (k.as_str(), v))
|
||||
}
|
||||
}
|
||||
|
||||
/// Check if a signal name is a valid identifier.
|
||||
///
|
||||
/// Must be non-empty, ASCII, lowercase alphanumeric + underscore,
|
||||
/// and start with a letter. Safe for use in storage keys
|
||||
/// (`SIG:{name}:{window}`) and the query language.
|
||||
fn is_valid_signal_name(name: &str) -> bool {
|
||||
!name.is_empty()
|
||||
&& name.is_ascii()
|
||||
&& name
|
||||
.bytes()
|
||||
.all(|b| b.is_ascii_lowercase() || b.is_ascii_digit() || b == b'_')
|
||||
&& name.as_bytes()[0].is_ascii_lowercase()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(unused_must_use, clippy::unwrap_used)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::schema::{EntityKind, Window};
|
||||
use std::time::Duration;
|
||||
|
||||
// === Signal name validation ===
|
||||
|
||||
#[test]
|
||||
fn is_valid_signal_name_unit() {
|
||||
assert!(is_valid_signal_name("view"));
|
||||
assert!(is_valid_signal_name("a"));
|
||||
assert!(is_valid_signal_name("view_count"));
|
||||
assert!(is_valid_signal_name("signal_24h"));
|
||||
|
||||
assert!(!is_valid_signal_name(""));
|
||||
assert!(!is_valid_signal_name("View"));
|
||||
assert!(!is_valid_signal_name("1view"));
|
||||
assert!(!is_valid_signal_name("_view"));
|
||||
assert!(!is_valid_signal_name("view count"));
|
||||
assert!(!is_valid_signal_name("view-count"));
|
||||
assert!(!is_valid_signal_name("view!"));
|
||||
}
|
||||
|
||||
// === UAT-style integration test ===
|
||||
|
||||
#[test]
|
||||
fn milestone_1_uat_schema() {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
builder
|
||||
.signal(
|
||||
"view",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(7 * 24 * 3600), // 7 days
|
||||
},
|
||||
)
|
||||
.windows(&[Window::OneHour, Window::TwentyFourHours, Window::SevenDays])
|
||||
.velocity(true)
|
||||
.add();
|
||||
builder
|
||||
.signal(
|
||||
"like",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(14 * 24 * 3600), // 14 days
|
||||
},
|
||||
)
|
||||
.windows(&[Window::TwentyFourHours, Window::SevenDays, Window::AllTime])
|
||||
.velocity(true)
|
||||
.add();
|
||||
builder
|
||||
.signal(
|
||||
"skip",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(24 * 3600), // 1 day
|
||||
},
|
||||
)
|
||||
.windows(&[Window::OneHour, Window::TwentyFourHours])
|
||||
.velocity(false)
|
||||
.add();
|
||||
|
||||
let schema = builder.build().expect("UAT schema should be valid");
|
||||
assert_eq!(schema.signal_count(), 3);
|
||||
|
||||
// Verify view signal
|
||||
let view = schema.signal("view").unwrap();
|
||||
assert_eq!(view.windows().len(), 3);
|
||||
assert!(view.velocity_enabled());
|
||||
let lambda = view.decay().lambda().unwrap();
|
||||
let expected_lambda = std::f64::consts::LN_2 / (7.0 * 24.0 * 3600.0);
|
||||
assert!((lambda - expected_lambda).abs() < 1e-15);
|
||||
|
||||
// Verify like signal
|
||||
let like = schema.signal("like").unwrap();
|
||||
assert_eq!(like.windows().len(), 3);
|
||||
assert!(like.windows().contains(&Window::AllTime));
|
||||
|
||||
// Verify skip signal
|
||||
let skip = schema.signal("skip").unwrap();
|
||||
assert!(!skip.velocity_enabled());
|
||||
let skip_lambda = skip.decay().lambda().unwrap();
|
||||
let expected_skip_lambda = std::f64::consts::LN_2 / (24.0 * 3600.0);
|
||||
assert!((skip_lambda - expected_skip_lambda).abs() < 1e-15);
|
||||
}
|
||||
|
||||
// === Schema query API ===
|
||||
|
||||
#[test]
|
||||
fn schema_signal_returns_none_for_missing() {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
builder
|
||||
.signal("view", EntityKind::Item, DecaySpec::Permanent)
|
||||
.add();
|
||||
let schema = builder.build().unwrap();
|
||||
assert!(schema.signal("nonexistent").is_none());
|
||||
}
|
||||
|
||||
// === Property tests ===
|
||||
|
||||
mod proptests {
|
||||
use super::*;
|
||||
use proptest::prelude::*;
|
||||
|
||||
proptest! {
|
||||
#[test]
|
||||
fn signal_name_validation_consistent(name in "\\PC{0,100}") {
|
||||
let valid = is_valid_signal_name(&name);
|
||||
let expected = !name.is_empty()
|
||||
&& name.is_ascii()
|
||||
&& name.bytes().all(|b| b.is_ascii_lowercase() || b.is_ascii_digit() || b == b'_')
|
||||
&& name.as_bytes()[0].is_ascii_lowercase();
|
||||
prop_assert_eq!(valid, expected);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn schema_contains_all_defined_signals(count in 1usize..10) {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
let names: Vec<String> = (0..count)
|
||||
.map(|i| format!("signal_{i}"))
|
||||
.collect();
|
||||
|
||||
for name in &names {
|
||||
builder.signal(name, EntityKind::Item, DecaySpec::Permanent).add();
|
||||
}
|
||||
|
||||
let schema = builder.build().unwrap();
|
||||
prop_assert_eq!(schema.signal_count(), count);
|
||||
for name in &names {
|
||||
prop_assert!(schema.signal(name).is_some());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
42
tidal/src/schema/validation/policies.rs
Normal file
42
tidal/src/schema/validation/policies.rs
Normal file
@ -0,0 +1,42 @@
|
||||
use std::time::Duration;
|
||||
|
||||
// ── AgentPolicy ──────────────────────────────────────────────────────────────
|
||||
|
||||
/// Policy controlling what an agent session is allowed to do.
|
||||
///
|
||||
/// Declared in the schema at build time via `SchemaBuilder::session_policy`.
|
||||
/// Policies are validated at schema build time: all signal names must exist
|
||||
/// in the schema, and no signal may appear in both allow and deny lists.
|
||||
///
|
||||
/// # Examples
|
||||
///
|
||||
/// ```
|
||||
/// use std::time::Duration;
|
||||
/// use tidaldb::schema::AgentPolicy;
|
||||
///
|
||||
/// let policy = AgentPolicy {
|
||||
/// allowed_signals: vec!["reward".to_string(), "preference_hint".to_string()],
|
||||
/// denied_signals: vec![],
|
||||
/// max_session_duration: Duration::from_secs(3600),
|
||||
/// max_signals_per_session: 10_000,
|
||||
/// };
|
||||
/// ```
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct AgentPolicy {
|
||||
/// If non-empty, only these signal types may be written in sessions using this policy.
|
||||
pub allowed_signals: Vec<String>,
|
||||
/// Signal types always blocked, regardless of `allowed_signals`.
|
||||
pub denied_signals: Vec<String>,
|
||||
/// Maximum wall-clock duration for a session before it is considered expired.
|
||||
pub max_session_duration: Duration,
|
||||
/// Maximum total signals that may be written in a single session.
|
||||
/// `0` means unlimited.
|
||||
pub max_signals_per_session: u32,
|
||||
}
|
||||
|
||||
/// Internal builder entry for a session policy.
|
||||
#[derive(Debug)]
|
||||
pub(super) struct PolicyEntry {
|
||||
pub(super) name: String,
|
||||
pub(super) policy: AgentPolicy,
|
||||
}
|
||||
39
tidal/src/schema/validation/text.rs
Normal file
39
tidal/src/schema/validation/text.rs
Normal file
@ -0,0 +1,39 @@
|
||||
use crate::schema::EntityKind;
|
||||
|
||||
/// Declaration of a text field for full-text search indexing.
|
||||
///
|
||||
/// Each `TextFieldDef` maps a metadata key (e.g., "title", "description") to
|
||||
/// a Tantivy indexing mode. Declared in the schema via `SchemaBuilder::text_field`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct TextFieldDef {
|
||||
/// The metadata key to index (e.g., "title", "description", "tags").
|
||||
pub key: String,
|
||||
/// Whether this field is tokenized (full-text) or raw (keyword/exact-match).
|
||||
pub field_type: TextFieldType,
|
||||
}
|
||||
|
||||
/// The Tantivy indexing mode for a text field.
|
||||
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||
pub enum TextFieldType {
|
||||
/// Full tokenization with Tantivy's default tokenizer.
|
||||
/// Good for: title, description, body text.
|
||||
Text,
|
||||
/// Raw storage, no tokenization. Only exact-match queries work.
|
||||
/// Good for: category, format, `creator_id`, language tags.
|
||||
Keyword,
|
||||
}
|
||||
|
||||
/// Definition of an embedding vector slot for ANN search.
|
||||
///
|
||||
/// Declared in the schema to tell tidalDB which embedding dimensions
|
||||
/// to expect for a given entity kind. The database retrieves and ranks
|
||||
/// over vectors -- it does not generate them.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct EmbeddingSlotDef {
|
||||
/// Slot name (e.g., "default", "thumbnail").
|
||||
pub name: String,
|
||||
/// Which entity kind this slot is attached to.
|
||||
pub entity_kind: EntityKind,
|
||||
/// Dimensionality of the embedding vector.
|
||||
pub dimensions: usize,
|
||||
}
|
||||
120
tidal/src/session/audit.rs
Normal file
120
tidal/src/session/audit.rs
Normal file
@ -0,0 +1,120 @@
|
||||
//! Bounded audit log for session policy decisions.
|
||||
|
||||
/// Maximum audit log entries before truncation.
|
||||
pub const MAX_AUDIT_ENTRIES: usize = 10_000;
|
||||
/// Maximum annotation entries.
|
||||
pub const MAX_ANNOTATIONS: usize = 100;
|
||||
/// Maximum number of closed sessions retained in memory.
|
||||
pub const MAX_CLOSED_SESSIONS: usize = 1_000;
|
||||
|
||||
// ── AuditEntry ────────────────────────────────────────────────────────────────
|
||||
|
||||
/// A single policy check result recorded in the session audit log.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct AuditEntry {
|
||||
/// Nanosecond timestamp of the check.
|
||||
pub timestamp_ns: u64,
|
||||
/// Signal type name that was checked.
|
||||
pub signal_type: String,
|
||||
/// Whether the signal was accepted.
|
||||
pub accepted: bool,
|
||||
/// Reason for rejection (present when `accepted == false`).
|
||||
pub reason: Option<String>,
|
||||
}
|
||||
|
||||
// ── AuditLog ──────────────────────────────────────────────────────────────────
|
||||
|
||||
/// Bounded audit log with overflow eviction.
|
||||
///
|
||||
/// Holds up to `MAX_AUDIT_ENTRIES` entries. When the cap is reached, the oldest
|
||||
/// entry is evicted and `truncated` is set to `true`.
|
||||
pub struct AuditLog {
|
||||
entries: Vec<AuditEntry>,
|
||||
/// Set to `true` once an entry has been evicted due to overflow.
|
||||
pub truncated: bool,
|
||||
}
|
||||
|
||||
impl AuditLog {
|
||||
/// Create an empty audit log.
|
||||
#[must_use]
|
||||
pub const fn new() -> Self {
|
||||
Self {
|
||||
entries: Vec::new(),
|
||||
truncated: false,
|
||||
}
|
||||
}
|
||||
|
||||
/// Append an entry, evicting the oldest if the cap is reached.
|
||||
pub fn push(&mut self, entry: AuditEntry) {
|
||||
if self.entries.len() >= MAX_AUDIT_ENTRIES {
|
||||
self.entries.remove(0);
|
||||
self.truncated = true;
|
||||
}
|
||||
self.entries.push(entry);
|
||||
}
|
||||
|
||||
/// Borrow the entries slice.
|
||||
#[must_use]
|
||||
pub fn entries(&self) -> &[AuditEntry] {
|
||||
&self.entries
|
||||
}
|
||||
|
||||
/// Number of entries currently held.
|
||||
#[must_use]
|
||||
pub const fn len(&self) -> usize {
|
||||
self.entries.len()
|
||||
}
|
||||
|
||||
/// `true` if no entries are held.
|
||||
#[must_use]
|
||||
pub const fn is_empty(&self) -> bool {
|
||||
self.entries.is_empty()
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for AuditLog {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
// ── Unit tests ────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn audit_log_truncates_on_overflow() {
|
||||
let mut log = AuditLog::new();
|
||||
assert!(!log.truncated);
|
||||
|
||||
// Fill to capacity.
|
||||
for i in 0..MAX_AUDIT_ENTRIES {
|
||||
log.push(AuditEntry {
|
||||
timestamp_ns: i as u64,
|
||||
signal_type: "view".to_string(),
|
||||
accepted: true,
|
||||
reason: None,
|
||||
});
|
||||
}
|
||||
assert!(!log.truncated);
|
||||
assert_eq!(log.len(), MAX_AUDIT_ENTRIES);
|
||||
|
||||
// One more should evict oldest and set truncated.
|
||||
log.push(AuditEntry {
|
||||
timestamp_ns: MAX_AUDIT_ENTRIES as u64,
|
||||
signal_type: "view".to_string(),
|
||||
accepted: true,
|
||||
reason: None,
|
||||
});
|
||||
assert!(log.truncated);
|
||||
assert_eq!(log.len(), MAX_AUDIT_ENTRIES);
|
||||
// Oldest (ts=0) should have been evicted; newest entry is at the end.
|
||||
assert_eq!(log.entries()[0].timestamp_ns, 1);
|
||||
assert_eq!(
|
||||
log.entries()[MAX_AUDIT_ENTRIES - 1].timestamp_ns,
|
||||
MAX_AUDIT_ENTRIES as u64
|
||||
);
|
||||
}
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
256
tidal/src/session/policy.rs
Normal file
256
tidal/src/session/policy.rs
Normal file
@ -0,0 +1,256 @@
|
||||
//! Policy evaluation for session signal writes.
|
||||
|
||||
use std::sync::atomic::Ordering;
|
||||
use std::time::Instant;
|
||||
|
||||
use crate::schema::AgentPolicy;
|
||||
|
||||
use super::state::SessionState;
|
||||
|
||||
// ── PolicyViolationKind ───────────────────────────────────────────────────────
|
||||
|
||||
/// Typed reason category for a policy check failure.
|
||||
///
|
||||
/// Allows callers to dispatch on the specific cause without parsing strings.
|
||||
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||
pub enum PolicyViolationKind {
|
||||
/// Session duration limit exceeded.
|
||||
Expired,
|
||||
/// Per-session signal count cap reached.
|
||||
CountCap,
|
||||
/// Signal type is in the `denied_signals` list.
|
||||
Denied,
|
||||
/// Signal type is not in the non-empty `allowed_signals` list.
|
||||
NotAllowed,
|
||||
}
|
||||
|
||||
// ── PolicyViolation ───────────────────────────────────────────────────────────
|
||||
|
||||
/// Describes why a policy check rejected a signal write.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct PolicyViolation {
|
||||
/// Typed reason for the rejection — used for error-type dispatch.
|
||||
pub kind: PolicyViolationKind,
|
||||
pub signal_type: String,
|
||||
pub policy_name: String,
|
||||
pub reason: String,
|
||||
}
|
||||
|
||||
// ── PolicyEvaluator ───────────────────────────────────────────────────────────
|
||||
|
||||
/// Evaluates policy rules against incoming session signals.
|
||||
pub struct PolicyEvaluator<'a> {
|
||||
policy: &'a AgentPolicy,
|
||||
policy_name: &'a str,
|
||||
}
|
||||
|
||||
impl<'a> PolicyEvaluator<'a> {
|
||||
#[must_use]
|
||||
pub const fn new(policy: &'a AgentPolicy, policy_name: &'a str) -> Self {
|
||||
Self {
|
||||
policy,
|
||||
policy_name,
|
||||
}
|
||||
}
|
||||
|
||||
/// Check whether a signal can be written under this policy.
|
||||
///
|
||||
/// Returns `Ok(())` if all policy checks pass.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `PolicyViolation` describing the first rule violated.
|
||||
pub fn check(
|
||||
&self,
|
||||
signal_type: &str,
|
||||
state: &SessionState,
|
||||
now: Instant,
|
||||
) -> Result<(), PolicyViolation> {
|
||||
let make_violation = |kind: PolicyViolationKind, reason: String| PolicyViolation {
|
||||
kind,
|
||||
signal_type: signal_type.to_owned(),
|
||||
policy_name: self.policy_name.to_owned(),
|
||||
reason,
|
||||
};
|
||||
|
||||
// 1. Duration check.
|
||||
let elapsed = now.duration_since(state.started_at);
|
||||
if elapsed > self.policy.max_session_duration {
|
||||
return Err(make_violation(
|
||||
PolicyViolationKind::Expired,
|
||||
format!(
|
||||
"session expired after {:.1}s (max {:.1}s)",
|
||||
elapsed.as_secs_f64(),
|
||||
self.policy.max_session_duration.as_secs_f64()
|
||||
),
|
||||
));
|
||||
}
|
||||
|
||||
// 2. Count cap (0 = unlimited).
|
||||
if self.policy.max_signals_per_session > 0
|
||||
&& state.signals_written.load(Ordering::Relaxed)
|
||||
>= u64::from(self.policy.max_signals_per_session)
|
||||
{
|
||||
return Err(make_violation(
|
||||
PolicyViolationKind::CountCap,
|
||||
format!(
|
||||
"signal count cap reached (max {})",
|
||||
self.policy.max_signals_per_session
|
||||
),
|
||||
));
|
||||
}
|
||||
|
||||
// 3. Deny list.
|
||||
if self
|
||||
.policy
|
||||
.denied_signals
|
||||
.iter()
|
||||
.any(|s| s.as_str() == signal_type)
|
||||
{
|
||||
return Err(make_violation(
|
||||
PolicyViolationKind::Denied,
|
||||
format!(
|
||||
"signal '{signal_type}' is explicitly denied by policy '{}'",
|
||||
self.policy_name
|
||||
),
|
||||
));
|
||||
}
|
||||
|
||||
// 4. Allow list (empty allow list = all allowed).
|
||||
if !self.policy.allowed_signals.is_empty()
|
||||
&& !self
|
||||
.policy
|
||||
.allowed_signals
|
||||
.iter()
|
||||
.any(|s| s.as_str() == signal_type)
|
||||
{
|
||||
return Err(make_violation(
|
||||
PolicyViolationKind::NotAllowed,
|
||||
format!(
|
||||
"signal '{signal_type}' not in allowed_signals for policy '{}'",
|
||||
self.policy_name
|
||||
),
|
||||
));
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
// ── Unit tests ────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use std::collections::HashMap;
|
||||
use std::sync::atomic::{AtomicBool, AtomicU64};
|
||||
use std::sync::{Arc, Mutex};
|
||||
use std::time::Duration;
|
||||
|
||||
use dashmap::DashMap;
|
||||
|
||||
use super::super::audit::AuditLog;
|
||||
use super::super::types::{AgentId, SessionId};
|
||||
|
||||
fn make_state(policy_name: &str) -> SessionState {
|
||||
SessionState {
|
||||
id: SessionId(1),
|
||||
user_id: 100,
|
||||
agent_id: AgentId::new("test-agent").unwrap(),
|
||||
policy_name: policy_name.to_owned(),
|
||||
started_at: Instant::now(),
|
||||
started_at_ns: 0,
|
||||
metadata: HashMap::new(),
|
||||
signals: DashMap::new(),
|
||||
signaled_entities: DashMap::new(),
|
||||
annotations: Mutex::new(Vec::new()),
|
||||
signals_written: AtomicU64::new(0),
|
||||
signals_rejected: AtomicU64::new(0),
|
||||
audit_log: Mutex::new(AuditLog::new()),
|
||||
closed: Arc::new(AtomicBool::new(false)),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_allow_list_accepts_allowed_signal() {
|
||||
let policy = AgentPolicy {
|
||||
allowed_signals: vec!["view".to_string()],
|
||||
denied_signals: vec![],
|
||||
max_session_duration: Duration::from_secs(3600),
|
||||
max_signals_per_session: 0,
|
||||
};
|
||||
let evaluator = PolicyEvaluator::new(&policy, "test_policy");
|
||||
let state = make_state("test_policy");
|
||||
assert!(evaluator.check("view", &state, Instant::now()).is_ok());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_allow_list_rejects_unknown_signal() {
|
||||
let policy = AgentPolicy {
|
||||
allowed_signals: vec!["view".to_string()],
|
||||
denied_signals: vec![],
|
||||
max_session_duration: Duration::from_secs(3600),
|
||||
max_signals_per_session: 0,
|
||||
};
|
||||
let evaluator = PolicyEvaluator::new(&policy, "test_policy");
|
||||
let state = make_state("test_policy");
|
||||
assert!(evaluator.check("like", &state, Instant::now()).is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_deny_list_rejects_denied_signal() {
|
||||
let policy = AgentPolicy {
|
||||
allowed_signals: vec![],
|
||||
denied_signals: vec!["block".to_string()],
|
||||
max_session_duration: Duration::from_secs(3600),
|
||||
max_signals_per_session: 0,
|
||||
};
|
||||
let evaluator = PolicyEvaluator::new(&policy, "test_policy");
|
||||
let state = make_state("test_policy");
|
||||
assert!(evaluator.check("block", &state, Instant::now()).is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_count_cap_rejects_when_exceeded() {
|
||||
let policy = AgentPolicy {
|
||||
allowed_signals: vec![],
|
||||
denied_signals: vec![],
|
||||
max_session_duration: Duration::from_secs(3600),
|
||||
max_signals_per_session: 2,
|
||||
};
|
||||
let evaluator = PolicyEvaluator::new(&policy, "test_policy");
|
||||
let state = make_state("test_policy");
|
||||
state.signals_written.store(2, Ordering::Relaxed);
|
||||
assert!(evaluator.check("view", &state, Instant::now()).is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_expired_session_rejected() {
|
||||
let policy = AgentPolicy {
|
||||
allowed_signals: vec![],
|
||||
denied_signals: vec![],
|
||||
max_session_duration: Duration::from_millis(1),
|
||||
max_signals_per_session: 0,
|
||||
};
|
||||
let evaluator = PolicyEvaluator::new(&policy, "test_policy");
|
||||
let state = SessionState {
|
||||
id: SessionId(1),
|
||||
user_id: 100,
|
||||
agent_id: AgentId::new("test-agent").unwrap(),
|
||||
policy_name: "test_policy".to_owned(),
|
||||
// started_at far in the past:
|
||||
started_at: Instant::now() - Duration::from_secs(10),
|
||||
started_at_ns: 0,
|
||||
metadata: HashMap::new(),
|
||||
signals: DashMap::new(),
|
||||
signaled_entities: DashMap::new(),
|
||||
annotations: Mutex::new(Vec::new()),
|
||||
signals_written: AtomicU64::new(0),
|
||||
signals_rejected: AtomicU64::new(0),
|
||||
audit_log: Mutex::new(AuditLog::new()),
|
||||
closed: Arc::new(AtomicBool::new(false)),
|
||||
};
|
||||
assert!(evaluator.check("view", &state, Instant::now()).is_err());
|
||||
}
|
||||
}
|
||||
517
tidal/src/session/serde.rs
Normal file
517
tidal/src/session/serde.rs
Normal file
@ -0,0 +1,517 @@
|
||||
//! Binary encoding and decoding for session records.
|
||||
//!
|
||||
//! Three record types are persisted:
|
||||
//! - **Snapshot** — full session state at close time (or on checkpoint).
|
||||
//! - **Start record** — compact record written on `start_session`; replaced by
|
||||
//! a snapshot when the session closes.
|
||||
//! - **Audit log** — standalone serialisation used for the separate audit keyspace.
|
||||
|
||||
use super::audit::AuditEntry;
|
||||
use super::signal_state::SignalSnapEntry;
|
||||
use super::snapshot::SessionSnapshot;
|
||||
use super::state::SessionState;
|
||||
use super::types::SessionId;
|
||||
|
||||
/// Format version byte for snapshot serialization.
|
||||
///
|
||||
/// v0x01: original format (annotations as plain Vec<String>)
|
||||
/// v0x02: annotations as Vec<(u64, String)>, `audit_truncated` flag, per-signal stats
|
||||
const SNAPSHOT_VERSION: u8 = 0x02;
|
||||
|
||||
// ── Snapshot ──────────────────────────────────────────────────────────────────
|
||||
|
||||
/// Serialize a `SessionSnapshot` to bytes for storage archival.
|
||||
///
|
||||
/// Format v0x02:
|
||||
/// `[version: u8][session_id: u64][user_id: u64][signals_written: u64]`
|
||||
/// `[signals_rejected: u64][duration_ms: u64][reward_velocity: f64]`
|
||||
/// `[metadata_count: u32][...kv pairs...][annotations_count: u32][...(ts_u64, len, bytes)...]`
|
||||
/// `[entities_count: u32][...u64s...][audit_truncated: u8]`
|
||||
/// `[audit_count: u32][...audit entries...]`
|
||||
/// `[signal_count: u32][...(name_len, name, decay_f64, window_1h_u64)...]`
|
||||
#[must_use]
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
pub fn serialize_snapshot(snap: &SessionSnapshot) -> Vec<u8> {
|
||||
let mut buf = Vec::new();
|
||||
buf.push(SNAPSHOT_VERSION);
|
||||
buf.extend_from_slice(&snap.id.as_u64().to_le_bytes());
|
||||
buf.extend_from_slice(&snap.user_id.to_le_bytes());
|
||||
buf.extend_from_slice(&snap.signals_written.to_le_bytes());
|
||||
buf.extend_from_slice(&snap.signals_rejected.to_le_bytes());
|
||||
buf.extend_from_slice(&snap.duration_ms.to_le_bytes());
|
||||
buf.extend_from_slice(&snap.reward_velocity.to_bits().to_le_bytes());
|
||||
|
||||
// Metadata: count then (key_len, key, val_len, val) pairs.
|
||||
buf.extend_from_slice(&(snap.metadata.len() as u32).to_le_bytes());
|
||||
for (k, v) in &snap.metadata {
|
||||
buf.extend_from_slice(&(k.len() as u32).to_le_bytes());
|
||||
buf.extend_from_slice(k.as_bytes());
|
||||
buf.extend_from_slice(&(v.len() as u32).to_le_bytes());
|
||||
buf.extend_from_slice(v.as_bytes());
|
||||
}
|
||||
|
||||
// Annotations v0x02: count then (timestamp_ns: u64, len: u32, bytes).
|
||||
buf.extend_from_slice(&(snap.annotations.len() as u32).to_le_bytes());
|
||||
for (ts, ann) in &snap.annotations {
|
||||
buf.extend_from_slice(&ts.to_le_bytes());
|
||||
buf.extend_from_slice(&(ann.len() as u32).to_le_bytes());
|
||||
buf.extend_from_slice(ann.as_bytes());
|
||||
}
|
||||
|
||||
// Signaled entities: count then u64 values.
|
||||
buf.extend_from_slice(&(snap.signaled_entities.len() as u32).to_le_bytes());
|
||||
for &eid in &snap.signaled_entities {
|
||||
buf.extend_from_slice(&eid.to_le_bytes());
|
||||
}
|
||||
|
||||
// Audit truncated flag (v0x02).
|
||||
buf.push(u8::from(snap.audit_truncated));
|
||||
|
||||
// Audit log: count then serialized entries.
|
||||
buf.extend_from_slice(&(snap.audit_log.len() as u32).to_le_bytes());
|
||||
for e in &snap.audit_log {
|
||||
buf.extend_from_slice(&e.timestamp_ns.to_le_bytes());
|
||||
buf.extend_from_slice(&(e.signal_type.len() as u32).to_le_bytes());
|
||||
buf.extend_from_slice(e.signal_type.as_bytes());
|
||||
buf.push(u8::from(e.accepted));
|
||||
match &e.reason {
|
||||
None => buf.extend_from_slice(&0u32.to_le_bytes()),
|
||||
Some(r) => {
|
||||
buf.extend_from_slice(&(r.len() as u32).to_le_bytes());
|
||||
buf.extend_from_slice(r.as_bytes());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Per-signal stats (v0x02): count then (name_len, name, decay_score_f64, window_1h_u64).
|
||||
buf.extend_from_slice(&(snap.signals.len() as u32).to_le_bytes());
|
||||
for (name, entry) in &snap.signals {
|
||||
buf.extend_from_slice(&(name.len() as u32).to_le_bytes());
|
||||
buf.extend_from_slice(name.as_bytes());
|
||||
buf.extend_from_slice(&entry.decay_score.to_bits().to_le_bytes());
|
||||
buf.extend_from_slice(&entry.window_1h.to_le_bytes());
|
||||
}
|
||||
|
||||
buf
|
||||
}
|
||||
|
||||
/// Deserialize a `SessionSnapshot` from bytes (version 0x02).
|
||||
#[must_use]
|
||||
#[allow(clippy::too_many_lines)]
|
||||
pub fn deserialize_snapshot(bytes: &[u8]) -> Option<SessionSnapshot> {
|
||||
let mut pos = 0;
|
||||
|
||||
let read_u32 = |pos: &mut usize| -> Option<u32> {
|
||||
if *pos + 4 > bytes.len() {
|
||||
return None;
|
||||
}
|
||||
let v = u32::from_le_bytes([
|
||||
bytes[*pos],
|
||||
bytes[*pos + 1],
|
||||
bytes[*pos + 2],
|
||||
bytes[*pos + 3],
|
||||
]);
|
||||
*pos += 4;
|
||||
Some(v)
|
||||
};
|
||||
|
||||
let read_u64 = |pos: &mut usize| -> Option<u64> {
|
||||
if *pos + 8 > bytes.len() {
|
||||
return None;
|
||||
}
|
||||
let v = u64::from_le_bytes([
|
||||
bytes[*pos],
|
||||
bytes[*pos + 1],
|
||||
bytes[*pos + 2],
|
||||
bytes[*pos + 3],
|
||||
bytes[*pos + 4],
|
||||
bytes[*pos + 5],
|
||||
bytes[*pos + 6],
|
||||
bytes[*pos + 7],
|
||||
]);
|
||||
*pos += 8;
|
||||
Some(v)
|
||||
};
|
||||
|
||||
// Version byte check — only 0x02 supported.
|
||||
if pos >= bytes.len() || bytes[pos] != SNAPSHOT_VERSION {
|
||||
return None;
|
||||
}
|
||||
pos += 1;
|
||||
|
||||
let session_id = read_u64(&mut pos)?;
|
||||
let user_id = read_u64(&mut pos)?;
|
||||
let signals_written = read_u64(&mut pos)?;
|
||||
let signals_rejected = read_u64(&mut pos)?;
|
||||
let duration_ms = read_u64(&mut pos)?;
|
||||
let reward_velocity = f64::from_bits(read_u64(&mut pos)?);
|
||||
|
||||
let meta_count = read_u32(&mut pos)? as usize;
|
||||
let mut metadata = std::collections::HashMap::with_capacity(meta_count);
|
||||
for _ in 0..meta_count {
|
||||
let key_len = read_u32(&mut pos)? as usize;
|
||||
if pos + key_len > bytes.len() {
|
||||
return None;
|
||||
}
|
||||
let key = String::from_utf8_lossy(&bytes[pos..pos + key_len]).to_string();
|
||||
pos += key_len;
|
||||
let val_len = read_u32(&mut pos)? as usize;
|
||||
if pos + val_len > bytes.len() {
|
||||
return None;
|
||||
}
|
||||
let val = String::from_utf8_lossy(&bytes[pos..pos + val_len]).to_string();
|
||||
pos += val_len;
|
||||
metadata.insert(key, val);
|
||||
}
|
||||
|
||||
// Annotations v0x02: (timestamp_ns: u64, len: u32, bytes).
|
||||
let ann_count = read_u32(&mut pos)? as usize;
|
||||
let mut annotations: Vec<(u64, String)> = Vec::with_capacity(ann_count);
|
||||
for _ in 0..ann_count {
|
||||
let ts = read_u64(&mut pos)?;
|
||||
let len = read_u32(&mut pos)? as usize;
|
||||
if pos + len > bytes.len() {
|
||||
return None;
|
||||
}
|
||||
let ann = String::from_utf8_lossy(&bytes[pos..pos + len]).to_string();
|
||||
pos += len;
|
||||
annotations.push((ts, ann));
|
||||
}
|
||||
|
||||
let ent_count = read_u32(&mut pos)? as usize;
|
||||
let mut signaled_entities = Vec::with_capacity(ent_count);
|
||||
for _ in 0..ent_count {
|
||||
signaled_entities.push(read_u64(&mut pos)?);
|
||||
}
|
||||
|
||||
// Audit truncated flag (v0x02).
|
||||
let audit_truncated = if pos < bytes.len() {
|
||||
let v = bytes[pos] != 0;
|
||||
pos += 1;
|
||||
v
|
||||
} else {
|
||||
false
|
||||
};
|
||||
|
||||
// Audit log.
|
||||
let audit_log = if pos < bytes.len() {
|
||||
let audit_count = read_u32(&mut pos)? as usize;
|
||||
let mut entries = Vec::with_capacity(audit_count);
|
||||
for _ in 0..audit_count {
|
||||
let ts = read_u64(&mut pos)?;
|
||||
let sig_len = read_u32(&mut pos)? as usize;
|
||||
if pos + sig_len > bytes.len() {
|
||||
break;
|
||||
}
|
||||
let sig_type = String::from_utf8_lossy(&bytes[pos..pos + sig_len]).to_string();
|
||||
pos += sig_len;
|
||||
if pos >= bytes.len() {
|
||||
break;
|
||||
}
|
||||
let accepted = bytes[pos] != 0;
|
||||
pos += 1;
|
||||
let reason_len = read_u32(&mut pos)? as usize;
|
||||
let reason = if reason_len == 0 {
|
||||
None
|
||||
} else {
|
||||
if pos + reason_len > bytes.len() {
|
||||
break;
|
||||
}
|
||||
let r = String::from_utf8_lossy(&bytes[pos..pos + reason_len]).to_string();
|
||||
pos += reason_len;
|
||||
Some(r)
|
||||
};
|
||||
entries.push(AuditEntry {
|
||||
timestamp_ns: ts,
|
||||
signal_type: sig_type,
|
||||
accepted,
|
||||
reason,
|
||||
});
|
||||
}
|
||||
entries
|
||||
} else {
|
||||
Vec::new()
|
||||
};
|
||||
|
||||
// Per-signal stats (v0x02).
|
||||
let signals: std::collections::HashMap<String, SignalSnapEntry> = if pos < bytes.len() {
|
||||
let sig_count = read_u32(&mut pos)? as usize;
|
||||
let mut map = std::collections::HashMap::with_capacity(sig_count);
|
||||
for _ in 0..sig_count {
|
||||
let name_len = read_u32(&mut pos)? as usize;
|
||||
if pos + name_len > bytes.len() {
|
||||
break;
|
||||
}
|
||||
let name = String::from_utf8_lossy(&bytes[pos..pos + name_len]).to_string();
|
||||
pos += name_len;
|
||||
let decay_score = f64::from_bits(read_u64(&mut pos)?);
|
||||
let window_1h = read_u64(&mut pos)?;
|
||||
map.insert(
|
||||
name,
|
||||
SignalSnapEntry {
|
||||
decay_score,
|
||||
window_1h,
|
||||
},
|
||||
);
|
||||
}
|
||||
map
|
||||
} else {
|
||||
std::collections::HashMap::new()
|
||||
};
|
||||
|
||||
Some(SessionSnapshot {
|
||||
id: SessionId(session_id),
|
||||
user_id,
|
||||
signals_written,
|
||||
signals_rejected,
|
||||
duration_ms,
|
||||
metadata,
|
||||
annotations,
|
||||
reward_velocity,
|
||||
signaled_entities,
|
||||
audit_log,
|
||||
audit_truncated,
|
||||
signals,
|
||||
})
|
||||
}
|
||||
|
||||
// ── Start record ──────────────────────────────────────────────────────────────
|
||||
|
||||
/// Serialize a compact session start record.
|
||||
///
|
||||
/// Written to storage on `start_session`; deleted when the session is closed
|
||||
/// and replaced by a snapshot record.
|
||||
///
|
||||
/// Format: `[session_id: 8][user_id: 8][started_at_ns: 8][metadata_count: u32][...kv...]`
|
||||
#[must_use]
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
pub fn serialize_start_record(state: &SessionState) -> Vec<u8> {
|
||||
let mut buf = Vec::new();
|
||||
buf.extend_from_slice(&state.id.as_u64().to_le_bytes());
|
||||
buf.extend_from_slice(&state.user_id.to_le_bytes());
|
||||
buf.extend_from_slice(&state.started_at_ns.to_le_bytes());
|
||||
buf.extend_from_slice(&(state.metadata.len() as u32).to_le_bytes());
|
||||
for (k, v) in &state.metadata {
|
||||
buf.extend_from_slice(&(k.len() as u32).to_le_bytes());
|
||||
buf.extend_from_slice(k.as_bytes());
|
||||
buf.extend_from_slice(&(v.len() as u32).to_le_bytes());
|
||||
buf.extend_from_slice(v.as_bytes());
|
||||
}
|
||||
buf
|
||||
}
|
||||
|
||||
/// Deserialize a session start record (reads `session_id`, `user_id`, `started_at_ns`).
|
||||
///
|
||||
/// Returns `None` if the bytes are malformed.
|
||||
#[must_use]
|
||||
pub fn deserialize_start_record(bytes: &[u8]) -> Option<(SessionId, u64, u64)> {
|
||||
if bytes.len() < 24 {
|
||||
return None;
|
||||
}
|
||||
let session_id = u64::from_le_bytes([
|
||||
bytes[0], bytes[1], bytes[2], bytes[3], bytes[4], bytes[5], bytes[6], bytes[7],
|
||||
]);
|
||||
let user_id = u64::from_le_bytes([
|
||||
bytes[8], bytes[9], bytes[10], bytes[11], bytes[12], bytes[13], bytes[14], bytes[15],
|
||||
]);
|
||||
let started_at_ns = u64::from_le_bytes([
|
||||
bytes[16], bytes[17], bytes[18], bytes[19], bytes[20], bytes[21], bytes[22], bytes[23],
|
||||
]);
|
||||
Some((SessionId::from_raw(session_id), user_id, started_at_ns))
|
||||
}
|
||||
|
||||
// ── Audit log ─────────────────────────────────────────────────────────────────
|
||||
|
||||
/// Serialize an audit log to bytes for storage archival.
|
||||
#[must_use]
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
pub fn serialize_audit_log(entries: &[AuditEntry]) -> Vec<u8> {
|
||||
let mut buf = Vec::new();
|
||||
buf.extend_from_slice(&(entries.len() as u32).to_le_bytes());
|
||||
for e in entries {
|
||||
buf.extend_from_slice(&e.timestamp_ns.to_le_bytes());
|
||||
buf.extend_from_slice(&(e.signal_type.len() as u32).to_le_bytes());
|
||||
buf.extend_from_slice(e.signal_type.as_bytes());
|
||||
buf.push(u8::from(e.accepted));
|
||||
match &e.reason {
|
||||
None => buf.extend_from_slice(&0u32.to_le_bytes()),
|
||||
Some(r) => {
|
||||
buf.extend_from_slice(&(r.len() as u32).to_le_bytes());
|
||||
buf.extend_from_slice(r.as_bytes());
|
||||
}
|
||||
}
|
||||
}
|
||||
buf
|
||||
}
|
||||
|
||||
/// Deserialize an audit log from bytes.
|
||||
#[must_use]
|
||||
pub fn deserialize_audit_log(bytes: &[u8]) -> Vec<AuditEntry> {
|
||||
let mut entries = Vec::new();
|
||||
let mut pos = 0;
|
||||
|
||||
if pos + 4 > bytes.len() {
|
||||
return entries;
|
||||
}
|
||||
let count =
|
||||
u32::from_le_bytes([bytes[pos], bytes[pos + 1], bytes[pos + 2], bytes[pos + 3]]) as usize;
|
||||
pos += 4;
|
||||
|
||||
for _ in 0..count {
|
||||
if pos + 8 > bytes.len() {
|
||||
break;
|
||||
}
|
||||
let ts = u64::from_le_bytes([
|
||||
bytes[pos],
|
||||
bytes[pos + 1],
|
||||
bytes[pos + 2],
|
||||
bytes[pos + 3],
|
||||
bytes[pos + 4],
|
||||
bytes[pos + 5],
|
||||
bytes[pos + 6],
|
||||
bytes[pos + 7],
|
||||
]);
|
||||
pos += 8;
|
||||
|
||||
if pos + 4 > bytes.len() {
|
||||
break;
|
||||
}
|
||||
let sig_len =
|
||||
u32::from_le_bytes([bytes[pos], bytes[pos + 1], bytes[pos + 2], bytes[pos + 3]])
|
||||
as usize;
|
||||
pos += 4;
|
||||
|
||||
if pos + sig_len > bytes.len() {
|
||||
break;
|
||||
}
|
||||
let sig_type = String::from_utf8_lossy(&bytes[pos..pos + sig_len]).to_string();
|
||||
pos += sig_len;
|
||||
|
||||
if pos >= bytes.len() {
|
||||
break;
|
||||
}
|
||||
let accepted = bytes[pos] != 0;
|
||||
pos += 1;
|
||||
|
||||
if pos + 4 > bytes.len() {
|
||||
break;
|
||||
}
|
||||
let reason_len =
|
||||
u32::from_le_bytes([bytes[pos], bytes[pos + 1], bytes[pos + 2], bytes[pos + 3]])
|
||||
as usize;
|
||||
pos += 4;
|
||||
|
||||
let reason = if reason_len == 0 {
|
||||
None
|
||||
} else {
|
||||
if pos + reason_len > bytes.len() {
|
||||
break;
|
||||
}
|
||||
let r = String::from_utf8_lossy(&bytes[pos..pos + reason_len]).to_string();
|
||||
pos += reason_len;
|
||||
Some(r)
|
||||
};
|
||||
|
||||
entries.push(AuditEntry {
|
||||
timestamp_ns: ts,
|
||||
signal_type: sig_type,
|
||||
accepted,
|
||||
reason,
|
||||
});
|
||||
}
|
||||
|
||||
entries
|
||||
}
|
||||
|
||||
// ── Unit tests ────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use std::collections::HashMap;
|
||||
|
||||
use super::super::signal_state::SignalSnapEntry;
|
||||
use super::super::types::SessionId;
|
||||
|
||||
#[test]
|
||||
fn snapshot_roundtrip() {
|
||||
let snap = SessionSnapshot {
|
||||
id: SessionId(7),
|
||||
user_id: 42,
|
||||
signals_written: 10,
|
||||
signals_rejected: 2,
|
||||
duration_ms: 1234,
|
||||
metadata: {
|
||||
let mut m = HashMap::new();
|
||||
m.insert("agent".to_string(), "planner".to_string());
|
||||
m
|
||||
},
|
||||
annotations: vec![(1_000_000_u64, "rust programming".to_string())],
|
||||
reward_velocity: 0.75,
|
||||
signaled_entities: vec![1, 2, 3],
|
||||
audit_log: Vec::new(),
|
||||
audit_truncated: false,
|
||||
signals: {
|
||||
let mut s = HashMap::new();
|
||||
s.insert(
|
||||
"reward".to_string(),
|
||||
SignalSnapEntry {
|
||||
decay_score: 0.75,
|
||||
window_1h: 3,
|
||||
},
|
||||
);
|
||||
s
|
||||
},
|
||||
};
|
||||
|
||||
let bytes = serialize_snapshot(&snap);
|
||||
let restored = deserialize_snapshot(&bytes).unwrap();
|
||||
|
||||
assert_eq!(restored.id, snap.id);
|
||||
assert_eq!(restored.user_id, snap.user_id);
|
||||
assert_eq!(restored.signals_written, snap.signals_written);
|
||||
assert_eq!(restored.signals_rejected, snap.signals_rejected);
|
||||
assert_eq!(restored.duration_ms, snap.duration_ms);
|
||||
assert_eq!(restored.annotations, snap.annotations);
|
||||
assert!((restored.reward_velocity - snap.reward_velocity).abs() < 1e-10);
|
||||
assert_eq!(restored.signaled_entities, snap.signaled_entities);
|
||||
assert_eq!(restored.metadata.get("agent").unwrap(), "planner");
|
||||
assert!(!restored.audit_truncated);
|
||||
let reward_snap = restored.signals.get("reward").unwrap();
|
||||
assert!((reward_snap.decay_score - 0.75).abs() < 1e-10);
|
||||
assert_eq!(reward_snap.window_1h, 3);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn audit_log_roundtrip() {
|
||||
let entries = vec![
|
||||
AuditEntry {
|
||||
timestamp_ns: 1000,
|
||||
signal_type: "view".to_string(),
|
||||
accepted: true,
|
||||
reason: None,
|
||||
},
|
||||
AuditEntry {
|
||||
timestamp_ns: 2000,
|
||||
signal_type: "block".to_string(),
|
||||
accepted: false,
|
||||
reason: Some("signal denied by policy".to_string()),
|
||||
},
|
||||
];
|
||||
|
||||
let bytes = serialize_audit_log(&entries);
|
||||
let restored = deserialize_audit_log(&bytes);
|
||||
|
||||
assert_eq!(restored.len(), 2);
|
||||
assert_eq!(restored[0].signal_type, "view");
|
||||
assert!(restored[0].accepted);
|
||||
assert!(restored[0].reason.is_none());
|
||||
assert_eq!(restored[1].signal_type, "block");
|
||||
assert!(!restored[1].accepted);
|
||||
assert_eq!(
|
||||
restored[1].reason.as_deref().unwrap(),
|
||||
"signal denied by policy"
|
||||
);
|
||||
}
|
||||
}
|
||||
213
tidal/src/session/signal_state.rs
Normal file
213
tidal/src/session/signal_state.rs
Normal file
@ -0,0 +1,213 @@
|
||||
//! Per-session signal tracking: running decay scores and windowed counts.
|
||||
|
||||
use std::sync::atomic::{AtomicU64, Ordering};
|
||||
|
||||
use crate::signals::BucketedCounter;
|
||||
|
||||
/// Decay lambda for session signals (5-minute half-life).
|
||||
///
|
||||
/// `λ = ln(2) / t½ = ln(2) / 300s`
|
||||
pub const DEFAULT_SESSION_LAMBDA: f64 = std::f64::consts::LN_2 / 300.0;
|
||||
|
||||
// ── SessionHotState ───────────────────────────────────────────────────────────
|
||||
|
||||
/// Per-signal-type running decay state for a session.
|
||||
///
|
||||
/// Mirrors `HotSignalState`'s running-score formula (CAS loop, same decay math)
|
||||
/// but uses a simpler, non-cache-aligned struct — sessions are not on the
|
||||
/// 200-entity hot ranking path.
|
||||
pub struct SessionHotState {
|
||||
/// Exponentially decayed running score, stored as `f64::to_bits()`.
|
||||
score: AtomicU64,
|
||||
/// Timestamp of the last update, nanoseconds since Unix epoch.
|
||||
last_update_ns: AtomicU64,
|
||||
/// Total signals written for this signal type in this session.
|
||||
count: AtomicU64,
|
||||
}
|
||||
|
||||
impl SessionHotState {
|
||||
#[must_use]
|
||||
pub const fn new() -> Self {
|
||||
Self {
|
||||
score: AtomicU64::new(0_f64.to_bits()),
|
||||
last_update_ns: AtomicU64::new(0),
|
||||
count: AtomicU64::new(0),
|
||||
}
|
||||
}
|
||||
|
||||
/// Update the running decay score with a new signal event.
|
||||
///
|
||||
/// Uses the same CAS formula as `HotSignalState`:
|
||||
/// `S(t) = S(prev) * exp(−λ × dt) + weight`.
|
||||
pub fn on_signal(&self, weight: f64, ts_ns: u64, lambda: f64) {
|
||||
let prev_ts = self.last_update_ns.load(Ordering::Acquire);
|
||||
#[allow(clippy::cast_precision_loss)]
|
||||
// Nanosecond delta fits in f64 mantissa for practical durations.
|
||||
let dt_secs = if ts_ns > prev_ts {
|
||||
(ts_ns - prev_ts) as f64 / 1_000_000_000.0
|
||||
} else {
|
||||
0.0
|
||||
};
|
||||
let decay_factor = (-lambda * dt_secs).exp();
|
||||
|
||||
// CAS loop: forward-decay old score then add weight.
|
||||
loop {
|
||||
let old_bits = self.score.load(Ordering::Acquire);
|
||||
let old_score = f64::from_bits(old_bits);
|
||||
let new_score = old_score.mul_add(decay_factor, weight);
|
||||
let new_bits = new_score.to_bits();
|
||||
if self
|
||||
.score
|
||||
.compare_exchange(old_bits, new_bits, Ordering::AcqRel, Ordering::Acquire)
|
||||
.is_ok()
|
||||
{
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
// Advance timestamp only if the new event is strictly later.
|
||||
let _ = self.last_update_ns.compare_exchange(
|
||||
prev_ts,
|
||||
ts_ns,
|
||||
Ordering::Release,
|
||||
Ordering::Relaxed,
|
||||
);
|
||||
|
||||
self.count.fetch_add(1, Ordering::Relaxed);
|
||||
}
|
||||
|
||||
/// Compute the current decayed score at `ts_now_ns`.
|
||||
#[must_use]
|
||||
pub fn current_score(&self, ts_now_ns: u64, lambda: f64) -> f64 {
|
||||
let score_bits = self.score.load(Ordering::Acquire);
|
||||
let score = f64::from_bits(score_bits);
|
||||
let ts = self.last_update_ns.load(Ordering::Acquire);
|
||||
if ts == 0 {
|
||||
return 0.0;
|
||||
}
|
||||
#[allow(clippy::cast_precision_loss)]
|
||||
// Nanosecond delta fits in f64 mantissa for practical durations.
|
||||
let dt_secs = if ts_now_ns > ts {
|
||||
(ts_now_ns - ts) as f64 / 1_000_000_000.0
|
||||
} else {
|
||||
0.0
|
||||
};
|
||||
score * (-lambda * dt_secs).exp()
|
||||
}
|
||||
|
||||
/// Frozen score (no further decay applied) — for archived sessions.
|
||||
#[must_use]
|
||||
pub fn frozen_score(&self) -> f64 {
|
||||
f64::from_bits(self.score.load(Ordering::Relaxed))
|
||||
}
|
||||
|
||||
/// Total number of signals received for this signal type.
|
||||
#[must_use]
|
||||
pub fn count(&self) -> u64 {
|
||||
self.count.load(Ordering::Relaxed)
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for SessionHotState {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
// ── SessionSignalState ────────────────────────────────────────────────────────
|
||||
|
||||
/// Combined hot-decay + windowed-count state for a single signal type in a session.
|
||||
///
|
||||
/// Stores the running decay score (via `SessionHotState`) plus bucketed window
|
||||
/// counts (via `BucketedCounter`). The `lambda` is captured at construction time
|
||||
/// from the schema's `DecaySpec` so snapshot building doesn't need schema access.
|
||||
pub struct SessionSignalState {
|
||||
/// Running exponential decay score for this signal type.
|
||||
pub hot: SessionHotState,
|
||||
/// Minute-level bucketed event counts for windowed queries.
|
||||
pub windows: BucketedCounter,
|
||||
/// Signal-type-specific decay rate (λ = ln(2) / `half_life_secs`).
|
||||
pub lambda: f64,
|
||||
}
|
||||
|
||||
impl SessionSignalState {
|
||||
/// Create a new state for a signal type with the given decay rate.
|
||||
///
|
||||
/// `now_ns` initialises the `BucketedCounter` start time.
|
||||
#[must_use]
|
||||
pub fn new(now_ns: u64, lambda: f64) -> Self {
|
||||
Self {
|
||||
hot: SessionHotState::new(),
|
||||
windows: BucketedCounter::with_start_time(now_ns),
|
||||
lambda,
|
||||
}
|
||||
}
|
||||
|
||||
/// Record a new signal event, updating both the hot score and window counts.
|
||||
pub fn on_signal(&self, weight: f64, ts_ns: u64) {
|
||||
self.hot.on_signal(weight, ts_ns, self.lambda);
|
||||
self.windows.increment(ts_ns);
|
||||
}
|
||||
|
||||
/// Compute the current decayed score at `ts_now_ns`.
|
||||
#[must_use]
|
||||
pub fn current_score(&self, ts_now_ns: u64) -> f64 {
|
||||
self.hot.current_score(ts_now_ns, self.lambda)
|
||||
}
|
||||
|
||||
/// Frozen score (no further decay) — for archived sessions.
|
||||
#[must_use]
|
||||
pub fn frozen_score(&self) -> f64 {
|
||||
self.hot.frozen_score()
|
||||
}
|
||||
}
|
||||
|
||||
// ── SignalSnapEntry ───────────────────────────────────────────────────────────
|
||||
|
||||
/// Per-signal-type statistics captured in a `SessionSnapshot`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct SignalSnapEntry {
|
||||
/// Decayed score at snapshot time (frozen for closed, current for active).
|
||||
pub decay_score: f64,
|
||||
/// Count of events in the 1-hour window.
|
||||
pub window_1h: u64,
|
||||
}
|
||||
|
||||
// ── Unit tests ────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn session_hot_state_basic() {
|
||||
let state = SessionHotState::new();
|
||||
let now_ns = 1_000_000_000u64; // 1 second
|
||||
// No signals yet.
|
||||
assert_eq!(state.current_score(now_ns, 0.01), 0.0);
|
||||
|
||||
// Write a signal.
|
||||
state.on_signal(1.0, now_ns, 0.01);
|
||||
// Score should be ~1.0 immediately.
|
||||
let score = state.current_score(now_ns, 0.01);
|
||||
assert!((score - 1.0).abs() < 1e-6, "score={score}");
|
||||
assert_eq!(state.count(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn session_hot_state_decay() {
|
||||
let state = SessionHotState::new();
|
||||
// Use a non-zero base time; ts=0 is the sentinel for "no signals ever".
|
||||
let t0 = 1_000_000_000u64; // 1 second in nanoseconds
|
||||
let lambda = std::f64::consts::LN_2 / 300.0; // 5-min half-life
|
||||
state.on_signal(1.0, t0, lambda);
|
||||
|
||||
// After 300s (one half-life), score should be ~0.5.
|
||||
let t1 = t0 + 300_000_000_000u64;
|
||||
let score = state.current_score(t1, lambda);
|
||||
assert!(
|
||||
(score - 0.5).abs() < 1e-4,
|
||||
"expected ~0.5 after half-life, got {score}"
|
||||
);
|
||||
}
|
||||
}
|
||||
257
tidal/src/session/snapshot.rs
Normal file
257
tidal/src/session/snapshot.rs
Normal file
@ -0,0 +1,257 @@
|
||||
//! Session snapshots: full state dumps and the `SessionContext` derived from them.
|
||||
|
||||
use std::collections::{HashMap, HashSet};
|
||||
|
||||
use crate::schema::Window;
|
||||
|
||||
use super::audit::AuditEntry;
|
||||
use super::signal_state::SignalSnapEntry;
|
||||
use super::state::SessionState;
|
||||
use super::types::SessionId;
|
||||
|
||||
// ── SessionSnapshot ───────────────────────────────────────────────────────────
|
||||
|
||||
/// Full state dump of a session (active or archived).
|
||||
///
|
||||
/// Active sessions: scores are decayed to the current wall-clock time.
|
||||
/// Archived sessions: scores are frozen at the moment of `close_session`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct SessionSnapshot {
|
||||
pub id: SessionId,
|
||||
pub user_id: u64,
|
||||
pub signals_written: u64,
|
||||
pub signals_rejected: u64,
|
||||
pub duration_ms: u64,
|
||||
pub metadata: HashMap<String, String>,
|
||||
/// Timestamped annotations: `(timestamp_ns, hint_text)`.
|
||||
pub annotations: Vec<(u64, String)>,
|
||||
/// Frozen score of the "reward" signal at session close (or current score if active).
|
||||
pub reward_velocity: f64,
|
||||
/// Entity IDs that received session signals.
|
||||
pub signaled_entities: Vec<u64>,
|
||||
/// Policy audit log (populated on `close_session`; empty for active-session snapshots).
|
||||
///
|
||||
/// For active sessions, use `TidalDb::session_audit()` to retrieve the live log.
|
||||
pub audit_log: Vec<AuditEntry>,
|
||||
/// `true` if the audit log was truncated due to the `MAX_AUDIT_ENTRIES` cap.
|
||||
pub audit_truncated: bool,
|
||||
/// Per-signal-type statistics (decay score + 1-hour window count).
|
||||
pub signals: HashMap<String, SignalSnapEntry>,
|
||||
}
|
||||
|
||||
// ── SessionContext ────────────────────────────────────────────────────────────
|
||||
|
||||
/// Session context for FOR SESSION ranking boost.
|
||||
///
|
||||
/// Extracted from a `SessionSnapshot` by the query executor and passed to
|
||||
/// `ProfileExecutor::score_with_session` to apply session-aware boosts.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct SessionContext {
|
||||
/// Keywords extracted from annotations (whitespace-split, lowercased).
|
||||
pub keywords: Vec<String>,
|
||||
/// Velocity/score of the "reward" signal (for velocity-based boost).
|
||||
pub reward_velocity: f64,
|
||||
/// Session metadata.
|
||||
pub metadata: HashMap<String, String>,
|
||||
/// Entity IDs that received session signals (for entity-level boost).
|
||||
pub signaled_entities: HashSet<u64>,
|
||||
}
|
||||
|
||||
impl SessionContext {
|
||||
/// Build a `SessionContext` from a `SessionSnapshot`.
|
||||
///
|
||||
/// Keywords are extracted from annotations by splitting on whitespace and
|
||||
/// lowercasing. Duplicates are removed via a `HashSet` before collecting.
|
||||
#[must_use]
|
||||
pub fn from_snapshot(snapshot: &SessionSnapshot) -> Self {
|
||||
let keywords: Vec<String> = snapshot
|
||||
.annotations
|
||||
.iter()
|
||||
.flat_map(|(_ts, hint)| {
|
||||
hint.split_whitespace()
|
||||
.map(str::to_lowercase)
|
||||
.collect::<Vec<_>>()
|
||||
})
|
||||
.collect::<HashSet<_>>()
|
||||
.into_iter()
|
||||
.collect();
|
||||
|
||||
let signaled_entities: HashSet<u64> = snapshot.signaled_entities.iter().copied().collect();
|
||||
|
||||
Self {
|
||||
keywords,
|
||||
reward_velocity: snapshot.reward_velocity,
|
||||
metadata: snapshot.metadata.clone(),
|
||||
signaled_entities,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ── Snapshot building ─────────────────────────────────────────────────────────
|
||||
|
||||
/// Build a live `SessionSnapshot` from an active `SessionState`.
|
||||
///
|
||||
/// Scores are decayed to `now_ns`. The `audit_log` field is left empty — for
|
||||
/// active sessions the live audit log is accessible via `TidalDb::session_audit()`.
|
||||
#[must_use]
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
pub fn build_snapshot(state: &SessionState, now_ns: u64) -> SessionSnapshot {
|
||||
let reward_velocity = state
|
||||
.signals
|
||||
.get("reward")
|
||||
.map_or(0.0, |ss| ss.current_score(now_ns));
|
||||
|
||||
let annotations = state
|
||||
.annotations
|
||||
.lock()
|
||||
.map(|guard| guard.clone())
|
||||
.unwrap_or_default();
|
||||
|
||||
let signaled_entities: Vec<u64> = state.signaled_entities.iter().map(|e| *e.key()).collect();
|
||||
|
||||
let duration_ms = state.started_at.elapsed().as_millis() as u64;
|
||||
|
||||
// Build per-signal stats: decay score decayed to now_ns.
|
||||
let signals: HashMap<String, SignalSnapEntry> = state
|
||||
.signals
|
||||
.iter()
|
||||
.map(|entry| {
|
||||
let name = entry.key().clone();
|
||||
let ss = entry.value();
|
||||
(
|
||||
name,
|
||||
SignalSnapEntry {
|
||||
decay_score: ss.current_score(now_ns),
|
||||
window_1h: ss.windows.windowed_count(Window::OneHour),
|
||||
},
|
||||
)
|
||||
})
|
||||
.collect();
|
||||
|
||||
let (audit_log_live, audit_truncated_live) = state
|
||||
.audit_log
|
||||
.lock()
|
||||
.map(|g| (Vec::new(), g.truncated))
|
||||
.unwrap_or_default();
|
||||
|
||||
SessionSnapshot {
|
||||
id: state.id,
|
||||
user_id: state.user_id,
|
||||
signals_written: state
|
||||
.signals_written
|
||||
.load(std::sync::atomic::Ordering::Relaxed),
|
||||
signals_rejected: state
|
||||
.signals_rejected
|
||||
.load(std::sync::atomic::Ordering::Relaxed),
|
||||
duration_ms,
|
||||
metadata: state.metadata.clone(),
|
||||
annotations,
|
||||
reward_velocity,
|
||||
signaled_entities,
|
||||
audit_log: audit_log_live,
|
||||
audit_truncated: audit_truncated_live,
|
||||
signals,
|
||||
}
|
||||
}
|
||||
|
||||
/// Build a frozen `SessionSnapshot` at close time (no further decay).
|
||||
///
|
||||
/// Captures the full audit log (with truncation flag) so it remains accessible
|
||||
/// after the session is removed from active state.
|
||||
#[must_use]
|
||||
#[allow(clippy::cast_possible_truncation)]
|
||||
pub fn build_frozen_snapshot(state: &SessionState, duration_ms: u64) -> SessionSnapshot {
|
||||
let reward_velocity = state
|
||||
.signals
|
||||
.get("reward")
|
||||
.map_or(0.0, |ss| ss.frozen_score());
|
||||
|
||||
let annotations = state
|
||||
.annotations
|
||||
.lock()
|
||||
.map(|guard| guard.clone())
|
||||
.unwrap_or_default();
|
||||
|
||||
let (audit_entries, audit_truncated) = state
|
||||
.audit_log
|
||||
.lock()
|
||||
.map(|guard| (guard.entries().to_vec(), guard.truncated))
|
||||
.unwrap_or_default();
|
||||
|
||||
let signaled_entities: Vec<u64> = state.signaled_entities.iter().map(|e| *e.key()).collect();
|
||||
|
||||
// Build per-signal stats: frozen scores (no further decay).
|
||||
let signals: HashMap<String, SignalSnapEntry> = state
|
||||
.signals
|
||||
.iter()
|
||||
.map(|entry| {
|
||||
let name = entry.key().clone();
|
||||
let ss = entry.value();
|
||||
(
|
||||
name,
|
||||
SignalSnapEntry {
|
||||
decay_score: ss.frozen_score(),
|
||||
window_1h: ss.windows.windowed_count(Window::OneHour),
|
||||
},
|
||||
)
|
||||
})
|
||||
.collect();
|
||||
|
||||
SessionSnapshot {
|
||||
id: state.id,
|
||||
user_id: state.user_id,
|
||||
signals_written: state
|
||||
.signals_written
|
||||
.load(std::sync::atomic::Ordering::Relaxed),
|
||||
signals_rejected: state
|
||||
.signals_rejected
|
||||
.load(std::sync::atomic::Ordering::Relaxed),
|
||||
duration_ms,
|
||||
metadata: state.metadata.clone(),
|
||||
annotations,
|
||||
reward_velocity,
|
||||
signaled_entities,
|
||||
audit_log: audit_entries,
|
||||
audit_truncated,
|
||||
signals,
|
||||
}
|
||||
}
|
||||
|
||||
// ── Unit tests ────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use std::collections::HashMap;
|
||||
|
||||
use super::super::types::SessionId;
|
||||
|
||||
#[test]
|
||||
fn session_context_keywords_extracted() {
|
||||
let snap = SessionSnapshot {
|
||||
id: SessionId(1),
|
||||
user_id: 1,
|
||||
signals_written: 0,
|
||||
signals_rejected: 0,
|
||||
duration_ms: 0,
|
||||
metadata: HashMap::new(),
|
||||
annotations: vec![
|
||||
(0u64, "rust programming".to_string()),
|
||||
(1u64, "systems databases".to_string()),
|
||||
],
|
||||
reward_velocity: 0.5,
|
||||
signaled_entities: vec![10, 20],
|
||||
audit_log: Vec::new(),
|
||||
audit_truncated: false,
|
||||
signals: HashMap::new(),
|
||||
};
|
||||
let ctx = SessionContext::from_snapshot(&snap);
|
||||
assert!(ctx.keywords.contains(&"rust".to_string()));
|
||||
assert!(ctx.keywords.contains(&"programming".to_string()));
|
||||
assert!(ctx.keywords.contains(&"systems".to_string()));
|
||||
assert!(ctx.keywords.contains(&"databases".to_string()));
|
||||
assert_eq!(ctx.reward_velocity, 0.5);
|
||||
assert!(ctx.signaled_entities.contains(&10));
|
||||
assert!(ctx.signaled_entities.contains(&20));
|
||||
}
|
||||
}
|
||||
61
tidal/src/session/state.rs
Normal file
61
tidal/src/session/state.rs
Normal file
@ -0,0 +1,61 @@
|
||||
//! Live runtime state structs for active sessions.
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::sync::atomic::{AtomicBool, AtomicU64};
|
||||
use std::sync::{Arc, Mutex};
|
||||
use std::time::Instant;
|
||||
|
||||
use dashmap::DashMap;
|
||||
|
||||
use super::audit::AuditLog;
|
||||
use super::signal_state::SessionSignalState;
|
||||
use super::types::{AgentId, SessionId};
|
||||
|
||||
// ── SessionState ──────────────────────────────────────────────────────────────
|
||||
|
||||
/// Runtime state for an active session. Stored in `TidalDb::sessions`.
|
||||
pub struct SessionState {
|
||||
pub id: SessionId,
|
||||
pub user_id: u64,
|
||||
pub agent_id: AgentId,
|
||||
pub policy_name: String,
|
||||
/// Wall-clock instant when the session was started.
|
||||
pub started_at: Instant,
|
||||
/// `started_at` expressed as nanoseconds since Unix epoch (for archival).
|
||||
pub started_at_ns: u64,
|
||||
/// Caller-supplied metadata attached to the session.
|
||||
pub metadata: HashMap<String, String>,
|
||||
/// Per-signal-type combined decay + window state (keyed by signal type name).
|
||||
pub signals: DashMap<String, SessionSignalState>,
|
||||
/// Entity IDs that have received any signal in this session.
|
||||
pub signaled_entities: DashMap<u64, ()>,
|
||||
/// Timestamped free-text annotations (e.g., preference hints).
|
||||
/// Each entry is `(timestamp_ns, hint_text)`. Capped at 100 entries.
|
||||
pub annotations: Mutex<Vec<(u64, String)>>,
|
||||
/// Total signals accepted (for audit and policy count cap).
|
||||
pub signals_written: AtomicU64,
|
||||
/// Total signals rejected by policy.
|
||||
pub signals_rejected: AtomicU64,
|
||||
/// Policy audit log with bounded eviction.
|
||||
pub audit_log: Mutex<AuditLog>,
|
||||
/// `true` once `close_session` has consumed the handle.
|
||||
pub closed: Arc<AtomicBool>,
|
||||
}
|
||||
|
||||
// ── SessionHandle ─────────────────────────────────────────────────────────────
|
||||
|
||||
/// Move-only handle to an active session.
|
||||
///
|
||||
/// Ownership is consumed by `TidalDb::close_session` to prevent use-after-close
|
||||
/// at the type level. The `closed` `Arc<AtomicBool>` provides runtime
|
||||
/// defense-in-depth for any clone held separately.
|
||||
#[derive(Debug)]
|
||||
pub struct SessionHandle {
|
||||
pub id: SessionId,
|
||||
pub user_id: u64,
|
||||
pub agent_id: AgentId,
|
||||
pub policy_name: String,
|
||||
pub started_at: Instant,
|
||||
/// Shared with `SessionState`; set to `true` by `close_session`.
|
||||
pub closed: Arc<AtomicBool>,
|
||||
}
|
||||
137
tidal/src/session/types.rs
Normal file
137
tidal/src/session/types.rs
Normal file
@ -0,0 +1,137 @@
|
||||
//! Identity types and lightweight output DTOs for the session layer.
|
||||
|
||||
// ── SessionId ─────────────────────────────────────────────────────────────────
|
||||
|
||||
/// Unique identifier for a session.
|
||||
///
|
||||
/// Monotonically increasing `u64`, assigned by `TidalDb::start_session`.
|
||||
/// Guaranteed unique within a process lifetime; not guaranteed across restarts.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, PartialOrd, Ord)]
|
||||
pub struct SessionId(pub(crate) u64);
|
||||
|
||||
impl SessionId {
|
||||
/// Wrap a raw `u64` as a `SessionId`. Used for deserialization.
|
||||
#[must_use]
|
||||
pub const fn from_raw(v: u64) -> Self {
|
||||
Self(v)
|
||||
}
|
||||
|
||||
/// Return the underlying `u64`.
|
||||
#[must_use]
|
||||
pub const fn as_u64(self) -> u64 {
|
||||
self.0
|
||||
}
|
||||
}
|
||||
|
||||
impl std::fmt::Display for SessionId {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
write!(f, "session:{}", self.0)
|
||||
}
|
||||
}
|
||||
|
||||
// ── AgentId ───────────────────────────────────────────────────────────────────
|
||||
|
||||
/// Identifier for an agent that created a session.
|
||||
///
|
||||
/// Must match `[a-z0-9_-]+`, max 64 characters.
|
||||
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
|
||||
pub struct AgentId(pub(crate) String);
|
||||
|
||||
impl AgentId {
|
||||
/// Create an `AgentId`, validating the format.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `Err(message)` if the string is empty, too long, or contains
|
||||
/// characters outside `[a-z0-9_-]`.
|
||||
pub fn new(s: &str) -> Result<Self, String> {
|
||||
if s.is_empty() || s.len() > 64 {
|
||||
return Err(format!("agent_id must be 1–64 chars, got len={}", s.len()));
|
||||
}
|
||||
if !s
|
||||
.bytes()
|
||||
.all(|b| b.is_ascii_lowercase() || b.is_ascii_digit() || b == b'_' || b == b'-')
|
||||
{
|
||||
return Err(format!("agent_id must match [a-z0-9_-], got: '{s}'"));
|
||||
}
|
||||
Ok(Self(s.to_owned()))
|
||||
}
|
||||
|
||||
/// Return the agent ID as a string slice.
|
||||
#[must_use]
|
||||
pub fn as_str(&self) -> &str {
|
||||
&self.0
|
||||
}
|
||||
}
|
||||
|
||||
impl std::fmt::Display for AgentId {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
f.write_str(&self.0)
|
||||
}
|
||||
}
|
||||
|
||||
// ── SessionInfo ───────────────────────────────────────────────────────────────
|
||||
|
||||
/// Lightweight info about an active session. Returned by `active_sessions()`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct SessionInfo {
|
||||
pub id: SessionId,
|
||||
pub user_id: u64,
|
||||
pub agent_id: String,
|
||||
pub started_at_ns: u64,
|
||||
pub signals_written: u64,
|
||||
}
|
||||
|
||||
// ── SessionSummary ────────────────────────────────────────────────────────────
|
||||
|
||||
/// Summary returned by `close_session()`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct SessionSummary {
|
||||
pub id: SessionId,
|
||||
pub duration_ms: u64,
|
||||
pub signals_written: u64,
|
||||
pub rejections: u64,
|
||||
}
|
||||
|
||||
// ── Unit tests ────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn session_id_display() {
|
||||
let id = SessionId(42);
|
||||
assert_eq!(id.to_string(), "session:42");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn session_id_roundtrip() {
|
||||
let id = SessionId(99);
|
||||
assert_eq!(id.as_u64(), 99);
|
||||
assert_eq!(SessionId::from_raw(99), id);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn agent_id_valid() {
|
||||
assert!(AgentId::new("planner").is_ok());
|
||||
assert!(AgentId::new("my-agent").is_ok());
|
||||
assert!(AgentId::new("agent_01").is_ok());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn agent_id_invalid_empty() {
|
||||
assert!(AgentId::new("").is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn agent_id_invalid_uppercase() {
|
||||
assert!(AgentId::new("MyAgent").is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn agent_id_too_long() {
|
||||
let s = "a".repeat(65);
|
||||
assert!(AgentId::new(&s).is_err());
|
||||
}
|
||||
}
|
||||
@ -1,856 +0,0 @@
|
||||
//! Checkpoint and restore for the `SignalLedger`.
|
||||
//!
|
||||
//! # Checkpoint
|
||||
//!
|
||||
//! `SignalLedger::checkpoint()` serializes all in-memory signal state to the
|
||||
//! `StorageEngine` as a single atomic `WriteBatch`. No partial checkpoints are
|
||||
//! possible: either the whole ledger is written or nothing is.
|
||||
//!
|
||||
//! # Restore
|
||||
//!
|
||||
//! `SignalLedger::restore()` scans the storage, filters for `Tag::Sig` keys,
|
||||
//! deserializes each entry, and populates the `DashMap`. Returns the checkpoint
|
||||
//! metadata (for WAL replay) or `None` if no checkpoint exists (first boot).
|
||||
//!
|
||||
//! # Binary format
|
||||
//!
|
||||
//! Each entry serializes as a 983-byte fixed-length record.
|
||||
//! The checkpoint metadata serializes as a 17-byte record at a well-known key.
|
||||
//! All payload values use little-endian byte order; storage keys use big-endian
|
||||
//! (the existing `encode_key` convention). A version byte at offset 0 enables
|
||||
//! future backward-compatible format changes.
|
||||
|
||||
use crate::schema::{EntityId, TidalError};
|
||||
use crate::storage::{StorageEngine, Tag, WriteBatch, encode_key, parse_key};
|
||||
|
||||
use super::SignalTypeId;
|
||||
use super::hot::HotSignalState;
|
||||
use super::ledger::{EntitySignalEntry, SignalLedger};
|
||||
use super::warm::{BucketedCounter, BucketedCounterSnapshot, HOUR_BUCKETS, MINUTE_BUCKETS};
|
||||
|
||||
// ── Constants ─────────────────────────────────────────────────────────────────
|
||||
|
||||
const VERSION: u8 = 0x01;
|
||||
const ENTRY_SIZE: usize = 983;
|
||||
const META_SIZE: usize = 17;
|
||||
const META_SUFFIX: &[u8] = b"meta";
|
||||
|
||||
/// Bit 0 of `flags` field: velocity tracking is enabled for this signal.
|
||||
const FLAG_VELOCITY_ENABLED: u16 = 0x0001;
|
||||
|
||||
// ── CheckpointMeta ────────────────────────────────────────────────────────────
|
||||
|
||||
/// Checkpoint sequence metadata stored alongside the signal state.
|
||||
///
|
||||
/// Used by the WAL replay mechanism to know where to start replaying.
|
||||
/// Events with `wal_sequence > checkpoint.wal_sequence` must be replayed
|
||||
/// after `restore()` to bring the ledger's state fully up to date.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub struct CheckpointMeta {
|
||||
/// Nanosecond timestamp when the checkpoint was taken.
|
||||
pub checkpoint_time_ns: u64,
|
||||
/// WAL sequence number at checkpoint time.
|
||||
pub wal_sequence: u64,
|
||||
}
|
||||
|
||||
// ── Serialization ─────────────────────────────────────────────────────────────
|
||||
|
||||
/// Serialize an `EntitySignalEntry` to a 983-byte buffer.
|
||||
///
|
||||
/// # Binary layout (all payload values little-endian)
|
||||
///
|
||||
/// ```text
|
||||
/// Offset Size Field
|
||||
/// 0 1 version (0x01)
|
||||
/// 1 8 entity_id (u64 LE)
|
||||
/// 9 2 signal_type_id (u16 LE)
|
||||
/// 11 2 flags (u16 LE) — bit 0: velocity_enabled
|
||||
/// 13 8 last_update_ns (u64 LE)
|
||||
/// 21 8 decay_score_0 (f64 bits LE)
|
||||
/// 29 8 decay_score_1 (f64 bits LE)
|
||||
/// 37 8 decay_score_2 (f64 bits LE)
|
||||
/// 45 1 current_minute (u8)
|
||||
/// 46 1 current_hour (u8)
|
||||
/// 47 8 all_time_count (u64 LE)
|
||||
/// 55 8 last_minute_rotation_ns (u64 LE)
|
||||
/// 63 8 last_hour_rotation_ns (u64 LE)
|
||||
/// 71 240 minute_buckets (60 × u32 LE)
|
||||
/// 311 672 hour_buckets (168 × u32 LE)
|
||||
/// Total: 983 bytes
|
||||
/// ```
|
||||
#[must_use]
|
||||
pub fn serialize_entry(
|
||||
entity_id: EntityId,
|
||||
signal_type_id: SignalTypeId,
|
||||
entry: &EntitySignalEntry,
|
||||
) -> Vec<u8> {
|
||||
let mut buf = Vec::with_capacity(ENTRY_SIZE);
|
||||
|
||||
// [0] version
|
||||
buf.push(VERSION);
|
||||
|
||||
// [1..9] entity_id LE
|
||||
buf.extend_from_slice(&entity_id.as_u64().to_le_bytes());
|
||||
|
||||
// [9..11] signal_type_id LE
|
||||
buf.extend_from_slice(&signal_type_id.as_u16().to_le_bytes());
|
||||
|
||||
// [11..13] flags LE — derived from hot-tier immutable fields
|
||||
let flags: u16 = if entry.hot.velocity_enabled() {
|
||||
FLAG_VELOCITY_ENABLED
|
||||
} else {
|
||||
0
|
||||
};
|
||||
buf.extend_from_slice(&flags.to_le_bytes());
|
||||
|
||||
// [13..21] last_update_ns LE
|
||||
buf.extend_from_slice(&entry.hot.last_update_ns().to_le_bytes());
|
||||
|
||||
// [21..45] three decay scores as f64 bits LE
|
||||
for i in 0..3 {
|
||||
buf.extend_from_slice(&entry.hot.stored_score(i).to_bits().to_le_bytes());
|
||||
}
|
||||
|
||||
// Snapshot warm tier (atomic reads of all bucket state)
|
||||
let snap = entry.warm.snapshot();
|
||||
|
||||
// [45] current_minute (u8)
|
||||
buf.push(snap.current_minute);
|
||||
|
||||
// [46] current_hour (u8)
|
||||
buf.push(snap.current_hour);
|
||||
|
||||
// [47..55] all_time_count LE
|
||||
buf.extend_from_slice(&snap.all_time_count.to_le_bytes());
|
||||
|
||||
// [55..63] last_minute_rotation_ns LE
|
||||
buf.extend_from_slice(&snap.last_minute_rotation_ns.to_le_bytes());
|
||||
|
||||
// [63..71] last_hour_rotation_ns LE
|
||||
buf.extend_from_slice(&snap.last_hour_rotation_ns.to_le_bytes());
|
||||
|
||||
// [71..311] minute_buckets (60 × u32 LE = 240 bytes)
|
||||
for &bucket in &snap.minute_buckets {
|
||||
buf.extend_from_slice(&bucket.to_le_bytes());
|
||||
}
|
||||
|
||||
// [311..983] hour_buckets (168 × u32 LE = 672 bytes)
|
||||
for &bucket in &snap.hour_buckets {
|
||||
buf.extend_from_slice(&bucket.to_le_bytes());
|
||||
}
|
||||
|
||||
debug_assert_eq!(buf.len(), ENTRY_SIZE, "serialize_entry produced wrong size");
|
||||
buf
|
||||
}
|
||||
|
||||
/// Deserialize an `EntitySignalEntry` from bytes.
|
||||
///
|
||||
/// Returns `(entity_id, signal_type_id, entry)` on success.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `Err` if:
|
||||
/// - The slice is not exactly `ENTRY_SIZE` (983) bytes
|
||||
/// - The version byte is not `0x01`
|
||||
/// - Any sub-slice conversion fails due to offset math errors
|
||||
pub fn deserialize_entry(
|
||||
bytes: &[u8],
|
||||
) -> Result<(EntityId, SignalTypeId, EntitySignalEntry), String> {
|
||||
if bytes.len() != ENTRY_SIZE {
|
||||
return Err(format!("expected {ENTRY_SIZE} bytes, got {}", bytes.len()));
|
||||
}
|
||||
|
||||
// [0] version check
|
||||
if bytes[0] != VERSION {
|
||||
return Err(format!(
|
||||
"unknown checkpoint version 0x{:02x}, expected 0x{:02x}",
|
||||
bytes[0], VERSION
|
||||
));
|
||||
}
|
||||
|
||||
// [1..9] entity_id LE
|
||||
let entity_id_val = u64::from_le_bytes(
|
||||
bytes[1..9]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at entity_id [1..9]".to_string())?,
|
||||
);
|
||||
let entity_id = EntityId::new(entity_id_val);
|
||||
|
||||
// [9..11] signal_type_id LE
|
||||
let signal_type_id_val = u16::from_le_bytes(
|
||||
bytes[9..11]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at signal_type_id [9..11]".to_string())?,
|
||||
);
|
||||
let signal_type_id = SignalTypeId::new(signal_type_id_val);
|
||||
|
||||
// [11..13] flags LE
|
||||
let flags = u16::from_le_bytes(
|
||||
bytes[11..13]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at flags [11..13]".to_string())?,
|
||||
);
|
||||
let velocity_enabled = (flags & FLAG_VELOCITY_ENABLED) != 0;
|
||||
|
||||
// [13..21] last_update_ns LE
|
||||
let last_update_ns = u64::from_le_bytes(
|
||||
bytes[13..21]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at last_update_ns [13..21]".to_string())?,
|
||||
);
|
||||
|
||||
// [21..45] three decay scores as f64 bits LE
|
||||
let score_0 = f64::from_bits(u64::from_le_bytes(
|
||||
bytes[21..29]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at score_0 [21..29]".to_string())?,
|
||||
));
|
||||
let score_1 = f64::from_bits(u64::from_le_bytes(
|
||||
bytes[29..37]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at score_1 [29..37]".to_string())?,
|
||||
));
|
||||
let score_2 = f64::from_bits(u64::from_le_bytes(
|
||||
bytes[37..45]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at score_2 [37..45]".to_string())?,
|
||||
));
|
||||
|
||||
// [45] current_minute (u8)
|
||||
let current_minute = bytes[45];
|
||||
|
||||
// [46] current_hour (u8)
|
||||
let current_hour = bytes[46];
|
||||
|
||||
// [47..55] all_time_count LE
|
||||
let all_time_count = u64::from_le_bytes(
|
||||
bytes[47..55]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at all_time_count [47..55]".to_string())?,
|
||||
);
|
||||
|
||||
// [55..63] last_minute_rotation_ns LE
|
||||
let last_minute_rotation_ns = u64::from_le_bytes(
|
||||
bytes[55..63]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at last_minute_rotation_ns [55..63]".to_string())?,
|
||||
);
|
||||
|
||||
// [63..71] last_hour_rotation_ns LE
|
||||
let last_hour_rotation_ns = u64::from_le_bytes(
|
||||
bytes[63..71]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at last_hour_rotation_ns [63..71]".to_string())?,
|
||||
);
|
||||
|
||||
// [71..311] minute_buckets (60 × u32 LE)
|
||||
let mut minute_buckets = [0u32; MINUTE_BUCKETS];
|
||||
for (i, bucket) in minute_buckets.iter_mut().enumerate() {
|
||||
let off = 71 + i * 4;
|
||||
*bucket = u32::from_le_bytes(bytes[off..off + 4].try_into().map_err(|_| {
|
||||
format!(
|
||||
"offset math error at minute_bucket[{i}] [{off}..{}]",
|
||||
off + 4
|
||||
)
|
||||
})?);
|
||||
}
|
||||
|
||||
// [311..983] hour_buckets (168 × u32 LE)
|
||||
let mut hour_buckets = [0u32; HOUR_BUCKETS];
|
||||
for (i, bucket) in hour_buckets.iter_mut().enumerate() {
|
||||
let off = 311 + i * 4;
|
||||
*bucket =
|
||||
u32::from_le_bytes(bytes[off..off + 4].try_into().map_err(|_| {
|
||||
format!("offset math error at hour_bucket[{i}] [{off}..{}]", off + 4)
|
||||
})?);
|
||||
}
|
||||
|
||||
// Reconstruct hot tier
|
||||
let hot = HotSignalState::with_flags(entity_id_val, signal_type_id_val, velocity_enabled);
|
||||
hot.restore(last_update_ns, &[score_0, score_1, score_2]);
|
||||
|
||||
// Reconstruct warm tier from snapshot
|
||||
let warm = BucketedCounter::new();
|
||||
warm.restore(&BucketedCounterSnapshot {
|
||||
minute_buckets,
|
||||
hour_buckets,
|
||||
current_minute,
|
||||
current_hour,
|
||||
all_time_count,
|
||||
last_minute_rotation_ns,
|
||||
last_hour_rotation_ns,
|
||||
});
|
||||
|
||||
Ok((entity_id, signal_type_id, EntitySignalEntry { hot, warm }))
|
||||
}
|
||||
|
||||
/// Serialize `CheckpointMeta` to a 17-byte buffer.
|
||||
///
|
||||
/// Format: `[version: 1][checkpoint_time_ns: 8 LE][wal_sequence: 8 LE]`
|
||||
#[must_use]
|
||||
pub fn serialize_meta(meta: &CheckpointMeta) -> Vec<u8> {
|
||||
let mut buf = Vec::with_capacity(META_SIZE);
|
||||
buf.push(VERSION);
|
||||
buf.extend_from_slice(&meta.checkpoint_time_ns.to_le_bytes());
|
||||
buf.extend_from_slice(&meta.wal_sequence.to_le_bytes());
|
||||
debug_assert_eq!(buf.len(), META_SIZE);
|
||||
buf
|
||||
}
|
||||
|
||||
/// Deserialize `CheckpointMeta` from bytes.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `Err` if the slice is not `META_SIZE` bytes, the version byte
|
||||
/// is unknown, or any sub-slice conversion fails.
|
||||
pub fn deserialize_meta(bytes: &[u8]) -> Result<CheckpointMeta, String> {
|
||||
if bytes.len() != META_SIZE {
|
||||
return Err(format!(
|
||||
"expected {META_SIZE} meta bytes, got {}",
|
||||
bytes.len()
|
||||
));
|
||||
}
|
||||
if bytes[0] != VERSION {
|
||||
return Err(format!(
|
||||
"unknown checkpoint meta version 0x{:02x}, expected 0x{:02x}",
|
||||
bytes[0], VERSION
|
||||
));
|
||||
}
|
||||
let checkpoint_time_ns = u64::from_le_bytes(
|
||||
bytes[1..9]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at checkpoint_time_ns [1..9]".to_string())?,
|
||||
);
|
||||
let wal_sequence = u64::from_le_bytes(
|
||||
bytes[9..17]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at wal_sequence [9..17]".to_string())?,
|
||||
);
|
||||
Ok(CheckpointMeta {
|
||||
checkpoint_time_ns,
|
||||
wal_sequence,
|
||||
})
|
||||
}
|
||||
|
||||
// ── SignalLedger impl ─────────────────────────────────────────────────────────
|
||||
|
||||
impl SignalLedger {
|
||||
/// Write all in-memory signal state to the storage engine atomically.
|
||||
///
|
||||
/// Iterates the `DashMap` and serializes each entry into a `WriteBatch`.
|
||||
/// The checkpoint metadata is stored at a well-known key:
|
||||
/// `encode_key(EntityId::new(0), Tag::Sig, b"meta")`.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `TidalError::Storage` if the `WriteBatch` commit or `flush` fails.
|
||||
///
|
||||
/// # Concurrency
|
||||
///
|
||||
/// This method iterates `DashMap` shards without a global lock. Entries
|
||||
/// written concurrently to already-snapshotted shards will be absent from
|
||||
/// the checkpoint. The caller must supply `meta.wal_sequence` equal to the
|
||||
/// WAL tail at checkpoint start; restore must replay from that sequence to
|
||||
/// recover any missing entries.
|
||||
pub fn checkpoint(
|
||||
&self,
|
||||
storage: &dyn StorageEngine,
|
||||
meta: CheckpointMeta,
|
||||
) -> crate::Result<()> {
|
||||
let mut batch = WriteBatch::new();
|
||||
|
||||
// Write checkpoint metadata at the well-known meta key.
|
||||
let meta_key = encode_key(EntityId::new(0), Tag::Sig, META_SUFFIX);
|
||||
batch.put(meta_key, serialize_meta(&meta));
|
||||
|
||||
// Write all entity-signal entries.
|
||||
for entry_ref in self.entries() {
|
||||
let &(entity_id, signal_type_id) = entry_ref.key();
|
||||
let entry = entry_ref.value();
|
||||
// Entry key suffix is the signal_type_id as 2 big-endian bytes,
|
||||
// so it is exactly 2 bytes — never collides with b"meta" (4 bytes).
|
||||
let suffix = signal_type_id.as_u16().to_be_bytes();
|
||||
let key = encode_key(entity_id, Tag::Sig, &suffix);
|
||||
let value = serialize_entry(entity_id, signal_type_id, entry);
|
||||
batch.put(key, value);
|
||||
}
|
||||
|
||||
storage.write_batch(batch)?;
|
||||
storage.flush()?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Restore in-memory signal state from the storage engine.
|
||||
///
|
||||
/// Scans all keys, filters for `Tag::Sig` entries (excluding the meta key),
|
||||
/// deserializes each entry, and inserts it into the `DashMap`.
|
||||
///
|
||||
/// Returns `Some(CheckpointMeta)` if a checkpoint exists, or `None` on
|
||||
/// first boot (empty storage).
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Storage` on I/O failure
|
||||
/// - `TidalError::Internal` on deserialization failure (corrupt checkpoint)
|
||||
pub fn restore(&self, storage: &dyn StorageEngine) -> crate::Result<Option<CheckpointMeta>> {
|
||||
// Read checkpoint metadata first.
|
||||
let meta_key = encode_key(EntityId::new(0), Tag::Sig, META_SUFFIX);
|
||||
let meta = match storage.get(&meta_key)? {
|
||||
None => None,
|
||||
Some(meta_bytes) => Some(
|
||||
deserialize_meta(&meta_bytes)
|
||||
.map_err(|e| TidalError::Internal(format!("corrupt checkpoint meta: {e}")))?,
|
||||
),
|
||||
};
|
||||
|
||||
// Scan all keys; keep only Tag::Sig entry keys (suffix length == 2).
|
||||
// TECH DEBT: scan_prefix(&[]) iterates the entire keyspace. This is safe
|
||||
// today (signals are the only key namespace), but must be replaced with a
|
||||
// Tag::Sig-scoped scan (e.g. `scan_tag(Tag::Sig)`) once M1P5 adds entity,
|
||||
// index, and embedding key namespaces to avoid iterating unrelated data.
|
||||
for item in storage.scan_prefix(&[]) {
|
||||
let (key, value) = item?;
|
||||
if let Some((entity_id, Tag::Sig, suffix)) = parse_key(&key) {
|
||||
// Skip the meta key (entity_id=0, suffix=b"meta").
|
||||
if entity_id == EntityId::new(0) && suffix == META_SUFFIX {
|
||||
continue;
|
||||
}
|
||||
let (eid, stid, entry) = deserialize_entry(&value)
|
||||
.map_err(|e| TidalError::Internal(format!("corrupt checkpoint entry: {e}")))?;
|
||||
self.entries.insert((eid, stid), entry);
|
||||
}
|
||||
}
|
||||
|
||||
Ok(meta)
|
||||
}
|
||||
|
||||
/// Return the number of entries currently in the `DashMap`.
|
||||
///
|
||||
/// Used for diagnostics and testing.
|
||||
#[must_use]
|
||||
pub fn entry_count(&self) -> usize {
|
||||
self.entries.len()
|
||||
}
|
||||
}
|
||||
|
||||
// ── Tests ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod tests {
|
||||
use std::time::Duration;
|
||||
|
||||
use super::*;
|
||||
use crate::schema::{DecaySpec, EntityKind, SchemaBuilder, Timestamp, Window};
|
||||
use crate::signals::ledger::NoopWalWriter;
|
||||
use crate::storage::InMemoryBackend;
|
||||
|
||||
fn test_schema() -> crate::schema::Schema {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
let _ = builder
|
||||
.signal(
|
||||
"view",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(7 * 24 * 3600),
|
||||
},
|
||||
)
|
||||
.windows(&[Window::OneHour, Window::AllTime])
|
||||
.velocity(true)
|
||||
.add();
|
||||
builder.build().expect("valid test schema")
|
||||
}
|
||||
|
||||
// ── Serialization unit tests ───────────────────────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn serialize_entry_version_byte() {
|
||||
let entry = EntitySignalEntry {
|
||||
hot: HotSignalState::new(1, 0),
|
||||
warm: BucketedCounter::new(),
|
||||
};
|
||||
let bytes = serialize_entry(EntityId::new(1), SignalTypeId::new(0), &entry);
|
||||
assert_eq!(bytes[0], 0x01, "version byte should be 0x01");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn serialize_entry_correct_length() {
|
||||
let entry = EntitySignalEntry {
|
||||
hot: HotSignalState::new(42, 3),
|
||||
warm: BucketedCounter::new(),
|
||||
};
|
||||
let bytes = serialize_entry(EntityId::new(42), SignalTypeId::new(3), &entry);
|
||||
assert_eq!(bytes.len(), ENTRY_SIZE);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deserialize_entry_rejects_wrong_version() {
|
||||
let bytes = vec![0x00u8; ENTRY_SIZE];
|
||||
assert!(deserialize_entry(&bytes).is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deserialize_entry_rejects_truncated_data() {
|
||||
let result = deserialize_entry(&[0x01, 0x00]);
|
||||
assert!(result.is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deserialize_entry_rejects_wrong_length() {
|
||||
let bytes = vec![0x01u8; ENTRY_SIZE - 1];
|
||||
assert!(deserialize_entry(&bytes).is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn serialize_deserialize_entry_roundtrip() {
|
||||
let entity_id = EntityId::new(99);
|
||||
let signal_type_id = SignalTypeId::new(2);
|
||||
|
||||
let hot = HotSignalState::with_flags(99, 2, true);
|
||||
hot.restore(1_000_000_000_000, &[3.125, 2.71, 1.41]);
|
||||
|
||||
let warm = BucketedCounter::with_start_time(1_000_000_000_000);
|
||||
warm.increment(1_000_000_000_000);
|
||||
warm.increment(1_000_000_000_001);
|
||||
|
||||
let entry = EntitySignalEntry { hot, warm };
|
||||
let bytes = serialize_entry(entity_id, signal_type_id, &entry);
|
||||
assert_eq!(bytes.len(), ENTRY_SIZE);
|
||||
|
||||
let (eid, stid, restored) = deserialize_entry(&bytes).expect("deserialize ok");
|
||||
assert_eq!(eid, entity_id);
|
||||
assert_eq!(stid, signal_type_id);
|
||||
assert!((restored.hot.stored_score(0) - 3.125).abs() < 1e-15);
|
||||
assert!((restored.hot.stored_score(1) - 2.71).abs() < 1e-15);
|
||||
assert!((restored.hot.stored_score(2) - 1.41).abs() < 1e-15);
|
||||
assert_eq!(restored.hot.last_update_ns(), 1_000_000_000_000);
|
||||
assert!(restored.hot.velocity_enabled());
|
||||
assert_eq!(restored.warm.all_time_count(), 2);
|
||||
}
|
||||
|
||||
// ── Meta serialization tests ───────────────────────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn serialize_meta_correct_length() {
|
||||
let meta = CheckpointMeta {
|
||||
checkpoint_time_ns: 123_456,
|
||||
wal_sequence: 78,
|
||||
};
|
||||
let bytes = serialize_meta(&meta);
|
||||
assert_eq!(bytes.len(), META_SIZE);
|
||||
assert_eq!(bytes[0], 0x01);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deserialize_meta_roundtrip() {
|
||||
let meta = CheckpointMeta {
|
||||
checkpoint_time_ns: 1_700_000_000_000_000_000,
|
||||
wal_sequence: 42_000,
|
||||
};
|
||||
let bytes = serialize_meta(&meta);
|
||||
let restored = deserialize_meta(&bytes).expect("ok");
|
||||
assert_eq!(restored, meta);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deserialize_meta_rejects_wrong_version() {
|
||||
let mut bytes = serialize_meta(&CheckpointMeta {
|
||||
checkpoint_time_ns: 1,
|
||||
wal_sequence: 1,
|
||||
});
|
||||
bytes[0] = 0xFF;
|
||||
assert!(deserialize_meta(&bytes).is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deserialize_meta_rejects_truncated() {
|
||||
assert!(deserialize_meta(&[0x01, 0x00]).is_err());
|
||||
}
|
||||
|
||||
// ── Checkpoint/restore integration tests ──────────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn checkpoint_to_empty_storage() {
|
||||
let schema = test_schema();
|
||||
let ledger = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
|
||||
let now = Timestamp::now();
|
||||
for i in 0..10u64 {
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(i + 1), 1.0, now)
|
||||
.expect("record");
|
||||
}
|
||||
|
||||
let storage = InMemoryBackend::new();
|
||||
let meta = CheckpointMeta {
|
||||
checkpoint_time_ns: now.as_nanos(),
|
||||
wal_sequence: 100,
|
||||
};
|
||||
ledger.checkpoint(&storage, meta).expect("checkpoint");
|
||||
|
||||
// Expect meta key + 10 entry keys = 11 total keys.
|
||||
let all_keys: Vec<_> = storage
|
||||
.scan_prefix(&[])
|
||||
.collect::<Result<_, _>>()
|
||||
.expect("scan ok");
|
||||
assert_eq!(
|
||||
all_keys.len(),
|
||||
11,
|
||||
"expected 11 keys, got {}",
|
||||
all_keys.len()
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn restore_from_empty_storage() {
|
||||
let schema = test_schema();
|
||||
let ledger = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
|
||||
let storage = InMemoryBackend::new();
|
||||
let meta = ledger.restore(&storage).expect("restore ok");
|
||||
|
||||
assert!(meta.is_none(), "empty storage should return None");
|
||||
assert_eq!(ledger.entry_count(), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn restore_preserves_decay_scores() {
|
||||
let schema = test_schema();
|
||||
let ledger = SignalLedger::new(schema.clone(), Box::new(NoopWalWriter));
|
||||
|
||||
let ts1 = Timestamp::from_nanos(1_000_000_000_000);
|
||||
let ts2 = Timestamp::from_nanos(1_001_000_000_000);
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(42), 5.0, ts1)
|
||||
.expect("record 1");
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(42), 3.0, ts2)
|
||||
.expect("record 2");
|
||||
|
||||
let storage = InMemoryBackend::new();
|
||||
let meta = CheckpointMeta {
|
||||
checkpoint_time_ns: 1_002_000_000_000,
|
||||
wal_sequence: 50,
|
||||
};
|
||||
ledger.checkpoint(&storage, meta).expect("checkpoint");
|
||||
|
||||
let ledger2 = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
let restored_meta = ledger2
|
||||
.restore(&storage)
|
||||
.expect("restore")
|
||||
.expect("some meta");
|
||||
assert_eq!(restored_meta.wal_sequence, 50);
|
||||
|
||||
let score_orig = ledger
|
||||
.read_decay_score(EntityId::new(42), "view", 0)
|
||||
.expect("ok");
|
||||
let score_rest = ledger2
|
||||
.read_decay_score(EntityId::new(42), "view", 0)
|
||||
.expect("ok");
|
||||
|
||||
assert!(score_orig.is_some());
|
||||
assert!(score_rest.is_some());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn restore_preserves_windowed_counts() {
|
||||
let schema = test_schema();
|
||||
let ledger = SignalLedger::new(schema.clone(), Box::new(NoopWalWriter));
|
||||
|
||||
let base_ns = 1_000_000_000_000u64;
|
||||
for i in 0..100u64 {
|
||||
let ts = Timestamp::from_nanos(base_ns + i * 100_000_000);
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(1), 1.0, ts)
|
||||
.expect("record");
|
||||
}
|
||||
|
||||
let storage = InMemoryBackend::new();
|
||||
let meta = CheckpointMeta {
|
||||
checkpoint_time_ns: base_ns + 10_000_000_000,
|
||||
wal_sequence: 0,
|
||||
};
|
||||
ledger.checkpoint(&storage, meta).expect("checkpoint");
|
||||
|
||||
let ledger2 = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
ledger2.restore(&storage).expect("restore");
|
||||
|
||||
let count_orig = ledger
|
||||
.read_windowed_count(EntityId::new(1), "view", Window::AllTime)
|
||||
.expect("ok");
|
||||
let count_rest = ledger2
|
||||
.read_windowed_count(EntityId::new(1), "view", Window::AllTime)
|
||||
.expect("ok");
|
||||
assert_eq!(count_orig, count_rest);
|
||||
assert_eq!(count_rest, 100);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn checkpoint_overwrites_previous() {
|
||||
let schema = test_schema();
|
||||
let ledger = SignalLedger::new(schema.clone(), Box::new(NoopWalWriter));
|
||||
let storage = InMemoryBackend::new();
|
||||
let ts = Timestamp::now();
|
||||
|
||||
// First checkpoint: 5 entities.
|
||||
for i in 0..5u64 {
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(i + 1), 1.0, ts)
|
||||
.expect("record");
|
||||
}
|
||||
ledger
|
||||
.checkpoint(
|
||||
&storage,
|
||||
CheckpointMeta {
|
||||
checkpoint_time_ns: 1,
|
||||
wal_sequence: 10,
|
||||
},
|
||||
)
|
||||
.expect("checkpoint 1");
|
||||
|
||||
// Add 3 more entities, then second checkpoint: 8 entities total.
|
||||
for i in 5..8u64 {
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(i + 1), 1.0, ts)
|
||||
.expect("record");
|
||||
}
|
||||
ledger
|
||||
.checkpoint(
|
||||
&storage,
|
||||
CheckpointMeta {
|
||||
checkpoint_time_ns: 2,
|
||||
wal_sequence: 20,
|
||||
},
|
||||
)
|
||||
.expect("checkpoint 2");
|
||||
|
||||
let ledger2 = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
let restored_meta = ledger2
|
||||
.restore(&storage)
|
||||
.expect("restore")
|
||||
.expect("some meta");
|
||||
assert_eq!(restored_meta.wal_sequence, 20);
|
||||
assert_eq!(ledger2.entry_count(), 8);
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod proptests {
|
||||
use std::time::Duration;
|
||||
|
||||
use proptest::prelude::*;
|
||||
|
||||
use super::*;
|
||||
use crate::schema::{DecaySpec, EntityKind, SchemaBuilder, Timestamp, Window};
|
||||
use crate::signals::ledger::NoopWalWriter;
|
||||
use crate::storage::InMemoryBackend;
|
||||
|
||||
fn test_schema() -> crate::schema::Schema {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
let _ = builder
|
||||
.signal(
|
||||
"view",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(7 * 24 * 3600),
|
||||
},
|
||||
)
|
||||
.windows(&[Window::AllTime])
|
||||
.velocity(false)
|
||||
.add();
|
||||
builder.build().expect("valid schema")
|
||||
}
|
||||
|
||||
// Meta serialization roundtrip for all u64 combinations.
|
||||
proptest! {
|
||||
#[test]
|
||||
fn serialize_deserialize_meta_roundtrip(
|
||||
checkpoint_time_ns: u64,
|
||||
wal_sequence: u64,
|
||||
) {
|
||||
let meta = CheckpointMeta { checkpoint_time_ns, wal_sequence };
|
||||
let bytes = serialize_meta(&meta);
|
||||
let restored = deserialize_meta(&bytes).unwrap();
|
||||
prop_assert_eq!(restored, meta);
|
||||
}
|
||||
}
|
||||
|
||||
// Entry serialization roundtrip for arbitrary hot-tier state.
|
||||
proptest! {
|
||||
#[test]
|
||||
fn serialize_deserialize_entry_roundtrip(
|
||||
entity_id_val in 1u64..1_000_000,
|
||||
signal_type_id_val in 0u16..64,
|
||||
score_0 in 0.0f64..1e12,
|
||||
score_1 in 0.0f64..1e12,
|
||||
score_2 in 0.0f64..1e12,
|
||||
last_update in 0u64..2_000_000_000_000u64,
|
||||
all_time in 0u64..1_000_000,
|
||||
) {
|
||||
let entity_id = EntityId::new(entity_id_val);
|
||||
let signal_type_id = SignalTypeId::new(signal_type_id_val);
|
||||
|
||||
let hot = HotSignalState::new(entity_id_val, signal_type_id_val);
|
||||
hot.restore(last_update, &[score_0, score_1, score_2]);
|
||||
|
||||
let warm = BucketedCounter::new();
|
||||
warm.increment_by(all_time as u32, 0);
|
||||
|
||||
let entry = EntitySignalEntry { hot, warm };
|
||||
let bytes = serialize_entry(entity_id, signal_type_id, &entry);
|
||||
let (eid, stid, restored) = deserialize_entry(&bytes).unwrap();
|
||||
|
||||
prop_assert_eq!(eid, entity_id);
|
||||
prop_assert_eq!(stid, signal_type_id);
|
||||
prop_assert!((restored.hot.stored_score(0) - score_0).abs() < 1e-15);
|
||||
prop_assert!((restored.hot.stored_score(1) - score_1).abs() < 1e-15);
|
||||
prop_assert!((restored.hot.stored_score(2) - score_2).abs() < 1e-15);
|
||||
prop_assert_eq!(restored.hot.last_update_ns(), last_update);
|
||||
}
|
||||
}
|
||||
|
||||
// Full checkpoint-restore roundtrip.
|
||||
proptest! {
|
||||
#[test]
|
||||
fn checkpoint_restore_roundtrip(
|
||||
entity_count in 1usize..50,
|
||||
signals_per_entity in 1usize..20,
|
||||
) {
|
||||
let schema = test_schema();
|
||||
let ledger = SignalLedger::new(schema.clone(), Box::new(NoopWalWriter));
|
||||
|
||||
let base_ns = 1_000_000_000_000u64;
|
||||
for entity in 0..entity_count as u64 {
|
||||
for i in 0..signals_per_entity {
|
||||
let ts = Timestamp::from_nanos(base_ns + (i as u64) * 1_000_000_000);
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(entity + 1), 1.0, ts)
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
let storage = InMemoryBackend::new();
|
||||
let meta = CheckpointMeta { checkpoint_time_ns: base_ns, wal_sequence: 42 };
|
||||
ledger.checkpoint(&storage, meta).unwrap();
|
||||
|
||||
let ledger2 = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
let restored_meta = ledger2.restore(&storage).unwrap();
|
||||
|
||||
prop_assert_eq!(restored_meta, Some(meta));
|
||||
prop_assert_eq!(ledger2.entry_count(), ledger.entry_count());
|
||||
|
||||
for entity in 0..entity_count as u64 {
|
||||
let eid = EntityId::new(entity + 1);
|
||||
|
||||
let orig_count = ledger
|
||||
.read_windowed_count(eid, "view", Window::AllTime)
|
||||
.unwrap();
|
||||
let rest_count = ledger2
|
||||
.read_windowed_count(eid, "view", Window::AllTime)
|
||||
.unwrap();
|
||||
prop_assert_eq!(orig_count, rest_count, "entity {}: all-time count mismatch", entity);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
369
tidal/src/signals/checkpoint/format.rs
Normal file
369
tidal/src/signals/checkpoint/format.rs
Normal file
@ -0,0 +1,369 @@
|
||||
//! Binary entry serialization for signal checkpoint records.
|
||||
//!
|
||||
//! Each `EntitySignalEntry` serializes to a fixed 983-byte record containing
|
||||
//! the hot-tier decay scores and warm-tier bucketed counters. All payload
|
||||
//! values use little-endian byte order; a version byte at offset 0 enables
|
||||
//! future backward-compatible format changes.
|
||||
|
||||
use crate::schema::EntityId;
|
||||
|
||||
use super::super::SignalTypeId;
|
||||
use super::super::hot::HotSignalState;
|
||||
use super::super::ledger::EntitySignalEntry;
|
||||
use super::super::warm::{BucketedCounter, BucketedCounterSnapshot, HOUR_BUCKETS, MINUTE_BUCKETS};
|
||||
|
||||
// ── Constants ─────────────────────────────────────────────────────────────────
|
||||
|
||||
const VERSION: u8 = 0x01;
|
||||
|
||||
/// Total size of a serialized entry in bytes.
|
||||
pub(super) const ENTRY_SIZE: usize = 983;
|
||||
|
||||
/// Bit 0 of `flags` field: velocity tracking is enabled for this signal.
|
||||
const FLAG_VELOCITY_ENABLED: u16 = 0x0001;
|
||||
|
||||
// ── Serialization ─────────────────────────────────────────────────────────────
|
||||
|
||||
/// Serialize an `EntitySignalEntry` to a 983-byte buffer.
|
||||
///
|
||||
/// # Binary layout (all payload values little-endian)
|
||||
///
|
||||
/// ```text
|
||||
/// Offset Size Field
|
||||
/// 0 1 version (0x01)
|
||||
/// 1 8 entity_id (u64 LE)
|
||||
/// 9 2 signal_type_id (u16 LE)
|
||||
/// 11 2 flags (u16 LE) -- bit 0: velocity_enabled
|
||||
/// 13 8 last_update_ns (u64 LE)
|
||||
/// 21 8 decay_score_0 (f64 bits LE)
|
||||
/// 29 8 decay_score_1 (f64 bits LE)
|
||||
/// 37 8 decay_score_2 (f64 bits LE)
|
||||
/// 45 1 current_minute (u8)
|
||||
/// 46 1 current_hour (u8)
|
||||
/// 47 8 all_time_count (u64 LE)
|
||||
/// 55 8 last_minute_rotation_ns (u64 LE)
|
||||
/// 63 8 last_hour_rotation_ns (u64 LE)
|
||||
/// 71 240 minute_buckets (60 x u32 LE)
|
||||
/// 311 672 hour_buckets (168 x u32 LE)
|
||||
/// Total: 983 bytes
|
||||
/// ```
|
||||
#[must_use]
|
||||
pub fn serialize_entry(
|
||||
entity_id: EntityId,
|
||||
signal_type_id: SignalTypeId,
|
||||
entry: &EntitySignalEntry,
|
||||
) -> Vec<u8> {
|
||||
let mut buf = Vec::with_capacity(ENTRY_SIZE);
|
||||
|
||||
// [0] version
|
||||
buf.push(VERSION);
|
||||
|
||||
// [1..9] entity_id LE
|
||||
buf.extend_from_slice(&entity_id.as_u64().to_le_bytes());
|
||||
|
||||
// [9..11] signal_type_id LE
|
||||
buf.extend_from_slice(&signal_type_id.as_u16().to_le_bytes());
|
||||
|
||||
// [11..13] flags LE -- derived from hot-tier immutable fields
|
||||
let flags: u16 = if entry.hot.velocity_enabled() {
|
||||
FLAG_VELOCITY_ENABLED
|
||||
} else {
|
||||
0
|
||||
};
|
||||
buf.extend_from_slice(&flags.to_le_bytes());
|
||||
|
||||
// [13..21] last_update_ns LE
|
||||
buf.extend_from_slice(&entry.hot.last_update_ns().to_le_bytes());
|
||||
|
||||
// [21..45] three decay scores as f64 bits LE
|
||||
for i in 0..3 {
|
||||
buf.extend_from_slice(&entry.hot.stored_score(i).to_bits().to_le_bytes());
|
||||
}
|
||||
|
||||
// Snapshot warm tier (atomic reads of all bucket state)
|
||||
let snap = entry.warm.snapshot();
|
||||
|
||||
// [45] current_minute (u8)
|
||||
buf.push(snap.current_minute);
|
||||
|
||||
// [46] current_hour (u8)
|
||||
buf.push(snap.current_hour);
|
||||
|
||||
// [47..55] all_time_count LE
|
||||
buf.extend_from_slice(&snap.all_time_count.to_le_bytes());
|
||||
|
||||
// [55..63] last_minute_rotation_ns LE
|
||||
buf.extend_from_slice(&snap.last_minute_rotation_ns.to_le_bytes());
|
||||
|
||||
// [63..71] last_hour_rotation_ns LE
|
||||
buf.extend_from_slice(&snap.last_hour_rotation_ns.to_le_bytes());
|
||||
|
||||
// [71..311] minute_buckets (60 x u32 LE = 240 bytes)
|
||||
for &bucket in &snap.minute_buckets {
|
||||
buf.extend_from_slice(&bucket.to_le_bytes());
|
||||
}
|
||||
|
||||
// [311..983] hour_buckets (168 x u32 LE = 672 bytes)
|
||||
for &bucket in &snap.hour_buckets {
|
||||
buf.extend_from_slice(&bucket.to_le_bytes());
|
||||
}
|
||||
|
||||
debug_assert_eq!(buf.len(), ENTRY_SIZE, "serialize_entry produced wrong size");
|
||||
buf
|
||||
}
|
||||
|
||||
/// Deserialize an `EntitySignalEntry` from bytes.
|
||||
///
|
||||
/// Returns `(entity_id, signal_type_id, entry)` on success.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `Err` if:
|
||||
/// - The slice is not exactly `ENTRY_SIZE` (983) bytes
|
||||
/// - The version byte is not `0x01`
|
||||
/// - Any sub-slice conversion fails due to offset math errors
|
||||
pub fn deserialize_entry(
|
||||
bytes: &[u8],
|
||||
) -> Result<(EntityId, SignalTypeId, EntitySignalEntry), String> {
|
||||
if bytes.len() != ENTRY_SIZE {
|
||||
return Err(format!("expected {ENTRY_SIZE} bytes, got {}", bytes.len()));
|
||||
}
|
||||
|
||||
// [0] version check
|
||||
if bytes[0] != VERSION {
|
||||
return Err(format!(
|
||||
"unknown checkpoint version 0x{:02x}, expected 0x{:02x}",
|
||||
bytes[0], VERSION
|
||||
));
|
||||
}
|
||||
|
||||
// [1..9] entity_id LE
|
||||
let entity_id_val = u64::from_le_bytes(
|
||||
bytes[1..9]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at entity_id [1..9]".to_string())?,
|
||||
);
|
||||
let entity_id = EntityId::new(entity_id_val);
|
||||
|
||||
// [9..11] signal_type_id LE
|
||||
let signal_type_id_val = u16::from_le_bytes(
|
||||
bytes[9..11]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at signal_type_id [9..11]".to_string())?,
|
||||
);
|
||||
let signal_type_id = SignalTypeId::new(signal_type_id_val);
|
||||
|
||||
// [11..13] flags LE
|
||||
let flags = u16::from_le_bytes(
|
||||
bytes[11..13]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at flags [11..13]".to_string())?,
|
||||
);
|
||||
let velocity_enabled = (flags & FLAG_VELOCITY_ENABLED) != 0;
|
||||
|
||||
// [13..21] last_update_ns LE
|
||||
let last_update_ns = u64::from_le_bytes(
|
||||
bytes[13..21]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at last_update_ns [13..21]".to_string())?,
|
||||
);
|
||||
|
||||
// [21..45] three decay scores as f64 bits LE
|
||||
let score_0 = f64::from_bits(u64::from_le_bytes(
|
||||
bytes[21..29]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at score_0 [21..29]".to_string())?,
|
||||
));
|
||||
let score_1 = f64::from_bits(u64::from_le_bytes(
|
||||
bytes[29..37]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at score_1 [29..37]".to_string())?,
|
||||
));
|
||||
let score_2 = f64::from_bits(u64::from_le_bytes(
|
||||
bytes[37..45]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at score_2 [37..45]".to_string())?,
|
||||
));
|
||||
|
||||
// [45] current_minute (u8)
|
||||
let current_minute = bytes[45];
|
||||
|
||||
// [46] current_hour (u8)
|
||||
let current_hour = bytes[46];
|
||||
|
||||
// [47..55] all_time_count LE
|
||||
let all_time_count = u64::from_le_bytes(
|
||||
bytes[47..55]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at all_time_count [47..55]".to_string())?,
|
||||
);
|
||||
|
||||
// [55..63] last_minute_rotation_ns LE
|
||||
let last_minute_rotation_ns = u64::from_le_bytes(
|
||||
bytes[55..63]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at last_minute_rotation_ns [55..63]".to_string())?,
|
||||
);
|
||||
|
||||
// [63..71] last_hour_rotation_ns LE
|
||||
let last_hour_rotation_ns = u64::from_le_bytes(
|
||||
bytes[63..71]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at last_hour_rotation_ns [63..71]".to_string())?,
|
||||
);
|
||||
|
||||
// [71..311] minute_buckets (60 x u32 LE)
|
||||
let mut minute_buckets = [0u32; MINUTE_BUCKETS];
|
||||
for (i, bucket) in minute_buckets.iter_mut().enumerate() {
|
||||
let off = 71 + i * 4;
|
||||
*bucket = u32::from_le_bytes(bytes[off..off + 4].try_into().map_err(|_| {
|
||||
format!(
|
||||
"offset math error at minute_bucket[{i}] [{off}..{}]",
|
||||
off + 4
|
||||
)
|
||||
})?);
|
||||
}
|
||||
|
||||
// [311..983] hour_buckets (168 x u32 LE)
|
||||
let mut hour_buckets = [0u32; HOUR_BUCKETS];
|
||||
for (i, bucket) in hour_buckets.iter_mut().enumerate() {
|
||||
let off = 311 + i * 4;
|
||||
*bucket =
|
||||
u32::from_le_bytes(bytes[off..off + 4].try_into().map_err(|_| {
|
||||
format!("offset math error at hour_bucket[{i}] [{off}..{}]", off + 4)
|
||||
})?);
|
||||
}
|
||||
|
||||
// Reconstruct hot tier
|
||||
let hot = HotSignalState::with_flags(entity_id_val, signal_type_id_val, velocity_enabled);
|
||||
hot.restore(last_update_ns, &[score_0, score_1, score_2]);
|
||||
|
||||
// Reconstruct warm tier from snapshot
|
||||
let warm = BucketedCounter::new();
|
||||
warm.restore(&BucketedCounterSnapshot {
|
||||
minute_buckets,
|
||||
hour_buckets,
|
||||
current_minute,
|
||||
current_hour,
|
||||
all_time_count,
|
||||
last_minute_rotation_ns,
|
||||
last_hour_rotation_ns,
|
||||
});
|
||||
|
||||
Ok((entity_id, signal_type_id, EntitySignalEntry { hot, warm }))
|
||||
}
|
||||
|
||||
// ── Tests ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn serialize_entry_version_byte() {
|
||||
let entry = EntitySignalEntry {
|
||||
hot: HotSignalState::new(1, 0),
|
||||
warm: BucketedCounter::new(),
|
||||
};
|
||||
let bytes = serialize_entry(EntityId::new(1), SignalTypeId::new(0), &entry);
|
||||
assert_eq!(bytes[0], 0x01, "version byte should be 0x01");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn serialize_entry_correct_length() {
|
||||
let entry = EntitySignalEntry {
|
||||
hot: HotSignalState::new(42, 3),
|
||||
warm: BucketedCounter::new(),
|
||||
};
|
||||
let bytes = serialize_entry(EntityId::new(42), SignalTypeId::new(3), &entry);
|
||||
assert_eq!(bytes.len(), ENTRY_SIZE);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deserialize_entry_rejects_wrong_version() {
|
||||
let bytes = vec![0x00u8; ENTRY_SIZE];
|
||||
assert!(deserialize_entry(&bytes).is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deserialize_entry_rejects_truncated_data() {
|
||||
let result = deserialize_entry(&[0x01, 0x00]);
|
||||
assert!(result.is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deserialize_entry_rejects_wrong_length() {
|
||||
let bytes = vec![0x01u8; ENTRY_SIZE - 1];
|
||||
assert!(deserialize_entry(&bytes).is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn serialize_deserialize_entry_roundtrip() {
|
||||
let entity_id = EntityId::new(99);
|
||||
let signal_type_id = SignalTypeId::new(2);
|
||||
|
||||
let hot = HotSignalState::with_flags(99, 2, true);
|
||||
hot.restore(1_000_000_000_000, &[3.125, 2.71, 1.41]);
|
||||
|
||||
let warm = BucketedCounter::with_start_time(1_000_000_000_000);
|
||||
warm.increment(1_000_000_000_000);
|
||||
warm.increment(1_000_000_000_001);
|
||||
|
||||
let entry = EntitySignalEntry { hot, warm };
|
||||
let bytes = serialize_entry(entity_id, signal_type_id, &entry);
|
||||
assert_eq!(bytes.len(), ENTRY_SIZE);
|
||||
|
||||
let (eid, stid, restored) = deserialize_entry(&bytes).expect("deserialize ok");
|
||||
assert_eq!(eid, entity_id);
|
||||
assert_eq!(stid, signal_type_id);
|
||||
assert!((restored.hot.stored_score(0) - 3.125).abs() < 1e-15);
|
||||
assert!((restored.hot.stored_score(1) - 2.71).abs() < 1e-15);
|
||||
assert!((restored.hot.stored_score(2) - 1.41).abs() < 1e-15);
|
||||
assert_eq!(restored.hot.last_update_ns(), 1_000_000_000_000);
|
||||
assert!(restored.hot.velocity_enabled());
|
||||
assert_eq!(restored.warm.all_time_count(), 2);
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod proptests {
|
||||
use proptest::prelude::*;
|
||||
|
||||
use super::*;
|
||||
|
||||
// Entry serialization roundtrip for arbitrary hot-tier state.
|
||||
proptest! {
|
||||
#[test]
|
||||
fn serialize_deserialize_entry_roundtrip(
|
||||
entity_id_val in 1u64..1_000_000,
|
||||
signal_type_id_val in 0u16..64,
|
||||
score_0 in 0.0f64..1e12,
|
||||
score_1 in 0.0f64..1e12,
|
||||
score_2 in 0.0f64..1e12,
|
||||
last_update in 0u64..2_000_000_000_000u64,
|
||||
all_time in 0u64..1_000_000,
|
||||
) {
|
||||
let entity_id = EntityId::new(entity_id_val);
|
||||
let signal_type_id = SignalTypeId::new(signal_type_id_val);
|
||||
|
||||
let hot = HotSignalState::new(entity_id_val, signal_type_id_val);
|
||||
hot.restore(last_update, &[score_0, score_1, score_2]);
|
||||
|
||||
let warm = BucketedCounter::new();
|
||||
warm.increment_by(all_time as u32, 0);
|
||||
|
||||
let entry = EntitySignalEntry { hot, warm };
|
||||
let bytes = serialize_entry(entity_id, signal_type_id, &entry);
|
||||
let (eid, stid, restored) = deserialize_entry(&bytes).unwrap();
|
||||
|
||||
prop_assert_eq!(eid, entity_id);
|
||||
prop_assert_eq!(stid, signal_type_id);
|
||||
prop_assert!((restored.hot.stored_score(0) - score_0).abs() < 1e-15);
|
||||
prop_assert!((restored.hot.stored_score(1) - score_1).abs() < 1e-15);
|
||||
prop_assert!((restored.hot.stored_score(2) - score_2).abs() < 1e-15);
|
||||
prop_assert_eq!(restored.hot.last_update_ns(), last_update);
|
||||
}
|
||||
}
|
||||
}
|
||||
142
tidal/src/signals/checkpoint/meta.rs
Normal file
142
tidal/src/signals/checkpoint/meta.rs
Normal file
@ -0,0 +1,142 @@
|
||||
//! Checkpoint metadata type and serialization.
|
||||
//!
|
||||
//! `CheckpointMeta` records the WAL sequence position at checkpoint time so
|
||||
//! that restore can replay only the events written after the checkpoint.
|
||||
//! The metadata serializes to a 17-byte fixed-length record.
|
||||
|
||||
// ── Constants ─────────────────────────────────────────────────────────────────
|
||||
|
||||
pub(super) const VERSION: u8 = 0x01;
|
||||
pub(super) const META_SIZE: usize = 17;
|
||||
pub(super) const META_SUFFIX: &[u8] = b"meta";
|
||||
|
||||
// ── CheckpointMeta ────────────────────────────────────────────────────────────
|
||||
|
||||
/// Checkpoint sequence metadata stored alongside the signal state.
|
||||
///
|
||||
/// Used by the WAL replay mechanism to know where to start replaying.
|
||||
/// Events with `wal_sequence > checkpoint.wal_sequence` must be replayed
|
||||
/// after `restore()` to bring the ledger's state fully up to date.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub struct CheckpointMeta {
|
||||
/// Nanosecond timestamp when the checkpoint was taken.
|
||||
pub checkpoint_time_ns: u64,
|
||||
/// WAL sequence number at checkpoint time.
|
||||
pub wal_sequence: u64,
|
||||
}
|
||||
|
||||
// ── Serialization ─────────────────────────────────────────────────────────────
|
||||
|
||||
/// Serialize `CheckpointMeta` to a 17-byte buffer.
|
||||
///
|
||||
/// Format: `[version: 1][checkpoint_time_ns: 8 LE][wal_sequence: 8 LE]`
|
||||
#[must_use]
|
||||
pub fn serialize_meta(meta: &CheckpointMeta) -> Vec<u8> {
|
||||
let mut buf = Vec::with_capacity(META_SIZE);
|
||||
buf.push(VERSION);
|
||||
buf.extend_from_slice(&meta.checkpoint_time_ns.to_le_bytes());
|
||||
buf.extend_from_slice(&meta.wal_sequence.to_le_bytes());
|
||||
debug_assert_eq!(buf.len(), META_SIZE);
|
||||
buf
|
||||
}
|
||||
|
||||
/// Deserialize `CheckpointMeta` from bytes.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `Err` if the slice is not `META_SIZE` bytes, the version byte
|
||||
/// is unknown, or any sub-slice conversion fails.
|
||||
pub fn deserialize_meta(bytes: &[u8]) -> Result<CheckpointMeta, String> {
|
||||
if bytes.len() != META_SIZE {
|
||||
return Err(format!(
|
||||
"expected {META_SIZE} meta bytes, got {}",
|
||||
bytes.len()
|
||||
));
|
||||
}
|
||||
if bytes[0] != VERSION {
|
||||
return Err(format!(
|
||||
"unknown checkpoint meta version 0x{:02x}, expected 0x{:02x}",
|
||||
bytes[0], VERSION
|
||||
));
|
||||
}
|
||||
let checkpoint_time_ns = u64::from_le_bytes(
|
||||
bytes[1..9]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at checkpoint_time_ns [1..9]".to_string())?,
|
||||
);
|
||||
let wal_sequence = u64::from_le_bytes(
|
||||
bytes[9..17]
|
||||
.try_into()
|
||||
.map_err(|_| "offset math error at wal_sequence [9..17]".to_string())?,
|
||||
);
|
||||
Ok(CheckpointMeta {
|
||||
checkpoint_time_ns,
|
||||
wal_sequence,
|
||||
})
|
||||
}
|
||||
|
||||
// ── Tests ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn serialize_meta_correct_length() {
|
||||
let meta = CheckpointMeta {
|
||||
checkpoint_time_ns: 123_456,
|
||||
wal_sequence: 78,
|
||||
};
|
||||
let bytes = serialize_meta(&meta);
|
||||
assert_eq!(bytes.len(), META_SIZE);
|
||||
assert_eq!(bytes[0], 0x01);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deserialize_meta_roundtrip() {
|
||||
let meta = CheckpointMeta {
|
||||
checkpoint_time_ns: 1_700_000_000_000_000_000,
|
||||
wal_sequence: 42_000,
|
||||
};
|
||||
let bytes = serialize_meta(&meta);
|
||||
let restored = deserialize_meta(&bytes).expect("ok");
|
||||
assert_eq!(restored, meta);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deserialize_meta_rejects_wrong_version() {
|
||||
let mut bytes = serialize_meta(&CheckpointMeta {
|
||||
checkpoint_time_ns: 1,
|
||||
wal_sequence: 1,
|
||||
});
|
||||
bytes[0] = 0xFF;
|
||||
assert!(deserialize_meta(&bytes).is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deserialize_meta_rejects_truncated() {
|
||||
assert!(deserialize_meta(&[0x01, 0x00]).is_err());
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod proptests {
|
||||
use proptest::prelude::*;
|
||||
|
||||
use super::*;
|
||||
|
||||
proptest! {
|
||||
#[test]
|
||||
fn serialize_deserialize_meta_roundtrip(
|
||||
checkpoint_time_ns: u64,
|
||||
wal_sequence: u64,
|
||||
) {
|
||||
let meta = CheckpointMeta { checkpoint_time_ns, wal_sequence };
|
||||
let bytes = serialize_meta(&meta);
|
||||
let restored = deserialize_meta(&bytes).unwrap();
|
||||
prop_assert_eq!(restored, meta);
|
||||
}
|
||||
}
|
||||
}
|
||||
401
tidal/src/signals/checkpoint/mod.rs
Normal file
401
tidal/src/signals/checkpoint/mod.rs
Normal file
@ -0,0 +1,401 @@
|
||||
//! Checkpoint and restore for the `SignalLedger`.
|
||||
//!
|
||||
//! # Checkpoint
|
||||
//!
|
||||
//! `SignalLedger::checkpoint()` serializes all in-memory signal state to the
|
||||
//! `StorageEngine` as a single atomic `WriteBatch`. No partial checkpoints are
|
||||
//! possible: either the whole ledger is written or nothing is.
|
||||
//!
|
||||
//! # Restore
|
||||
//!
|
||||
//! `SignalLedger::restore()` scans the storage, filters for `Tag::Sig` keys,
|
||||
//! deserializes each entry, and populates the `DashMap`. Returns the checkpoint
|
||||
//! metadata (for WAL replay) or `None` if no checkpoint exists (first boot).
|
||||
//!
|
||||
//! # Binary format
|
||||
//!
|
||||
//! Each entry serializes as a 983-byte fixed-length record (see [`format`]).
|
||||
//! The checkpoint metadata serializes as a 17-byte record at a well-known key
|
||||
//! (see [`meta`]). All payload values use little-endian byte order; storage
|
||||
//! keys use big-endian (the existing `encode_key` convention). A version byte
|
||||
//! at offset 0 enables future backward-compatible format changes.
|
||||
|
||||
pub mod format;
|
||||
pub mod meta;
|
||||
|
||||
pub use meta::CheckpointMeta;
|
||||
|
||||
use crate::schema::{EntityId, TidalError};
|
||||
use crate::storage::{StorageEngine, Tag, WriteBatch, encode_key, parse_key};
|
||||
|
||||
use format::{deserialize_entry, serialize_entry};
|
||||
use meta::{META_SUFFIX, deserialize_meta, serialize_meta};
|
||||
|
||||
use super::ledger::SignalLedger;
|
||||
|
||||
// ── SignalLedger impl ─────────────────────────────────────────────────────────
|
||||
|
||||
impl SignalLedger {
|
||||
/// Write all in-memory signal state to the storage engine atomically.
|
||||
///
|
||||
/// Iterates the `DashMap` and serializes each entry into a `WriteBatch`.
|
||||
/// The checkpoint metadata is stored at a well-known key:
|
||||
/// `encode_key(EntityId::new(0), Tag::Sig, b"meta")`.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `TidalError::Storage` if the `WriteBatch` commit or `flush` fails.
|
||||
///
|
||||
/// # Concurrency
|
||||
///
|
||||
/// This method iterates `DashMap` shards without a global lock. Entries
|
||||
/// written concurrently to already-snapshotted shards will be absent from
|
||||
/// the checkpoint. The caller must supply `meta.wal_sequence` equal to the
|
||||
/// WAL tail at checkpoint start; restore must replay from that sequence to
|
||||
/// recover any missing entries.
|
||||
pub fn checkpoint(
|
||||
&self,
|
||||
storage: &dyn StorageEngine,
|
||||
meta: CheckpointMeta,
|
||||
) -> crate::Result<()> {
|
||||
let mut batch = WriteBatch::new();
|
||||
|
||||
// Write checkpoint metadata at the well-known meta key.
|
||||
let meta_key = encode_key(EntityId::new(0), Tag::Sig, META_SUFFIX);
|
||||
batch.put(meta_key, serialize_meta(&meta));
|
||||
|
||||
// Write all entity-signal entries.
|
||||
for entry_ref in self.entries() {
|
||||
let &(entity_id, signal_type_id) = entry_ref.key();
|
||||
let entry = entry_ref.value();
|
||||
// Entry key suffix is the signal_type_id as 2 big-endian bytes,
|
||||
// so it is exactly 2 bytes -- never collides with b"meta" (4 bytes).
|
||||
let suffix = signal_type_id.as_u16().to_be_bytes();
|
||||
let key = encode_key(entity_id, Tag::Sig, &suffix);
|
||||
let value = serialize_entry(entity_id, signal_type_id, entry);
|
||||
batch.put(key, value);
|
||||
}
|
||||
|
||||
storage.write_batch(batch)?;
|
||||
storage.flush()?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Restore in-memory signal state from the storage engine.
|
||||
///
|
||||
/// Scans all keys, filters for `Tag::Sig` entries (excluding the meta key),
|
||||
/// deserializes each entry, and inserts it into the `DashMap`.
|
||||
///
|
||||
/// Returns `Some(CheckpointMeta)` if a checkpoint exists, or `None` on
|
||||
/// first boot (empty storage).
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// - `TidalError::Storage` on I/O failure
|
||||
/// - `TidalError::Internal` on deserialization failure (corrupt checkpoint)
|
||||
pub fn restore(&self, storage: &dyn StorageEngine) -> crate::Result<Option<CheckpointMeta>> {
|
||||
// Read checkpoint metadata first.
|
||||
let meta_key = encode_key(EntityId::new(0), Tag::Sig, META_SUFFIX);
|
||||
let meta = match storage.get(&meta_key)? {
|
||||
None => None,
|
||||
Some(meta_bytes) => Some(
|
||||
deserialize_meta(&meta_bytes)
|
||||
.map_err(|e| TidalError::Internal(format!("corrupt checkpoint meta: {e}")))?,
|
||||
),
|
||||
};
|
||||
|
||||
// Scan all keys; keep only Tag::Sig entry keys (suffix length == 2).
|
||||
// TECH DEBT: scan_prefix(&[]) iterates the entire keyspace. This is safe
|
||||
// today (signals are the only key namespace), but must be replaced with a
|
||||
// Tag::Sig-scoped scan (e.g. `scan_tag(Tag::Sig)`) once M1P5 adds entity,
|
||||
// index, and embedding key namespaces to avoid iterating unrelated data.
|
||||
for item in storage.scan_prefix(&[]) {
|
||||
let (key, value) = item?;
|
||||
if let Some((entity_id, Tag::Sig, suffix)) = parse_key(&key) {
|
||||
// Skip the meta key (entity_id=0, suffix=b"meta").
|
||||
if entity_id == EntityId::new(0) && suffix == META_SUFFIX {
|
||||
continue;
|
||||
}
|
||||
let (eid, stid, entry) = deserialize_entry(&value)
|
||||
.map_err(|e| TidalError::Internal(format!("corrupt checkpoint entry: {e}")))?;
|
||||
self.entries.insert((eid, stid), entry);
|
||||
}
|
||||
}
|
||||
|
||||
Ok(meta)
|
||||
}
|
||||
|
||||
/// Return the number of entries currently in the `DashMap`.
|
||||
///
|
||||
/// Used for diagnostics and testing.
|
||||
#[must_use]
|
||||
pub fn entry_count(&self) -> usize {
|
||||
self.entries.len()
|
||||
}
|
||||
}
|
||||
|
||||
// ── Tests ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod tests {
|
||||
use std::time::Duration;
|
||||
|
||||
use super::*;
|
||||
use crate::schema::{DecaySpec, EntityKind, SchemaBuilder, Timestamp, Window};
|
||||
use crate::signals::ledger::NoopWalWriter;
|
||||
use crate::storage::InMemoryBackend;
|
||||
|
||||
fn test_schema() -> crate::schema::Schema {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
let _ = builder
|
||||
.signal(
|
||||
"view",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(7 * 24 * 3600),
|
||||
},
|
||||
)
|
||||
.windows(&[Window::OneHour, Window::AllTime])
|
||||
.velocity(true)
|
||||
.add();
|
||||
builder.build().expect("valid test schema")
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn checkpoint_to_empty_storage() {
|
||||
let schema = test_schema();
|
||||
let ledger = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
|
||||
let now = Timestamp::now();
|
||||
for i in 0..10u64 {
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(i + 1), 1.0, now)
|
||||
.expect("record");
|
||||
}
|
||||
|
||||
let storage = InMemoryBackend::new();
|
||||
let meta = CheckpointMeta {
|
||||
checkpoint_time_ns: now.as_nanos(),
|
||||
wal_sequence: 100,
|
||||
};
|
||||
ledger.checkpoint(&storage, meta).expect("checkpoint");
|
||||
|
||||
// Expect meta key + 10 entry keys = 11 total keys.
|
||||
let all_keys: Vec<_> = storage
|
||||
.scan_prefix(&[])
|
||||
.collect::<Result<_, _>>()
|
||||
.expect("scan ok");
|
||||
assert_eq!(
|
||||
all_keys.len(),
|
||||
11,
|
||||
"expected 11 keys, got {}",
|
||||
all_keys.len()
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn restore_from_empty_storage() {
|
||||
let schema = test_schema();
|
||||
let ledger = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
|
||||
let storage = InMemoryBackend::new();
|
||||
let meta = ledger.restore(&storage).expect("restore ok");
|
||||
|
||||
assert!(meta.is_none(), "empty storage should return None");
|
||||
assert_eq!(ledger.entry_count(), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn restore_preserves_decay_scores() {
|
||||
let schema = test_schema();
|
||||
let ledger = SignalLedger::new(schema.clone(), Box::new(NoopWalWriter));
|
||||
|
||||
let ts1 = Timestamp::from_nanos(1_000_000_000_000);
|
||||
let ts2 = Timestamp::from_nanos(1_001_000_000_000);
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(42), 5.0, ts1)
|
||||
.expect("record 1");
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(42), 3.0, ts2)
|
||||
.expect("record 2");
|
||||
|
||||
let storage = InMemoryBackend::new();
|
||||
let meta = CheckpointMeta {
|
||||
checkpoint_time_ns: 1_002_000_000_000,
|
||||
wal_sequence: 50,
|
||||
};
|
||||
ledger.checkpoint(&storage, meta).expect("checkpoint");
|
||||
|
||||
let ledger2 = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
let restored_meta = ledger2
|
||||
.restore(&storage)
|
||||
.expect("restore")
|
||||
.expect("some meta");
|
||||
assert_eq!(restored_meta.wal_sequence, 50);
|
||||
|
||||
let score_orig = ledger
|
||||
.read_decay_score(EntityId::new(42), "view", 0)
|
||||
.expect("ok");
|
||||
let score_rest = ledger2
|
||||
.read_decay_score(EntityId::new(42), "view", 0)
|
||||
.expect("ok");
|
||||
|
||||
assert!(score_orig.is_some());
|
||||
assert!(score_rest.is_some());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn restore_preserves_windowed_counts() {
|
||||
let schema = test_schema();
|
||||
let ledger = SignalLedger::new(schema.clone(), Box::new(NoopWalWriter));
|
||||
|
||||
let base_ns = 1_000_000_000_000u64;
|
||||
for i in 0..100u64 {
|
||||
let ts = Timestamp::from_nanos(base_ns + i * 100_000_000);
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(1), 1.0, ts)
|
||||
.expect("record");
|
||||
}
|
||||
|
||||
let storage = InMemoryBackend::new();
|
||||
let meta = CheckpointMeta {
|
||||
checkpoint_time_ns: base_ns + 10_000_000_000,
|
||||
wal_sequence: 0,
|
||||
};
|
||||
ledger.checkpoint(&storage, meta).expect("checkpoint");
|
||||
|
||||
let ledger2 = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
ledger2.restore(&storage).expect("restore");
|
||||
|
||||
let count_orig = ledger
|
||||
.read_windowed_count(EntityId::new(1), "view", Window::AllTime)
|
||||
.expect("ok");
|
||||
let count_rest = ledger2
|
||||
.read_windowed_count(EntityId::new(1), "view", Window::AllTime)
|
||||
.expect("ok");
|
||||
assert_eq!(count_orig, count_rest);
|
||||
assert_eq!(count_rest, 100);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn checkpoint_overwrites_previous() {
|
||||
let schema = test_schema();
|
||||
let ledger = SignalLedger::new(schema.clone(), Box::new(NoopWalWriter));
|
||||
let storage = InMemoryBackend::new();
|
||||
let ts = Timestamp::now();
|
||||
|
||||
// First checkpoint: 5 entities.
|
||||
for i in 0..5u64 {
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(i + 1), 1.0, ts)
|
||||
.expect("record");
|
||||
}
|
||||
ledger
|
||||
.checkpoint(
|
||||
&storage,
|
||||
CheckpointMeta {
|
||||
checkpoint_time_ns: 1,
|
||||
wal_sequence: 10,
|
||||
},
|
||||
)
|
||||
.expect("checkpoint 1");
|
||||
|
||||
// Add 3 more entities, then second checkpoint: 8 entities total.
|
||||
for i in 5..8u64 {
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(i + 1), 1.0, ts)
|
||||
.expect("record");
|
||||
}
|
||||
ledger
|
||||
.checkpoint(
|
||||
&storage,
|
||||
CheckpointMeta {
|
||||
checkpoint_time_ns: 2,
|
||||
wal_sequence: 20,
|
||||
},
|
||||
)
|
||||
.expect("checkpoint 2");
|
||||
|
||||
let ledger2 = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
let restored_meta = ledger2
|
||||
.restore(&storage)
|
||||
.expect("restore")
|
||||
.expect("some meta");
|
||||
assert_eq!(restored_meta.wal_sequence, 20);
|
||||
assert_eq!(ledger2.entry_count(), 8);
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod proptests {
|
||||
use std::time::Duration;
|
||||
|
||||
use proptest::prelude::*;
|
||||
|
||||
use super::*;
|
||||
use crate::schema::{DecaySpec, EntityKind, SchemaBuilder, Timestamp, Window};
|
||||
use crate::signals::ledger::NoopWalWriter;
|
||||
use crate::storage::InMemoryBackend;
|
||||
|
||||
fn test_schema() -> crate::schema::Schema {
|
||||
let mut builder = SchemaBuilder::new();
|
||||
let _ = builder
|
||||
.signal(
|
||||
"view",
|
||||
EntityKind::Item,
|
||||
DecaySpec::Exponential {
|
||||
half_life: Duration::from_secs(7 * 24 * 3600),
|
||||
},
|
||||
)
|
||||
.windows(&[Window::AllTime])
|
||||
.velocity(false)
|
||||
.add();
|
||||
builder.build().expect("valid schema")
|
||||
}
|
||||
|
||||
// Full checkpoint-restore roundtrip.
|
||||
proptest! {
|
||||
#[test]
|
||||
fn checkpoint_restore_roundtrip(
|
||||
entity_count in 1usize..50,
|
||||
signals_per_entity in 1usize..20,
|
||||
) {
|
||||
let schema = test_schema();
|
||||
let ledger = SignalLedger::new(schema.clone(), Box::new(NoopWalWriter));
|
||||
|
||||
let base_ns = 1_000_000_000_000u64;
|
||||
for entity in 0..entity_count as u64 {
|
||||
for i in 0..signals_per_entity {
|
||||
let ts = Timestamp::from_nanos(base_ns + (i as u64) * 1_000_000_000);
|
||||
ledger
|
||||
.record_signal("view", EntityId::new(entity + 1), 1.0, ts)
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
let storage = InMemoryBackend::new();
|
||||
let meta = CheckpointMeta { checkpoint_time_ns: base_ns, wal_sequence: 42 };
|
||||
ledger.checkpoint(&storage, meta).unwrap();
|
||||
|
||||
let ledger2 = SignalLedger::new(schema, Box::new(NoopWalWriter));
|
||||
let restored_meta = ledger2.restore(&storage).unwrap();
|
||||
|
||||
prop_assert_eq!(restored_meta, Some(meta));
|
||||
prop_assert_eq!(ledger2.entry_count(), ledger.entry_count());
|
||||
|
||||
for entity in 0..entity_count as u64 {
|
||||
let eid = EntityId::new(entity + 1);
|
||||
|
||||
let orig_count = ledger
|
||||
.read_windowed_count(eid, "view", Window::AllTime)
|
||||
.unwrap();
|
||||
let rest_count = ledger2
|
||||
.read_windowed_count(eid, "view", Window::AllTime)
|
||||
.unwrap();
|
||||
prop_assert_eq!(orig_count, rest_count, "entity {}: all-time count mismatch", entity);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -1,99 +1,20 @@
|
||||
//! Signal ledger: top-level coordinator for hot-tier decay scores and
|
||||
//! warm-tier bucketed counters across all active entities.
|
||||
//! The signal ledger: coordinates hot and warm tiers for all active entities.
|
||||
//!
|
||||
//! `SignalLedger` is the single entry point for signal state management.
|
||||
//! It owns a `DashMap<(EntityId, SignalTypeId), EntitySignalEntry>` that
|
||||
//! provides concurrent access to per-entity signal state.
|
||||
//!
|
||||
//! # WAL integration
|
||||
//!
|
||||
//! Every `record_signal()` call first appends the event to the WAL via the
|
||||
//! `WalWriter` trait. Only after the WAL confirms durability does the ledger
|
||||
//! update in-memory state. This ensures signals survive crashes.
|
||||
//!
|
||||
//! # Concurrency
|
||||
//!
|
||||
//! Multiple threads can write signals to different entities simultaneously.
|
||||
//! Writes to the same entity contend on the `DashMap` shard lock only for entry
|
||||
//! lookup; the actual state update (CAS on hot tier, atomic increment on warm
|
||||
//! tier) is lock-free once the entry reference is obtained.
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::fmt;
|
||||
|
||||
use dashmap::DashMap;
|
||||
|
||||
use crate::schema::{DecayModel, Schema};
|
||||
use crate::schema::{EntityId, SchemaError, TidalError, Timestamp, Window};
|
||||
use crate::schema::{DecayModel, EntityId, Schema, SchemaError, TidalError, Timestamp, Window};
|
||||
|
||||
use super::SignalTypeId;
|
||||
use super::hot::HotSignalState;
|
||||
use super::warm::BucketedCounter;
|
||||
|
||||
// ── WAL boundary ─────────────────────────────────────────────────────────────
|
||||
|
||||
/// Trait boundary for WAL integration.
|
||||
///
|
||||
/// `m1p2` provides the real implementation. `m1p4` tests use `NoopWalWriter`.
|
||||
/// The `SignalLedger` calls `append_signal()` before updating in-memory state,
|
||||
/// ensuring WAL-first durability semantics.
|
||||
pub trait WalWriter: Send + Sync {
|
||||
/// Append a signal event to the WAL.
|
||||
///
|
||||
/// Returns `Ok(())` when the event is durably committed. After this
|
||||
/// returns, in-memory state is updated.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `TidalError::Durability` if the WAL write fails.
|
||||
fn append_signal(
|
||||
&self,
|
||||
signal_type_id: SignalTypeId,
|
||||
entity_id: EntityId,
|
||||
weight: f64,
|
||||
timestamp: Timestamp,
|
||||
) -> crate::Result<()>;
|
||||
}
|
||||
|
||||
/// No-op WAL writer for testing. Always succeeds without writing anything.
|
||||
pub struct NoopWalWriter;
|
||||
|
||||
impl WalWriter for NoopWalWriter {
|
||||
fn append_signal(
|
||||
&self,
|
||||
_signal_type_id: SignalTypeId,
|
||||
_entity_id: EntityId,
|
||||
_weight: f64,
|
||||
_timestamp: Timestamp,
|
||||
) -> crate::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Debug for NoopWalWriter {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
f.write_str("NoopWalWriter")
|
||||
}
|
||||
}
|
||||
|
||||
// ── Entry ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
/// Combined hot-tier and warm-tier state for one entity-signal pair.
|
||||
pub struct EntitySignalEntry {
|
||||
/// Running exponentially-decayed score (hot tier).
|
||||
pub hot: HotSignalState,
|
||||
/// Bucketed windowed event counters (warm tier).
|
||||
pub warm: BucketedCounter,
|
||||
}
|
||||
|
||||
impl fmt::Debug for EntitySignalEntry {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
f.debug_struct("EntitySignalEntry")
|
||||
.field("hot", &self.hot)
|
||||
.field("warm", &self.warm)
|
||||
.finish()
|
||||
}
|
||||
}
|
||||
use super::super::SignalTypeId;
|
||||
use super::super::hot::HotSignalState;
|
||||
use super::super::warm::BucketedCounter;
|
||||
use super::types::{EntitySignalEntry, WalWriter};
|
||||
|
||||
// ── Ledger ────────────────────────────────────────────────────────────────────
|
||||
|
||||
@ -108,9 +29,9 @@ pub struct SignalLedger {
|
||||
wal: Box<dyn WalWriter>,
|
||||
/// Schema for signal type lookup and lambda retrieval.
|
||||
schema: Schema,
|
||||
/// Signal name → `SignalTypeId` mapping (built at construction, immutable).
|
||||
/// Signal name -> `SignalTypeId` mapping (built at construction, immutable).
|
||||
signal_name_to_id: HashMap<String, SignalTypeId>,
|
||||
/// `SignalTypeId` → lambda array (cached from schema, immutable after construction).
|
||||
/// `SignalTypeId` -> lambda array (cached from schema, immutable after construction).
|
||||
signal_lambdas: HashMap<SignalTypeId, Vec<f64>>,
|
||||
}
|
||||
|
||||
@ -118,7 +39,7 @@ impl SignalLedger {
|
||||
/// Construct a new ledger from a validated schema and WAL writer.
|
||||
///
|
||||
/// Signal types are enumerated in alphabetical order and assigned
|
||||
/// sequential `SignalTypeId` values (0, 1, 2, …).
|
||||
/// sequential `SignalTypeId` values (0, 1, 2, ...).
|
||||
#[must_use]
|
||||
pub fn new(schema: Schema, wal: Box<dyn WalWriter>) -> Self {
|
||||
let mut signal_list: Vec<&crate::schema::SignalTypeDef> = schema.signals().collect();
|
||||
@ -150,7 +71,7 @@ impl SignalLedger {
|
||||
/// Record a signal event for an entity.
|
||||
///
|
||||
/// Steps:
|
||||
/// 1. Resolve signal type name → `SignalTypeId`
|
||||
/// 1. Resolve signal type name -> `SignalTypeId`
|
||||
/// 2. Append event to WAL (WAL-first)
|
||||
/// 3. Get or create `EntitySignalEntry` in `DashMap`
|
||||
/// 4. Update hot-tier decay score
|
||||
@ -276,7 +197,7 @@ impl SignalLedger {
|
||||
let count = self.read_windowed_count(entity_id, signal_type_name, window)?;
|
||||
let duration_secs = window.duration_secs_f64();
|
||||
if duration_secs.is_infinite() {
|
||||
// AllTime window — velocity is undefined.
|
||||
// AllTime window -- velocity is undefined.
|
||||
return Ok(0.0);
|
||||
}
|
||||
#[allow(clippy::cast_precision_loss)]
|
||||
@ -361,6 +282,7 @@ impl fmt::Debug for SignalLedger {
|
||||
mod tests {
|
||||
use std::time::Duration;
|
||||
|
||||
use super::super::types::NoopWalWriter;
|
||||
use super::*;
|
||||
use crate::schema::{DecaySpec, EntityKind, SchemaBuilder, Window};
|
||||
|
||||
@ -563,6 +485,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn signal_type_id_newtype() {
|
||||
use super::super::super::SignalTypeId;
|
||||
let id = SignalTypeId::new(5);
|
||||
assert_eq!(id.as_u16(), 5);
|
||||
assert_eq!(id.to_string(), "5");
|
||||
@ -578,6 +501,7 @@ mod proptests {
|
||||
|
||||
use proptest::prelude::*;
|
||||
|
||||
use super::super::types::NoopWalWriter;
|
||||
use super::*;
|
||||
use crate::schema::{DecaySpec, EntityKind, SchemaBuilder, Window};
|
||||
|
||||
25
tidal/src/signals/ledger/mod.rs
Normal file
25
tidal/src/signals/ledger/mod.rs
Normal file
@ -0,0 +1,25 @@
|
||||
//! Signal ledger: top-level coordinator for hot-tier decay scores and
|
||||
//! warm-tier bucketed counters across all active entities.
|
||||
//!
|
||||
//! `SignalLedger` is the single entry point for signal state management.
|
||||
//! It owns a `DashMap<(EntityId, SignalTypeId), EntitySignalEntry>` that
|
||||
//! provides concurrent access to per-entity signal state.
|
||||
//!
|
||||
//! # WAL integration
|
||||
//!
|
||||
//! Every `record_signal()` call first appends the event to the WAL via the
|
||||
//! `WalWriter` trait. Only after the WAL confirms durability does the ledger
|
||||
//! update in-memory state. This ensures signals survive crashes.
|
||||
//!
|
||||
//! # Concurrency
|
||||
//!
|
||||
//! Multiple threads can write signals to different entities simultaneously.
|
||||
//! Writes to the same entity contend on the `DashMap` shard lock only for entry
|
||||
//! lookup; the actual state update (CAS on hot tier, atomic increment on warm
|
||||
//! tier) is lock-free once the entry reference is obtained.
|
||||
|
||||
pub mod core;
|
||||
pub mod types;
|
||||
|
||||
pub use core::SignalLedger;
|
||||
pub use types::{EntitySignalEntry, NoopWalWriter, WalWriter};
|
||||
78
tidal/src/signals/ledger/types.rs
Normal file
78
tidal/src/signals/ledger/types.rs
Normal file
@ -0,0 +1,78 @@
|
||||
//! Boundary traits and data types for the signal ledger.
|
||||
//!
|
||||
//! Contains the `WalWriter` trait (the boundary between the signals subsystem
|
||||
//! and the WAL), its no-op test implementation, and the `EntitySignalEntry`
|
||||
//! combined hot + warm state container.
|
||||
|
||||
use std::fmt;
|
||||
|
||||
use crate::schema::{EntityId, Timestamp};
|
||||
|
||||
use super::super::SignalTypeId;
|
||||
use super::super::hot::HotSignalState;
|
||||
use super::super::warm::BucketedCounter;
|
||||
|
||||
// ── WAL boundary ─────────────────────────────────────────────────────────────
|
||||
|
||||
/// Trait boundary for WAL integration.
|
||||
///
|
||||
/// `m1p2` provides the real implementation. `m1p4` tests use `NoopWalWriter`.
|
||||
/// The `SignalLedger` calls `append_signal()` before updating in-memory state,
|
||||
/// ensuring WAL-first durability semantics.
|
||||
pub trait WalWriter: Send + Sync {
|
||||
/// Append a signal event to the WAL.
|
||||
///
|
||||
/// Returns `Ok(())` when the event is durably committed. After this
|
||||
/// returns, in-memory state is updated.
|
||||
///
|
||||
/// # Errors
|
||||
///
|
||||
/// Returns `TidalError::Durability` if the WAL write fails.
|
||||
fn append_signal(
|
||||
&self,
|
||||
signal_type_id: SignalTypeId,
|
||||
entity_id: EntityId,
|
||||
weight: f64,
|
||||
timestamp: Timestamp,
|
||||
) -> crate::Result<()>;
|
||||
}
|
||||
|
||||
/// No-op WAL writer for testing. Always succeeds without writing anything.
|
||||
pub struct NoopWalWriter;
|
||||
|
||||
impl WalWriter for NoopWalWriter {
|
||||
fn append_signal(
|
||||
&self,
|
||||
_signal_type_id: SignalTypeId,
|
||||
_entity_id: EntityId,
|
||||
_weight: f64,
|
||||
_timestamp: Timestamp,
|
||||
) -> crate::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Debug for NoopWalWriter {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
f.write_str("NoopWalWriter")
|
||||
}
|
||||
}
|
||||
|
||||
// ── Entry ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
/// Combined hot-tier and warm-tier state for one entity-signal pair.
|
||||
pub struct EntitySignalEntry {
|
||||
/// Running exponentially-decayed score (hot tier).
|
||||
pub hot: HotSignalState,
|
||||
/// Bucketed windowed event counters (warm tier).
|
||||
pub warm: BucketedCounter,
|
||||
}
|
||||
|
||||
impl fmt::Debug for EntitySignalEntry {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
f.debug_struct("EntitySignalEntry")
|
||||
.field("hot", &self.hot)
|
||||
.field("warm", &self.warm)
|
||||
.finish()
|
||||
}
|
||||
}
|
||||
@ -1,47 +1,23 @@
|
||||
use std::fmt;
|
||||
|
||||
/// Storage engine error types.
|
||||
///
|
||||
/// Replaces the stub `StorageError { message }` from Phase 1.1.
|
||||
/// All storage backends surface errors through this enum.
|
||||
#[derive(Debug)]
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum StorageError {
|
||||
/// I/O error from the underlying filesystem or storage engine.
|
||||
Io(std::io::Error),
|
||||
#[error("I/O error: {0}")]
|
||||
Io(#[from] std::io::Error),
|
||||
/// Data corruption detected (checksum mismatch, invalid key encoding, etc.).
|
||||
#[error("data corruption: {message}")]
|
||||
Corruption { message: String },
|
||||
/// The storage engine has been closed and cannot service requests.
|
||||
#[error("storage closed")]
|
||||
Closed,
|
||||
/// A batch write conflicted with a concurrent operation.
|
||||
#[error("batch conflict")]
|
||||
BatchConflict,
|
||||
}
|
||||
|
||||
impl fmt::Display for StorageError {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
match self {
|
||||
Self::Io(source) => write!(f, "I/O error: {source}"),
|
||||
Self::Corruption { message } => write!(f, "data corruption: {message}"),
|
||||
Self::Closed => f.write_str("storage closed"),
|
||||
Self::BatchConflict => f.write_str("batch conflict"),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl std::error::Error for StorageError {
|
||||
fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
|
||||
match self {
|
||||
Self::Io(source) => Some(source),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl From<std::io::Error> for StorageError {
|
||||
fn from(e: std::io::Error) -> Self {
|
||||
Self::Io(e)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::unwrap_used)]
|
||||
mod tests {
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user