Milestone 8 (phases 1-4): - Shard-aware WAL segment naming, BatchHeader v2, ShardRouter - Transport trait, InProcessTransport, WalShipper, FollowerDb - HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine - Session replication bridge with SeqNo/HWM, idempotency store Forage application: - Multi-source discovery engine with MAB exploration - Embedding-based label system, server handlers, UI refresh Other: - QUICKSTART.md, README.md, milestone-8 planning docs - Hard negative union semantics, RLHF export enhancements - Recovery benchmark and visibility test expansions - Split 8 oversized source files per CODING_GUIDELINES §9 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
145 lines
6.5 KiB
Markdown
145 lines
6.5 KiB
Markdown
# tidalDB
|
|
|
|
**An embeddable Rust database for the personalized content ranking problem.**
|
|
|
|
> Pre-release. API is stabilizing. Not yet recommended for production.
|
|
|
|
---
|
|
|
|
Every content platform eventually builds the same distributed system from scratch: Elasticsearch for retrieval, Redis for hot signals, Kafka for event ingestion, a feature store for user profiles, a vector database for semantic search, and a ranking service that stitches them together. The seams between those systems are where correctness dies — stale signals, inconsistent ranking, cache invalidation bugs, ETL lag.
|
|
|
|
The root cause: existing databases treat ranking as an afterthought. They have no native concept of signals that evolve over time, no understanding of user context, no diversity as a query constraint.
|
|
|
|
**Ranking is not a feature. It is a primitive.**
|
|
|
|
tidalDB is a single-node, embeddable Rust library built for one question: *given a user and a context, what content should they see, and in what order?* No server, no network protocol, no client SDK. Link it into your process.
|
|
|
|
---
|
|
|
|
## What it looks like
|
|
|
|
```rust
|
|
use std::collections::HashMap;
|
|
use std::time::Duration;
|
|
use tidaldb::{TidalDb, query::retrieve::Retrieve, schema::{DecaySpec, EntityId, EntityKind, SchemaBuilder, Timestamp, Window}};
|
|
|
|
// Declare signals with native decay — no application formulas.
|
|
let mut schema = SchemaBuilder::new();
|
|
let _ = schema.signal("view", EntityKind::Item, DecaySpec::Exponential {
|
|
half_life: Duration::from_secs(7 * 24 * 3600),
|
|
}).windows(&[Window::OneHour, Window::TwentyFourHours, Window::AllTime]).velocity(true).add();
|
|
let _ = schema.signal("like", EntityKind::Item, DecaySpec::Exponential {
|
|
half_life: Duration::from_secs(30 * 24 * 3600),
|
|
}).windows(&[Window::AllTime]).velocity(false).add();
|
|
let schema = schema.build()?;
|
|
|
|
// Open — ephemeral for tests, persistent for production.
|
|
let db = TidalDb::builder().ephemeral().with_schema(schema).open()?;
|
|
|
|
// Ingest content with metadata.
|
|
let mut meta = HashMap::new();
|
|
meta.insert("title".to_string(), "Introduction to Jazz Piano".to_string());
|
|
meta.insert("category".to_string(), "music".to_string());
|
|
db.write_item_with_metadata(EntityId::new(1), &meta)?;
|
|
|
|
// Write an embedding (you generate it, tidalDB indexes and ranks over it).
|
|
db.write_item_embedding(EntityId::new(1), &your_model.embed("Introduction to Jazz Piano"))?;
|
|
|
|
// Record engagement — the feedback loop closes here, no ETL required.
|
|
db.signal("view", EntityId::new(1), 1.0, Timestamp::now())?;
|
|
db.signal_with_context("like", EntityId::new(1), 1.0, Timestamp::now(), Some(user_id), Some(creator_id))?;
|
|
|
|
// Retrieve a ranked feed. Name the profile. tidalDB executes the pipeline.
|
|
let results = db.retrieve(&Retrieve::builder().for_user(user_id).profile("for_you").limit(50).build()?)?;
|
|
|
|
// Search: BM25 + semantic similarity fused via RRF.
|
|
let results = db.search(&Search::builder().query("jazz piano tutorial").for_user(user_id).limit(20).build()?)?;
|
|
|
|
db.close()?;
|
|
```
|
|
|
|
---
|
|
|
|
## What it replaces
|
|
|
|
| System | tidalDB equivalent |
|
|
|--------|--------------------|
|
|
| Elasticsearch | Tantivy BM25 text index (derived, crash-recoverable) |
|
|
| Redis | Lock-free in-memory signal ledger — decay scores, windowed counters |
|
|
| Kafka | Write-ahead log — durable, ordered, replayable |
|
|
| Feature store | Signal aggregates + user preference vectors (updated at write time) |
|
|
| Vector DB | USearch HNSW — embedded, f16 quantized, predicate-filtered ANN |
|
|
| Ranking service | 25 named profiles, scored at query time, swappable by name |
|
|
|
|
---
|
|
|
|
## Key capabilities
|
|
|
|
- **Signals with native decay** — declare `view` with a 7-day half-life; the database applies it at query time. No `trending_score_7d` field to maintain.
|
|
- **25 built-in ranking profiles** — `trending`, `hot`, `for_you`, `following`, `related`, `hidden_gems`, `top_week`, `shuffle`, `controversial`, and more. Name the profile; the database executes the full pipeline.
|
|
- **Hybrid search** — BM25 full-text + ANN semantic similarity, fused via Reciprocal Rank Fusion, personalized by user preference vector.
|
|
- **Composable filters** — filter by category, format, duration, language, engagement threshold, location, collection membership, and more — any combination, all composable.
|
|
- **Diversity as a query constraint** — `max_per_creator: 2` belongs in the query, not your API layer.
|
|
- **Feedback loop in the write path** — a signal write atomically updates the item's ledger, the user's preference vector, and relationship weights. The next ranking query — 100ms later — reflects it.
|
|
- **Cold start handled** — new content gets an exploration budget; new users get sensible defaults. No application logic required.
|
|
- **Cohort-scoped trending** — "trending among US users aged 18-24 who engage with jazz" is one query, not a pipeline.
|
|
- **Embeddable first** — runs in your process. `Arc<TidalDb>` is `Send + Sync`. No operational overhead.
|
|
|
|
---
|
|
|
|
## Getting started
|
|
|
|
tidalDB is not yet published to crates.io. Add it as a git dependency:
|
|
|
|
```toml
|
|
[dependencies]
|
|
tidaldb = { git = "https://github.com/your-org/tidalDB", rev = "..." }
|
|
```
|
|
|
|
Then follow the **[Quickstart](QUICKSTART.md)** to get a working ranked feed in 10 minutes, or run the included example:
|
|
|
|
```bash
|
|
cargo run --manifest-path tidal/Cargo.toml --example quickstart
|
|
```
|
|
|
|
**MSRV:** Rust 1.91
|
|
|
|
---
|
|
|
|
## Documentation
|
|
|
|
| Document | Contents |
|
|
|----------|----------|
|
|
| [QUICKSTART.md](QUICKSTART.md) | Step-by-step guide: schema, ingest, signals, ranking, search |
|
|
| [API.md](API.md) | Full API reference with code examples |
|
|
| [VISION.md](VISION.md) | Problem statement and design thesis |
|
|
| [ARCHITECTURE.md](ARCHITECTURE.md) | Storage, signal system, vector index, query pipeline |
|
|
| [USE_CASES.md](USE_CASES.md) | 14 content discovery surfaces, filter and sort references |
|
|
|
|
---
|
|
|
|
## Status
|
|
|
|
Milestones completed:
|
|
|
|
- Storage engine, WAL, entity store, signal ledger
|
|
- RETRIEVE query: candidate retrieval, filtering, scoring, diversity, pagination
|
|
- Vector index (USearch HNSW) with adaptive filtered search
|
|
- 25 built-in ranking profiles
|
|
- BM25 full-text search (Tantivy) + hybrid RRF fusion
|
|
- Creator search and creator profiles
|
|
- Cohort-scoped signal aggregation and trending
|
|
- Social graph (follows, blocks, following feed)
|
|
- Collections, saved searches, autocomplete suggestions
|
|
- Session and agent context (short-lived signals, preference decay)
|
|
- Crash recovery, graceful degradation, rate limiting, diagnostics
|
|
- Scale: tested to 1M items; scale benchmarks passing
|
|
|
|
The API surface is stable for the implemented features. Breaking changes are possible before 1.0.
|
|
|
|
---
|
|
|
|
## License
|
|
|
|
MIT
|