- Extract redeliver_missed(tx, db, log) helper into cluster_transport.rs - heal_region now removes partition then immediately ships any missed batch-log entries to the healed follower's channel - await_convergence refactored to call the same helper (no logic change) - tidal-server: reload_text_index before search in cluster mode - tidal-server: write_signal returns Result instead of panicking on unknown signal - tidal-server: leader shows lag_events=0 (writes directly, no receiver thread) - tidal-server: fix cluster mode error propagation (ServerError::from) - docs/runbooks/cluster.md: add full cluster operations runbook - docker/: add Dockerfile for containerised cluster deployment - README.md: add tidal-server HTTP API getting-started section - Split oversized source files per CODING_GUIDELINES §9 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
307 lines
12 KiB
Markdown
307 lines
12 KiB
Markdown
# tidalDB
|
||
|
||
**An embeddable Rust database for the personalized content ranking problem.**
|
||
|
||
> Pre-release. API is stabilizing. Not yet recommended for production.
|
||
|
||
---
|
||
|
||
Every content platform eventually builds the same distributed system from scratch: Elasticsearch for retrieval, Redis for hot signals, Kafka for event ingestion, a feature store for user profiles, a vector database for semantic search, and a ranking service that stitches them together. The seams between those systems are where correctness dies — stale signals, inconsistent ranking, cache invalidation bugs, ETL lag.
|
||
|
||
The root cause: existing databases treat ranking as an afterthought. They have no native concept of signals that evolve over time, no understanding of user context, no diversity as a query constraint.
|
||
|
||
**Ranking is not a feature. It is a primitive.**
|
||
|
||
tidalDB is a single-node, embeddable Rust library built for one question: *given a user and a context, what content should they see, and in what order?* No server, no network protocol, no client SDK. Link it into your process.
|
||
|
||
---
|
||
|
||
## What it looks like
|
||
|
||
```rust
|
||
use std::collections::HashMap;
|
||
use std::time::Duration;
|
||
use tidaldb::{TidalDb, query::retrieve::Retrieve, schema::{DecaySpec, EntityId, EntityKind, SchemaBuilder, Timestamp, Window}};
|
||
|
||
// Declare signals with native decay — no application formulas.
|
||
let mut schema = SchemaBuilder::new();
|
||
let _ = schema.signal("view", EntityKind::Item, DecaySpec::Exponential {
|
||
half_life: Duration::from_secs(7 * 24 * 3600),
|
||
}).windows(&[Window::OneHour, Window::TwentyFourHours, Window::AllTime]).velocity(true).add();
|
||
let _ = schema.signal("like", EntityKind::Item, DecaySpec::Exponential {
|
||
half_life: Duration::from_secs(30 * 24 * 3600),
|
||
}).windows(&[Window::AllTime]).velocity(false).add();
|
||
let schema = schema.build()?;
|
||
|
||
// Open — ephemeral for tests, persistent for production.
|
||
let db = TidalDb::builder().ephemeral().with_schema(schema).open()?;
|
||
|
||
// Ingest content with metadata.
|
||
let mut meta = HashMap::new();
|
||
meta.insert("title".to_string(), "Introduction to Jazz Piano".to_string());
|
||
meta.insert("category".to_string(), "music".to_string());
|
||
db.write_item_with_metadata(EntityId::new(1), &meta)?;
|
||
|
||
// Write an embedding (you generate it, tidalDB indexes and ranks over it).
|
||
db.write_item_embedding(EntityId::new(1), &your_model.embed("Introduction to Jazz Piano"))?;
|
||
|
||
// Record engagement — the feedback loop closes here, no ETL required.
|
||
db.signal("view", EntityId::new(1), 1.0, Timestamp::now())?;
|
||
db.signal_with_context("like", EntityId::new(1), 1.0, Timestamp::now(), Some(user_id), Some(creator_id))?;
|
||
|
||
// Retrieve a ranked feed. Name the profile. tidalDB executes the pipeline.
|
||
let results = db.retrieve(&Retrieve::builder().for_user(user_id).profile("for_you").limit(50).build()?)?;
|
||
|
||
// Search: BM25 + semantic similarity fused via RRF.
|
||
let results = db.search(&Search::builder().query("jazz piano tutorial").for_user(user_id).limit(20).build()?)?;
|
||
|
||
db.close()?;
|
||
```
|
||
|
||
---
|
||
|
||
## What it replaces
|
||
|
||
| System | tidalDB equivalent |
|
||
|--------|--------------------|
|
||
| Elasticsearch | Tantivy BM25 text index (derived, crash-recoverable) |
|
||
| Redis | Lock-free in-memory signal ledger — decay scores, windowed counters |
|
||
| Kafka | Write-ahead log — durable, ordered, replayable |
|
||
| Feature store | Signal aggregates + user preference vectors (updated at write time) |
|
||
| Vector DB | USearch HNSW — embedded, f16 quantized, predicate-filtered ANN |
|
||
| Ranking service | 25 named profiles, scored at query time, swappable by name |
|
||
|
||
---
|
||
|
||
## Key capabilities
|
||
|
||
- **Signals with native decay** — declare `view` with a 7-day half-life; the database applies it at query time. No `trending_score_7d` field to maintain.
|
||
- **25 built-in ranking profiles** — `trending`, `hot`, `for_you`, `following`, `related`, `hidden_gems`, `top_week`, `shuffle`, `controversial`, and more. Name the profile; the database executes the full pipeline.
|
||
- **Hybrid search** — BM25 full-text + ANN semantic similarity, fused via Reciprocal Rank Fusion, personalized by user preference vector.
|
||
- **Composable filters** — filter by category, format, duration, language, engagement threshold, location, collection membership, and more — any combination, all composable.
|
||
- **Diversity as a query constraint** — `max_per_creator: 2` belongs in the query, not your API layer.
|
||
- **Feedback loop in the write path** — a signal write atomically updates the item's ledger, the user's preference vector, and relationship weights. The next ranking query — 100ms later — reflects it.
|
||
- **Cold start handled** — new content gets an exploration budget; new users get sensible defaults. No application logic required.
|
||
- **Cohort-scoped trending** — "trending among US users aged 18-24 who engage with jazz" is one query, not a pipeline.
|
||
- **Embeddable first** — runs in your process. `Arc<TidalDb>` is `Send + Sync`. No operational overhead.
|
||
|
||
---
|
||
|
||
## Getting started
|
||
|
||
Pick the path that matches how you plan to use tidalDB today. Every option below is self-contained and ships in this repo.
|
||
|
||
### 1. Embed tidalDB inside your Rust service (library mode)
|
||
|
||
**Setup**
|
||
|
||
1. Add the git dependency:
|
||
```toml
|
||
[dependencies]
|
||
tidaldb = { git = "https://github.com/your-org/tidalDB", rev = "..." }
|
||
```
|
||
2. Define your schema before opening the database (decay, windows, text fields, embeddings). The snippet in **[Quickstart, Step 2](QUICKSTART.md#step-2-define-a-schema)** is a ready-to-copy template.
|
||
3. Choose storage mode when building:
|
||
```rust
|
||
let db = tidaldb::TidalDb::builder()
|
||
.with_schema(schema)
|
||
.ephemeral() // in-memory for tests
|
||
// .with_data_dir("/var/lib/tidaldb") // persistent deployment
|
||
.open()?;
|
||
```
|
||
4. Run the end-to-end sample:
|
||
```bash
|
||
cargo run --manifest-path tidal/Cargo.toml --example quickstart
|
||
```
|
||
|
||
**Usage**
|
||
|
||
- Call `db.signal(...)`, `db.signal_with_context(...)`, and `db.retrieve(...)` / `db.search(...)` from the same process; no network stack required.
|
||
- Wrap the instance in `Arc<TidalDb>` to share it across threads or tasks.
|
||
- Persisted deployments can be inspected with the CLI tool: `cargo run -p tidalctl -- status --path /var/lib/tidaldb`.
|
||
- Full walkthrough: **[QUICKSTART.md](QUICKSTART.md)** and **[API.md](API.md)**.
|
||
|
||
### 2. Run the standalone HTTP server (`tidal-server`)
|
||
|
||
**Why:** you want a ready-to-run HTTP facade without writing Axum/Actix glue.
|
||
|
||
```bash
|
||
cargo run -p tidal-server -- \
|
||
standalone \
|
||
--listen 127.0.0.1:9400 \
|
||
--schema tidal-server/config/default-schema.yaml
|
||
```
|
||
|
||
Options:
|
||
- `--data-dir /var/lib/tidaldb` switches to persistent storage.
|
||
- Provide your own schema file (YAML) to match your signal mix.
|
||
|
||
Usage:
|
||
|
||
```bash
|
||
# register metadata + embedding
|
||
curl -X POST http://127.0.0.1:9400/items \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{ "entity_id": 1, "metadata": { "title": "Jazz Piano", "category": "music" } }'
|
||
curl -X POST http://127.0.0.1:9400/embeddings \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{ "entity_id": 1, "values": [0.1, 0.2, 0.3] }'
|
||
|
||
# write engagement (supports user/creator context)
|
||
curl -X POST http://127.0.0.1:9400/signals \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{ "entity_id": 1, "signal": "view", "weight": 1.0, "user_id": 42 }'
|
||
|
||
# query
|
||
curl "http://127.0.0.1:9400/feed?user_id=42&profile=for_you&limit=20"
|
||
curl "http://127.0.0.1:9400/search?query=jazz%20piano&user_id=42&limit=5"
|
||
curl http://127.0.0.1:9400/health
|
||
```
|
||
|
||
The default schema lives at `tidal-server/config/default-schema.yaml`. Edit
|
||
it (or provide your own path) to align with your application’s signals,
|
||
text fields, and embedding slots.
|
||
|
||
### 3. Wrap it in an HTTP service you control
|
||
|
||
Expose tidalDB through your favorite web framework; the repo ships runnable templates.
|
||
|
||
- **Axum sample (`tidal/examples/axum_embedding.rs`)**
|
||
```bash
|
||
cargo run --example axum_embedding --manifest-path tidal/Cargo.toml
|
||
```
|
||
Usage:
|
||
```bash
|
||
curl -X POST http://127.0.0.1:3000/signal \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{ "entity_id": 1, "signal": "view", "weight": 1.0 }'
|
||
curl "http://127.0.0.1:3000/feed?user_id=42"
|
||
curl http://127.0.0.1:3000/health
|
||
```
|
||
The example handles schema setup, wraps `Arc<TidalDb>` in Axum `State`, and maps `TidalError` to HTTP responses.
|
||
|
||
- **Actix sample (`tidal/examples/actix_embedding.rs`)**
|
||
```bash
|
||
cargo run --example actix_embedding --manifest-path tidal/Cargo.toml
|
||
# curl http://127.0.0.1:3001/health
|
||
```
|
||
Demonstrates sharing `Arc<TidalDb>` through `web::Data` and using Actix’s shutdown hooks.
|
||
|
||
Use either sample as a starting point for microservices that prefer a client/server boundary.
|
||
|
||
### 4. Run the Forage demo server (Axum + UI)
|
||
|
||
Want to see tidalDB powering a live personalization surface? Forage is a thin Axum server + feed UI that talks to a tidalDB instance embedded in-process.
|
||
|
||
```bash
|
||
cargo run -p forage-server --manifest-path applications/forage/server/Cargo.toml
|
||
open http://localhost:4242
|
||
```
|
||
|
||
Flags:
|
||
- `--ephemeral` to keep everything in-memory.
|
||
- `--data-dir ~/.forage/data` to point at a custom persistent directory.
|
||
|
||
Usage:
|
||
```bash
|
||
curl -X POST http://localhost:4242/signal \
|
||
-H "Content-Type: application/json" \
|
||
-d '{ "user_id": 1, "item_id": 42, "signal_type": "view" }'
|
||
curl "http://localhost:4242/feed?user=1&limit=7"
|
||
```
|
||
The UI shows seeded users, exploration labels, and real-time adaptation; see `applications/forage/readme.md` for the full loop.
|
||
|
||
### 5. Run the cluster server + Docker image
|
||
|
||
Need a single endpoint that fronts the built-in simulated cluster? Use
|
||
`tidal-server` in `cluster` mode. It spins up the multi-region fabric,
|
||
ships WAL batches between regions, and exposes `/signals`, `/feed`,
|
||
`/search` plus cluster-management routes.
|
||
|
||
```bash
|
||
cargo run -p tidal-server -- \
|
||
cluster \
|
||
--listen 0.0.0.0:9500 \
|
||
--schema tidal-server/config/default-schema.yaml \
|
||
--topology tidal-server/config/default-cluster.yaml
|
||
```
|
||
|
||
Key endpoints:
|
||
|
||
```bash
|
||
curl http://127.0.0.1:9500/health
|
||
curl -X POST http://127.0.0.1:9500/signals -d '{ "entity_id": 1, "signal": "view", "weight": 1.0 }'
|
||
curl "http://127.0.0.1:9500/feed?profile=trending®ion=eu-west"
|
||
curl http://127.0.0.1:9500/cluster/status
|
||
curl -X POST http://127.0.0.1:9500/cluster/promote -d '{ "region": "eu-west" }'
|
||
```
|
||
|
||
Cluster mode currently replicates global signals (no `user_id` /
|
||
`creator_id` contexts) so that followers can stay in sync with the leader’s
|
||
WAL stream. See **[docs/runbooks/cluster.md](docs/runbooks/cluster.md)** for
|
||
operational steps, failure drills, and API references.
|
||
|
||
Prefer containers? Build the provided image and run it anywhere:
|
||
|
||
```bash
|
||
docker build -f docker/cluster/Dockerfile -t tidal-cluster .
|
||
docker run --rm -p 9500:9500 tidal-cluster
|
||
```
|
||
|
||
Mount your own schema/topology files with `-v` if you want different regions
|
||
or signal definitions.
|
||
|
||
### 6. Simulate a multi-region cluster in tests
|
||
|
||
The raw `SimulatedCluster` harness (no HTTP) remains available for property
|
||
tests and fuzzing.
|
||
|
||
```bash
|
||
cargo test --test m8_uat
|
||
cargo test --test m8_uat uat_step3 -- --nocapture # run a single scenario
|
||
```
|
||
|
||
Tweak `tidal/tests/m8_uat.rs` to script specific replication, failover, and
|
||
migration scenarios inside your own test suites.
|
||
|
||
**MSRV:** Rust 1.91
|
||
|
||
---
|
||
|
||
## Documentation
|
||
|
||
| Document | Contents |
|
||
|----------|----------|
|
||
| [QUICKSTART.md](QUICKSTART.md) | Step-by-step guide: schema, ingest, signals, ranking, search |
|
||
| [API.md](API.md) | Full API reference with code examples |
|
||
| [VISION.md](VISION.md) | Problem statement and design thesis |
|
||
| [ARCHITECTURE.md](ARCHITECTURE.md) | Storage, signal system, vector index, query pipeline |
|
||
| [USE_CASES.md](USE_CASES.md) | 14 content discovery surfaces, filter and sort references |
|
||
|
||
---
|
||
|
||
## Status
|
||
|
||
Milestones completed:
|
||
|
||
- Storage engine, WAL, entity store, signal ledger
|
||
- RETRIEVE query: candidate retrieval, filtering, scoring, diversity, pagination
|
||
- Vector index (USearch HNSW) with adaptive filtered search
|
||
- 25 built-in ranking profiles
|
||
- BM25 full-text search (Tantivy) + hybrid RRF fusion
|
||
- Creator search and creator profiles
|
||
- Cohort-scoped signal aggregation and trending
|
||
- Social graph (follows, blocks, following feed)
|
||
- Collections, saved searches, autocomplete suggestions
|
||
- Session and agent context (short-lived signals, preference decay)
|
||
- Crash recovery, graceful degradation, rate limiting, diagnostics
|
||
- Scale: tested to 1M items; scale benchmarks passing
|
||
|
||
The API surface is stable for the implemented features. Breaking changes are possible before 1.0.
|
||
|
||
---
|
||
|
||
## License
|
||
|
||
MIT
|