tidaldb/README.md

# tidalDB

**An embeddable Rust database for the personalized content ranking problem.**

> Pre-release. API is stabilizing. Not yet recommended for production.

---

Every content platform eventually builds the same distributed system from scratch: Elasticsearch for retrieval, Redis for hot signals, Kafka for event ingestion, a feature store for user profiles, a vector database for semantic search, and a ranking service that stitches them together. The seams between those systems are where correctness dies — stale signals, inconsistent ranking, cache invalidation bugs, ETL lag.

The root cause: existing databases treat ranking as an afterthought. They have no native concept of signals that evolve over time, no understanding of user context, no diversity as a query constraint.

**Ranking is not a feature. It is a primitive.**

tidalDB is a single-node, embeddable Rust library built for one question: *given a user and a context, what content should they see, and in what order?* No server, no network protocol, no client SDK. Link it into your process.

---

## What it looks like

```rust
use std::collections::HashMap;
use std::time::Duration;
use tidaldb::{TidalDb, query::retrieve::Retrieve, schema::{DecaySpec, EntityId, EntityKind, SchemaBuilder, Timestamp, Window}};

// Declare signals with native decay — no application formulas.
let mut schema = SchemaBuilder::new();
let _ = schema.signal("view", EntityKind::Item, DecaySpec::Exponential {
    half_life: Duration::from_secs(7 * 24 * 3600),
}).windows(&[Window::OneHour, Window::TwentyFourHours, Window::AllTime]).velocity(true).add();
let _ = schema.signal("like", EntityKind::Item, DecaySpec::Exponential {
    half_life: Duration::from_secs(30 * 24 * 3600),
}).windows(&[Window::AllTime]).velocity(false).add();
let schema = schema.build()?;

// Open — ephemeral for tests, persistent for production.
let db = TidalDb::builder().ephemeral().with_schema(schema).open()?;

// Ingest content with metadata.
let mut meta = HashMap::new();
meta.insert("title".to_string(), "Introduction to Jazz Piano".to_string());
meta.insert("category".to_string(), "music".to_string());
db.write_item_with_metadata(EntityId::new(1), &meta)?;

// Write an embedding (you generate it, tidalDB indexes and ranks over it).
db.write_item_embedding(EntityId::new(1), &your_model.embed("Introduction to Jazz Piano"))?;

// Record engagement — the feedback loop closes here, no ETL required.
db.signal("view", EntityId::new(1), 1.0, Timestamp::now())?;
db.signal_with_context("like", EntityId::new(1), 1.0, Timestamp::now(), Some(user_id), Some(creator_id))?;

// Retrieve a ranked feed. Name the profile. tidalDB executes the pipeline.
let results = db.retrieve(&Retrieve::builder().for_user(user_id).profile("for_you").limit(50).build()?)?;

// Search: BM25 + semantic similarity fused via RRF.
let results = db.search(&Search::builder().query("jazz piano tutorial").for_user(user_id).limit(20).build()?)?;

db.close()?;
```

---

## What it replaces

| System | tidalDB equivalent |
|--------|--------------------|
| Elasticsearch | Tantivy BM25 text index (derived, crash-recoverable) |
| Redis | Lock-free in-memory signal ledger — decay scores, windowed counters |
| Kafka | Write-ahead log — durable, ordered, replayable |
| Feature store | Signal aggregates + user preference vectors (updated at write time) |
| Vector DB | USearch HNSW — embedded, f16 quantized, predicate-filtered ANN |
| Ranking service | 25 named profiles, scored at query time, swappable by name |

---

## Key capabilities

- **Signals with native decay** — declare `view` with a 7-day half-life; the database applies it at query time. No `trending_score_7d` field to maintain.
- **25 built-in ranking profiles** — `trending`, `hot`, `for_you`, `following`, `related`, `hidden_gems`, `top_week`, `shuffle`, `controversial`, and more. Name the profile; the database executes the full pipeline.
- **Hybrid search** — BM25 full-text + ANN semantic similarity, fused via Reciprocal Rank Fusion, personalized by user preference vector.
- **Composable filters** — filter by category, format, duration, language, engagement threshold, location, collection membership, and more — any combination, all composable.
- **Diversity as a query constraint** — `max_per_creator: 2` belongs in the query, not your API layer.
- **Feedback loop in the write path** — a signal write atomically updates the item's ledger, the user's preference vector, and relationship weights. The next ranking query — 100ms later — reflects it.
- **Cold start handled** — new content gets an exploration budget; new users get sensible defaults. No application logic required.
- **Cohort-scoped trending** — "trending among US users aged 18-24 who engage with jazz" is one query, not a pipeline.
- **Embeddable first** — runs in your process. `Arc<TidalDb>` is `Send + Sync`. No operational overhead.

---

## Getting started

Pick the path that matches how you plan to use tidalDB today. Every option below is self-contained and ships in this repo.

### 1. Embed tidalDB inside your Rust service (library mode)

**Setup**

1. Add the git dependency:
   ```toml
   [dependencies]
   tidaldb = { git = "https://github.com/your-org/tidalDB", rev = "..." }
   ```
2. Define your schema before opening the database (decay, windows, text fields, embeddings). The snippet in **[Quickstart, Step 2](QUICKSTART.md#step-2-define-a-schema)** is a ready-to-copy template.
3. Choose storage mode when building:
   ```rust
   let db = tidaldb::TidalDb::builder()
       .with_schema(schema)
       .ephemeral()               // in-memory for tests
       // .with_data_dir("/var/lib/tidaldb") // persistent deployment
       .open()?;
   ```
4. Run the end-to-end sample:
   ```bash
   cargo run --manifest-path tidal/Cargo.toml --example quickstart
   ```

**Usage**

- Call `db.signal(...)`, `db.signal_with_context(...)`, and `db.retrieve(...)` / `db.search(...)` from the same process; no network stack required.
- Wrap the instance in `Arc<TidalDb>` to share it across threads or tasks.
- Persisted deployments can be inspected with the CLI tool: `cargo run -p tidalctl -- status --path /var/lib/tidaldb`.
- Full walkthrough: **[QUICKSTART.md](QUICKSTART.md)** and **[API.md](API.md)**.

### 2. Run the standalone HTTP server (`tidal-server`)

**Why:** you want a ready-to-run HTTP facade without writing Axum/Actix glue.

```bash
cargo run -p tidal-server -- \
  standalone \
  --listen 127.0.0.1:9400 \
  --schema tidal-server/config/default-schema.yaml
```

Options:
- `--data-dir /var/lib/tidaldb` switches to persistent storage.
- Provide your own schema file (YAML) to match your signal mix.

Usage:

```bash
# register metadata + embedding
curl -X POST http://127.0.0.1:9400/items \
  -H 'Content-Type: application/json' \
  -d '{ "entity_id": 1, "metadata": { "title": "Jazz Piano", "category": "music" } }'
curl -X POST http://127.0.0.1:9400/embeddings \
  -H 'Content-Type: application/json' \
  -d '{ "entity_id": 1, "values": [0.1, 0.2, 0.3] }'

# write engagement (supports user/creator context)
curl -X POST http://127.0.0.1:9400/signals \
  -H 'Content-Type: application/json' \
  -d '{ "entity_id": 1, "signal": "view", "weight": 1.0, "user_id": 42 }'

# query
curl "http://127.0.0.1:9400/feed?user_id=42&profile=for_you&limit=20"
curl "http://127.0.0.1:9400/search?query=jazz%20piano&user_id=42&limit=5"
curl http://127.0.0.1:9400/health
```

The default schema lives at `tidal-server/config/default-schema.yaml`. Edit
it (or provide your own path) to align with your application’s signals,
text fields, and embedding slots.

### 3. Wrap it in an HTTP service you control

Expose tidalDB through your favorite web framework; the repo ships runnable templates.

- **Axum sample (`tidal/examples/axum_embedding.rs`)**
  ```bash
  cargo run --example axum_embedding --manifest-path tidal/Cargo.toml
  ```
  Usage:
  ```bash
  curl -X POST http://127.0.0.1:3000/signal \
       -H 'Content-Type: application/json' \
       -d '{ "entity_id": 1, "signal": "view", "weight": 1.0 }'
  curl "http://127.0.0.1:3000/feed?user_id=42"
  curl http://127.0.0.1:3000/health
  ```
  The example handles schema setup, wraps `Arc<TidalDb>` in Axum `State`, and maps `TidalError` to HTTP responses.

- **Actix sample (`tidal/examples/actix_embedding.rs`)**
  ```bash
  cargo run --example actix_embedding --manifest-path tidal/Cargo.toml
  # curl http://127.0.0.1:3001/health
  ```
  Demonstrates sharing `Arc<TidalDb>` through `web::Data` and using Actix’s shutdown hooks.

Use either sample as a starting point for microservices that prefer a client/server boundary.

### 4. Run the Forage demo server (Axum + UI)

Want to see tidalDB powering a live personalization surface? Forage is a thin Axum server + feed UI that talks to a tidalDB instance embedded in-process.

```bash
cargo run -p forage-server --manifest-path applications/forage/server/Cargo.toml
open http://localhost:4242
```

Flags:
- `--ephemeral` to keep everything in-memory.
- `--data-dir ~/.forage/data` to point at a custom persistent directory.

Usage:
```bash
curl -X POST http://localhost:4242/signal \
     -H "Content-Type: application/json" \
     -d '{ "user_id": 1, "item_id": 42, "signal_type": "view" }'
curl "http://localhost:4242/feed?user=1&limit=7"
```
The UI shows seeded users, exploration labels, and real-time adaptation; see `applications/forage/readme.md` for the full loop.

### 5. Run the cluster server + Docker image

Need a single endpoint that fronts the built-in simulated cluster? Use
`tidal-server` in `cluster` mode. It spins up the multi-region fabric,
ships WAL batches between regions, and exposes `/signals`, `/feed`,
`/search` plus cluster-management routes.

```bash
cargo run -p tidal-server -- \
  cluster \
  --listen 0.0.0.0:9500 \
  --schema tidal-server/config/default-schema.yaml \
  --topology tidal-server/config/default-cluster.yaml
```

Key endpoints:

```bash
curl http://127.0.0.1:9500/health
curl -X POST http://127.0.0.1:9500/signals -d '{ "entity_id": 1, "signal": "view", "weight": 1.0 }'
curl "http://127.0.0.1:9500/feed?profile=trending&region=eu-west"
curl http://127.0.0.1:9500/cluster/status
curl -X POST http://127.0.0.1:9500/cluster/promote -d '{ "region": "eu-west" }'
```

Cluster mode currently replicates global signals (no `user_id` /
`creator_id` contexts) so that followers can stay in sync with the leader’s
WAL stream. See **[docs/runbooks/cluster.md](docs/runbooks/cluster.md)** for
operational steps, failure drills, and API references.

Prefer containers? Build the provided image and run it anywhere:

```bash
docker build -f docker/cluster/Dockerfile -t tidal-cluster .
docker run --rm -p 9500:9500 tidal-cluster
```

Mount your own schema/topology files with `-v` if you want different regions
or signal definitions.

### 6. Simulate a multi-region cluster in tests

The raw `SimulatedCluster` harness (no HTTP) remains available for property
tests and fuzzing.

```bash
cargo test --test m8_uat
cargo test --test m8_uat uat_step3 -- --nocapture   # run a single scenario
```

Tweak `tidal/tests/m8_uat.rs` to script specific replication, failover, and
migration scenarios inside your own test suites.

**MSRV:** Rust 1.91

---

## Documentation

| Document | Contents |
|----------|----------|
| [QUICKSTART.md](QUICKSTART.md) | Step-by-step guide: schema, ingest, signals, ranking, search |
| [API.md](API.md) | Full API reference with code examples |
| [VISION.md](VISION.md) | Problem statement and design thesis |
| [ARCHITECTURE.md](ARCHITECTURE.md) | Storage, signal system, vector index, query pipeline |
| [USE_CASES.md](USE_CASES.md) | 14 content discovery surfaces, filter and sort references |

---

## Status

Milestones completed:

- Storage engine, WAL, entity store, signal ledger
- RETRIEVE query: candidate retrieval, filtering, scoring, diversity, pagination
- Vector index (USearch HNSW) with adaptive filtered search
- 25 built-in ranking profiles
- BM25 full-text search (Tantivy) + hybrid RRF fusion
- Creator search and creator profiles
- Cohort-scoped signal aggregation and trending
- Social graph (follows, blocks, following feed)
- Collections, saved searches, autocomplete suggestions
- Session and agent context (short-lived signals, preference decay)
- Crash recovery, graceful degradation, rate limiting, diagnostics
- Scale: tested to 1M items; scale benchmarks passing

The API surface is stable for the implemented features. Breaking changes are possible before 1.0.

---

## License

MIT