- Extract redeliver_missed(tx, db, log) helper into cluster_transport.rs - heal_region now removes partition then immediately ships any missed batch-log entries to the healed follower's channel - await_convergence refactored to call the same helper (no logic change) - tidal-server: reload_text_index before search in cluster mode - tidal-server: write_signal returns Result instead of panicking on unknown signal - tidal-server: leader shows lag_events=0 (writes directly, no receiver thread) - tidal-server: fix cluster mode error propagation (ServerError::from) - docs/runbooks/cluster.md: add full cluster operations runbook - docker/: add Dockerfile for containerised cluster deployment - README.md: add tidal-server HTTP API getting-started section - Split oversized source files per CODING_GUIDELINES §9 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
167 lines
4.8 KiB
Markdown
167 lines
4.8 KiB
Markdown
# tidalDB Cluster Runbook
|
||
|
||
This runbook describes how to operate the simulated multi-region tidalDB
|
||
cluster that ships with `tidal-server`. The cluster reuses the
|
||
`SimulatedCluster` fabric — it runs multiple in-process nodes, replays the
|
||
real WAL + CRDT reconciliation paths, and exposes a single HTTP surface
|
||
for microservices.
|
||
|
||
> **Important limitations**
|
||
>
|
||
> - Cluster mode currently replicates global signals only. `user_id` /
|
||
> `creator_id` contexts are rejected so followers stay consistent with the
|
||
> leader’s WAL stream.
|
||
> - All metadata and embedding writes are broadcast to every region up front.
|
||
> There is no separate replication log for items yet.
|
||
|
||
## Prerequisites
|
||
|
||
- Rust toolchain ≥ 1.91 if running directly.
|
||
- Docker 25+ if running via container.
|
||
- Port 9500 available (default cluster listener).
|
||
|
||
## 1. Launch the cluster locally
|
||
|
||
```bash
|
||
cargo run -p tidal-server -- \
|
||
cluster \
|
||
--listen 127.0.0.1:9500 \
|
||
--schema tidal-server/config/default-schema.yaml \
|
||
--topology tidal-server/config/default-cluster.yaml
|
||
```
|
||
|
||
The default topology spins up three regions (`us-east`, `eu-west`,
|
||
`ap-south`) with `us-east` as leader.
|
||
|
||
## 2. Launch via Docker
|
||
|
||
```bash
|
||
# Build the image once
|
||
docker build -f docker/cluster/Dockerfile -t tidal-cluster .
|
||
|
||
# Run (press Ctrl+C to stop)
|
||
docker run --rm -p 9500:9500 tidal-cluster
|
||
```
|
||
|
||
To supply custom schema/topology files:
|
||
|
||
```bash
|
||
docker run --rm -p 9500:9500 \
|
||
-v $PWD/configs/my-schema.yaml:/srv/schema.yaml \
|
||
-v $PWD/configs/my-topology.yaml:/srv/topology.yaml \
|
||
tidal-cluster \
|
||
tidal-server cluster \
|
||
--listen 0.0.0.0:9500 \
|
||
--schema /srv/schema.yaml \
|
||
--topology /srv/topology.yaml
|
||
```
|
||
|
||
## 3. Core API calls
|
||
|
||
All routes are JSON unless noted.
|
||
|
||
### Health
|
||
|
||
```bash
|
||
curl http://localhost:9500/health
|
||
```
|
||
|
||
Returns overall status and item count on the leader.
|
||
|
||
### Register items & embeddings
|
||
|
||
```bash
|
||
curl -X POST http://localhost:9500/items \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{ "entity_id": 1, "metadata": { "title": "Jazz Piano", "category": "music" } }'
|
||
|
||
curl -X POST http://localhost:9500/embeddings \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{ "entity_id": 1, "values": [0.1, 0.2, 0.3, 0.4] }'
|
||
```
|
||
|
||
### Record signals (cluster mode = global only)
|
||
|
||
```bash
|
||
curl -X POST http://localhost:9500/signals \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{ "entity_id": 1, "signal": "view", "weight": 1.0 }'
|
||
```
|
||
|
||
### Retrieve and search
|
||
|
||
```bash
|
||
curl "http://localhost:9500/feed?user_id=42&profile=trending&limit=10"
|
||
curl "http://localhost:9500/search?query=jazz%20piano&limit=5"
|
||
|
||
# Target a specific region (followers may lag during partitions)
|
||
curl "http://localhost:9500/feed?profile=trending®ion=eu-west"
|
||
```
|
||
|
||
## 4. Cluster operations
|
||
|
||
### Check cluster status
|
||
|
||
```bash
|
||
curl http://localhost:9500/cluster/status | jq
|
||
```
|
||
|
||
Sample response:
|
||
|
||
```json
|
||
{
|
||
"leader": "us-east",
|
||
"relay_log_len": 125,
|
||
"regions": [
|
||
{ "name": "us-east", "applied_events": 125, "lag_events": 0, "partitioned": false },
|
||
{ "name": "eu-west", "applied_events": 125, "lag_events": 0, "partitioned": false },
|
||
{ "name": "ap-south", "applied_events": 124, "lag_events": 1, "partitioned": false }
|
||
]
|
||
}
|
||
```
|
||
|
||
### Promote a new leader
|
||
|
||
```bash
|
||
curl -X POST http://localhost:9500/cluster/promote \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{ "region": "eu-west" }'
|
||
```
|
||
|
||
`/cluster/status` will now report `eu-west` as leader. New writes are routed
|
||
there and replayed to the other regions.
|
||
|
||
### Simulate a partition & heal
|
||
|
||
```bash
|
||
# Isolate ap-south (writes will skip this follower)
|
||
curl -X POST http://localhost:9500/cluster/partition \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{ "region": "ap-south" }'
|
||
|
||
# Heal the partition (missed batches are replayed automatically)
|
||
curl -X POST http://localhost:9500/cluster/heal \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{ "region": "ap-south" }'
|
||
```
|
||
|
||
Monitor `/cluster/status` to confirm lag drops back to zero after healing.
|
||
|
||
## 5. Runbook checklist
|
||
|
||
1. **Startup** — launch `tidal-server cluster …` (or Docker). Confirm log line
|
||
`listening on http://…`.
|
||
2. **Baseline health** — `GET /health` and `GET /cluster/status` return `200`.
|
||
3. **Seed data** — `POST /items`, `/embeddings`, `/signals` for initial items.
|
||
4. **Traffic** — microservices call `/signals`, `/feed`, `/search`. Add `region`
|
||
query param to pin to a follower for canary reads.
|
||
5. **Failover** — to move traffic during maintenance, `POST /cluster/promote`
|
||
to the target region. Verify status before proceeding.
|
||
6. **Partition drill** — `POST /cluster/partition` to isolate a follower,
|
||
observe lag, then `POST /cluster/heal`.
|
||
7. **Shutdown** — send SIGINT (Ctrl+C) or stop the container. The server logs
|
||
`shutdown signal received` and exits cleanly.
|
||
|
||
Refer to `docs/planning/ROADMAP.md` for the underlying distributed
|
||
fabric guarantees and property tests.
|