tidaldb/docs/runbooks/cluster.md
jordan eca7765e8d fix: heal_region re-delivers missed WAL batches so partitioned followers converge immediately after heal
- Extract redeliver_missed(tx, db, log) helper into cluster_transport.rs
- heal_region now removes partition then immediately ships any missed
  batch-log entries to the healed follower's channel
- await_convergence refactored to call the same helper (no logic change)
- tidal-server: reload_text_index before search in cluster mode
- tidal-server: write_signal returns Result instead of panicking on unknown signal
- tidal-server: leader shows lag_events=0 (writes directly, no receiver thread)
- tidal-server: fix cluster mode error propagation (ServerError::from)
- docs/runbooks/cluster.md: add full cluster operations runbook
- docker/: add Dockerfile for containerised cluster deployment
- README.md: add tidal-server HTTP API getting-started section
- Split oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 11:57:01 -07:00

167 lines
4.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# tidalDB Cluster Runbook
This runbook describes how to operate the simulated multi-region tidalDB
cluster that ships with `tidal-server`. The cluster reuses the
`SimulatedCluster` fabric — it runs multiple in-process nodes, replays the
real WAL + CRDT reconciliation paths, and exposes a single HTTP surface
for microservices.
> **Important limitations**
>
> - Cluster mode currently replicates global signals only. `user_id` /
> `creator_id` contexts are rejected so followers stay consistent with the
> leaders WAL stream.
> - All metadata and embedding writes are broadcast to every region up front.
> There is no separate replication log for items yet.
## Prerequisites
- Rust toolchain ≥ 1.91 if running directly.
- Docker 25+ if running via container.
- Port 9500 available (default cluster listener).
## 1. Launch the cluster locally
```bash
cargo run -p tidal-server -- \
cluster \
--listen 127.0.0.1:9500 \
--schema tidal-server/config/default-schema.yaml \
--topology tidal-server/config/default-cluster.yaml
```
The default topology spins up three regions (`us-east`, `eu-west`,
`ap-south`) with `us-east` as leader.
## 2. Launch via Docker
```bash
# Build the image once
docker build -f docker/cluster/Dockerfile -t tidal-cluster .
# Run (press Ctrl+C to stop)
docker run --rm -p 9500:9500 tidal-cluster
```
To supply custom schema/topology files:
```bash
docker run --rm -p 9500:9500 \
-v $PWD/configs/my-schema.yaml:/srv/schema.yaml \
-v $PWD/configs/my-topology.yaml:/srv/topology.yaml \
tidal-cluster \
tidal-server cluster \
--listen 0.0.0.0:9500 \
--schema /srv/schema.yaml \
--topology /srv/topology.yaml
```
## 3. Core API calls
All routes are JSON unless noted.
### Health
```bash
curl http://localhost:9500/health
```
Returns overall status and item count on the leader.
### Register items & embeddings
```bash
curl -X POST http://localhost:9500/items \
-H 'Content-Type: application/json' \
-d '{ "entity_id": 1, "metadata": { "title": "Jazz Piano", "category": "music" } }'
curl -X POST http://localhost:9500/embeddings \
-H 'Content-Type: application/json' \
-d '{ "entity_id": 1, "values": [0.1, 0.2, 0.3, 0.4] }'
```
### Record signals (cluster mode = global only)
```bash
curl -X POST http://localhost:9500/signals \
-H 'Content-Type: application/json' \
-d '{ "entity_id": 1, "signal": "view", "weight": 1.0 }'
```
### Retrieve and search
```bash
curl "http://localhost:9500/feed?user_id=42&profile=trending&limit=10"
curl "http://localhost:9500/search?query=jazz%20piano&limit=5"
# Target a specific region (followers may lag during partitions)
curl "http://localhost:9500/feed?profile=trending&region=eu-west"
```
## 4. Cluster operations
### Check cluster status
```bash
curl http://localhost:9500/cluster/status | jq
```
Sample response:
```json
{
"leader": "us-east",
"relay_log_len": 125,
"regions": [
{ "name": "us-east", "applied_events": 125, "lag_events": 0, "partitioned": false },
{ "name": "eu-west", "applied_events": 125, "lag_events": 0, "partitioned": false },
{ "name": "ap-south", "applied_events": 124, "lag_events": 1, "partitioned": false }
]
}
```
### Promote a new leader
```bash
curl -X POST http://localhost:9500/cluster/promote \
-H 'Content-Type: application/json' \
-d '{ "region": "eu-west" }'
```
`/cluster/status` will now report `eu-west` as leader. New writes are routed
there and replayed to the other regions.
### Simulate a partition & heal
```bash
# Isolate ap-south (writes will skip this follower)
curl -X POST http://localhost:9500/cluster/partition \
-H 'Content-Type: application/json' \
-d '{ "region": "ap-south" }'
# Heal the partition (missed batches are replayed automatically)
curl -X POST http://localhost:9500/cluster/heal \
-H 'Content-Type: application/json' \
-d '{ "region": "ap-south" }'
```
Monitor `/cluster/status` to confirm lag drops back to zero after healing.
## 5. Runbook checklist
1. **Startup** — launch `tidal-server cluster …` (or Docker). Confirm log line
`listening on http://…`.
2. **Baseline health**`GET /health` and `GET /cluster/status` return `200`.
3. **Seed data**`POST /items`, `/embeddings`, `/signals` for initial items.
4. **Traffic** — microservices call `/signals`, `/feed`, `/search`. Add `region`
query param to pin to a follower for canary reads.
5. **Failover** — to move traffic during maintenance, `POST /cluster/promote`
to the target region. Verify status before proceeding.
6. **Partition drill**`POST /cluster/partition` to isolate a follower,
observe lag, then `POST /cluster/heal`.
7. **Shutdown** — send SIGINT (Ctrl+C) or stop the container. The server logs
`shutdown signal received` and exits cleanly.
Refer to `docs/planning/ROADMAP.md` for the underlying distributed
fabric guarantees and property tests.