tidaldb/docs/runbooks/cluster.md
jordan eca7765e8d fix: heal_region re-delivers missed WAL batches so partitioned followers converge immediately after heal
- Extract redeliver_missed(tx, db, log) helper into cluster_transport.rs
- heal_region now removes partition then immediately ships any missed
  batch-log entries to the healed follower's channel
- await_convergence refactored to call the same helper (no logic change)
- tidal-server: reload_text_index before search in cluster mode
- tidal-server: write_signal returns Result instead of panicking on unknown signal
- tidal-server: leader shows lag_events=0 (writes directly, no receiver thread)
- tidal-server: fix cluster mode error propagation (ServerError::from)
- docs/runbooks/cluster.md: add full cluster operations runbook
- docker/: add Dockerfile for containerised cluster deployment
- README.md: add tidal-server HTTP API getting-started section
- Split oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 11:57:01 -07:00

4.8 KiB
Raw Permalink Blame History

tidalDB Cluster Runbook

This runbook describes how to operate the simulated multi-region tidalDB cluster that ships with tidal-server. The cluster reuses the SimulatedCluster fabric — it runs multiple in-process nodes, replays the real WAL + CRDT reconciliation paths, and exposes a single HTTP surface for microservices.

Important limitations

  • Cluster mode currently replicates global signals only. user_id / creator_id contexts are rejected so followers stay consistent with the leaders WAL stream.
  • All metadata and embedding writes are broadcast to every region up front. There is no separate replication log for items yet.

Prerequisites

  • Rust toolchain ≥ 1.91 if running directly.
  • Docker 25+ if running via container.
  • Port 9500 available (default cluster listener).

1. Launch the cluster locally

cargo run -p tidal-server -- \
  cluster \
  --listen 127.0.0.1:9500 \
  --schema tidal-server/config/default-schema.yaml \
  --topology tidal-server/config/default-cluster.yaml

The default topology spins up three regions (us-east, eu-west, ap-south) with us-east as leader.

2. Launch via Docker

# Build the image once
docker build -f docker/cluster/Dockerfile -t tidal-cluster .

# Run (press Ctrl+C to stop)
docker run --rm -p 9500:9500 tidal-cluster

To supply custom schema/topology files:

docker run --rm -p 9500:9500 \
  -v $PWD/configs/my-schema.yaml:/srv/schema.yaml \
  -v $PWD/configs/my-topology.yaml:/srv/topology.yaml \
  tidal-cluster \
  tidal-server cluster \
    --listen 0.0.0.0:9500 \
    --schema /srv/schema.yaml \
    --topology /srv/topology.yaml

3. Core API calls

All routes are JSON unless noted.

Health

curl http://localhost:9500/health

Returns overall status and item count on the leader.

Register items & embeddings

curl -X POST http://localhost:9500/items \
  -H 'Content-Type: application/json' \
  -d '{ "entity_id": 1, "metadata": { "title": "Jazz Piano", "category": "music" } }'

curl -X POST http://localhost:9500/embeddings \
  -H 'Content-Type: application/json' \
  -d '{ "entity_id": 1, "values": [0.1, 0.2, 0.3, 0.4] }'

Record signals (cluster mode = global only)

curl -X POST http://localhost:9500/signals \
  -H 'Content-Type: application/json' \
  -d '{ "entity_id": 1, "signal": "view", "weight": 1.0 }'
curl "http://localhost:9500/feed?user_id=42&profile=trending&limit=10"
curl "http://localhost:9500/search?query=jazz%20piano&limit=5"

# Target a specific region (followers may lag during partitions)
curl "http://localhost:9500/feed?profile=trending&region=eu-west"

4. Cluster operations

Check cluster status

curl http://localhost:9500/cluster/status | jq

Sample response:

{
  "leader": "us-east",
  "relay_log_len": 125,
  "regions": [
    { "name": "us-east", "applied_events": 125, "lag_events": 0, "partitioned": false },
    { "name": "eu-west", "applied_events": 125, "lag_events": 0, "partitioned": false },
    { "name": "ap-south", "applied_events": 124, "lag_events": 1, "partitioned": false }
  ]
}

Promote a new leader

curl -X POST http://localhost:9500/cluster/promote \
  -H 'Content-Type: application/json' \
  -d '{ "region": "eu-west" }'

/cluster/status will now report eu-west as leader. New writes are routed there and replayed to the other regions.

Simulate a partition & heal

# Isolate ap-south (writes will skip this follower)
curl -X POST http://localhost:9500/cluster/partition \
  -H 'Content-Type: application/json' \
  -d '{ "region": "ap-south" }'

# Heal the partition (missed batches are replayed automatically)
curl -X POST http://localhost:9500/cluster/heal \
  -H 'Content-Type: application/json' \
  -d '{ "region": "ap-south" }'

Monitor /cluster/status to confirm lag drops back to zero after healing.

5. Runbook checklist

  1. Startup — launch tidal-server cluster … (or Docker). Confirm log line listening on http://….
  2. Baseline healthGET /health and GET /cluster/status return 200.
  3. Seed dataPOST /items, /embeddings, /signals for initial items.
  4. Traffic — microservices call /signals, /feed, /search. Add region query param to pin to a follower for canary reads.
  5. Failover — to move traffic during maintenance, POST /cluster/promote to the target region. Verify status before proceeding.
  6. Partition drillPOST /cluster/partition to isolate a follower, observe lag, then POST /cluster/heal.
  7. Shutdown — send SIGINT (Ctrl+C) or stop the container. The server logs shutdown signal received and exits cleanly.

Refer to docs/planning/ROADMAP.md for the underlying distributed fabric guarantees and property tests.