Commit Graph

3 Commits

Author SHA1 Message Date
jordan
a0a33f4d9a feat: harden tidal-server for production (Weeks 1–3)
Week 1 — deployment prerequisites:
- Add TIDAL_API_KEY Bearer auth middleware (constant-time comparison)
- Handle SIGTERM alongside ctrl-c for graceful shutdown
- Remove test-utils feature from production tidal-server binary
- Fix standalone Dockerfile; add cluster Dockerfile and docker-compose
- Extract MultiRegionState into state.rs with per-region TidalDb map

Week 2 — operational middleware and observability:
- Add body limit (2MB), request timeout (30s), concurrency limit (100)
- Add SetRequestIdLayer + PropagateRequestIdLayer (x-request-id header)
- Add TraceLayer with structured spans including request ID
- Activate Prometheus /metrics endpoint via --metrics flag
- Add monitoring.md, recovery.md, prometheus-alerts.yaml, grafana-dashboard.json

Week 3 — query latency histograms and middleware integration tests:
- Add QUERY_LATENCY_BOUNDS (100µs–10s) histogram to tidal library
- Instrument retrieve() and search() with tidaldb_retrieve/search_latency_us
- Fix: search() latency now recorded on error paths (was skipped via ?)
- Lib+bin split in tidal-server enabling integration tests
- Add 8 middleware integration tests (auth, body limit, request ID)
- Add 2 Prometheus alert rules and 2 Grafana latency panels

Post-review fixes:
- Fix SIGTERM handler compilation on non-Unix targets (#[cfg(unix)] guard)
- Exempt /health from TimeoutLayer + ConcurrencyLimitLayer (prevents false liveness failures under load)
- Case-insensitive Bearer scheme matching per RFC 7235 §2.1
2026-02-27 20:32:39 -07:00
jordan
eca7765e8d fix: heal_region re-delivers missed WAL batches so partitioned followers converge immediately after heal
- Extract redeliver_missed(tx, db, log) helper into cluster_transport.rs
- heal_region now removes partition then immediately ships any missed
  batch-log entries to the healed follower's channel
- await_convergence refactored to call the same helper (no logic change)
- tidal-server: reload_text_index before search in cluster mode
- tidal-server: write_signal returns Result instead of panicking on unknown signal
- tidal-server: leader shows lag_events=0 (writes directly, no receiver thread)
- tidal-server: fix cluster mode error propagation (ServerError::from)
- docs/runbooks/cluster.md: add full cluster operations runbook
- docker/: add Dockerfile for containerised cluster deployment
- README.md: add tidal-server HTTP API getting-started section
- Split oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 11:57:01 -07:00
jordan
51b4d1bbd6 fix: repair tidal-server compilation and verify standalone HTTP server
Fix 9 compilation errors across tidal-server and testing/cluster.rs so
that `cargo run -p tidal-server -- standalone` works end-to-end.

Bugs fixed:
- cluster.rs: wrong return types `RetrieveResult`→`Results` and
  `SearchResult`→`SearchResults` on retrieve/search helpers
- state.rs: `RegionId` imported from private path; now uses
  `tidaldb::replication::RegionId`
- state.rs: missing `Ok()` wrapper on `ServerState::cluster()` return
- state.rs: cluster match arms returned `TidalError` where `ServerError`
  required; added `.map_err(ServerError::from)` on write_item,
  write_embedding, retrieve, search
- error.rs: `Result<T>` alias lacked default E param; callers in router
  used two-arg form `Result<T, AppError>` — changed to
  `Result<T, E = ServerError>`
- router.rs: `with_state()` called before cluster routes were added,
  making `app` `Router<()>`; restructured to call `with_state` once at end
- router.rs: `TidalErrorWrapper(TidalError)` used to map `QueryError`;
  fixed with `|e| TidalErrorWrapper(e.into())`
- router.rs: `Search::limit()` takes `u32` but code cast to `usize`
- router.rs: `bm25_score`/`semantic_score` are `f32` in SearchResultItem
  but `f64` in response struct; added `.map(f64::from)` conversion

Also split cluster.rs into cluster.rs + cluster_transport.rs to stay
under the 600-line limit required by CODING_GUIDELINES §9.

Verified all README curl examples work:
  POST /items, POST /embeddings, POST /signals, GET /feed, GET /search,
  GET /health all return correct HTTP status codes and JSON responses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 01:45:09 -07:00