tidaldb/docs/planning/milestone-8/phase-5/OVERVIEW.md
jordan f4cfd6c81f feat: complete M8 replication primitives + forage enhancements + docs
Milestone 8 (phases 1-4):
- Shard-aware WAL segment naming, BatchHeader v2, ShardRouter
- Transport trait, InProcessTransport, WalShipper, FollowerDb
- HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine
- Session replication bridge with SeqNo/HWM, idempotency store

Forage application:
- Multi-source discovery engine with MAB exploration
- Embedding-based label system, server handlers, UI refresh

Other:
- QUICKSTART.md, README.md, milestone-8 planning docs
- Hard negative union semantics, RLHF export enhancements
- Recovery benchmark and visibility test expansions
- Split 8 oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 13:17:19 -07:00

5.8 KiB

m8p5: Control Plane, Multi-Tenancy, and Routing

Delivers

Tenant isolation, routing configuration, and operational tooling for a hosted multi-tenant deployment. Each tenant (agent workspace) gets its own WAL namespace and resource quotas. The control plane manages shard-to-region assignment, tenant placement, and rolling upgrades. A tenant can be migrated to a new region by changing routing configuration only.

Deliverables:

  • TenantId(u64): tenant identity type; WAL segments namespaced by tenant
  • TenantConfig: per-tenant quota (max signals/sec, max entities, max storage bytes), residency policy (required regions)
  • TenantRouter: extends ShardRouter with tenant-aware routing; tenant -> shard mapping
  • ControlPlane: manages cluster topology (shard assignments, tenant placement, region health)
  • TenantMigration: moves a tenant to a new shard/region by shipping WAL segments + state snapshot; zero-downtime via dual-write window
  • RollingUpgradeCoordinator: upgrades nodes one at a time with drain + upgrade + rejoin; uses WAL shipping to keep followers current during the window

Dependencies

  • Requires: Phase 8.2 (WAL shipping), Phase 8.3 (reconciliation), Phase 8.4 (session continuity)
  • Files modified:
    • tidal/src/db/config.rs -- add tenant configuration fields
    • tidal/src/replication/shard.rs -- extend ShardRouter with tenant routing
    • tidal/src/wal/segment.rs -- tenant-namespaced segment directories
    • tidal/src/db/open.rs -- tenant-scoped initialization
  • Files created:
    • tidal/src/replication/tenant.rs -- TenantId, TenantConfig, TenantRouter
    • tidal/src/replication/control.rs -- ControlPlane, topology management
    • tidal/src/replication/migration.rs -- TenantMigration
    • tidal/src/replication/upgrade.rs -- RollingUpgradeCoordinator

Research References

  • thoughts.md -- Part I/Citadel (per-tenant filesystem isolation: "every tenant is an island")

Acceptance Criteria (Phase Level)

  • TenantId(u64) is Copy + Clone + Debug + Eq + Hash + Ord; WAL segment directories are namespaced as {data_dir}/tenants/{tenant_id}/wal/
  • TenantConfig enforces rate limits: signals/sec (token bucket), max entities (hard cap), max storage bytes (checked on write); violations return TidalError::QuotaExceeded
  • TenantRouter maps (TenantId, EntityId) -> (RegionId, ShardId); default is hash-based; residency policy constrains which regions a tenant's data can reside in
  • ControlPlane exposes cluster health: per-shard entity count, signal throughput, replication lag, disk usage; serializable to JSON for monitoring integration
  • Tenant migration test: move tenant from shard A to shard B; during migration, dual-write ensures no signal loss; after migration, shard A's tenant data is garbage-collected; total downtime = 0 (reads served from both shards during migration window)
  • Rolling upgrade: upgrade 1 of 3 nodes; WAL shipping continues to remaining 2; upgraded node rejoins and catches up from WAL; total query availability = 100% during the upgrade window
  • Per-tenant WAL isolation: a misbehaving tenant (burst of 100K signals/sec) is throttled without affecting other tenants on the same shard; rate limiter returns TidalError::QuotaExceeded within 1ms

Task Execution Order

Task 01: TenantId + TenantConfig ──────────┐
                                            ├──> Task 03: ControlPlane
Task 02: TenantRouter ────────────────────┤
                                            ├──> Task 04: TenantMigration
                                            │
                                            └──> Task 05: RollingUpgrade
                                                      │
                                                      v
                                            Task 06: Multi-Tenancy Integration Tests

Tasks 01 and 02 are parallelizable. Tasks 03, 04, 05 depend on both. Task 06 depends on all.

Module Location

File Status Contains
tidal/src/replication/tenant.rs NEW TenantId, TenantConfig, TenantRouter, quota enforcement
tidal/src/replication/control.rs NEW ControlPlane, cluster topology, health metrics
tidal/src/replication/migration.rs NEW TenantMigration, dual-write protocol
tidal/src/replication/upgrade.rs NEW RollingUpgradeCoordinator
tidal/src/db/config.rs MODIFIED Tenant config fields
tidal/src/replication/shard.rs MODIFIED Tenant-aware routing
tidal/src/wal/segment.rs MODIFIED Tenant-namespaced directories
tidal/src/db/open.rs MODIFIED Tenant-scoped initialization

Notes

Tenant isolation follows Citadel's model

Per-tenant filesystem directories, per-tenant WAL files, per-tenant rate limiters. The OS enforces the boundary. A misbehaving tenant cannot affect others because its writes go to separate files and its rate limiter is checked before the WAL write.

Migration via dual-write

During migration, writes for the migrating tenant go to both the old shard and the new shard. After the new shard has caught up (verified by seqno matching), reads are switched to the new shard, and the old shard's tenant data is garbage-collected. This is the CockroachDB range-split model adapted for tenant migration.

Control plane is embedded, not external

The ControlPlane runs within the leader node's process (or a designated coordinator node). It is not a separate service. This matches tidalDB's embeddable philosophy.

Done When

A developer can configure 3 tenants on a 3-shard cluster, apply per-tenant rate limits, migrate a tenant from one shard to another with zero downtime, perform a rolling upgrade of all nodes, and observe that per-tenant isolation prevents noisy-neighbor effects throughout.