This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
237 lines
13 KiB
Plaintext
237 lines
13 KiB
Plaintext
# Three-Node Cluster Architecture Diagram
|
||
|
||
## High-Level Topology
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────────────┐
|
||
│ Client Layer │
|
||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||
│ │ Agents │ │ Dashboard │ │ CLI Tools │ │
|
||
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
|
||
│ │ │ │ │
|
||
│ └──────────────────┴──────────────────┘ │
|
||
│ │ │
|
||
│ │ HTTPS (443) │
|
||
│ ▼ │
|
||
└──────────────────────────────────────────────────────────────────────┘
|
||
|
||
┌──────────────────────────────────────────────────────────────────────┐
|
||
│ Load Balancer Layer │
|
||
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
||
│ │ Nginx / Envoy / AWS ALB │ │
|
||
│ │ • Round-robin distribution │ │
|
||
│ │ • Health checks (5s interval) │ │
|
||
│ │ • TLS termination │ │
|
||
│ │ • Removes failed nodes automatically │ │
|
||
│ └────────────┬──────────────┬──────────────┬─────────────────────┘ │
|
||
│ │ │ │ HTTP (18180) │
|
||
│ ▼ ▼ ▼ │
|
||
└──────────────────────────────────────────────────────────────────────┘
|
||
|
||
┌──────────────────────────────────────────────────────────────────────┐
|
||
│ StemeDB Cluster Nodes │
|
||
│ │
|
||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
|
||
│ │ 10.0.1.51 │ │ 10.0.1.52 │ │ 10.0.1.53 │ │
|
||
│ │ │ │ │ │ │ │
|
||
│ │ stemedb-api │ │ stemedb-api │ │ stemedb-api │ │
|
||
│ │ :18180 (API) │ │ :18180 (API) │ │ :18180 (API) │ │
|
||
│ │ :18181 (Gate) │ │ :18181 (Gate) │ │ :18181 (Gate) │ │
|
||
│ │ :18182 (RPC) │ │ :18182 (RPC) │ │ :18182 (RPC) │ │
|
||
│ │ :18183 (SWIM) │ │ :18183 (SWIM) │ │ :18183 (SWIM) │ │
|
||
│ │ │ │ │ │ │ │
|
||
│ │ /data/wal/ │ │ /data/wal/ │ │ /data/wal/ │ │
|
||
│ │ /data/db/ │ │ /data/db/ │ │ /data/db/ │ │
|
||
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
|
||
│ │ │ │ │
|
||
│ └────────────────────┴────────────────────┘ │
|
||
│ │ │
|
||
│ SWIM Gossip + gRPC Replication │
|
||
│ (UDP 18183 + TCP 18182) │
|
||
│ Replication Factor: 2 │
|
||
└──────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
## Inter-Node Communication
|
||
|
||
```
|
||
Node 1 ◀──────────────────────────────────────────────────▶ Node 2
|
||
│ │
|
||
│ SWIM Gossip (UDP 18183) │
|
||
│ • Membership: "Node 2 is UP" │
|
||
│ • Failure detection: ping/ack │
|
||
│ • Frequency: every 1 second │
|
||
│ │
|
||
│ gRPC Replication (TCP 18182) │
|
||
│ • Push assertions: "Assert X written to Node 1" │
|
||
│ • Pull sync: Merkle tree comparison │
|
||
│ • Frequency: continuous │
|
||
│ │
|
||
│ │
|
||
▼ ▼
|
||
◀───────────────────────────────────────────────────────────▶
|
||
Node 3
|
||
(Same protocol with Node 1 & 2)
|
||
```
|
||
|
||
## Write Path (Replication Factor 2)
|
||
|
||
```
|
||
Client submits assertion
|
||
│
|
||
▼
|
||
Load Balancer (routes to Node 1)
|
||
│
|
||
▼
|
||
┌───────────────────────────────────────┐
|
||
│ Node 1 (Coordinator) │
|
||
│ │
|
||
│ 1. Validate assertion │
|
||
│ 2. Write to local WAL (fsync) │
|
||
│ 3. Return 201 Created to client │
|
||
│ 4. Async replicate to Node 2 │
|
||
│ (background, no blocking) │
|
||
└───────────────┬───────────────────────┘
|
||
│
|
||
│ gRPC (async)
|
||
▼
|
||
┌───────────────────┐
|
||
│ Node 2 (Replica) │
|
||
│ 1. Receive assert│
|
||
│ 2. Write to WAL │
|
||
│ 3. ACK to Node 1 │
|
||
└───────────────────┘
|
||
|
||
(Node 3 may also receive replica
|
||
depending on hash-based shard assignment)
|
||
```
|
||
|
||
## Read Path (Eventually Consistent)
|
||
|
||
```
|
||
Client queries concept_path: "drug/aspirin/safety"
|
||
│
|
||
▼
|
||
Load Balancer (routes to any node, e.g., Node 2)
|
||
│
|
||
▼
|
||
┌───────────────────────────────────────┐
|
||
│ Node 2 (Query Handler) │
|
||
│ │
|
||
│ 1. Check local KV store │
|
||
│ 2. Apply lens (RecencyLens) │
|
||
│ 3. Resolve conflicts (CRDTs) │
|
||
│ 4. Return result to client │
|
||
│ │
|
||
│ No coordination with other nodes! │
|
||
└───────────────────────────────────────┘
|
||
│
|
||
▼
|
||
Client receives result (may be slightly stale if replication lag)
|
||
```
|
||
|
||
## Failure Scenario: Node 2 Down
|
||
|
||
```
|
||
Initial State (All UP):
|
||
┌────────┐ ┌────────┐ ┌────────┐
|
||
│ Node 1 │ │ Node 2 │ │ Node 3 │
|
||
│ UP │ │ UP │ │ UP │
|
||
└───┬────┘ └───┬────┘ └───┬────┘
|
||
│ │ │
|
||
└───────────┴───────────┘
|
||
SWIM: All healthy
|
||
|
||
|
||
Node 2 Failure:
|
||
┌────────┐ ┌────────┐ ┌────────┐
|
||
│ Node 1 │ │ Node 2 │ │ Node 3 │
|
||
│ UP │ │ ❌ DOWN│ │ UP │
|
||
└───┬────┘ └────────┘ └───┬────┘
|
||
│ │
|
||
└───────────────────────┘
|
||
SWIM: Node 2 detected as DOWN
|
||
Load Balancer: Health check fails, routes to Node 1 & 3
|
||
Replication: Factor 2 maintained (data on Node 1 & 3)
|
||
|
||
|
||
Recovery (Automatic):
|
||
┌────────┐ ┌────────┐
|
||
│ Node 1 │ │ Node 3 │
|
||
│ UP │──────────────│ UP │
|
||
└────────┘ └────────┘
|
||
Cluster continues operating
|
||
No data loss (replicated)
|
||
No manual intervention
|
||
|
||
RTO: <1 minute (automatic)
|
||
RPO: 0 (no data loss)
|
||
```
|
||
|
||
## Merkle Sync (Convergence)
|
||
|
||
```
|
||
Node 1 Node 2
|
||
┌──────────────┐ ┌──────────────┐
|
||
│ Merkle Tree │ │ Merkle Tree │
|
||
│ Root: ABC123│◀───────────────│ Root: DEF456│
|
||
│ │ Compare roots │ │
|
||
│ /drug/ │ (differ!) │ /drug/ │
|
||
│ /treatment/ │────────────────▶│ /treatment/ │
|
||
└──────────────┘ └──────────────┘
|
||
│ │
|
||
│ Descend tree, find diffs │
|
||
▼ ▼
|
||
Node 1 has: Node 2 has:
|
||
- Assert X (missing on Node 2) - Assert Y (missing on Node 1)
|
||
- Assert Z (both have) - Assert Z (both have)
|
||
|
||
│ │
|
||
▼ ▼
|
||
Exchange missing assertions
|
||
│ │
|
||
▼ ▼
|
||
Both nodes now have: X, Y, Z
|
||
Root hash: GHI789 (same!)
|
||
|
||
Convergence achieved.
|
||
```
|
||
|
||
## Cluster Health Monitoring
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────┐
|
||
│ Prometheus │
|
||
│ Scrapes all 3 nodes every 15s │
|
||
│ │
|
||
│ Metrics: │
|
||
│ - up{node="node1"} = 1 │
|
||
│ - up{node="node2"} = 1 │
|
||
│ - up{node="node3"} = 1 │
|
||
│ - replication_lag_seconds{node="node2"} = 0.5 │
|
||
│ - stemedb_query_latency_seconds{node="node1"} │
|
||
└─────────────────┬───────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────┐
|
||
│ Grafana │
|
||
│ Dashboard │
|
||
│ • Cluster map │
|
||
│ • Latency p99 │
|
||
│ • Repl lag │
|
||
└─────────────────┘
|
||
```
|
||
|
||
## Key Characteristics
|
||
|
||
- **High Availability:** Survives 1 node failure (99.9% uptime)
|
||
- **Replication:** Factor 2 (each assertion on 2 nodes)
|
||
- **Consistency:** Eventual (CRDTs + Merkle sync)
|
||
- **Recovery:** Automatic (<5 minute RTO)
|
||
- **Capacity:** <100K assertions, <1K queries/sec
|
||
- **Cost:** ~$425/month (AWS t3.xlarge × 3)
|
||
- **Use Case:** Production deployments, enterprise pilots
|
||
|
||
✅ RECOMMENDED FOR PRODUCTION
|