stemedb/docs/operations/reference-architecture/diagrams/three-node.txt
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

237 lines
13 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Three-Node Cluster Architecture Diagram
## High-Level Topology
```
┌──────────────────────────────────────────────────────────────────────┐
│ Client Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Agents │ │ Dashboard │ │ CLI Tools │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────────────────┴──────────────────┘ │
│ │ │
│ │ HTTPS (443) │
│ ▼ │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ Load Balancer Layer │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Nginx / Envoy / AWS ALB │ │
│ │ • Round-robin distribution │ │
│ │ • Health checks (5s interval) │ │
│ │ • TLS termination │ │
│ │ • Removes failed nodes automatically │ │
│ └────────────┬──────────────┬──────────────┬─────────────────────┘ │
│ │ │ │ HTTP (18180) │
│ ▼ ▼ ▼ │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ StemeDB Cluster Nodes │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ 10.0.1.51 │ │ 10.0.1.52 │ │ 10.0.1.53 │ │
│ │ │ │ │ │ │ │
│ │ stemedb-api │ │ stemedb-api │ │ stemedb-api │ │
│ │ :18180 (API) │ │ :18180 (API) │ │ :18180 (API) │ │
│ │ :18181 (Gate) │ │ :18181 (Gate) │ │ :18181 (Gate) │ │
│ │ :18182 (RPC) │ │ :18182 (RPC) │ │ :18182 (RPC) │ │
│ │ :18183 (SWIM) │ │ :18183 (SWIM) │ │ :18183 (SWIM) │ │
│ │ │ │ │ │ │ │
│ │ /data/wal/ │ │ /data/wal/ │ │ /data/wal/ │ │
│ │ /data/db/ │ │ /data/db/ │ │ /data/db/ │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └────────────────────┴────────────────────┘ │
│ │ │
│ SWIM Gossip + gRPC Replication │
│ (UDP 18183 + TCP 18182) │
│ Replication Factor: 2 │
└──────────────────────────────────────────────────────────────────────┘
```
## Inter-Node Communication
```
Node 1 ◀──────────────────────────────────────────────────▶ Node 2
│ │
│ SWIM Gossip (UDP 18183) │
│ • Membership: "Node 2 is UP" │
│ • Failure detection: ping/ack │
│ • Frequency: every 1 second │
│ │
│ gRPC Replication (TCP 18182) │
│ • Push assertions: "Assert X written to Node 1" │
│ • Pull sync: Merkle tree comparison │
│ • Frequency: continuous │
│ │
│ │
▼ ▼
◀───────────────────────────────────────────────────────────▶
Node 3
(Same protocol with Node 1 & 2)
```
## Write Path (Replication Factor 2)
```
Client submits assertion
Load Balancer (routes to Node 1)
┌───────────────────────────────────────┐
│ Node 1 (Coordinator) │
│ │
│ 1. Validate assertion │
│ 2. Write to local WAL (fsync) │
│ 3. Return 201 Created to client │
│ 4. Async replicate to Node 2 │
│ (background, no blocking) │
└───────────────┬───────────────────────┘
│ gRPC (async)
┌───────────────────┐
│ Node 2 (Replica) │
│ 1. Receive assert│
│ 2. Write to WAL │
│ 3. ACK to Node 1 │
└───────────────────┘
(Node 3 may also receive replica
depending on hash-based shard assignment)
```
## Read Path (Eventually Consistent)
```
Client queries concept_path: "drug/aspirin/safety"
Load Balancer (routes to any node, e.g., Node 2)
┌───────────────────────────────────────┐
│ Node 2 (Query Handler) │
│ │
│ 1. Check local KV store │
│ 2. Apply lens (RecencyLens) │
│ 3. Resolve conflicts (CRDTs) │
│ 4. Return result to client │
│ │
│ No coordination with other nodes! │
└───────────────────────────────────────┘
Client receives result (may be slightly stale if replication lag)
```
## Failure Scenario: Node 2 Down
```
Initial State (All UP):
┌────────┐ ┌────────┐ ┌────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ UP │ │ UP │ │ UP │
└───┬────┘ └───┬────┘ └───┬────┘
│ │ │
└───────────┴───────────┘
SWIM: All healthy
Node 2 Failure:
┌────────┐ ┌────────┐ ┌────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ UP │ │ ❌ DOWN│ │ UP │
└───┬────┘ └────────┘ └───┬────┘
│ │
└───────────────────────┘
SWIM: Node 2 detected as DOWN
Load Balancer: Health check fails, routes to Node 1 & 3
Replication: Factor 2 maintained (data on Node 1 & 3)
Recovery (Automatic):
┌────────┐ ┌────────┐
│ Node 1 │ │ Node 3 │
│ UP │──────────────│ UP │
└────────┘ └────────┘
Cluster continues operating
No data loss (replicated)
No manual intervention
RTO: <1 minute (automatic)
RPO: 0 (no data loss)
```
## Merkle Sync (Convergence)
```
Node 1 Node 2
┌──────────────┐ ┌──────────────┐
│ Merkle Tree │ │ Merkle Tree │
│ Root: ABC123│◀───────────────│ Root: DEF456│
│ │ Compare roots │ │
│ /drug/ │ (differ!) │ /drug/ │
│ /treatment/ │────────────────▶│ /treatment/ │
└──────────────┘ └──────────────┘
│ │
│ Descend tree, find diffs │
▼ ▼
Node 1 has: Node 2 has:
- Assert X (missing on Node 2) - Assert Y (missing on Node 1)
- Assert Z (both have) - Assert Z (both have)
│ │
▼ ▼
Exchange missing assertions
│ │
▼ ▼
Both nodes now have: X, Y, Z
Root hash: GHI789 (same!)
Convergence achieved.
```
## Cluster Health Monitoring
```
┌─────────────────────────────────────────────────┐
│ Prometheus │
│ Scrapes all 3 nodes every 15s │
│ │
│ Metrics: │
│ - up{node="node1"} = 1 │
│ - up{node="node2"} = 1 │
│ - up{node="node3"} = 1 │
│ - replication_lag_seconds{node="node2"} = 0.5 │
│ - stemedb_query_latency_seconds{node="node1"} │
└─────────────────┬───────────────────────────────┘
┌─────────────────┐
│ Grafana │
│ Dashboard │
│ • Cluster map │
│ • Latency p99 │
│ • Repl lag │
└─────────────────┘
```
## Key Characteristics
- **High Availability:** Survives 1 node failure (99.9% uptime)
- **Replication:** Factor 2 (each assertion on 2 nodes)
- **Consistency:** Eventual (CRDTs + Merkle sync)
- **Recovery:** Automatic (<5 minute RTO)
- **Capacity:** <100K assertions, <1K queries/sec
- **Cost:** ~$425/month (AWS t3.xlarge × 3)
- **Use Case:** Production deployments, enterprise pilots
✅ RECOMMENDED FOR PRODUCTION