This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
309 lines
19 KiB
Plaintext
309 lines
19 KiB
Plaintext
# Network Topology Diagram
|
||
|
||
## Port Scheme Overview
|
||
|
||
```
|
||
┌────────────────────────────────────────────────────────────────┐
|
||
│ StemeDB Port Allocation (181XX) │
|
||
├────────┬──────────┬─────────────────────┬──────────────────────┤
|
||
│ Port │ Protocol │ Service │ Purpose │
|
||
├────────┼──────────┼─────────────────────┼──────────────────────┤
|
||
│ 18180 │ TCP/HTTP │ API Server │ Queries, ingest │
|
||
│ 18181 │ TCP/HTTP │ Cluster Gateway │ Coordination │
|
||
│ 18182 │ TCP/gRPC │ Cluster RPC │ Replication │
|
||
│ 18183 │ UDP │ SWIM Gossip │ Membership │
|
||
│ 18184 │ - │ (Reserved) │ Future metrics │
|
||
│ 18185 │ - │ (Reserved) │ Future admin │
|
||
│ 18186 │ TCP/HTTP │ Latent Signal │ AE detection │
|
||
│ 18187 │ TCP/HTTP │ Community App │ Community corpus │
|
||
│ 18188 │ TCP/HTTP │ StemeDB Dashboard │ Web UI │
|
||
│ 18189 │ TCP/HTTP │ Aphoria Dashboard │ Aphoria UI │
|
||
└────────┴──────────┴─────────────────────┴──────────────────────┘
|
||
```
|
||
|
||
## Single-Node Network Topology
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Internet │
|
||
│ │ │
|
||
│ │ HTTPS (443) │
|
||
│ ▼ │
|
||
│ ┌───────────────┐ │
|
||
│ │ Reverse Proxy │ │
|
||
│ │ (Nginx/Envoy) │ │
|
||
│ │ • TLS term │ │
|
||
│ │ • Rate limit │ │
|
||
│ └───────┬───────┘ │
|
||
│ │ │
|
||
│ │ HTTP (18180) │
|
||
└────────────────────────────┼─────────────────────────────────────┘
|
||
│
|
||
┌──────────────────┼──────────────────┐
|
||
│ Internal Network (10.0.0.0/8) │
|
||
│ ▼ │
|
||
│ ┌─────────────────┐ │
|
||
│ │ StemeDB Node │ │
|
||
│ │ 10.0.1.50 │ │
|
||
│ │ │ │
|
||
│ │ :18180 (API) │◀────────┼─── Clients (internal)
|
||
│ │ :18188 (Dash) │ │
|
||
│ └────────┬────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌─────────────────┐ │
|
||
│ │ Prometheus │ │
|
||
│ │ 10.0.1.100 │ │
|
||
│ │ Scrapes :18180 │ │
|
||
│ └─────────────────┘ │
|
||
└─────────────────────────────────────┘
|
||
|
||
Security Zones:
|
||
- Public: Internet → Reverse Proxy (443)
|
||
- DMZ: Reverse Proxy → StemeDB (18180)
|
||
- Internal: Prometheus → StemeDB (18180/metrics)
|
||
```
|
||
|
||
## Three-Node Cluster Network Topology
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ Internet │
|
||
│ │ │
|
||
│ │ HTTPS (443) │
|
||
│ ▼ │
|
||
│ ┌───────────────┐ │
|
||
│ │ Load Balancer │ │
|
||
│ │ (ALB/ELB) │ │
|
||
│ │ • TLS term │ │
|
||
│ │ • Health chks │ │
|
||
│ └───────┬───────┘ │
|
||
│ │ │
|
||
│ │ HTTP (18180) │
|
||
└─────────────────────────────┼──────────────────────────────────────┘
|
||
│
|
||
┌───────────────┴───────────────┐
|
||
│ │
|
||
┌─────────────┼───────────────────────────────┼──────────────────┐
|
||
│ Private Network (10.0.1.0/24) │ │
|
||
│ ▼ ▼ │
|
||
│ ┌─────────────────┐ ┌─────────────────┐ │
|
||
│ │ Node 1 │ │ Node 2 │ │
|
||
│ │ 10.0.1.51 │ │ 10.0.1.52 │ │
|
||
│ │ │ │ │ │
|
||
│ │ :18180 (API) │ │ :18180 (API) │ │
|
||
│ │ :18181 (Gate) │ │ :18181 (Gate) │ │
|
||
│ │ :18182 (RPC)────┼────────────┼────:18182 (RPC) │ │
|
||
│ │ :18183 (SWIM)···┼···········UDP···:18183 (SWIM)│ │
|
||
│ └────────┬────────┘ └────────┬────────┘ │
|
||
│ │ │ │
|
||
│ │ │ │
|
||
│ │ │ │
|
||
│ │ ┌─────────────────┐ │ │
|
||
│ │ │ Node 3 │ │ │
|
||
│ │ │ 10.0.1.53 │ │ │
|
||
│ │ │ │ │ │
|
||
│ │ │ :18180 (API) │ │ │
|
||
│ │ │ :18181 (Gate) │ │ │
|
||
│ └─────────┼────:18182 (RPC) │──┘ │
|
||
│ ···UDP···:18183 (SWIM)│ │
|
||
│ └────────┬────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌─────────────────┐ │
|
||
│ │ Prometheus │ │
|
||
│ │ 10.0.1.100 │ │
|
||
│ │ Scrapes all 3 │ │
|
||
│ └─────────────────┘ │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
|
||
Security Zones:
|
||
- Public: Internet → Load Balancer (443)
|
||
- DMZ: Load Balancer → Nodes (18180)
|
||
- Cluster: Node ↔ Node (18181-18183)
|
||
- Internal: Prometheus → Nodes (18180/metrics)
|
||
|
||
Firewall Rules:
|
||
- Allow 18180 from Load Balancer to all nodes
|
||
- Allow 18181-18183 within cluster (node ↔ node)
|
||
- Allow 18180/metrics from Prometheus only
|
||
- Block 18181 from outside (admin endpoints)
|
||
```
|
||
|
||
## Inter-Node Communication Detail
|
||
|
||
```
|
||
Node 1 (10.0.1.51) Node 2 (10.0.1.52)
|
||
|
||
Port 18182 (TCP/gRPC)
|
||
│
|
||
├─────────────────────────────────────▶ :18182
|
||
│ Push Replication (receive assertions)
|
||
│ • Assertion payload
|
||
│ • BLAKE3 hash
|
||
│ • Signature
|
||
│
|
||
◀─────────────────────────────────────┤
|
||
ACK (received) │
|
||
│
|
||
Port 18183 (UDP)
|
||
│
|
||
├───────────────────────────────────▶ :18183
|
||
│ SWIM Gossip (every 1s) (membership)
|
||
│ • Ping: "Are you alive?"
|
||
│ • Membership: "Node 3 is UP"
|
||
│
|
||
◀───────────────────────────────────┤
|
||
Ack: "I'm alive" │
|
||
Membership: "Node 1 is UP" │
|
||
|
||
Port 18181 (TCP/HTTP)
|
||
│
|
||
├─────────────────────────────────────▶ :18181
|
||
│ Merkle Sync (periodic) (compare trees)
|
||
│ GET /cluster/merkle
|
||
│ • Root hash: ABC123
|
||
│
|
||
◀─────────────────────────────────────┤
|
||
Merkle tree response │
|
||
• Root hash: ABC123 (same!) │
|
||
• No sync needed │
|
||
```
|
||
|
||
## Firewall Configuration (iptables)
|
||
|
||
```
|
||
# On each cluster node:
|
||
|
||
# Allow API from load balancer
|
||
-A INPUT -s 10.0.1.10 -p tcp --dport 18180 -j ACCEPT
|
||
|
||
# Allow cluster RPC from other nodes
|
||
-A INPUT -s 10.0.1.51 -p tcp --dport 18181:18182 -j ACCEPT
|
||
-A INPUT -s 10.0.1.52 -p tcp --dport 18181:18182 -j ACCEPT
|
||
-A INPUT -s 10.0.1.53 -p tcp --dport 18181:18182 -j ACCEPT
|
||
|
||
# Allow SWIM gossip (UDP) from other nodes
|
||
-A INPUT -s 10.0.1.51 -p udp --dport 18183 -j ACCEPT
|
||
-A INPUT -s 10.0.1.52 -p udp --dport 18183 -j ACCEPT
|
||
-A INPUT -s 10.0.1.53 -p udp --dport 18183 -j ACCEPT
|
||
|
||
# Allow metrics from Prometheus
|
||
-A INPUT -s 10.0.1.100 -p tcp --dport 18180 -j ACCEPT
|
||
|
||
# Allow SSH from bastion
|
||
-A INPUT -s 10.0.1.200 -p tcp --dport 22 -j ACCEPT
|
||
|
||
# Drop everything else
|
||
-A INPUT -p tcp --dport 18180:18189 -j DROP
|
||
-A INPUT -p udp --dport 18183 -j DROP
|
||
```
|
||
|
||
## AWS Security Group Example
|
||
|
||
```
|
||
Security Group: sg-stemedb-cluster
|
||
|
||
Inbound Rules:
|
||
┌──────────┬──────────┬─────────────────┬─────────────────────────┐
|
||
│ Type │ Protocol │ Port Range │ Source │
|
||
├──────────┼──────────┼─────────────────┼─────────────────────────┤
|
||
│ HTTP │ TCP │ 18180 │ sg-load-balancer │
|
||
│ Custom │ TCP │ 18181-18182 │ sg-stemedb-cluster │
|
||
│ Custom │ UDP │ 18183 │ sg-stemedb-cluster │
|
||
│ SSH │ TCP │ 22 │ sg-bastion │
|
||
└──────────┴──────────┴─────────────────┴─────────────────────────┘
|
||
|
||
Outbound Rules:
|
||
┌──────────┬──────────┬─────────────────┬─────────────────────────┐
|
||
│ All │ All │ All │ 0.0.0.0/0 │
|
||
└──────────┴──────────┴─────────────────┴─────────────────────────┘
|
||
```
|
||
|
||
## Network Latency Requirements
|
||
|
||
```
|
||
Client → Load Balancer: <100ms (internet typical)
|
||
│
|
||
▼
|
||
Load Balancer → Node: <10ms (same region)
|
||
│
|
||
├───────────────────────────────────────┐
|
||
▼ ▼
|
||
Node 1 ◀─────<5ms (CRITICAL)─────────▶ Node 2
|
||
▲ ▲
|
||
│ │
|
||
└───────────<5ms (CRITICAL)─────────────┘
|
||
Node 3
|
||
|
||
Why <5ms inter-node?
|
||
- SWIM gossip requires fast ping/ack
|
||
- Replication lag increases with latency
|
||
- Merkle sync performance degrades
|
||
|
||
Test: ping -c 100 node2 (should show avg <5ms)
|
||
```
|
||
|
||
## Bandwidth Usage
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ Bandwidth Breakdown │
|
||
├─────────────────┬───────────────────────────────────────────┤
|
||
│ Direction │ Usage (per node) │
|
||
├─────────────────┼───────────────────────────────────────────┤
|
||
│ Inbound (API) │ 100 assertions/sec × 1KB = 0.8 Mbps │
|
||
│ Outbound (API) │ 100 queries/sec × 5KB = 4 Mbps │
|
||
│ Replication │ 100 assertions/sec × 1KB × 2 = 1.6 Mbps │
|
||
│ SWIM Gossip │ ~10 KB/sec (negligible) │
|
||
├─────────────────┼───────────────────────────────────────────┤
|
||
│ Total │ ~7 Mbps per node │
|
||
│ Recommended │ 1 Gbps NIC (100× headroom) │
|
||
└─────────────────┴───────────────────────────────────────────┘
|
||
```
|
||
|
||
## Monitoring Endpoints
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ Prometheus Scrape Targets │
|
||
├─────────────────┬───────────────────────────────────────────┤
|
||
│ Target │ URL │
|
||
├─────────────────┼───────────────────────────────────────────┤
|
||
│ Node 1 │ http://10.0.1.51:18180/metrics │
|
||
│ Node 2 │ http://10.0.1.52:18180/metrics │
|
||
│ Node 3 │ http://10.0.1.53:18180/metrics │
|
||
├─────────────────┼───────────────────────────────────────────┤
|
||
│ Scrape Interval │ 15 seconds │
|
||
│ Timeout │ 10 seconds │
|
||
└─────────────────┴───────────────────────────────────────────┘
|
||
|
||
Key Metrics:
|
||
- up{job="stemedb", instance="node1"} = 1
|
||
- stemedb_query_latency_seconds{quantile="0.99", instance="node1"}
|
||
- replication_lag_seconds{instance="node1"}
|
||
- process_resident_memory_bytes{instance="node1"}
|
||
```
|
||
|
||
## DNS Configuration
|
||
|
||
```
|
||
Public DNS (example.com):
|
||
┌────────────────────────────────────────────────────────────┐
|
||
│ stemedb.example.com. 300 IN CNAME stemedb-lb.example. │
|
||
│ stemedb-lb.example. 60 IN A 203.0.113.10 │
|
||
└────────────────────────────────────────────────────────────┘
|
||
|
||
Private DNS (cluster.local):
|
||
┌────────────────────────────────────────────────────────────┐
|
||
│ node1.cluster.local. 300 IN A 10.0.1.51 │
|
||
│ node2.cluster.local. 300 IN A 10.0.1.52 │
|
||
│ node3.cluster.local. 300 IN A 10.0.1.53 │
|
||
└────────────────────────────────────────────────────────────┘
|
||
|
||
TTL Recommendations:
|
||
- Public: 300s (5 min) - balance caching vs failover speed
|
||
- Private: 60s (1 min) - faster convergence within cluster
|
||
```
|