This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
11 KiB
Network Requirements
Network configuration for StemeDB deployments
Port Scheme (181XX)
StemeDB uses ports in the 181XX range for all services:
| Port | Protocol | Service | Purpose | Expose To |
|---|---|---|---|---|
| 18180 | TCP/HTTP | API Server | Queries, ingest, metrics | Clients (via reverse proxy) |
| 18181 | TCP/HTTP | Cluster Gateway | Cluster coordination, admin endpoints | Internal network only |
| 18182 | TCP/gRPC | Cluster RPC | Assertion replication | Cluster nodes only |
| 18183 | UDP | SWIM Gossip | Membership, failure detection | Cluster nodes only |
| 18184 | TCP/HTTP | (Reserved for future metrics) | - | - |
| 18185 | TCP/HTTP | (Reserved for future admin) | - | - |
| 18186-18189 | - | (Reserved for applications) | - | - |
Firewall Rules
Single-Node Deployment
Allow inbound:
- Port 18180 from load balancer/reverse proxy (or internal network)
- Port 22 (SSH) from bastion host
Block:
- Port 18180 from public internet (use reverse proxy)
- Ports 18181-18183 (not used in single-node)
AWS Security Group:
# Allow API from load balancer
aws ec2 authorize-security-group-ingress \
--group-id sg-stemedb \
--source-group sg-load-balancer \
--protocol tcp \
--port 18180
# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
--group-id sg-stemedb \
--source-group sg-bastion \
--protocol tcp \
--port 22
iptables:
# Allow API from internal network only
sudo iptables -A INPUT -p tcp -s 10.0.0.0/8 --dport 18180 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 18180 -j DROP
# Save rules
sudo iptables-save > /etc/iptables/rules.v4
Three-Node Cluster
Allow inbound:
- Port 18180 from load balancer (API traffic)
- Ports 18181-18183 from cluster nodes (inter-node)
- Port 22 (SSH) from bastion host
Block:
- Ports 18180-18183 from public internet
- Port 18181 from outside internal network (admin endpoint security)
AWS Security Group:
# Allow API from load balancer
aws ec2 authorize-security-group-ingress \
--group-id sg-stemedb \
--source-group sg-load-balancer \
--protocol tcp \
--port 18180
# Allow cluster communication (node ↔ node)
aws ec2 authorize-security-group-ingress \
--group-id sg-stemedb \
--source-group sg-stemedb \
--protocol tcp \
--port 18181-18182
# Allow SWIM gossip (UDP)
aws ec2 authorize-security-group-ingress \
--group-id sg-stemedb \
--source-group sg-stemedb \
--protocol udp \
--port 18183
# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
--group-id sg-stemedb \
--source-group sg-bastion \
--protocol tcp \
--port 22
iptables (on each node):
# Allow API from load balancer
sudo iptables -A INPUT -p tcp -s 10.0.1.10 --dport 18180 -j ACCEPT
# Allow cluster traffic from other nodes
sudo iptables -A INPUT -p tcp -s 10.0.1.51 --dport 18181:18182 -j ACCEPT
sudo iptables -A INPUT -p tcp -s 10.0.1.52 --dport 18181:18182 -j ACCEPT
sudo iptables -A INPUT -p tcp -s 10.0.1.53 --dport 18181:18182 -j ACCEPT
# Allow SWIM gossip
sudo iptables -A INPUT -p udp -s 10.0.1.0/24 --dport 18183 -j ACCEPT
# Drop everything else
sudo iptables -A INPUT -p tcp --dport 18180:18189 -j DROP
TLS Configuration
Requirements
- Minimum TLS version: 1.3
- Certificate validity: <90 days (automate renewal)
- Key algorithm: RSA 2048-bit or ECDSA P-256
- Termination: At reverse proxy (recommended) or at StemeDB API
Let's Encrypt Automation
Certbot with nginx:
# Install certbot
sudo apt install certbot python3-certbot-nginx
# Obtain certificate
sudo certbot --nginx -d stemedb.example.com
# Auto-renewal (cron)
sudo crontab -e
# Add:
0 3 * * * certbot renew --quiet && systemctl reload nginx
Manual certificate (for testing):
# Generate self-signed (NOT for production)
openssl req -x509 -newkey rsa:2048 -nodes \
-keyout /etc/stemedb/tls/key.pem \
-out /etc/stemedb/tls/cert.pem \
-days 365 \
-subj "/CN=stemedb.local"
# Set permissions
sudo chmod 600 /etc/stemedb/tls/key.pem
sudo chmod 644 /etc/stemedb/tls/cert.pem
TLS at Reverse Proxy (Recommended)
Nginx example:
server {
listen 443 ssl http2;
server_name stemedb.example.com;
ssl_certificate /etc/letsencrypt/live/stemedb.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/stemedb.example.com/privkey.pem;
ssl_protocols TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
location / {
proxy_pass http://stemedb_cluster;
}
}
See: Nginx Config for complete example.
DNS Configuration
Single-Node
Simple A record:
stemedb.example.com. 300 IN A 10.0.1.50
Health check: Point DNS to healthy server, manual failover
Three-Node Cluster
Option 1: Load balancer with CNAME
stemedb.example.com. 300 IN CNAME stemedb-lb.example.com.
stemedb-lb.example.com. 60 IN A 10.0.1.10
node1.example.com. 300 IN A 10.0.1.51
node2.example.com. 300 IN A 10.0.1.52
node3.example.com. 300 IN A 10.0.1.53
Option 2: Multiple A records (DNS round-robin)
stemedb.example.com. 60 IN A 10.0.1.51
stemedb.example.com. 60 IN A 10.0.1.52
stemedb.example.com. 60 IN A 10.0.1.53
⚠️ Note: DNS round-robin doesn't detect failed nodes. Use load balancer instead.
Internal DNS (Private Network)
For cluster communication:
# Private hosted zone: cluster.local
node1.cluster.local. 300 IN A 10.0.1.51
node2.cluster.local. 300 IN A 10.0.1.52
node3.cluster.local. 300 IN A 10.0.1.53
Latency Requirements
Single-Node
- Client → Server: <100ms (typical internet)
- No inter-node requirements
Three-Node Cluster
- Client → Load Balancer: <100ms
- Load Balancer → Node: <10ms (same region)
- Node ↔ Node: <5ms (CRITICAL)
Why <5ms inter-node?
- SWIM gossip requires fast responses
- Replication lag increases with latency
- Merkle sync performance degrades
Test latency:
# From node1 to node2
ping -c 100 node2.cluster.local
# Expected:
# rtt min/avg/max/mdev = 0.5/1.2/3.5/0.8 ms
# If avg >5ms → Nodes too far apart (different regions?)
Deployment recommendations:
- ✅ Same availability zone: <1ms typical
- ⚠️ Same region, different AZs: 1-5ms (acceptable)
- ❌ Different regions: >10ms (not supported)
Bandwidth Requirements
Single-Node
- Ingest: ~1 KB per assertion → 100 assertions/sec = 100 KB/sec = 0.8 Mbps
- Queries: ~5 KB per query → 100 queries/sec = 500 KB/sec = 4 Mbps
- Total: ~5 Mbps typical, 10 Mbps recommended
Three-Node Cluster
Per node:
- Client traffic: Same as single-node (~5 Mbps)
- Replication traffic: ~1 MB per 1K assertions → 1 Gbps for high-throughput
Total cluster:
- Client traffic: 15 Mbps (3× single-node)
- Replication traffic: ~10 Mbps typical, 100 Mbps burst
Recommended:
- Public bandwidth: 100 Mbps per node
- Private bandwidth: 1 Gbps per node (10 Gbps for production)
Load Balancer Configuration
Health Checks
HTTP health check configuration:
Endpoint: /v1/health
Method: GET
Interval: 5 seconds
Timeout: 3 seconds
Healthy threshold: 2
Unhealthy threshold: 3
Expected response:
{
"status": "healthy",
"version": "0.1.0",
"uptime_seconds": 12345
}
Mark unhealthy if:
- HTTP status != 200
- Response time >3 seconds
statusfield != "healthy"
Load Balancing Algorithm
Recommended: Round-robin
- Simple
- Evenly distributes load
- No sticky sessions needed (CRDTs handle conflicts)
Not recommended: Least connections
- Can cause hotspots
- Unnecessary complexity
Session Affinity
Not required - StemeDB uses CRDTs, so queries can hit any node
Security Considerations
Admin Endpoints
⚠️ CRITICAL: Admin endpoints have NO authentication in Pilot 5
Endpoints to restrict:
/v1/admin/quarantine- Manage quarantine queue/v1/admin/circuit_breakers- Ban/unban agents/v1/admin/indexes/rebuild- Trigger index rebuild/v1/admin/compact- Trigger compaction
Restriction methods:
Option 1: Firewall (recommended)
# Block /v1/admin/ from public
# iptables example:
sudo iptables -A INPUT -p tcp --dport 18180 -m string --string "/v1/admin/" --algo bm -j DROP
# Or in nginx:
location /v1/admin/ {
deny all;
return 403;
}
Option 2: VPN-only access
- Require VPN connection to reach port 18181 (cluster gateway)
- Use
/v1/admin/endpoints via cluster gateway only
Option 3: IP allowlist
# Nginx example
location /v1/admin/ {
allow 10.0.0.0/8; # Internal network
deny all;
}
Metrics Endpoint
/metrics endpoint exposes sensitive information:
- Assertion counts
- Query patterns
- Agent IDs
- Performance data
Restriction:
# Allow only from monitoring systems
location /metrics {
allow 10.0.1.100; # Prometheus server
deny all;
}
Network Topology Examples
Single-Node with Reverse Proxy
Internet
│
▼
[Nginx/Envoy] (TLS termination, port 443)
│
▼
[StemeDB API] (port 18180, HTTP)
│
▼
[Data] (/data/wal, /data/db)
Three-Node Cluster
Internet
│
▼
[Load Balancer] (TLS, port 443)
│
├─────────┬─────────┐
▼ ▼ ▼
[Node 1] [Node 2] [Node 3] (port 18180, HTTP)
│ │ │
└─────────┴─────────┘ (ports 18182-18183, replication)
See: diagrams/network-topology.txt for ASCII diagram.
Troubleshooting
Connection Refused
Symptom: curl: (7) Failed to connect to localhost port 18180: Connection refused
Diagnosis:
# Check if port is listening
sudo lsof -i :18180
# Should show: stemedb-api
# Check firewall
sudo iptables -L -n | grep 18180
# Check service status
sudo systemctl status stemedb-api
Resolution: See Server Won't Start Runbook
High Latency Between Nodes
Symptom: replication_lag_seconds >5
Diagnosis:
# Test inter-node latency
ping -c 100 node2
# If avg >5ms → Network issue
# Check bandwidth
iperf3 -c node2
# Should show >100 Mbps
Resolution: See High Query Latency Runbook
SWIM Gossip Not Working
Symptom: Nodes not discovering each other
Diagnosis:
# Check UDP port 18183
sudo tcpdump -i eth0 udp port 18183
# Should show periodic SWIM messages
# Check firewall (UDP!)
sudo iptables -L -n | grep 18183
Resolution: Open UDP port 18183 between cluster nodes
Related Documentation
- Single-Node Architecture - Network for single-node
- Three-Node Cluster - Network for cluster
- Deployment Examples - Nginx and Envoy configs
- Add Node Runbook - Cluster network setup
Last Updated: 2026-02-11