This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
501 lines
11 KiB
Markdown
501 lines
11 KiB
Markdown
# Network Requirements
|
||
|
||
**Network configuration for StemeDB deployments**
|
||
|
||
---
|
||
|
||
## Port Scheme (181XX)
|
||
|
||
StemeDB uses ports in the `181XX` range for all services:
|
||
|
||
| Port | Protocol | Service | Purpose | Expose To |
|
||
|------|----------|---------|---------|-----------|
|
||
| **18180** | TCP/HTTP | API Server | Queries, ingest, metrics | Clients (via reverse proxy) |
|
||
| **18181** | TCP/HTTP | Cluster Gateway | Cluster coordination, admin endpoints | Internal network only |
|
||
| **18182** | TCP/gRPC | Cluster RPC | Assertion replication | Cluster nodes only |
|
||
| **18183** | UDP | SWIM Gossip | Membership, failure detection | Cluster nodes only |
|
||
| 18184 | TCP/HTTP | (Reserved for future metrics) | - | - |
|
||
| 18185 | TCP/HTTP | (Reserved for future admin) | - | - |
|
||
| 18186-18189 | - | (Reserved for applications) | - | - |
|
||
|
||
---
|
||
|
||
## Firewall Rules
|
||
|
||
### Single-Node Deployment
|
||
|
||
**Allow inbound:**
|
||
- Port 18180 from load balancer/reverse proxy (or internal network)
|
||
- Port 22 (SSH) from bastion host
|
||
|
||
**Block:**
|
||
- Port 18180 from public internet (use reverse proxy)
|
||
- Ports 18181-18183 (not used in single-node)
|
||
|
||
**AWS Security Group:**
|
||
```bash
|
||
# Allow API from load balancer
|
||
aws ec2 authorize-security-group-ingress \
|
||
--group-id sg-stemedb \
|
||
--source-group sg-load-balancer \
|
||
--protocol tcp \
|
||
--port 18180
|
||
|
||
# Allow SSH from bastion
|
||
aws ec2 authorize-security-group-ingress \
|
||
--group-id sg-stemedb \
|
||
--source-group sg-bastion \
|
||
--protocol tcp \
|
||
--port 22
|
||
```
|
||
|
||
**iptables:**
|
||
```bash
|
||
# Allow API from internal network only
|
||
sudo iptables -A INPUT -p tcp -s 10.0.0.0/8 --dport 18180 -j ACCEPT
|
||
sudo iptables -A INPUT -p tcp --dport 18180 -j DROP
|
||
|
||
# Save rules
|
||
sudo iptables-save > /etc/iptables/rules.v4
|
||
```
|
||
|
||
---
|
||
|
||
### Three-Node Cluster
|
||
|
||
**Allow inbound:**
|
||
- Port 18180 from load balancer (API traffic)
|
||
- Ports 18181-18183 from cluster nodes (inter-node)
|
||
- Port 22 (SSH) from bastion host
|
||
|
||
**Block:**
|
||
- Ports 18180-18183 from public internet
|
||
- Port 18181 from outside internal network (admin endpoint security)
|
||
|
||
**AWS Security Group:**
|
||
```bash
|
||
# Allow API from load balancer
|
||
aws ec2 authorize-security-group-ingress \
|
||
--group-id sg-stemedb \
|
||
--source-group sg-load-balancer \
|
||
--protocol tcp \
|
||
--port 18180
|
||
|
||
# Allow cluster communication (node ↔ node)
|
||
aws ec2 authorize-security-group-ingress \
|
||
--group-id sg-stemedb \
|
||
--source-group sg-stemedb \
|
||
--protocol tcp \
|
||
--port 18181-18182
|
||
|
||
# Allow SWIM gossip (UDP)
|
||
aws ec2 authorize-security-group-ingress \
|
||
--group-id sg-stemedb \
|
||
--source-group sg-stemedb \
|
||
--protocol udp \
|
||
--port 18183
|
||
|
||
# Allow SSH from bastion
|
||
aws ec2 authorize-security-group-ingress \
|
||
--group-id sg-stemedb \
|
||
--source-group sg-bastion \
|
||
--protocol tcp \
|
||
--port 22
|
||
```
|
||
|
||
**iptables (on each node):**
|
||
```bash
|
||
# Allow API from load balancer
|
||
sudo iptables -A INPUT -p tcp -s 10.0.1.10 --dport 18180 -j ACCEPT
|
||
|
||
# Allow cluster traffic from other nodes
|
||
sudo iptables -A INPUT -p tcp -s 10.0.1.51 --dport 18181:18182 -j ACCEPT
|
||
sudo iptables -A INPUT -p tcp -s 10.0.1.52 --dport 18181:18182 -j ACCEPT
|
||
sudo iptables -A INPUT -p tcp -s 10.0.1.53 --dport 18181:18182 -j ACCEPT
|
||
|
||
# Allow SWIM gossip
|
||
sudo iptables -A INPUT -p udp -s 10.0.1.0/24 --dport 18183 -j ACCEPT
|
||
|
||
# Drop everything else
|
||
sudo iptables -A INPUT -p tcp --dport 18180:18189 -j DROP
|
||
```
|
||
|
||
---
|
||
|
||
## TLS Configuration
|
||
|
||
### Requirements
|
||
|
||
- **Minimum TLS version:** 1.3
|
||
- **Certificate validity:** <90 days (automate renewal)
|
||
- **Key algorithm:** RSA 2048-bit or ECDSA P-256
|
||
- **Termination:** At reverse proxy (recommended) or at StemeDB API
|
||
|
||
### Let's Encrypt Automation
|
||
|
||
**Certbot with nginx:**
|
||
```bash
|
||
# Install certbot
|
||
sudo apt install certbot python3-certbot-nginx
|
||
|
||
# Obtain certificate
|
||
sudo certbot --nginx -d stemedb.example.com
|
||
|
||
# Auto-renewal (cron)
|
||
sudo crontab -e
|
||
# Add:
|
||
0 3 * * * certbot renew --quiet && systemctl reload nginx
|
||
```
|
||
|
||
**Manual certificate (for testing):**
|
||
```bash
|
||
# Generate self-signed (NOT for production)
|
||
openssl req -x509 -newkey rsa:2048 -nodes \
|
||
-keyout /etc/stemedb/tls/key.pem \
|
||
-out /etc/stemedb/tls/cert.pem \
|
||
-days 365 \
|
||
-subj "/CN=stemedb.local"
|
||
|
||
# Set permissions
|
||
sudo chmod 600 /etc/stemedb/tls/key.pem
|
||
sudo chmod 644 /etc/stemedb/tls/cert.pem
|
||
```
|
||
|
||
### TLS at Reverse Proxy (Recommended)
|
||
|
||
**Nginx example:**
|
||
```nginx
|
||
server {
|
||
listen 443 ssl http2;
|
||
server_name stemedb.example.com;
|
||
|
||
ssl_certificate /etc/letsencrypt/live/stemedb.example.com/fullchain.pem;
|
||
ssl_certificate_key /etc/letsencrypt/live/stemedb.example.com/privkey.pem;
|
||
|
||
ssl_protocols TLSv1.3;
|
||
ssl_ciphers HIGH:!aNULL:!MD5;
|
||
ssl_prefer_server_ciphers on;
|
||
|
||
location / {
|
||
proxy_pass http://stemedb_cluster;
|
||
}
|
||
}
|
||
```
|
||
|
||
**See:** [Nginx Config](../../deployment/nginx/stemedb.conf) for complete example.
|
||
|
||
---
|
||
|
||
## DNS Configuration
|
||
|
||
### Single-Node
|
||
|
||
**Simple A record:**
|
||
```
|
||
stemedb.example.com. 300 IN A 10.0.1.50
|
||
```
|
||
|
||
**Health check:** Point DNS to healthy server, manual failover
|
||
|
||
### Three-Node Cluster
|
||
|
||
**Option 1: Load balancer with CNAME**
|
||
```
|
||
stemedb.example.com. 300 IN CNAME stemedb-lb.example.com.
|
||
stemedb-lb.example.com. 60 IN A 10.0.1.10
|
||
|
||
node1.example.com. 300 IN A 10.0.1.51
|
||
node2.example.com. 300 IN A 10.0.1.52
|
||
node3.example.com. 300 IN A 10.0.1.53
|
||
```
|
||
|
||
**Option 2: Multiple A records (DNS round-robin)**
|
||
```
|
||
stemedb.example.com. 60 IN A 10.0.1.51
|
||
stemedb.example.com. 60 IN A 10.0.1.52
|
||
stemedb.example.com. 60 IN A 10.0.1.53
|
||
```
|
||
|
||
⚠️ **Note:** DNS round-robin doesn't detect failed nodes. Use load balancer instead.
|
||
|
||
### Internal DNS (Private Network)
|
||
|
||
**For cluster communication:**
|
||
```
|
||
# Private hosted zone: cluster.local
|
||
node1.cluster.local. 300 IN A 10.0.1.51
|
||
node2.cluster.local. 300 IN A 10.0.1.52
|
||
node3.cluster.local. 300 IN A 10.0.1.53
|
||
```
|
||
|
||
---
|
||
|
||
## Latency Requirements
|
||
|
||
### Single-Node
|
||
|
||
- **Client → Server:** <100ms (typical internet)
|
||
- **No inter-node requirements**
|
||
|
||
### Three-Node Cluster
|
||
|
||
- **Client → Load Balancer:** <100ms
|
||
- **Load Balancer → Node:** <10ms (same region)
|
||
- **Node ↔ Node:** **<5ms (CRITICAL)**
|
||
|
||
**Why <5ms inter-node?**
|
||
- SWIM gossip requires fast responses
|
||
- Replication lag increases with latency
|
||
- Merkle sync performance degrades
|
||
|
||
**Test latency:**
|
||
```bash
|
||
# From node1 to node2
|
||
ping -c 100 node2.cluster.local
|
||
|
||
# Expected:
|
||
# rtt min/avg/max/mdev = 0.5/1.2/3.5/0.8 ms
|
||
|
||
# If avg >5ms → Nodes too far apart (different regions?)
|
||
```
|
||
|
||
**Deployment recommendations:**
|
||
- ✅ Same availability zone: <1ms typical
|
||
- ⚠️ Same region, different AZs: 1-5ms (acceptable)
|
||
- ❌ Different regions: >10ms (not supported)
|
||
|
||
---
|
||
|
||
## Bandwidth Requirements
|
||
|
||
### Single-Node
|
||
|
||
- **Ingest:** ~1 KB per assertion → 100 assertions/sec = 100 KB/sec = 0.8 Mbps
|
||
- **Queries:** ~5 KB per query → 100 queries/sec = 500 KB/sec = 4 Mbps
|
||
- **Total:** ~5 Mbps typical, 10 Mbps recommended
|
||
|
||
### Three-Node Cluster
|
||
|
||
**Per node:**
|
||
- **Client traffic:** Same as single-node (~5 Mbps)
|
||
- **Replication traffic:** ~1 MB per 1K assertions → 1 Gbps for high-throughput
|
||
|
||
**Total cluster:**
|
||
- **Client traffic:** 15 Mbps (3× single-node)
|
||
- **Replication traffic:** ~10 Mbps typical, 100 Mbps burst
|
||
|
||
**Recommended:**
|
||
- **Public bandwidth:** 100 Mbps per node
|
||
- **Private bandwidth:** 1 Gbps per node (10 Gbps for production)
|
||
|
||
---
|
||
|
||
## Load Balancer Configuration
|
||
|
||
### Health Checks
|
||
|
||
**HTTP health check configuration:**
|
||
```
|
||
Endpoint: /v1/health
|
||
Method: GET
|
||
Interval: 5 seconds
|
||
Timeout: 3 seconds
|
||
Healthy threshold: 2
|
||
Unhealthy threshold: 3
|
||
```
|
||
|
||
**Expected response:**
|
||
```json
|
||
{
|
||
"status": "healthy",
|
||
"version": "0.1.0",
|
||
"uptime_seconds": 12345
|
||
}
|
||
```
|
||
|
||
**Mark unhealthy if:**
|
||
- HTTP status != 200
|
||
- Response time >3 seconds
|
||
- `status` field != "healthy"
|
||
|
||
### Load Balancing Algorithm
|
||
|
||
**Recommended:** Round-robin
|
||
|
||
- Simple
|
||
- Evenly distributes load
|
||
- No sticky sessions needed (CRDTs handle conflicts)
|
||
|
||
**Not recommended:** Least connections
|
||
|
||
- Can cause hotspots
|
||
- Unnecessary complexity
|
||
|
||
### Session Affinity
|
||
|
||
**Not required** - StemeDB uses CRDTs, so queries can hit any node
|
||
|
||
---
|
||
|
||
## Security Considerations
|
||
|
||
### Admin Endpoints
|
||
|
||
⚠️ **CRITICAL:** Admin endpoints have NO authentication in Pilot 5
|
||
|
||
**Endpoints to restrict:**
|
||
- `/v1/admin/quarantine` - Manage quarantine queue
|
||
- `/v1/admin/circuit_breakers` - Ban/unban agents
|
||
- `/v1/admin/indexes/rebuild` - Trigger index rebuild
|
||
- `/v1/admin/compact` - Trigger compaction
|
||
|
||
**Restriction methods:**
|
||
|
||
**Option 1: Firewall (recommended)**
|
||
```bash
|
||
# Block /v1/admin/ from public
|
||
# iptables example:
|
||
sudo iptables -A INPUT -p tcp --dport 18180 -m string --string "/v1/admin/" --algo bm -j DROP
|
||
|
||
# Or in nginx:
|
||
location /v1/admin/ {
|
||
deny all;
|
||
return 403;
|
||
}
|
||
```
|
||
|
||
**Option 2: VPN-only access**
|
||
- Require VPN connection to reach port 18181 (cluster gateway)
|
||
- Use `/v1/admin/` endpoints via cluster gateway only
|
||
|
||
**Option 3: IP allowlist**
|
||
```nginx
|
||
# Nginx example
|
||
location /v1/admin/ {
|
||
allow 10.0.0.0/8; # Internal network
|
||
deny all;
|
||
}
|
||
```
|
||
|
||
### Metrics Endpoint
|
||
|
||
**`/metrics` endpoint exposes sensitive information:**
|
||
- Assertion counts
|
||
- Query patterns
|
||
- Agent IDs
|
||
- Performance data
|
||
|
||
**Restriction:**
|
||
```nginx
|
||
# Allow only from monitoring systems
|
||
location /metrics {
|
||
allow 10.0.1.100; # Prometheus server
|
||
deny all;
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Network Topology Examples
|
||
|
||
### Single-Node with Reverse Proxy
|
||
|
||
```
|
||
Internet
|
||
│
|
||
▼
|
||
[Nginx/Envoy] (TLS termination, port 443)
|
||
│
|
||
▼
|
||
[StemeDB API] (port 18180, HTTP)
|
||
│
|
||
▼
|
||
[Data] (/data/wal, /data/db)
|
||
```
|
||
|
||
### Three-Node Cluster
|
||
|
||
```
|
||
Internet
|
||
│
|
||
▼
|
||
[Load Balancer] (TLS, port 443)
|
||
│
|
||
├─────────┬─────────┐
|
||
▼ ▼ ▼
|
||
[Node 1] [Node 2] [Node 3] (port 18180, HTTP)
|
||
│ │ │
|
||
└─────────┴─────────┘ (ports 18182-18183, replication)
|
||
```
|
||
|
||
**See:** [diagrams/network-topology.txt](./diagrams/network-topology.txt) for ASCII diagram.
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### Connection Refused
|
||
|
||
**Symptom:** `curl: (7) Failed to connect to localhost port 18180: Connection refused`
|
||
|
||
**Diagnosis:**
|
||
```bash
|
||
# Check if port is listening
|
||
sudo lsof -i :18180
|
||
# Should show: stemedb-api
|
||
|
||
# Check firewall
|
||
sudo iptables -L -n | grep 18180
|
||
|
||
# Check service status
|
||
sudo systemctl status stemedb-api
|
||
```
|
||
|
||
**Resolution:** See [Server Won't Start Runbook](../../runbooks/server-wont-start.md)
|
||
|
||
### High Latency Between Nodes
|
||
|
||
**Symptom:** `replication_lag_seconds` >5
|
||
|
||
**Diagnosis:**
|
||
```bash
|
||
# Test inter-node latency
|
||
ping -c 100 node2
|
||
# If avg >5ms → Network issue
|
||
|
||
# Check bandwidth
|
||
iperf3 -c node2
|
||
# Should show >100 Mbps
|
||
```
|
||
|
||
**Resolution:** See [High Query Latency Runbook](../../runbooks/high-query-latency.md#1-replication-lag)
|
||
|
||
### SWIM Gossip Not Working
|
||
|
||
**Symptom:** Nodes not discovering each other
|
||
|
||
**Diagnosis:**
|
||
```bash
|
||
# Check UDP port 18183
|
||
sudo tcpdump -i eth0 udp port 18183
|
||
# Should show periodic SWIM messages
|
||
|
||
# Check firewall (UDP!)
|
||
sudo iptables -L -n | grep 18183
|
||
```
|
||
|
||
**Resolution:** Open UDP port 18183 between cluster nodes
|
||
|
||
---
|
||
|
||
## Related Documentation
|
||
|
||
- [Single-Node Architecture](./single-node-pilot.md) - Network for single-node
|
||
- [Three-Node Cluster](./three-node-cluster.md) - Network for cluster
|
||
- [Deployment Examples](../../deployment/) - Nginx and Envoy configs
|
||
- [Add Node Runbook](../../runbooks/add-node.md) - Cluster network setup
|
||
|
||
---
|
||
|
||
**Last Updated:** 2026-02-11
|