stemedb/docs/operations/reference-architecture/network-requirements.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

501 lines
11 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Network Requirements
**Network configuration for StemeDB deployments**
---
## Port Scheme (181XX)
StemeDB uses ports in the `181XX` range for all services:
| Port | Protocol | Service | Purpose | Expose To |
|------|----------|---------|---------|-----------|
| **18180** | TCP/HTTP | API Server | Queries, ingest, metrics | Clients (via reverse proxy) |
| **18181** | TCP/HTTP | Cluster Gateway | Cluster coordination, admin endpoints | Internal network only |
| **18182** | TCP/gRPC | Cluster RPC | Assertion replication | Cluster nodes only |
| **18183** | UDP | SWIM Gossip | Membership, failure detection | Cluster nodes only |
| 18184 | TCP/HTTP | (Reserved for future metrics) | - | - |
| 18185 | TCP/HTTP | (Reserved for future admin) | - | - |
| 18186-18189 | - | (Reserved for applications) | - | - |
---
## Firewall Rules
### Single-Node Deployment
**Allow inbound:**
- Port 18180 from load balancer/reverse proxy (or internal network)
- Port 22 (SSH) from bastion host
**Block:**
- Port 18180 from public internet (use reverse proxy)
- Ports 18181-18183 (not used in single-node)
**AWS Security Group:**
```bash
# Allow API from load balancer
aws ec2 authorize-security-group-ingress \
--group-id sg-stemedb \
--source-group sg-load-balancer \
--protocol tcp \
--port 18180
# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
--group-id sg-stemedb \
--source-group sg-bastion \
--protocol tcp \
--port 22
```
**iptables:**
```bash
# Allow API from internal network only
sudo iptables -A INPUT -p tcp -s 10.0.0.0/8 --dport 18180 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 18180 -j DROP
# Save rules
sudo iptables-save > /etc/iptables/rules.v4
```
---
### Three-Node Cluster
**Allow inbound:**
- Port 18180 from load balancer (API traffic)
- Ports 18181-18183 from cluster nodes (inter-node)
- Port 22 (SSH) from bastion host
**Block:**
- Ports 18180-18183 from public internet
- Port 18181 from outside internal network (admin endpoint security)
**AWS Security Group:**
```bash
# Allow API from load balancer
aws ec2 authorize-security-group-ingress \
--group-id sg-stemedb \
--source-group sg-load-balancer \
--protocol tcp \
--port 18180
# Allow cluster communication (node ↔ node)
aws ec2 authorize-security-group-ingress \
--group-id sg-stemedb \
--source-group sg-stemedb \
--protocol tcp \
--port 18181-18182
# Allow SWIM gossip (UDP)
aws ec2 authorize-security-group-ingress \
--group-id sg-stemedb \
--source-group sg-stemedb \
--protocol udp \
--port 18183
# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
--group-id sg-stemedb \
--source-group sg-bastion \
--protocol tcp \
--port 22
```
**iptables (on each node):**
```bash
# Allow API from load balancer
sudo iptables -A INPUT -p tcp -s 10.0.1.10 --dport 18180 -j ACCEPT
# Allow cluster traffic from other nodes
sudo iptables -A INPUT -p tcp -s 10.0.1.51 --dport 18181:18182 -j ACCEPT
sudo iptables -A INPUT -p tcp -s 10.0.1.52 --dport 18181:18182 -j ACCEPT
sudo iptables -A INPUT -p tcp -s 10.0.1.53 --dport 18181:18182 -j ACCEPT
# Allow SWIM gossip
sudo iptables -A INPUT -p udp -s 10.0.1.0/24 --dport 18183 -j ACCEPT
# Drop everything else
sudo iptables -A INPUT -p tcp --dport 18180:18189 -j DROP
```
---
## TLS Configuration
### Requirements
- **Minimum TLS version:** 1.3
- **Certificate validity:** <90 days (automate renewal)
- **Key algorithm:** RSA 2048-bit or ECDSA P-256
- **Termination:** At reverse proxy (recommended) or at StemeDB API
### Let's Encrypt Automation
**Certbot with nginx:**
```bash
# Install certbot
sudo apt install certbot python3-certbot-nginx
# Obtain certificate
sudo certbot --nginx -d stemedb.example.com
# Auto-renewal (cron)
sudo crontab -e
# Add:
0 3 * * * certbot renew --quiet && systemctl reload nginx
```
**Manual certificate (for testing):**
```bash
# Generate self-signed (NOT for production)
openssl req -x509 -newkey rsa:2048 -nodes \
-keyout /etc/stemedb/tls/key.pem \
-out /etc/stemedb/tls/cert.pem \
-days 365 \
-subj "/CN=stemedb.local"
# Set permissions
sudo chmod 600 /etc/stemedb/tls/key.pem
sudo chmod 644 /etc/stemedb/tls/cert.pem
```
### TLS at Reverse Proxy (Recommended)
**Nginx example:**
```nginx
server {
listen 443 ssl http2;
server_name stemedb.example.com;
ssl_certificate /etc/letsencrypt/live/stemedb.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/stemedb.example.com/privkey.pem;
ssl_protocols TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
location / {
proxy_pass http://stemedb_cluster;
}
}
```
**See:** [Nginx Config](../../deployment/nginx/stemedb.conf) for complete example.
---
## DNS Configuration
### Single-Node
**Simple A record:**
```
stemedb.example.com. 300 IN A 10.0.1.50
```
**Health check:** Point DNS to healthy server, manual failover
### Three-Node Cluster
**Option 1: Load balancer with CNAME**
```
stemedb.example.com. 300 IN CNAME stemedb-lb.example.com.
stemedb-lb.example.com. 60 IN A 10.0.1.10
node1.example.com. 300 IN A 10.0.1.51
node2.example.com. 300 IN A 10.0.1.52
node3.example.com. 300 IN A 10.0.1.53
```
**Option 2: Multiple A records (DNS round-robin)**
```
stemedb.example.com. 60 IN A 10.0.1.51
stemedb.example.com. 60 IN A 10.0.1.52
stemedb.example.com. 60 IN A 10.0.1.53
```
**Note:** DNS round-robin doesn't detect failed nodes. Use load balancer instead.
### Internal DNS (Private Network)
**For cluster communication:**
```
# Private hosted zone: cluster.local
node1.cluster.local. 300 IN A 10.0.1.51
node2.cluster.local. 300 IN A 10.0.1.52
node3.cluster.local. 300 IN A 10.0.1.53
```
---
## Latency Requirements
### Single-Node
- **Client Server:** <100ms (typical internet)
- **No inter-node requirements**
### Three-Node Cluster
- **Client Load Balancer:** <100ms
- **Load Balancer Node:** <10ms (same region)
- **Node Node:** **<5ms (CRITICAL)**
**Why <5ms inter-node?**
- SWIM gossip requires fast responses
- Replication lag increases with latency
- Merkle sync performance degrades
**Test latency:**
```bash
# From node1 to node2
ping -c 100 node2.cluster.local
# Expected:
# rtt min/avg/max/mdev = 0.5/1.2/3.5/0.8 ms
# If avg >5ms → Nodes too far apart (different regions?)
```
**Deployment recommendations:**
- Same availability zone: <1ms typical
- Same region, different AZs: 1-5ms (acceptable)
- Different regions: >10ms (not supported)
---
## Bandwidth Requirements
### Single-Node
- **Ingest:** ~1 KB per assertion → 100 assertions/sec = 100 KB/sec = 0.8 Mbps
- **Queries:** ~5 KB per query → 100 queries/sec = 500 KB/sec = 4 Mbps
- **Total:** ~5 Mbps typical, 10 Mbps recommended
### Three-Node Cluster
**Per node:**
- **Client traffic:** Same as single-node (~5 Mbps)
- **Replication traffic:** ~1 MB per 1K assertions → 1 Gbps for high-throughput
**Total cluster:**
- **Client traffic:** 15 Mbps (3× single-node)
- **Replication traffic:** ~10 Mbps typical, 100 Mbps burst
**Recommended:**
- **Public bandwidth:** 100 Mbps per node
- **Private bandwidth:** 1 Gbps per node (10 Gbps for production)
---
## Load Balancer Configuration
### Health Checks
**HTTP health check configuration:**
```
Endpoint: /v1/health
Method: GET
Interval: 5 seconds
Timeout: 3 seconds
Healthy threshold: 2
Unhealthy threshold: 3
```
**Expected response:**
```json
{
"status": "healthy",
"version": "0.1.0",
"uptime_seconds": 12345
}
```
**Mark unhealthy if:**
- HTTP status != 200
- Response time >3 seconds
- `status` field != "healthy"
### Load Balancing Algorithm
**Recommended:** Round-robin
- Simple
- Evenly distributes load
- No sticky sessions needed (CRDTs handle conflicts)
**Not recommended:** Least connections
- Can cause hotspots
- Unnecessary complexity
### Session Affinity
**Not required** - StemeDB uses CRDTs, so queries can hit any node
---
## Security Considerations
### Admin Endpoints
⚠️ **CRITICAL:** Admin endpoints have NO authentication in Pilot 5
**Endpoints to restrict:**
- `/v1/admin/quarantine` - Manage quarantine queue
- `/v1/admin/circuit_breakers` - Ban/unban agents
- `/v1/admin/indexes/rebuild` - Trigger index rebuild
- `/v1/admin/compact` - Trigger compaction
**Restriction methods:**
**Option 1: Firewall (recommended)**
```bash
# Block /v1/admin/ from public
# iptables example:
sudo iptables -A INPUT -p tcp --dport 18180 -m string --string "/v1/admin/" --algo bm -j DROP
# Or in nginx:
location /v1/admin/ {
deny all;
return 403;
}
```
**Option 2: VPN-only access**
- Require VPN connection to reach port 18181 (cluster gateway)
- Use `/v1/admin/` endpoints via cluster gateway only
**Option 3: IP allowlist**
```nginx
# Nginx example
location /v1/admin/ {
allow 10.0.0.0/8; # Internal network
deny all;
}
```
### Metrics Endpoint
**`/metrics` endpoint exposes sensitive information:**
- Assertion counts
- Query patterns
- Agent IDs
- Performance data
**Restriction:**
```nginx
# Allow only from monitoring systems
location /metrics {
allow 10.0.1.100; # Prometheus server
deny all;
}
```
---
## Network Topology Examples
### Single-Node with Reverse Proxy
```
Internet
[Nginx/Envoy] (TLS termination, port 443)
[StemeDB API] (port 18180, HTTP)
[Data] (/data/wal, /data/db)
```
### Three-Node Cluster
```
Internet
[Load Balancer] (TLS, port 443)
├─────────┬─────────┐
▼ ▼ ▼
[Node 1] [Node 2] [Node 3] (port 18180, HTTP)
│ │ │
└─────────┴─────────┘ (ports 18182-18183, replication)
```
**See:** [diagrams/network-topology.txt](./diagrams/network-topology.txt) for ASCII diagram.
---
## Troubleshooting
### Connection Refused
**Symptom:** `curl: (7) Failed to connect to localhost port 18180: Connection refused`
**Diagnosis:**
```bash
# Check if port is listening
sudo lsof -i :18180
# Should show: stemedb-api
# Check firewall
sudo iptables -L -n | grep 18180
# Check service status
sudo systemctl status stemedb-api
```
**Resolution:** See [Server Won't Start Runbook](../../runbooks/server-wont-start.md)
### High Latency Between Nodes
**Symptom:** `replication_lag_seconds` >5
**Diagnosis:**
```bash
# Test inter-node latency
ping -c 100 node2
# If avg >5ms → Network issue
# Check bandwidth
iperf3 -c node2
# Should show >100 Mbps
```
**Resolution:** See [High Query Latency Runbook](../../runbooks/high-query-latency.md#1-replication-lag)
### SWIM Gossip Not Working
**Symptom:** Nodes not discovering each other
**Diagnosis:**
```bash
# Check UDP port 18183
sudo tcpdump -i eth0 udp port 18183
# Should show periodic SWIM messages
# Check firewall (UDP!)
sudo iptables -L -n | grep 18183
```
**Resolution:** Open UDP port 18183 between cluster nodes
---
## Related Documentation
- [Single-Node Architecture](./single-node-pilot.md) - Network for single-node
- [Three-Node Cluster](./three-node-cluster.md) - Network for cluster
- [Deployment Examples](../../deployment/) - Nginx and Envoy configs
- [Add Node Runbook](../../runbooks/add-node.md) - Cluster network setup
---
**Last Updated:** 2026-02-11