jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

11 KiB

Raw Blame History

Network Requirements

Network configuration for StemeDB deployments

Port Scheme (181XX)

StemeDB uses ports in the 181XX range for all services:

Port	Protocol	Service	Purpose	Expose To
18180	TCP/HTTP	API Server	Queries, ingest, metrics	Clients (via reverse proxy)
18181	TCP/HTTP	Cluster Gateway	Cluster coordination, admin endpoints	Internal network only
18182	TCP/gRPC	Cluster RPC	Assertion replication	Cluster nodes only
18183	UDP	SWIM Gossip	Membership, failure detection	Cluster nodes only
18184	TCP/HTTP	(Reserved for future metrics)	-	-
18185	TCP/HTTP	(Reserved for future admin)	-	-
18186-18189	-	(Reserved for applications)	-	-

Firewall Rules

Single-Node Deployment

Allow inbound:

Port 18180 from load balancer/reverse proxy (or internal network)
Port 22 (SSH) from bastion host

Block:

Port 18180 from public internet (use reverse proxy)
Ports 18181-18183 (not used in single-node)

AWS Security Group:

# Allow API from load balancer
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-load-balancer \
  --protocol tcp \
  --port 18180

# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-bastion \
  --protocol tcp \
  --port 22

iptables:

# Allow API from internal network only
sudo iptables -A INPUT -p tcp -s 10.0.0.0/8 --dport 18180 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 18180 -j DROP

# Save rules
sudo iptables-save > /etc/iptables/rules.v4

Three-Node Cluster

Allow inbound:

Port 18180 from load balancer (API traffic)
Ports 18181-18183 from cluster nodes (inter-node)
Port 22 (SSH) from bastion host

Block:

Ports 18180-18183 from public internet
Port 18181 from outside internal network (admin endpoint security)

AWS Security Group:

# Allow API from load balancer
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-load-balancer \
  --protocol tcp \
  --port 18180

# Allow cluster communication (node ↔ node)
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-stemedb \
  --protocol tcp \
  --port 18181-18182

# Allow SWIM gossip (UDP)
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-stemedb \
  --protocol udp \
  --port 18183

# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-bastion \
  --protocol tcp \
  --port 22

iptables (on each node):

# Allow API from load balancer
sudo iptables -A INPUT -p tcp -s 10.0.1.10 --dport 18180 -j ACCEPT

# Allow cluster traffic from other nodes
sudo iptables -A INPUT -p tcp -s 10.0.1.51 --dport 18181:18182 -j ACCEPT
sudo iptables -A INPUT -p tcp -s 10.0.1.52 --dport 18181:18182 -j ACCEPT
sudo iptables -A INPUT -p tcp -s 10.0.1.53 --dport 18181:18182 -j ACCEPT

# Allow SWIM gossip
sudo iptables -A INPUT -p udp -s 10.0.1.0/24 --dport 18183 -j ACCEPT

# Drop everything else
sudo iptables -A INPUT -p tcp --dport 18180:18189 -j DROP

TLS Configuration

Requirements

Minimum TLS version: 1.3
Certificate validity: <90 days (automate renewal)
Key algorithm: RSA 2048-bit or ECDSA P-256
Termination: At reverse proxy (recommended) or at StemeDB API

Let's Encrypt Automation

Certbot with nginx:

# Install certbot
sudo apt install certbot python3-certbot-nginx

# Obtain certificate
sudo certbot --nginx -d stemedb.example.com

# Auto-renewal (cron)
sudo crontab -e
# Add:
0 3 * * * certbot renew --quiet && systemctl reload nginx

Manual certificate (for testing):

# Generate self-signed (NOT for production)
openssl req -x509 -newkey rsa:2048 -nodes \
  -keyout /etc/stemedb/tls/key.pem \
  -out /etc/stemedb/tls/cert.pem \
  -days 365 \
  -subj "/CN=stemedb.local"

# Set permissions
sudo chmod 600 /etc/stemedb/tls/key.pem
sudo chmod 644 /etc/stemedb/tls/cert.pem

TLS at Reverse Proxy (Recommended)

Nginx example:

server {
    listen 443 ssl http2;
    server_name stemedb.example.com;

    ssl_certificate /etc/letsencrypt/live/stemedb.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/stemedb.example.com/privkey.pem;

    ssl_protocols TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;

    location / {
        proxy_pass http://stemedb_cluster;
    }
}

See: Nginx Config for complete example.

DNS Configuration

Single-Node

Simple A record:

stemedb.example.com.  300  IN  A  10.0.1.50

Health check: Point DNS to healthy server, manual failover

Three-Node Cluster

Option 1: Load balancer with CNAME

stemedb.example.com.     300  IN  CNAME  stemedb-lb.example.com.
stemedb-lb.example.com.  60   IN  A      10.0.1.10

node1.example.com.       300  IN  A      10.0.1.51
node2.example.com.       300  IN  A      10.0.1.52
node3.example.com.       300  IN  A      10.0.1.53

Option 2: Multiple A records (DNS round-robin)

stemedb.example.com.  60  IN  A  10.0.1.51
stemedb.example.com.  60  IN  A  10.0.1.52
stemedb.example.com.  60  IN  A  10.0.1.53

⚠️ Note: DNS round-robin doesn't detect failed nodes. Use load balancer instead.

Internal DNS (Private Network)

For cluster communication:

# Private hosted zone: cluster.local
node1.cluster.local.  300  IN  A  10.0.1.51
node2.cluster.local.  300  IN  A  10.0.1.52
node3.cluster.local.  300  IN  A  10.0.1.53

Latency Requirements

Single-Node

Client → Server: <100ms (typical internet)
No inter-node requirements

Three-Node Cluster

Client → Load Balancer: <100ms
Load Balancer → Node: <10ms (same region)
Node ↔ Node: <5ms (CRITICAL)

Why <5ms inter-node?

SWIM gossip requires fast responses
Replication lag increases with latency
Merkle sync performance degrades

Test latency:

# From node1 to node2
ping -c 100 node2.cluster.local

# Expected:
# rtt min/avg/max/mdev = 0.5/1.2/3.5/0.8 ms

# If avg >5ms → Nodes too far apart (different regions?)

Deployment recommendations:

✅ Same availability zone: <1ms typical
⚠️ Same region, different AZs: 1-5ms (acceptable)
❌ Different regions: >10ms (not supported)

Bandwidth Requirements

Single-Node

Ingest: ~1 KB per assertion → 100 assertions/sec = 100 KB/sec = 0.8 Mbps
Queries: ~5 KB per query → 100 queries/sec = 500 KB/sec = 4 Mbps
Total: ~5 Mbps typical, 10 Mbps recommended

Three-Node Cluster

Per node:

Client traffic: Same as single-node (~5 Mbps)
Replication traffic: ~1 MB per 1K assertions → 1 Gbps for high-throughput

Total cluster:

Client traffic: 15 Mbps (3× single-node)
Replication traffic: ~10 Mbps typical, 100 Mbps burst

Recommended:

Public bandwidth: 100 Mbps per node
Private bandwidth: 1 Gbps per node (10 Gbps for production)

Load Balancer Configuration

Health Checks

HTTP health check configuration:

Endpoint: /v1/health
Method: GET
Interval: 5 seconds
Timeout: 3 seconds
Healthy threshold: 2
Unhealthy threshold: 3

Expected response:

{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 12345
}

Mark unhealthy if:

HTTP status != 200
Response time >3 seconds
status field != "healthy"

Load Balancing Algorithm

Recommended: Round-robin

Simple
Evenly distributes load
No sticky sessions needed (CRDTs handle conflicts)

Not recommended: Least connections

Can cause hotspots
Unnecessary complexity

Session Affinity

Not required - StemeDB uses CRDTs, so queries can hit any node

Security Considerations

Admin Endpoints

⚠️ CRITICAL: Admin endpoints have NO authentication in Pilot 5

Endpoints to restrict:

/v1/admin/quarantine - Manage quarantine queue
/v1/admin/circuit_breakers - Ban/unban agents
/v1/admin/indexes/rebuild - Trigger index rebuild
/v1/admin/compact - Trigger compaction

Restriction methods:

Option 1: Firewall (recommended)

# Block /v1/admin/ from public
# iptables example:
sudo iptables -A INPUT -p tcp --dport 18180 -m string --string "/v1/admin/" --algo bm -j DROP

# Or in nginx:
location /v1/admin/ {
    deny all;
    return 403;
}

Option 2: VPN-only access

Require VPN connection to reach port 18181 (cluster gateway)
Use /v1/admin/ endpoints via cluster gateway only

Option 3: IP allowlist

# Nginx example
location /v1/admin/ {
    allow 10.0.0.0/8;  # Internal network
    deny all;
}

Metrics Endpoint

/metrics endpoint exposes sensitive information:

Assertion counts
Query patterns
Agent IDs
Performance data

Restriction:

# Allow only from monitoring systems
location /metrics {
    allow 10.0.1.100;  # Prometheus server
    deny all;
}

Network Topology Examples

Single-Node with Reverse Proxy

Internet
    │
    ▼
[Nginx/Envoy]  (TLS termination, port 443)
    │
    ▼
[StemeDB API]  (port 18180, HTTP)
    │
    ▼
[Data]  (/data/wal, /data/db)

Three-Node Cluster

Internet
    │
    ▼
[Load Balancer]  (TLS, port 443)
    │
    ├─────────┬─────────┐
    ▼         ▼         ▼
[Node 1]  [Node 2]  [Node 3]  (port 18180, HTTP)
    │         │         │
    └─────────┴─────────┘  (ports 18182-18183, replication)

See: diagrams/network-topology.txt for ASCII diagram.

Troubleshooting

Connection Refused

Symptom: curl: (7) Failed to connect to localhost port 18180: Connection refused

Diagnosis:

# Check if port is listening
sudo lsof -i :18180
# Should show: stemedb-api

# Check firewall
sudo iptables -L -n | grep 18180

# Check service status
sudo systemctl status stemedb-api

Resolution: See Server Won't Start Runbook

High Latency Between Nodes

Symptom: replication_lag_seconds >5

Diagnosis:

# Test inter-node latency
ping -c 100 node2
# If avg >5ms → Network issue

# Check bandwidth
iperf3 -c node2
# Should show >100 Mbps

Resolution: See High Query Latency Runbook

SWIM Gossip Not Working

Symptom: Nodes not discovering each other

Diagnosis:

# Check UDP port 18183
sudo tcpdump -i eth0 udp port 18183
# Should show periodic SWIM messages

# Check firewall (UDP!)
sudo iptables -L -n | grep 18183

Resolution: Open UDP port 18183 between cluster nodes

Single-Node Architecture - Network for single-node
Three-Node Cluster - Network for cluster
Deployment Examples - Nginx and Envoy configs
Add Node Runbook - Cluster network setup

Last Updated: 2026-02-11

11 KiB Raw Blame History Unescape Escape

Network Requirements

Port Scheme (181XX)

Firewall Rules

Single-Node Deployment

Three-Node Cluster

TLS Configuration

Requirements

Let's Encrypt Automation

TLS at Reverse Proxy (Recommended)

DNS Configuration

Single-Node

Three-Node Cluster

Internal DNS (Private Network)

Latency Requirements

Single-Node

Three-Node Cluster

Bandwidth Requirements

Single-Node

Three-Node Cluster

Load Balancer Configuration

Health Checks

Load Balancing Algorithm

Session Affinity

Security Considerations

Admin Endpoints

Metrics Endpoint

Network Topology Examples

Single-Node with Reverse Proxy

Three-Node Cluster

Troubleshooting

Connection Refused

High Latency Between Nodes

SWIM Gossip Not Working

Related Documentation

11 KiB

Raw Blame History