stemedb/docs/operations/reference-architecture/network-requirements.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

11 KiB
Raw Blame History

Network Requirements

Network configuration for StemeDB deployments


Port Scheme (181XX)

StemeDB uses ports in the 181XX range for all services:

Port Protocol Service Purpose Expose To
18180 TCP/HTTP API Server Queries, ingest, metrics Clients (via reverse proxy)
18181 TCP/HTTP Cluster Gateway Cluster coordination, admin endpoints Internal network only
18182 TCP/gRPC Cluster RPC Assertion replication Cluster nodes only
18183 UDP SWIM Gossip Membership, failure detection Cluster nodes only
18184 TCP/HTTP (Reserved for future metrics) - -
18185 TCP/HTTP (Reserved for future admin) - -
18186-18189 - (Reserved for applications) - -

Firewall Rules

Single-Node Deployment

Allow inbound:

  • Port 18180 from load balancer/reverse proxy (or internal network)
  • Port 22 (SSH) from bastion host

Block:

  • Port 18180 from public internet (use reverse proxy)
  • Ports 18181-18183 (not used in single-node)

AWS Security Group:

# Allow API from load balancer
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-load-balancer \
  --protocol tcp \
  --port 18180

# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-bastion \
  --protocol tcp \
  --port 22

iptables:

# Allow API from internal network only
sudo iptables -A INPUT -p tcp -s 10.0.0.0/8 --dport 18180 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 18180 -j DROP

# Save rules
sudo iptables-save > /etc/iptables/rules.v4

Three-Node Cluster

Allow inbound:

  • Port 18180 from load balancer (API traffic)
  • Ports 18181-18183 from cluster nodes (inter-node)
  • Port 22 (SSH) from bastion host

Block:

  • Ports 18180-18183 from public internet
  • Port 18181 from outside internal network (admin endpoint security)

AWS Security Group:

# Allow API from load balancer
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-load-balancer \
  --protocol tcp \
  --port 18180

# Allow cluster communication (node ↔ node)
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-stemedb \
  --protocol tcp \
  --port 18181-18182

# Allow SWIM gossip (UDP)
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-stemedb \
  --protocol udp \
  --port 18183

# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-bastion \
  --protocol tcp \
  --port 22

iptables (on each node):

# Allow API from load balancer
sudo iptables -A INPUT -p tcp -s 10.0.1.10 --dport 18180 -j ACCEPT

# Allow cluster traffic from other nodes
sudo iptables -A INPUT -p tcp -s 10.0.1.51 --dport 18181:18182 -j ACCEPT
sudo iptables -A INPUT -p tcp -s 10.0.1.52 --dport 18181:18182 -j ACCEPT
sudo iptables -A INPUT -p tcp -s 10.0.1.53 --dport 18181:18182 -j ACCEPT

# Allow SWIM gossip
sudo iptables -A INPUT -p udp -s 10.0.1.0/24 --dport 18183 -j ACCEPT

# Drop everything else
sudo iptables -A INPUT -p tcp --dport 18180:18189 -j DROP

TLS Configuration

Requirements

  • Minimum TLS version: 1.3
  • Certificate validity: <90 days (automate renewal)
  • Key algorithm: RSA 2048-bit or ECDSA P-256
  • Termination: At reverse proxy (recommended) or at StemeDB API

Let's Encrypt Automation

Certbot with nginx:

# Install certbot
sudo apt install certbot python3-certbot-nginx

# Obtain certificate
sudo certbot --nginx -d stemedb.example.com

# Auto-renewal (cron)
sudo crontab -e
# Add:
0 3 * * * certbot renew --quiet && systemctl reload nginx

Manual certificate (for testing):

# Generate self-signed (NOT for production)
openssl req -x509 -newkey rsa:2048 -nodes \
  -keyout /etc/stemedb/tls/key.pem \
  -out /etc/stemedb/tls/cert.pem \
  -days 365 \
  -subj "/CN=stemedb.local"

# Set permissions
sudo chmod 600 /etc/stemedb/tls/key.pem
sudo chmod 644 /etc/stemedb/tls/cert.pem

Nginx example:

server {
    listen 443 ssl http2;
    server_name stemedb.example.com;

    ssl_certificate /etc/letsencrypt/live/stemedb.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/stemedb.example.com/privkey.pem;

    ssl_protocols TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;

    location / {
        proxy_pass http://stemedb_cluster;
    }
}

See: Nginx Config for complete example.


DNS Configuration

Single-Node

Simple A record:

stemedb.example.com.  300  IN  A  10.0.1.50

Health check: Point DNS to healthy server, manual failover

Three-Node Cluster

Option 1: Load balancer with CNAME

stemedb.example.com.     300  IN  CNAME  stemedb-lb.example.com.
stemedb-lb.example.com.  60   IN  A      10.0.1.10

node1.example.com.       300  IN  A      10.0.1.51
node2.example.com.       300  IN  A      10.0.1.52
node3.example.com.       300  IN  A      10.0.1.53

Option 2: Multiple A records (DNS round-robin)

stemedb.example.com.  60  IN  A  10.0.1.51
stemedb.example.com.  60  IN  A  10.0.1.52
stemedb.example.com.  60  IN  A  10.0.1.53

⚠️ Note: DNS round-robin doesn't detect failed nodes. Use load balancer instead.

Internal DNS (Private Network)

For cluster communication:

# Private hosted zone: cluster.local
node1.cluster.local.  300  IN  A  10.0.1.51
node2.cluster.local.  300  IN  A  10.0.1.52
node3.cluster.local.  300  IN  A  10.0.1.53

Latency Requirements

Single-Node

  • Client → Server: <100ms (typical internet)
  • No inter-node requirements

Three-Node Cluster

  • Client → Load Balancer: <100ms
  • Load Balancer → Node: <10ms (same region)
  • Node ↔ Node: <5ms (CRITICAL)

Why <5ms inter-node?

  • SWIM gossip requires fast responses
  • Replication lag increases with latency
  • Merkle sync performance degrades

Test latency:

# From node1 to node2
ping -c 100 node2.cluster.local

# Expected:
# rtt min/avg/max/mdev = 0.5/1.2/3.5/0.8 ms

# If avg >5ms → Nodes too far apart (different regions?)

Deployment recommendations:

  • Same availability zone: <1ms typical
  • ⚠️ Same region, different AZs: 1-5ms (acceptable)
  • Different regions: >10ms (not supported)

Bandwidth Requirements

Single-Node

  • Ingest: ~1 KB per assertion → 100 assertions/sec = 100 KB/sec = 0.8 Mbps
  • Queries: ~5 KB per query → 100 queries/sec = 500 KB/sec = 4 Mbps
  • Total: ~5 Mbps typical, 10 Mbps recommended

Three-Node Cluster

Per node:

  • Client traffic: Same as single-node (~5 Mbps)
  • Replication traffic: ~1 MB per 1K assertions → 1 Gbps for high-throughput

Total cluster:

  • Client traffic: 15 Mbps (3× single-node)
  • Replication traffic: ~10 Mbps typical, 100 Mbps burst

Recommended:

  • Public bandwidth: 100 Mbps per node
  • Private bandwidth: 1 Gbps per node (10 Gbps for production)

Load Balancer Configuration

Health Checks

HTTP health check configuration:

Endpoint: /v1/health
Method: GET
Interval: 5 seconds
Timeout: 3 seconds
Healthy threshold: 2
Unhealthy threshold: 3

Expected response:

{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 12345
}

Mark unhealthy if:

  • HTTP status != 200
  • Response time >3 seconds
  • status field != "healthy"

Load Balancing Algorithm

Recommended: Round-robin

  • Simple
  • Evenly distributes load
  • No sticky sessions needed (CRDTs handle conflicts)

Not recommended: Least connections

  • Can cause hotspots
  • Unnecessary complexity

Session Affinity

Not required - StemeDB uses CRDTs, so queries can hit any node


Security Considerations

Admin Endpoints

⚠️ CRITICAL: Admin endpoints have NO authentication in Pilot 5

Endpoints to restrict:

  • /v1/admin/quarantine - Manage quarantine queue
  • /v1/admin/circuit_breakers - Ban/unban agents
  • /v1/admin/indexes/rebuild - Trigger index rebuild
  • /v1/admin/compact - Trigger compaction

Restriction methods:

Option 1: Firewall (recommended)

# Block /v1/admin/ from public
# iptables example:
sudo iptables -A INPUT -p tcp --dport 18180 -m string --string "/v1/admin/" --algo bm -j DROP

# Or in nginx:
location /v1/admin/ {
    deny all;
    return 403;
}

Option 2: VPN-only access

  • Require VPN connection to reach port 18181 (cluster gateway)
  • Use /v1/admin/ endpoints via cluster gateway only

Option 3: IP allowlist

# Nginx example
location /v1/admin/ {
    allow 10.0.0.0/8;  # Internal network
    deny all;
}

Metrics Endpoint

/metrics endpoint exposes sensitive information:

  • Assertion counts
  • Query patterns
  • Agent IDs
  • Performance data

Restriction:

# Allow only from monitoring systems
location /metrics {
    allow 10.0.1.100;  # Prometheus server
    deny all;
}

Network Topology Examples

Single-Node with Reverse Proxy

Internet
    │
    ▼
[Nginx/Envoy]  (TLS termination, port 443)
    │
    ▼
[StemeDB API]  (port 18180, HTTP)
    │
    ▼
[Data]  (/data/wal, /data/db)

Three-Node Cluster

Internet
    │
    ▼
[Load Balancer]  (TLS, port 443)
    │
    ├─────────┬─────────┐
    ▼         ▼         ▼
[Node 1]  [Node 2]  [Node 3]  (port 18180, HTTP)
    │         │         │
    └─────────┴─────────┘  (ports 18182-18183, replication)

See: diagrams/network-topology.txt for ASCII diagram.


Troubleshooting

Connection Refused

Symptom: curl: (7) Failed to connect to localhost port 18180: Connection refused

Diagnosis:

# Check if port is listening
sudo lsof -i :18180
# Should show: stemedb-api

# Check firewall
sudo iptables -L -n | grep 18180

# Check service status
sudo systemctl status stemedb-api

Resolution: See Server Won't Start Runbook

High Latency Between Nodes

Symptom: replication_lag_seconds >5

Diagnosis:

# Test inter-node latency
ping -c 100 node2
# If avg >5ms → Network issue

# Check bandwidth
iperf3 -c node2
# Should show >100 Mbps

Resolution: See High Query Latency Runbook

SWIM Gossip Not Working

Symptom: Nodes not discovering each other

Diagnosis:

# Check UDP port 18183
sudo tcpdump -i eth0 udp port 18183
# Should show periodic SWIM messages

# Check firewall (UDP!)
sudo iptables -L -n | grep 18183

Resolution: Open UDP port 18183 between cluster nodes



Last Updated: 2026-02-11