stemedb/docs/operations/reference-architecture/network-requirements.md

# Network Requirements

**Network configuration for StemeDB deployments**

---

## Port Scheme (181XX)

StemeDB uses ports in the `181XX` range for all services:

| Port | Protocol | Service | Purpose | Expose To |
|------|----------|---------|---------|-----------|
| **18180** | TCP/HTTP | API Server | Queries, ingest, metrics | Clients (via reverse proxy) |
| **18181** | TCP/HTTP | Cluster Gateway | Cluster coordination, admin endpoints | Internal network only |
| **18182** | TCP/gRPC | Cluster RPC | Assertion replication | Cluster nodes only |
| **18183** | UDP | SWIM Gossip | Membership, failure detection | Cluster nodes only |
| 18184 | TCP/HTTP | (Reserved for future metrics) | - | - |
| 18185 | TCP/HTTP | (Reserved for future admin) | - | - |
| 18186-18189 | - | (Reserved for applications) | - | - |

---

## Firewall Rules

### Single-Node Deployment

**Allow inbound:**
- Port 18180 from load balancer/reverse proxy (or internal network)
- Port 22 (SSH) from bastion host

**Block:**
- Port 18180 from public internet (use reverse proxy)
- Ports 18181-18183 (not used in single-node)

**AWS Security Group:**
```bash
# Allow API from load balancer
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-load-balancer \
  --protocol tcp \
  --port 18180

# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-bastion \
  --protocol tcp \
  --port 22
```

**iptables:**
```bash
# Allow API from internal network only
sudo iptables -A INPUT -p tcp -s 10.0.0.0/8 --dport 18180 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 18180 -j DROP

# Save rules
sudo iptables-save > /etc/iptables/rules.v4
```

---

### Three-Node Cluster

**Allow inbound:**
- Port 18180 from load balancer (API traffic)
- Ports 18181-18183 from cluster nodes (inter-node)
- Port 22 (SSH) from bastion host

**Block:**
- Ports 18180-18183 from public internet
- Port 18181 from outside internal network (admin endpoint security)

**AWS Security Group:**
```bash
# Allow API from load balancer
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-load-balancer \
  --protocol tcp \
  --port 18180

# Allow cluster communication (node ↔ node)
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-stemedb \
  --protocol tcp \
  --port 18181-18182

# Allow SWIM gossip (UDP)
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-stemedb \
  --protocol udp \
  --port 18183

# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
  --group-id sg-stemedb \
  --source-group sg-bastion \
  --protocol tcp \
  --port 22
```

**iptables (on each node):**
```bash
# Allow API from load balancer
sudo iptables -A INPUT -p tcp -s 10.0.1.10 --dport 18180 -j ACCEPT

# Allow cluster traffic from other nodes
sudo iptables -A INPUT -p tcp -s 10.0.1.51 --dport 18181:18182 -j ACCEPT
sudo iptables -A INPUT -p tcp -s 10.0.1.52 --dport 18181:18182 -j ACCEPT
sudo iptables -A INPUT -p tcp -s 10.0.1.53 --dport 18181:18182 -j ACCEPT

# Allow SWIM gossip
sudo iptables -A INPUT -p udp -s 10.0.1.0/24 --dport 18183 -j ACCEPT

# Drop everything else
sudo iptables -A INPUT -p tcp --dport 18180:18189 -j DROP
```

---

## TLS Configuration

### Requirements

- **Minimum TLS version:** 1.3
- **Certificate validity:** <90 days (automate renewal)
- **Key algorithm:** RSA 2048-bit or ECDSA P-256
- **Termination:** At reverse proxy (recommended) or at StemeDB API

### Let's Encrypt Automation

**Certbot with nginx:**
```bash
# Install certbot
sudo apt install certbot python3-certbot-nginx

# Obtain certificate
sudo certbot --nginx -d stemedb.example.com

# Auto-renewal (cron)
sudo crontab -e
# Add:
0 3 * * * certbot renew --quiet && systemctl reload nginx
```

**Manual certificate (for testing):**
```bash
# Generate self-signed (NOT for production)
openssl req -x509 -newkey rsa:2048 -nodes \
  -keyout /etc/stemedb/tls/key.pem \
  -out /etc/stemedb/tls/cert.pem \
  -days 365 \
  -subj "/CN=stemedb.local"

# Set permissions
sudo chmod 600 /etc/stemedb/tls/key.pem
sudo chmod 644 /etc/stemedb/tls/cert.pem
```

### TLS at Reverse Proxy (Recommended)

**Nginx example:**
```nginx
server {
    listen 443 ssl http2;
    server_name stemedb.example.com;

    ssl_certificate /etc/letsencrypt/live/stemedb.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/stemedb.example.com/privkey.pem;

    ssl_protocols TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;

    location / {
        proxy_pass http://stemedb_cluster;
    }
}
```

**See:** [Nginx Config](../../deployment/nginx/stemedb.conf) for complete example.

---

## DNS Configuration

### Single-Node

**Simple A record:**
```
stemedb.example.com.  300  IN  A  10.0.1.50
```

**Health check:** Point DNS to healthy server, manual failover

### Three-Node Cluster

**Option 1: Load balancer with CNAME**
```
stemedb.example.com.     300  IN  CNAME  stemedb-lb.example.com.
stemedb-lb.example.com.  60   IN  A      10.0.1.10

node1.example.com.       300  IN  A      10.0.1.51
node2.example.com.       300  IN  A      10.0.1.52
node3.example.com.       300  IN  A      10.0.1.53
```

**Option 2: Multiple A records (DNS round-robin)**
```
stemedb.example.com.  60  IN  A  10.0.1.51
stemedb.example.com.  60  IN  A  10.0.1.52
stemedb.example.com.  60  IN  A  10.0.1.53
```

⚠️ **Note:** DNS round-robin doesn't detect failed nodes. Use load balancer instead.

### Internal DNS (Private Network)

**For cluster communication:**
```
# Private hosted zone: cluster.local
node1.cluster.local.  300  IN  A  10.0.1.51
node2.cluster.local.  300  IN  A  10.0.1.52
node3.cluster.local.  300  IN  A  10.0.1.53
```

---

## Latency Requirements

### Single-Node

- **Client → Server:** <100ms (typical internet)
- **No inter-node requirements**

### Three-Node Cluster

- **Client → Load Balancer:** <100ms
- **Load Balancer → Node:** <10ms (same region)
- **Node ↔ Node:** **<5ms (CRITICAL)**

**Why <5ms inter-node?**
- SWIM gossip requires fast responses
- Replication lag increases with latency
- Merkle sync performance degrades

**Test latency:**
```bash
# From node1 to node2
ping -c 100 node2.cluster.local

# Expected:
# rtt min/avg/max/mdev = 0.5/1.2/3.5/0.8 ms

# If avg >5ms → Nodes too far apart (different regions?)
```

**Deployment recommendations:**
- ✅ Same availability zone: <1ms typical
- ⚠️ Same region, different AZs: 1-5ms (acceptable)
- ❌ Different regions: >10ms (not supported)

---

## Bandwidth Requirements

### Single-Node

- **Ingest:** ~1 KB per assertion → 100 assertions/sec = 100 KB/sec = 0.8 Mbps
- **Queries:** ~5 KB per query → 100 queries/sec = 500 KB/sec = 4 Mbps
- **Total:** ~5 Mbps typical, 10 Mbps recommended

### Three-Node Cluster

**Per node:**
- **Client traffic:** Same as single-node (~5 Mbps)
- **Replication traffic:** ~1 MB per 1K assertions → 1 Gbps for high-throughput

**Total cluster:**
- **Client traffic:** 15 Mbps (3× single-node)
- **Replication traffic:** ~10 Mbps typical, 100 Mbps burst

**Recommended:**
- **Public bandwidth:** 100 Mbps per node
- **Private bandwidth:** 1 Gbps per node (10 Gbps for production)

---

## Load Balancer Configuration

### Health Checks

**HTTP health check configuration:**
```
Endpoint: /v1/health
Method: GET
Interval: 5 seconds
Timeout: 3 seconds
Healthy threshold: 2
Unhealthy threshold: 3
```

**Expected response:**
```json
{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 12345
}
```

**Mark unhealthy if:**
- HTTP status != 200
- Response time >3 seconds
- `status` field != "healthy"

### Load Balancing Algorithm

**Recommended:** Round-robin

- Simple
- Evenly distributes load
- No sticky sessions needed (CRDTs handle conflicts)

**Not recommended:** Least connections

- Can cause hotspots
- Unnecessary complexity

### Session Affinity

**Not required** - StemeDB uses CRDTs, so queries can hit any node

---

## Security Considerations

### Admin Endpoints

⚠️ **CRITICAL:** Admin endpoints have NO authentication in Pilot 5

**Endpoints to restrict:**
- `/v1/admin/quarantine` - Manage quarantine queue
- `/v1/admin/circuit_breakers` - Ban/unban agents
- `/v1/admin/indexes/rebuild` - Trigger index rebuild
- `/v1/admin/compact` - Trigger compaction

**Restriction methods:**

**Option 1: Firewall (recommended)**
```bash
# Block /v1/admin/ from public
# iptables example:
sudo iptables -A INPUT -p tcp --dport 18180 -m string --string "/v1/admin/" --algo bm -j DROP

# Or in nginx:
location /v1/admin/ {
    deny all;
    return 403;
}
```

**Option 2: VPN-only access**
- Require VPN connection to reach port 18181 (cluster gateway)
- Use `/v1/admin/` endpoints via cluster gateway only

**Option 3: IP allowlist**
```nginx
# Nginx example
location /v1/admin/ {
    allow 10.0.0.0/8;  # Internal network
    deny all;
}
```

### Metrics Endpoint

**`/metrics` endpoint exposes sensitive information:**
- Assertion counts
- Query patterns
- Agent IDs
- Performance data

**Restriction:**
```nginx
# Allow only from monitoring systems
location /metrics {
    allow 10.0.1.100;  # Prometheus server
    deny all;
}
```

---

## Network Topology Examples

### Single-Node with Reverse Proxy

```
Internet
    │
    ▼
[Nginx/Envoy]  (TLS termination, port 443)
    │
    ▼
[StemeDB API]  (port 18180, HTTP)
    │
    ▼
[Data]  (/data/wal, /data/db)
```

### Three-Node Cluster

```
Internet
    │
    ▼
[Load Balancer]  (TLS, port 443)
    │
    ├─────────┬─────────┐
    ▼         ▼         ▼
[Node 1]  [Node 2]  [Node 3]  (port 18180, HTTP)
    │         │         │
    └─────────┴─────────┘  (ports 18182-18183, replication)
```

**See:** [diagrams/network-topology.txt](./diagrams/network-topology.txt) for ASCII diagram.

---

## Troubleshooting

### Connection Refused

**Symptom:** `curl: (7) Failed to connect to localhost port 18180: Connection refused`

**Diagnosis:**
```bash
# Check if port is listening
sudo lsof -i :18180
# Should show: stemedb-api

# Check firewall
sudo iptables -L -n | grep 18180

# Check service status
sudo systemctl status stemedb-api
```

**Resolution:** See [Server Won't Start Runbook](../../runbooks/server-wont-start.md)

### High Latency Between Nodes

**Symptom:** `replication_lag_seconds` >5

**Diagnosis:**
```bash
# Test inter-node latency
ping -c 100 node2
# If avg >5ms → Network issue

# Check bandwidth
iperf3 -c node2
# Should show >100 Mbps
```

**Resolution:** See [High Query Latency Runbook](../../runbooks/high-query-latency.md#1-replication-lag)

### SWIM Gossip Not Working

**Symptom:** Nodes not discovering each other

**Diagnosis:**
```bash
# Check UDP port 18183
sudo tcpdump -i eth0 udp port 18183
# Should show periodic SWIM messages

# Check firewall (UDP!)
sudo iptables -L -n | grep 18183
```

**Resolution:** Open UDP port 18183 between cluster nodes

---

## Related Documentation

- [Single-Node Architecture](./single-node-pilot.md) - Network for single-node
- [Three-Node Cluster](./three-node-cluster.md) - Network for cluster
- [Deployment Examples](../../deployment/) - Nginx and Envoy configs
- [Add Node Runbook](../../runbooks/add-node.md) - Cluster network setup

---

**Last Updated:** 2026-02-11