stemedb/docs/operations/reference-architecture/diagrams/network-topology.txt
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

309 lines
19 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Network Topology Diagram
## Port Scheme Overview
```
┌────────────────────────────────────────────────────────────────┐
│ StemeDB Port Allocation (181XX) │
├────────┬──────────┬─────────────────────┬──────────────────────┤
│ Port │ Protocol │ Service │ Purpose │
├────────┼──────────┼─────────────────────┼──────────────────────┤
│ 18180 │ TCP/HTTP │ API Server │ Queries, ingest │
│ 18181 │ TCP/HTTP │ Cluster Gateway │ Coordination │
│ 18182 │ TCP/gRPC │ Cluster RPC │ Replication │
│ 18183 │ UDP │ SWIM Gossip │ Membership │
│ 18184 │ - │ (Reserved) │ Future metrics │
│ 18185 │ - │ (Reserved) │ Future admin │
│ 18186 │ TCP/HTTP │ Latent Signal │ AE detection │
│ 18187 │ TCP/HTTP │ Community App │ Community corpus │
│ 18188 │ TCP/HTTP │ StemeDB Dashboard │ Web UI │
│ 18189 │ TCP/HTTP │ Aphoria Dashboard │ Aphoria UI │
└────────┴──────────┴─────────────────────┴──────────────────────┘
```
## Single-Node Network Topology
```
┌─────────────────────────────────────────────────────────────────┐
│ Internet │
│ │ │
│ │ HTTPS (443) │
│ ▼ │
│ ┌───────────────┐ │
│ │ Reverse Proxy │ │
│ │ (Nginx/Envoy) │ │
│ │ • TLS term │ │
│ │ • Rate limit │ │
│ └───────┬───────┘ │
│ │ │
│ │ HTTP (18180) │
└────────────────────────────┼─────────────────────────────────────┘
┌──────────────────┼──────────────────┐
│ Internal Network (10.0.0.0/8) │
│ ▼ │
│ ┌─────────────────┐ │
│ │ StemeDB Node │ │
│ │ 10.0.1.50 │ │
│ │ │ │
│ │ :18180 (API) │◀────────┼─── Clients (internal)
│ │ :18188 (Dash) │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Prometheus │ │
│ │ 10.0.1.100 │ │
│ │ Scrapes :18180 │ │
│ └─────────────────┘ │
└─────────────────────────────────────┘
Security Zones:
- Public: Internet → Reverse Proxy (443)
- DMZ: Reverse Proxy → StemeDB (18180)
- Internal: Prometheus → StemeDB (18180/metrics)
```
## Three-Node Cluster Network Topology
```
┌──────────────────────────────────────────────────────────────────┐
│ Internet │
│ │ │
│ │ HTTPS (443) │
│ ▼ │
│ ┌───────────────┐ │
│ │ Load Balancer │ │
│ │ (ALB/ELB) │ │
│ │ • TLS term │ │
│ │ • Health chks │ │
│ └───────┬───────┘ │
│ │ │
│ │ HTTP (18180) │
└─────────────────────────────┼──────────────────────────────────────┘
┌───────────────┴───────────────┐
│ │
┌─────────────┼───────────────────────────────┼──────────────────┐
│ Private Network (10.0.1.0/24) │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Node 1 │ │ Node 2 │ │
│ │ 10.0.1.51 │ │ 10.0.1.52 │ │
│ │ │ │ │ │
│ │ :18180 (API) │ │ :18180 (API) │ │
│ │ :18181 (Gate) │ │ :18181 (Gate) │ │
│ │ :18182 (RPC)────┼────────────┼────:18182 (RPC) │ │
│ │ :18183 (SWIM)···┼···········UDP···:18183 (SWIM)│ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │
│ │ │ │
│ │ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ Node 3 │ │ │
│ │ │ 10.0.1.53 │ │ │
│ │ │ │ │ │
│ │ │ :18180 (API) │ │ │
│ │ │ :18181 (Gate) │ │ │
│ └─────────┼────:18182 (RPC) │──┘ │
│ ···UDP···:18183 (SWIM)│ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Prometheus │ │
│ │ 10.0.1.100 │ │
│ │ Scrapes all 3 │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Security Zones:
- Public: Internet → Load Balancer (443)
- DMZ: Load Balancer → Nodes (18180)
- Cluster: Node ↔ Node (18181-18183)
- Internal: Prometheus → Nodes (18180/metrics)
Firewall Rules:
- Allow 18180 from Load Balancer to all nodes
- Allow 18181-18183 within cluster (node ↔ node)
- Allow 18180/metrics from Prometheus only
- Block 18181 from outside (admin endpoints)
```
## Inter-Node Communication Detail
```
Node 1 (10.0.1.51) Node 2 (10.0.1.52)
Port 18182 (TCP/gRPC)
├─────────────────────────────────────▶ :18182
│ Push Replication (receive assertions)
│ • Assertion payload
│ • BLAKE3 hash
│ • Signature
◀─────────────────────────────────────┤
ACK (received) │
Port 18183 (UDP)
├───────────────────────────────────▶ :18183
│ SWIM Gossip (every 1s) (membership)
│ • Ping: "Are you alive?"
│ • Membership: "Node 3 is UP"
◀───────────────────────────────────┤
Ack: "I'm alive" │
Membership: "Node 1 is UP" │
Port 18181 (TCP/HTTP)
├─────────────────────────────────────▶ :18181
│ Merkle Sync (periodic) (compare trees)
│ GET /cluster/merkle
│ • Root hash: ABC123
◀─────────────────────────────────────┤
Merkle tree response │
• Root hash: ABC123 (same!) │
• No sync needed │
```
## Firewall Configuration (iptables)
```
# On each cluster node:
# Allow API from load balancer
-A INPUT -s 10.0.1.10 -p tcp --dport 18180 -j ACCEPT
# Allow cluster RPC from other nodes
-A INPUT -s 10.0.1.51 -p tcp --dport 18181:18182 -j ACCEPT
-A INPUT -s 10.0.1.52 -p tcp --dport 18181:18182 -j ACCEPT
-A INPUT -s 10.0.1.53 -p tcp --dport 18181:18182 -j ACCEPT
# Allow SWIM gossip (UDP) from other nodes
-A INPUT -s 10.0.1.51 -p udp --dport 18183 -j ACCEPT
-A INPUT -s 10.0.1.52 -p udp --dport 18183 -j ACCEPT
-A INPUT -s 10.0.1.53 -p udp --dport 18183 -j ACCEPT
# Allow metrics from Prometheus
-A INPUT -s 10.0.1.100 -p tcp --dport 18180 -j ACCEPT
# Allow SSH from bastion
-A INPUT -s 10.0.1.200 -p tcp --dport 22 -j ACCEPT
# Drop everything else
-A INPUT -p tcp --dport 18180:18189 -j DROP
-A INPUT -p udp --dport 18183 -j DROP
```
## AWS Security Group Example
```
Security Group: sg-stemedb-cluster
Inbound Rules:
┌──────────┬──────────┬─────────────────┬─────────────────────────┐
│ Type │ Protocol │ Port Range │ Source │
├──────────┼──────────┼─────────────────┼─────────────────────────┤
│ HTTP │ TCP │ 18180 │ sg-load-balancer │
│ Custom │ TCP │ 18181-18182 │ sg-stemedb-cluster │
│ Custom │ UDP │ 18183 │ sg-stemedb-cluster │
│ SSH │ TCP │ 22 │ sg-bastion │
└──────────┴──────────┴─────────────────┴─────────────────────────┘
Outbound Rules:
┌──────────┬──────────┬─────────────────┬─────────────────────────┐
│ All │ All │ All │ 0.0.0.0/0 │
└──────────┴──────────┴─────────────────┴─────────────────────────┘
```
## Network Latency Requirements
```
Client → Load Balancer: <100ms (internet typical)
Load Balancer → Node: <10ms (same region)
├───────────────────────────────────────┐
▼ ▼
Node 1 ◀─────<5ms (CRITICAL)─────────▶ Node 2
▲ ▲
│ │
└───────────<5ms (CRITICAL)─────────────┘
Node 3
Why <5ms inter-node?
- SWIM gossip requires fast ping/ack
- Replication lag increases with latency
- Merkle sync performance degrades
Test: ping -c 100 node2 (should show avg <5ms)
```
## Bandwidth Usage
```
┌─────────────────────────────────────────────────────────────┐
│ Bandwidth Breakdown │
├─────────────────┬───────────────────────────────────────────┤
│ Direction │ Usage (per node) │
├─────────────────┼───────────────────────────────────────────┤
│ Inbound (API) │ 100 assertions/sec × 1KB = 0.8 Mbps │
│ Outbound (API) │ 100 queries/sec × 5KB = 4 Mbps │
│ Replication │ 100 assertions/sec × 1KB × 2 = 1.6 Mbps │
│ SWIM Gossip │ ~10 KB/sec (negligible) │
├─────────────────┼───────────────────────────────────────────┤
│ Total │ ~7 Mbps per node │
│ Recommended │ 1 Gbps NIC (100× headroom) │
└─────────────────┴───────────────────────────────────────────┘
```
## Monitoring Endpoints
```
┌─────────────────────────────────────────────────────────────┐
│ Prometheus Scrape Targets │
├─────────────────┬───────────────────────────────────────────┤
│ Target │ URL │
├─────────────────┼───────────────────────────────────────────┤
│ Node 1 │ http://10.0.1.51:18180/metrics │
│ Node 2 │ http://10.0.1.52:18180/metrics │
│ Node 3 │ http://10.0.1.53:18180/metrics │
├─────────────────┼───────────────────────────────────────────┤
│ Scrape Interval │ 15 seconds │
│ Timeout │ 10 seconds │
└─────────────────┴───────────────────────────────────────────┘
Key Metrics:
- up{job="stemedb", instance="node1"} = 1
- stemedb_query_latency_seconds{quantile="0.99", instance="node1"}
- replication_lag_seconds{instance="node1"}
- process_resident_memory_bytes{instance="node1"}
```
## DNS Configuration
```
Public DNS (example.com):
┌────────────────────────────────────────────────────────────┐
│ stemedb.example.com. 300 IN CNAME stemedb-lb.example. │
│ stemedb-lb.example. 60 IN A 203.0.113.10 │
└────────────────────────────────────────────────────────────┘
Private DNS (cluster.local):
┌────────────────────────────────────────────────────────────┐
│ node1.cluster.local. 300 IN A 10.0.1.51 │
│ node2.cluster.local. 300 IN A 10.0.1.52 │
│ node3.cluster.local. 300 IN A 10.0.1.53 │
└────────────────────────────────────────────────────────────┘
TTL Recommendations:
- Public: 300s (5 min) - balance caching vs failover speed
- Private: 60s (1 min) - faster convergence within cluster
```