jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

11 KiB

Raw Blame History

Single-Node Pilot Architecture

Target: Proof of concept, friendly pilot, development environments

⚠️ NOT RECOMMENDED FOR PRODUCTION - Single point of failure, manual recovery required

Overview

The single-node architecture is the simplest StemeDB deployment: one server running stemedb-api with local storage. Suitable for early pilots, development, and demonstrations where availability is not critical.

[See: diagrams/single-node.txt for ASCII diagram]

Target Specifications

Metric	Value
Assertions	<10,000
Queries/sec	<100
Concurrent users	<50
Availability	Best effort (single point of failure)
RTO	2 hours (manual restore)
RPO	24 hours (daily backup)

Hardware Requirements

Minimum (Pilot <5K assertions)

CPU: 2 vCPUs
RAM: 4GB
Disk: 50GB SSD (30GB WAL + 20GB DB)
Network: 100 Mbps

Example instances:

AWS: t3.medium (2 vCPU, 4GB)
GCP: n1-standard-1 (1 vCPU, 3.75GB)
Azure: Standard_B2s (2 vCPU, 4GB)

Recommended (Pilot <10K assertions)

CPU: 4 vCPUs
RAM: 8GB
Disk: 100GB SSD (50GB WAL + 50GB DB)
Network: 1 Gbps

Example instances:

AWS: t3.large (2 vCPU, 8GB)
GCP: n2-standard-2 (2 vCPU, 8GB)
Azure: Standard_D2s_v3 (2 vCPU, 8GB)

See: Resource Sizing Guide for calculations.

Architecture Diagram

Component layout:

┌─────────────────────────────────────────────────────┐
│                  StemeDB Server                     │
│  ┌───────────────────────────────────────────────┐  │
│  │          stemedb-api (Port 18180)            │  │
│  │  ┌─────────────┐    ┌──────────────┐         │  │
│  │  │ HTTP Router │───▶│ Ingest       │         │  │
│  │  │ (Axum)      │    │ Pipeline     │         │  │
│  │  └─────────────┘    └──────┬───────┘         │  │
│  │                            │                  │  │
│  │  ┌──────────────────┐     ▼                  │  │
│  │  │ Query Engine     │  ┌────────────┐        │  │
│  │  │ (Lenses)         │  │ WAL        │        │  │
│  │  └────────┬─────────┘  └────────────┘        │  │
│  │           │              /data/wal/           │  │
│  │           ▼                                   │  │
│  │  ┌──────────────────┐                        │  │
│  │  │ HybridStore      │                        │  │
│  │  │ • KV Store       │                        │  │
│  │  │ • Indexes        │                        │  │
│  │  └──────────────────┘                        │  │
│  │     /data/db/                                │  │
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘
        ▲                           │
        │                           ▼
   ┌─────────┐            ┌──────────────────┐
   │ Clients │            │ Backups (daily)  │
   │ (Agents,│            │ /backups/        │
   │ Dash)   │            │ (rsync-based)    │
   └─────────┘            └──────────────────┘

Deployment Steps

Prerequisites

Ubuntu 22.04 or RHEL 9 server
stemedb-api binary installed
systemd service configured
Firewall rules applied

Step 1: Install StemeDB

# Download binary (replace with your release URL)
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api

# Verify installation
stemedb-api --version
# Expected: stemedb-api 0.1.0

Step 2: Create Data Directories

# Create directories
sudo mkdir -p /data/{wal,db}
sudo mkdir -p /backups

# Create stemedb user
sudo useradd -r -s /bin/false stemedb

# Set permissions
sudo chown -R stemedb:stemedb /data
sudo chown -R stemedb:stemedb /backups
sudo chmod 755 /data/{wal,db}

Step 3: Configure Environment

# Create config file
sudo tee /etc/stemedb/config.env <<EOF
STEMEDB_BIND_ADDR=0.0.0.0:18180
STEMEDB_WAL_DIR=/data/wal
STEMEDB_DB_DIR=/data/db
STEMEDB_METER_ENABLED=true
RUST_LOG=info
EOF

# Set permissions
sudo chmod 600 /etc/stemedb/config.env

Step 4: Create systemd Service

# Create service file
sudo tee /etc/systemd/system/stemedb-api.service <<EOF
[Unit]
Description=StemeDB API Server
After=network.target

[Service]
Type=simple
User=stemedb
Group=stemedb
EnvironmentFile=/etc/stemedb/config.env
ExecStart=/usr/local/bin/stemedb-api
Restart=on-failure
RestartSec=5s

# Resource limits
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

# Reload systemd
sudo systemctl daemon-reload

# Enable service
sudo systemctl enable stemedb-api

Step 5: Start Server

# Start service
sudo systemctl start stemedb-api

# Check status
sudo systemctl status stemedb-api

# Verify health
curl http://localhost:18180/v1/health
# Expected: {"status": "healthy", "version": "0.1.0", ...}

Step 6: Configure Reverse Proxy (Optional)

For TLS termination and external access:

See: Nginx Config for complete example.

# Install nginx
sudo apt install nginx

# Copy config
sudo cp docs/operations/deployment/nginx/stemedb.conf /etc/nginx/sites-available/stemedb

# Enable site
sudo ln -s /etc/nginx/sites-available/stemedb /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx

Step 7: Set Up Daily Backups

# Copy backup script
sudo cp scripts/backup-stemedb.sh /usr/local/bin/
sudo chmod +x /usr/local/bin/backup-stemedb.sh

# Create cron job
sudo crontab -e

# Add daily backup at 2 AM
0 2 * * * /usr/local/bin/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1

# Test backup
sudo /usr/local/bin/backup-stemedb.sh
ls -lh /backups/

Estimated deployment time: 1-2 hours

Network Configuration

Ports

Port	Protocol	Purpose	Expose To
18180	TCP/HTTP	API queries, ingest	Clients (via reverse proxy)
18180	TCP/HTTP	Metrics endpoint	Internal monitoring

Firewall Rules

AWS Security Group:

# Allow HTTP from load balancer only
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --source-group sg-lb \
  --protocol tcp \
  --port 18180

# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --source-group sg-bastion \
  --protocol tcp \
  --port 22

iptables:

# Allow HTTP from internal network only
sudo iptables -A INPUT -p tcp -s 10.0.0.0/8 --dport 18180 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 18180 -j DROP

# Persist rules
sudo iptables-save > /etc/iptables/rules.v4

See: Network Requirements for full details.

Monitoring

Prometheus

Scrape configuration:

# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'stemedb'
    static_configs:
      - targets: ['localhost:18180']
    metrics_path: '/metrics'
    scrape_interval: 15s

Key Metrics to Monitor

# Query latency (should be <200ms p99)
stemedb_query_latency_seconds{quantile="0.99"}

# Ingest rate (assertions/sec)
rate(stemedb_assertions_total[1m])

# WAL fsync latency (should be <10ms)
stemedb_wal_fsync_latency_seconds

# Disk usage (alert at 80%)
node_filesystem_avail_bytes{mountpoint="/data"}

# Memory usage
process_resident_memory_bytes

Grafana Dashboard

See: Example dashboard in docker-compose/pilot-with-monitoring.yml stack.

Key panels:

Query latency (p50, p95, p99)
Ingest rate (assertions/sec)
Disk usage (WAL, DB, total)
Error rate (4xx, 5xx responses)

Failure Scenarios

Server Failure

Impact: Complete outage, all queries and writes fail

Recovery:

Provision new server
Restore from backup (see Restore Runbook)
Update DNS to point to new server
Validate with test queries

Estimated RTO: 2 hours (manual)

Data loss: Last 24 hours (if daily backup)

Disk Failure

Impact: Data loss, server won't start

Recovery:

Replace disk
Restore from backup
Restart server

Estimated RTO: 2 hours

Data loss: Last 24 hours

Process Crash (OOM, segfault)

Impact: Temporary outage, automatic restart via systemd

Recovery:

Automatic (systemd restart after 5s)
WAL replay recovers in-flight data

Estimated RTO: 10-30 seconds

Data loss: None (WAL preserves writes)

Limitations

Single-node architecture has these limitations:

No High Availability:
- Server failure = complete outage
- No automatic failover
- Manual recovery required
No Horizontal Scaling:
- Single CPU/RAM/disk bottleneck
- Can't add capacity by adding nodes
Manual Recovery:
- Restore from backup is manual process
- Downtime 1-2 hours typical
Limited Throughput:
- ~100 queries/sec typical
- ~100 assertions/sec write capacity
Data Loss Risk:
- Daily backups = up to 24hr data loss
- No real-time replication

For production deployments, use Three-Node Cluster instead.

When to Migrate

Migrate to three-node cluster when:

Assertion count approaching 10,000
Query latency p99 >500ms sustained
Availability requirements tighten (need <5min RTO)
Pilot validated, moving to production
Compliance requires redundancy

Migration procedure: Add Node Runbook

Cost Estimate

AWS example (t3.large, us-east-1):

Resource	Monthly Cost
Compute (t3.large)	$60
Storage (100GB SSD)	$10
Backup (500GB S3)	$12
Data transfer	$5
Total	~$87/month

GCP example (n2-standard-2, us-central1):

Resource	Monthly Cost
Compute (n2-standard-2)	$65
Storage (100GB SSD)	$17
Backup (500GB Cloud Storage)	$10
Total	~$92/month

Three-Node Cluster - Production architecture
Resource Sizing - Hardware calculations
Network Requirements - Firewall rules
Pilot Success Criteria - Validation checklist
Deployment Example - Docker Compose stack

Last Updated: 2026-02-11

11 KiB Raw Blame History

Single-Node Pilot Architecture

Overview

Target Specifications

Hardware Requirements

Minimum (Pilot <5K assertions)

Recommended (Pilot <10K assertions)

Architecture Diagram

Deployment Steps

Prerequisites

Step 1: Install StemeDB

Step 2: Create Data Directories

Step 3: Configure Environment

Step 4: Create systemd Service

Step 5: Start Server

Step 6: Configure Reverse Proxy (Optional)

Step 7: Set Up Daily Backups

Network Configuration

Ports

Firewall Rules

Monitoring

Prometheus

Key Metrics to Monitor

Grafana Dashboard

Failure Scenarios

Server Failure

Disk Failure

Process Crash (OOM, segfault)

Limitations

When to Migrate

Cost Estimate

Related Documentation

11 KiB

Raw Blame History