This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
11 KiB
Single-Node Pilot Architecture
Target: Proof of concept, friendly pilot, development environments
⚠️ NOT RECOMMENDED FOR PRODUCTION - Single point of failure, manual recovery required
Overview
The single-node architecture is the simplest StemeDB deployment: one server running stemedb-api with local storage. Suitable for early pilots, development, and demonstrations where availability is not critical.
[See: diagrams/single-node.txt for ASCII diagram]
Target Specifications
| Metric | Value |
|---|---|
| Assertions | <10,000 |
| Queries/sec | <100 |
| Concurrent users | <50 |
| Availability | Best effort (single point of failure) |
| RTO | 2 hours (manual restore) |
| RPO | 24 hours (daily backup) |
Hardware Requirements
Minimum (Pilot <5K assertions)
- CPU: 2 vCPUs
- RAM: 4GB
- Disk: 50GB SSD (30GB WAL + 20GB DB)
- Network: 100 Mbps
Example instances:
- AWS:
t3.medium(2 vCPU, 4GB) - GCP:
n1-standard-1(1 vCPU, 3.75GB) - Azure:
Standard_B2s(2 vCPU, 4GB)
Recommended (Pilot <10K assertions)
- CPU: 4 vCPUs
- RAM: 8GB
- Disk: 100GB SSD (50GB WAL + 50GB DB)
- Network: 1 Gbps
Example instances:
- AWS:
t3.large(2 vCPU, 8GB) - GCP:
n2-standard-2(2 vCPU, 8GB) - Azure:
Standard_D2s_v3(2 vCPU, 8GB)
See: Resource Sizing Guide for calculations.
Architecture Diagram
Component layout:
┌─────────────────────────────────────────────────────┐
│ StemeDB Server │
│ ┌───────────────────────────────────────────────┐ │
│ │ stemedb-api (Port 18180) │ │
│ │ ┌─────────────┐ ┌──────────────┐ │ │
│ │ │ HTTP Router │───▶│ Ingest │ │ │
│ │ │ (Axum) │ │ Pipeline │ │ │
│ │ └─────────────┘ └──────┬───────┘ │ │
│ │ │ │ │
│ │ ┌──────────────────┐ ▼ │ │
│ │ │ Query Engine │ ┌────────────┐ │ │
│ │ │ (Lenses) │ │ WAL │ │ │
│ │ └────────┬─────────┘ └────────────┘ │ │
│ │ │ /data/wal/ │ │
│ │ ▼ │ │
│ │ ┌──────────────────┐ │ │
│ │ │ HybridStore │ │ │
│ │ │ • KV Store │ │ │
│ │ │ • Indexes │ │ │
│ │ └──────────────────┘ │ │
│ │ /data/db/ │ │
│ └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
▲ │
│ ▼
┌─────────┐ ┌──────────────────┐
│ Clients │ │ Backups (daily) │
│ (Agents,│ │ /backups/ │
│ Dash) │ │ (rsync-based) │
└─────────┘ └──────────────────┘
Deployment Steps
Prerequisites
- Ubuntu 22.04 or RHEL 9 server
stemedb-apibinary installed- systemd service configured
- Firewall rules applied
Step 1: Install StemeDB
# Download binary (replace with your release URL)
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api
# Verify installation
stemedb-api --version
# Expected: stemedb-api 0.1.0
Step 2: Create Data Directories
# Create directories
sudo mkdir -p /data/{wal,db}
sudo mkdir -p /backups
# Create stemedb user
sudo useradd -r -s /bin/false stemedb
# Set permissions
sudo chown -R stemedb:stemedb /data
sudo chown -R stemedb:stemedb /backups
sudo chmod 755 /data/{wal,db}
Step 3: Configure Environment
# Create config file
sudo tee /etc/stemedb/config.env <<EOF
STEMEDB_BIND_ADDR=0.0.0.0:18180
STEMEDB_WAL_DIR=/data/wal
STEMEDB_DB_DIR=/data/db
STEMEDB_METER_ENABLED=true
RUST_LOG=info
EOF
# Set permissions
sudo chmod 600 /etc/stemedb/config.env
Step 4: Create systemd Service
# Create service file
sudo tee /etc/systemd/system/stemedb-api.service <<EOF
[Unit]
Description=StemeDB API Server
After=network.target
[Service]
Type=simple
User=stemedb
Group=stemedb
EnvironmentFile=/etc/stemedb/config.env
ExecStart=/usr/local/bin/stemedb-api
Restart=on-failure
RestartSec=5s
# Resource limits
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
EOF
# Reload systemd
sudo systemctl daemon-reload
# Enable service
sudo systemctl enable stemedb-api
Step 5: Start Server
# Start service
sudo systemctl start stemedb-api
# Check status
sudo systemctl status stemedb-api
# Verify health
curl http://localhost:18180/v1/health
# Expected: {"status": "healthy", "version": "0.1.0", ...}
Step 6: Configure Reverse Proxy (Optional)
For TLS termination and external access:
See: Nginx Config for complete example.
# Install nginx
sudo apt install nginx
# Copy config
sudo cp docs/operations/deployment/nginx/stemedb.conf /etc/nginx/sites-available/stemedb
# Enable site
sudo ln -s /etc/nginx/sites-available/stemedb /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
Step 7: Set Up Daily Backups
# Copy backup script
sudo cp scripts/backup-stemedb.sh /usr/local/bin/
sudo chmod +x /usr/local/bin/backup-stemedb.sh
# Create cron job
sudo crontab -e
# Add daily backup at 2 AM
0 2 * * * /usr/local/bin/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1
# Test backup
sudo /usr/local/bin/backup-stemedb.sh
ls -lh /backups/
Estimated deployment time: 1-2 hours
Network Configuration
Ports
| Port | Protocol | Purpose | Expose To |
|---|---|---|---|
| 18180 | TCP/HTTP | API queries, ingest | Clients (via reverse proxy) |
| 18180 | TCP/HTTP | Metrics endpoint | Internal monitoring |
Firewall Rules
AWS Security Group:
# Allow HTTP from load balancer only
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx \
--source-group sg-lb \
--protocol tcp \
--port 18180
# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx \
--source-group sg-bastion \
--protocol tcp \
--port 22
iptables:
# Allow HTTP from internal network only
sudo iptables -A INPUT -p tcp -s 10.0.0.0/8 --dport 18180 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 18180 -j DROP
# Persist rules
sudo iptables-save > /etc/iptables/rules.v4
See: Network Requirements for full details.
Monitoring
Prometheus
Scrape configuration:
# /etc/prometheus/prometheus.yml
scrape_configs:
- job_name: 'stemedb'
static_configs:
- targets: ['localhost:18180']
metrics_path: '/metrics'
scrape_interval: 15s
Key Metrics to Monitor
# Query latency (should be <200ms p99)
stemedb_query_latency_seconds{quantile="0.99"}
# Ingest rate (assertions/sec)
rate(stemedb_assertions_total[1m])
# WAL fsync latency (should be <10ms)
stemedb_wal_fsync_latency_seconds
# Disk usage (alert at 80%)
node_filesystem_avail_bytes{mountpoint="/data"}
# Memory usage
process_resident_memory_bytes
Grafana Dashboard
See: Example dashboard in docker-compose/pilot-with-monitoring.yml stack.
Key panels:
- Query latency (p50, p95, p99)
- Ingest rate (assertions/sec)
- Disk usage (WAL, DB, total)
- Error rate (4xx, 5xx responses)
Failure Scenarios
Server Failure
Impact: Complete outage, all queries and writes fail
Recovery:
- Provision new server
- Restore from backup (see Restore Runbook)
- Update DNS to point to new server
- Validate with test queries
Estimated RTO: 2 hours (manual)
Data loss: Last 24 hours (if daily backup)
Disk Failure
Impact: Data loss, server won't start
Recovery:
- Replace disk
- Restore from backup
- Restart server
Estimated RTO: 2 hours
Data loss: Last 24 hours
Process Crash (OOM, segfault)
Impact: Temporary outage, automatic restart via systemd
Recovery:
- Automatic (systemd restart after 5s)
- WAL replay recovers in-flight data
Estimated RTO: 10-30 seconds
Data loss: None (WAL preserves writes)
Limitations
Single-node architecture has these limitations:
-
No High Availability:
- Server failure = complete outage
- No automatic failover
- Manual recovery required
-
No Horizontal Scaling:
- Single CPU/RAM/disk bottleneck
- Can't add capacity by adding nodes
-
Manual Recovery:
- Restore from backup is manual process
- Downtime 1-2 hours typical
-
Limited Throughput:
- ~100 queries/sec typical
- ~100 assertions/sec write capacity
-
Data Loss Risk:
- Daily backups = up to 24hr data loss
- No real-time replication
For production deployments, use Three-Node Cluster instead.
When to Migrate
Migrate to three-node cluster when:
- Assertion count approaching 10,000
- Query latency p99 >500ms sustained
- Availability requirements tighten (need <5min RTO)
- Pilot validated, moving to production
- Compliance requires redundancy
Migration procedure: Add Node Runbook
Cost Estimate
AWS example (t3.large, us-east-1):
| Resource | Monthly Cost |
|---|---|
| Compute (t3.large) | $60 |
| Storage (100GB SSD) | $10 |
| Backup (500GB S3) | $12 |
| Data transfer | $5 |
| Total | ~$87/month |
GCP example (n2-standard-2, us-central1):
| Resource | Monthly Cost |
|---|---|
| Compute (n2-standard-2) | $65 |
| Storage (100GB SSD) | $17 |
| Backup (500GB Cloud Storage) | $10 |
| Total | ~$92/month |
Related Documentation
- Three-Node Cluster - Production architecture
- Resource Sizing - Hardware calculations
- Network Requirements - Firewall rules
- Pilot Success Criteria - Validation checklist
- Deployment Example - Docker Compose stack
Last Updated: 2026-02-11