stemedb/docs/operations/reference-architecture/single-node-pilot.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

11 KiB

Single-Node Pilot Architecture

Target: Proof of concept, friendly pilot, development environments

⚠️ NOT RECOMMENDED FOR PRODUCTION - Single point of failure, manual recovery required


Overview

The single-node architecture is the simplest StemeDB deployment: one server running stemedb-api with local storage. Suitable for early pilots, development, and demonstrations where availability is not critical.

[See: diagrams/single-node.txt for ASCII diagram]

Target Specifications

Metric Value
Assertions <10,000
Queries/sec <100
Concurrent users <50
Availability Best effort (single point of failure)
RTO 2 hours (manual restore)
RPO 24 hours (daily backup)

Hardware Requirements

Minimum (Pilot <5K assertions)

  • CPU: 2 vCPUs
  • RAM: 4GB
  • Disk: 50GB SSD (30GB WAL + 20GB DB)
  • Network: 100 Mbps

Example instances:

  • AWS: t3.medium (2 vCPU, 4GB)
  • GCP: n1-standard-1 (1 vCPU, 3.75GB)
  • Azure: Standard_B2s (2 vCPU, 4GB)
  • CPU: 4 vCPUs
  • RAM: 8GB
  • Disk: 100GB SSD (50GB WAL + 50GB DB)
  • Network: 1 Gbps

Example instances:

  • AWS: t3.large (2 vCPU, 8GB)
  • GCP: n2-standard-2 (2 vCPU, 8GB)
  • Azure: Standard_D2s_v3 (2 vCPU, 8GB)

See: Resource Sizing Guide for calculations.


Architecture Diagram

Component layout:

┌─────────────────────────────────────────────────────┐
│                  StemeDB Server                     │
│  ┌───────────────────────────────────────────────┐  │
│  │          stemedb-api (Port 18180)            │  │
│  │  ┌─────────────┐    ┌──────────────┐         │  │
│  │  │ HTTP Router │───▶│ Ingest       │         │  │
│  │  │ (Axum)      │    │ Pipeline     │         │  │
│  │  └─────────────┘    └──────┬───────┘         │  │
│  │                            │                  │  │
│  │  ┌──────────────────┐     ▼                  │  │
│  │  │ Query Engine     │  ┌────────────┐        │  │
│  │  │ (Lenses)         │  │ WAL        │        │  │
│  │  └────────┬─────────┘  └────────────┘        │  │
│  │           │              /data/wal/           │  │
│  │           ▼                                   │  │
│  │  ┌──────────────────┐                        │  │
│  │  │ HybridStore      │                        │  │
│  │  │ • KV Store       │                        │  │
│  │  │ • Indexes        │                        │  │
│  │  └──────────────────┘                        │  │
│  │     /data/db/                                │  │
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘
        ▲                           │
        │                           ▼
   ┌─────────┐            ┌──────────────────┐
   │ Clients │            │ Backups (daily)  │
   │ (Agents,│            │ /backups/        │
   │ Dash)   │            │ (rsync-based)    │
   └─────────┘            └──────────────────┘

Deployment Steps

Prerequisites

  • Ubuntu 22.04 or RHEL 9 server
  • stemedb-api binary installed
  • systemd service configured
  • Firewall rules applied

Step 1: Install StemeDB

# Download binary (replace with your release URL)
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api

# Verify installation
stemedb-api --version
# Expected: stemedb-api 0.1.0

Step 2: Create Data Directories

# Create directories
sudo mkdir -p /data/{wal,db}
sudo mkdir -p /backups

# Create stemedb user
sudo useradd -r -s /bin/false stemedb

# Set permissions
sudo chown -R stemedb:stemedb /data
sudo chown -R stemedb:stemedb /backups
sudo chmod 755 /data/{wal,db}

Step 3: Configure Environment

# Create config file
sudo tee /etc/stemedb/config.env <<EOF
STEMEDB_BIND_ADDR=0.0.0.0:18180
STEMEDB_WAL_DIR=/data/wal
STEMEDB_DB_DIR=/data/db
STEMEDB_METER_ENABLED=true
RUST_LOG=info
EOF

# Set permissions
sudo chmod 600 /etc/stemedb/config.env

Step 4: Create systemd Service

# Create service file
sudo tee /etc/systemd/system/stemedb-api.service <<EOF
[Unit]
Description=StemeDB API Server
After=network.target

[Service]
Type=simple
User=stemedb
Group=stemedb
EnvironmentFile=/etc/stemedb/config.env
ExecStart=/usr/local/bin/stemedb-api
Restart=on-failure
RestartSec=5s

# Resource limits
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

# Reload systemd
sudo systemctl daemon-reload

# Enable service
sudo systemctl enable stemedb-api

Step 5: Start Server

# Start service
sudo systemctl start stemedb-api

# Check status
sudo systemctl status stemedb-api

# Verify health
curl http://localhost:18180/v1/health
# Expected: {"status": "healthy", "version": "0.1.0", ...}

Step 6: Configure Reverse Proxy (Optional)

For TLS termination and external access:

See: Nginx Config for complete example.

# Install nginx
sudo apt install nginx

# Copy config
sudo cp docs/operations/deployment/nginx/stemedb.conf /etc/nginx/sites-available/stemedb

# Enable site
sudo ln -s /etc/nginx/sites-available/stemedb /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx

Step 7: Set Up Daily Backups

# Copy backup script
sudo cp scripts/backup-stemedb.sh /usr/local/bin/
sudo chmod +x /usr/local/bin/backup-stemedb.sh

# Create cron job
sudo crontab -e

# Add daily backup at 2 AM
0 2 * * * /usr/local/bin/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1

# Test backup
sudo /usr/local/bin/backup-stemedb.sh
ls -lh /backups/

Estimated deployment time: 1-2 hours


Network Configuration

Ports

Port Protocol Purpose Expose To
18180 TCP/HTTP API queries, ingest Clients (via reverse proxy)
18180 TCP/HTTP Metrics endpoint Internal monitoring

Firewall Rules

AWS Security Group:

# Allow HTTP from load balancer only
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --source-group sg-lb \
  --protocol tcp \
  --port 18180

# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --source-group sg-bastion \
  --protocol tcp \
  --port 22

iptables:

# Allow HTTP from internal network only
sudo iptables -A INPUT -p tcp -s 10.0.0.0/8 --dport 18180 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 18180 -j DROP

# Persist rules
sudo iptables-save > /etc/iptables/rules.v4

See: Network Requirements for full details.


Monitoring

Prometheus

Scrape configuration:

# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'stemedb'
    static_configs:
      - targets: ['localhost:18180']
    metrics_path: '/metrics'
    scrape_interval: 15s

Key Metrics to Monitor

# Query latency (should be <200ms p99)
stemedb_query_latency_seconds{quantile="0.99"}

# Ingest rate (assertions/sec)
rate(stemedb_assertions_total[1m])

# WAL fsync latency (should be <10ms)
stemedb_wal_fsync_latency_seconds

# Disk usage (alert at 80%)
node_filesystem_avail_bytes{mountpoint="/data"}

# Memory usage
process_resident_memory_bytes

Grafana Dashboard

See: Example dashboard in docker-compose/pilot-with-monitoring.yml stack.

Key panels:

  • Query latency (p50, p95, p99)
  • Ingest rate (assertions/sec)
  • Disk usage (WAL, DB, total)
  • Error rate (4xx, 5xx responses)

Failure Scenarios

Server Failure

Impact: Complete outage, all queries and writes fail

Recovery:

  1. Provision new server
  2. Restore from backup (see Restore Runbook)
  3. Update DNS to point to new server
  4. Validate with test queries

Estimated RTO: 2 hours (manual)

Data loss: Last 24 hours (if daily backup)

Disk Failure

Impact: Data loss, server won't start

Recovery:

  1. Replace disk
  2. Restore from backup
  3. Restart server

Estimated RTO: 2 hours

Data loss: Last 24 hours

Process Crash (OOM, segfault)

Impact: Temporary outage, automatic restart via systemd

Recovery:

  • Automatic (systemd restart after 5s)
  • WAL replay recovers in-flight data

Estimated RTO: 10-30 seconds

Data loss: None (WAL preserves writes)


Limitations

Single-node architecture has these limitations:

  1. No High Availability:

    • Server failure = complete outage
    • No automatic failover
    • Manual recovery required
  2. No Horizontal Scaling:

    • Single CPU/RAM/disk bottleneck
    • Can't add capacity by adding nodes
  3. Manual Recovery:

    • Restore from backup is manual process
    • Downtime 1-2 hours typical
  4. Limited Throughput:

    • ~100 queries/sec typical
    • ~100 assertions/sec write capacity
  5. Data Loss Risk:

    • Daily backups = up to 24hr data loss
    • No real-time replication

For production deployments, use Three-Node Cluster instead.


When to Migrate

Migrate to three-node cluster when:

  • Assertion count approaching 10,000
  • Query latency p99 >500ms sustained
  • Availability requirements tighten (need <5min RTO)
  • Pilot validated, moving to production
  • Compliance requires redundancy

Migration procedure: Add Node Runbook


Cost Estimate

AWS example (t3.large, us-east-1):

Resource Monthly Cost
Compute (t3.large) $60
Storage (100GB SSD) $10
Backup (500GB S3) $12
Data transfer $5
Total ~$87/month

GCP example (n2-standard-2, us-central1):

Resource Monthly Cost
Compute (n2-standard-2) $65
Storage (100GB SSD) $17
Backup (500GB Cloud Storage) $10
Total ~$92/month


Last Updated: 2026-02-11