jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

11 KiB

Raw Blame History

Runbook: Server Won't Start

Symptom

stemedb-api process exits immediately after startup
Port binding fails with "Address already in use"
TLS certificate errors in logs
"No space left on device" errors
WAL magic byte validation failures
Permission denied errors on data directories

Metrics Alerts:

N/A (server never starts, metrics unavailable)

Quick Diagnosis

Server won't start
    │
    ├─► Check: lsof -i :18180
    │   └─► Port in use? → §1 Port Conflict
    │
    ├─► Check: journalctl -u stemedb-api | grep -i tls
    │   └─► TLS errors? → §2 TLS Error
    │
    ├─► Check: df -h
    │   └─► Disk full? → [Disk Full Runbook](./disk-full.md)
    │
    ├─► Check: journalctl -u stemedb-api | grep -i magic
    │   └─► WAL corruption? → §3 WAL Corruption
    │
    └─► Check: ls -la data/wal/
        └─► Permission denied? → §4 Permissions

Common Causes

Port already in use — Likelihood: 40%
- Previous instance didn't shut down cleanly
- Another service using port 18180
- Development server still running
TLS certificate issues — Likelihood: 25%
- Certificate expired
- Wrong file paths in config
- Certificate/key mismatch
WAL corruption — Likelihood: 15%
- Unclean shutdown (power loss, OOM kill)
- Disk corruption
- Version mismatch after upgrade
Disk full — Likelihood: 10%
- WAL directory out of space
- DB directory out of space
- No inodes available
Permission issues — Likelihood: 10%
- Wrong ownership on data directories
- SELinux/AppArmor blocking access
- Container user mismatch

Resolution Steps

§1. Port Conflict

Diagnostic:

# Check if port 18180 is in use
lsof -i :18180

# Expected output if port in use:
# COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
# stemedb- 1234 root   10u  IPv4  12345      0t0  TCP *:18180 (LISTEN)

Resolution A: Kill stale process

# Find process using port
lsof -ti :18180

# Kill gracefully (SIGTERM)
kill $(lsof -ti :18180)

# Wait 5 seconds
sleep 5

# Verify port is free
lsof -i :18180
# (Should return empty)

# Start server
systemctl start stemedb-api

Resolution B: Change port

# Set custom port via environment variable
export STEMEDB_BIND_ADDR="127.0.0.1:18280"

# Or in systemd service file
sudo systemctl edit stemedb-api

# Add:
# [Service]
# Environment="STEMEDB_BIND_ADDR=127.0.0.1:18280"

sudo systemctl daemon-reload
sudo systemctl start stemedb-api

If failed: Port still in use after kill → Check for multiple instances or conflicting services. Proceed to reboot if critical.

§2. TLS Certificate Error

Diagnostic:

# Check logs for TLS errors
journalctl -u stemedb-api -n 50 | grep -i tls

# Common errors:
# - "certificate has expired"
# - "No such file or directory: /etc/stemedb/tls/cert.pem"
# - "key values mismatch"

# Verify certificate files exist
ls -lh /etc/stemedb/tls/

Resolution A: Certificate expired

# Check expiration date
openssl x509 -in /etc/stemedb/tls/cert.pem -noout -enddate

# Renew with Let's Encrypt (example)
sudo certbot renew --cert-name stemedb.example.com

# Copy renewed certificates
sudo cp /etc/letsencrypt/live/stemedb.example.com/fullchain.pem /etc/stemedb/tls/cert.pem
sudo cp /etc/letsencrypt/live/stemedb.example.com/privkey.pem /etc/stemedb/tls/key.pem

# Set correct permissions
sudo chown stemedb:stemedb /etc/stemedb/tls/*.pem
sudo chmod 600 /etc/stemedb/tls/key.pem
sudo chmod 644 /etc/stemedb/tls/cert.pem

# Restart server
sudo systemctl start stemedb-api

Resolution B: Wrong file paths

# Check environment variables
env | grep STEMEDB_TLS

# Set correct paths
export STEMEDB_TLS_CERT="/path/to/cert.pem"
export STEMEDB_TLS_KEY="/path/to/key.pem"

# Or update systemd service
sudo systemctl edit stemedb-api
# Add correct paths

sudo systemctl daemon-reload
sudo systemctl start stemedb-api

Resolution C: Certificate/key mismatch

# Verify certificate and key match
openssl x509 -noout -modulus -in /etc/stemedb/tls/cert.pem | openssl md5
openssl rsa -noout -modulus -in /etc/stemedb/tls/key.pem | openssl md5

# Hashes should match. If not, regenerate certificate or find matching pair.

If failed: TLS still failing → Temporarily disable TLS for debugging (NOT for production):

# Disable TLS (debugging only)
export STEMEDB_TLS_ENABLED=false
systemctl start stemedb-api

§3. WAL Corruption

Diagnostic:

# Check logs for WAL errors
journalctl -u stemedb-api -n 50 | grep -i wal

# Common errors:
# - "WAL magic byte validation failed"
# - "Failed to recover WAL segment"
# - "Checksum mismatch in WAL"

# Check WAL directory
ls -lh data/wal/

Resolution: Restore from backup

⚠️ WARNING: This destroys current WAL data. Only proceed if backup is available and data loss is acceptable.

# Stop server (if running)
sudo systemctl stop stemedb-api

# Backup corrupted WAL for forensics
sudo mv data/wal data/wal.corrupted.$(date +%Y%m%d-%H%M%S)

# List available backups
ls -lh backups/

# Restore from most recent backup
sudo ./scripts/restore-stemedb.sh backups/stemedb-backup-YYYYMMDD-HHMMSS

# Verify restoration
cat data/metadata.json

# Start server
sudo systemctl start stemedb-api

# Verify health
curl http://localhost:18180/v1/health

Expected output after restore:

{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 5,
  "assertion_count": 10234
}

If failed: Restore failed → Check backup integrity. See Restore from Backup Runbook.

§4. Disk Full

See: Disk Full Runbook for full procedure.

Quick emergency fix:

# Check disk usage
df -h

# If >98%, emergency cleanup
sudo find data/wal -name "*.log" -mtime +7 -delete

# Start server
sudo systemctl start stemedb-api

§5. Permission Issues

Diagnostic:

# Check directory permissions
ls -la data/

# Expected ownership:
# drwxr-xr-x stemedb stemedb wal/
# drwxr-xr-x stemedb stemedb db/

# Check SELinux denials (RHEL/CentOS)
sudo ausearch -m avc -ts recent

Resolution A: Fix ownership

# Fix ownership recursively
sudo chown -R stemedb:stemedb data/

# Fix permissions
sudo chmod -R 755 data/
sudo chmod -R 644 data/wal/*.log
sudo chmod -R 644 data/db/*.kv

# Start server
sudo systemctl start stemedb-api

Resolution B: SELinux context

# Restore SELinux context
sudo restorecon -Rv data/

# Or set permissive for debugging (NOT for production)
sudo setenforce 0

# Start server
sudo systemctl start stemedb-api

# If works, add SELinux policy instead of disabling

Resolution C: Container user mismatch

# In Docker/Kubernetes, ensure volumes have correct UID
# docker-compose.yml example:
# services:
#   stemedb:
#     user: "1000:1000"  # Match host UID
#     volumes:
#       - ./data:/data

# Or use chown in entrypoint:
# entrypoint: ["sh", "-c", "chown -R stemedb:stemedb /data && exec stemedb-api"]

If failed: Permissions correct but still denied → Check AppArmor profiles or mandatory access controls.

Validation

After applying resolution, validate server is healthy:

Server starts successfully

systemctl status stemedb-api
# Should show "active (running)"

Health endpoint returns 200

curl http://localhost:18180/v1/health
# Should return: {"status":"healthy", ...}

Port is bound

lsof -i :18180
# Should show stemedb-api listening

Logs show successful startup

journalctl -u stemedb-api -n 20
# Should show 10 startup steps completed

Test query succeeds

curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path":"test/health","lens":"recency"}'
# Should return 200 (even if empty results)

Metrics endpoint works

curl http://localhost:18180/metrics | head -20
# Should return Prometheus metrics

Prevention

Monitoring

Set up alerts for:

# Prometheus alert rules
groups:
  - name: stemedb_availability
    rules:
      - alert: StemeDBDown
        expr: up{job="stemedb"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "StemeDB server is down"
          description: "Server has been down for >1 minute"

      - alert: StemeDBRestartLoop
        expr: rate(stemedb_restarts_total[5m]) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "StemeDB restarting frequently"
          description: "Server has restarted >2 times in 5 minutes"

Configuration Changes

To prevent recurrence:

Port conflicts: Reserve port 18180 in your infrastructure registry
TLS expiry: Automate certificate renewal with certbot + systemd timer
WAL corruption: Enable daily backups via cron
Disk full: Monitor disk at 80% threshold, alert at 90%
Permissions: Document correct UID/GID in deployment guide

Example: Automated TLS renewal

# /etc/systemd/system/certbot-renewal.timer
[Unit]
Description=Certbot renewal timer

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target

Startup Sequence Reference

Normal startup takes 2-5 seconds and includes 10 steps:

Initialize logging (tracing subscriber)
Start metrics registry
Load configuration (env vars)
Verify data directories exist
Open WAL journal (crash recovery if needed)
Initialize HybridStore (KV + indexes)
Start IngestWorker (background thread)
Build HTTP router (axum)
Bind TCP listener on configured address
Start accepting connections

If server hangs at specific step, check:

Step 5 (WAL): Corruption or disk full
Step 6 (HybridStore): Database corruption
Step 9 (Bind): Port already in use

Environment Variables Reference

Variable	Default	Description
`STEMEDB_BIND_ADDR`	`127.0.0.1:18180`	HTTP API listen address
`STEMEDB_WAL_DIR`	`data/wal`	Write-ahead log directory
`STEMEDB_DB_DIR`	`data/db`	Database directory
`STEMEDB_TLS_ENABLED`	`false`	Enable TLS termination
`STEMEDB_TLS_CERT`	(none)	Path to TLS certificate
`STEMEDB_TLS_KEY`	(none)	Path to TLS private key
`STEMEDB_METER_ENABLED`	`true`	Enable Prometheus metrics

Disk Full - Storage management
Restore from Backup - WAL corruption recovery
High Query Latency - Performance issues after startup

Last Updated

2026-02-11

11 KiB Raw Blame History

Runbook: Server Won't Start

Symptom

Quick Diagnosis

Common Causes

Resolution Steps

§1. Port Conflict

§2. TLS Certificate Error

§3. WAL Corruption

§4. Disk Full

§5. Permission Issues

Validation

Prevention

Monitoring

Configuration Changes

Startup Sequence Reference

Environment Variables Reference

Related Runbooks

Last Updated

11 KiB

Raw Blame History