stemedb/docs/operations/runbooks/server-wont-start.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

11 KiB

Runbook: Server Won't Start

Symptom

  • stemedb-api process exits immediately after startup
  • Port binding fails with "Address already in use"
  • TLS certificate errors in logs
  • "No space left on device" errors
  • WAL magic byte validation failures
  • Permission denied errors on data directories

Metrics Alerts:

  • N/A (server never starts, metrics unavailable)

Quick Diagnosis

Server won't start
    │
    ├─► Check: lsof -i :18180
    │   └─► Port in use? → §1 Port Conflict
    │
    ├─► Check: journalctl -u stemedb-api | grep -i tls
    │   └─► TLS errors? → §2 TLS Error
    │
    ├─► Check: df -h
    │   └─► Disk full? → [Disk Full Runbook](./disk-full.md)
    │
    ├─► Check: journalctl -u stemedb-api | grep -i magic
    │   └─► WAL corruption? → §3 WAL Corruption
    │
    └─► Check: ls -la data/wal/
        └─► Permission denied? → §4 Permissions

Common Causes

  1. Port already in use — Likelihood: 40%

    • Previous instance didn't shut down cleanly
    • Another service using port 18180
    • Development server still running
  2. TLS certificate issues — Likelihood: 25%

    • Certificate expired
    • Wrong file paths in config
    • Certificate/key mismatch
  3. WAL corruption — Likelihood: 15%

    • Unclean shutdown (power loss, OOM kill)
    • Disk corruption
    • Version mismatch after upgrade
  4. Disk full — Likelihood: 10%

    • WAL directory out of space
    • DB directory out of space
    • No inodes available
  5. Permission issues — Likelihood: 10%

    • Wrong ownership on data directories
    • SELinux/AppArmor blocking access
    • Container user mismatch

Resolution Steps

§1. Port Conflict

Diagnostic:

# Check if port 18180 is in use
lsof -i :18180

# Expected output if port in use:
# COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
# stemedb- 1234 root   10u  IPv4  12345      0t0  TCP *:18180 (LISTEN)

Resolution A: Kill stale process

# Find process using port
lsof -ti :18180

# Kill gracefully (SIGTERM)
kill $(lsof -ti :18180)

# Wait 5 seconds
sleep 5

# Verify port is free
lsof -i :18180
# (Should return empty)

# Start server
systemctl start stemedb-api

Resolution B: Change port

# Set custom port via environment variable
export STEMEDB_BIND_ADDR="127.0.0.1:18280"

# Or in systemd service file
sudo systemctl edit stemedb-api

# Add:
# [Service]
# Environment="STEMEDB_BIND_ADDR=127.0.0.1:18280"

sudo systemctl daemon-reload
sudo systemctl start stemedb-api

If failed: Port still in use after kill → Check for multiple instances or conflicting services. Proceed to reboot if critical.


§2. TLS Certificate Error

Diagnostic:

# Check logs for TLS errors
journalctl -u stemedb-api -n 50 | grep -i tls

# Common errors:
# - "certificate has expired"
# - "No such file or directory: /etc/stemedb/tls/cert.pem"
# - "key values mismatch"

# Verify certificate files exist
ls -lh /etc/stemedb/tls/

Resolution A: Certificate expired

# Check expiration date
openssl x509 -in /etc/stemedb/tls/cert.pem -noout -enddate

# Renew with Let's Encrypt (example)
sudo certbot renew --cert-name stemedb.example.com

# Copy renewed certificates
sudo cp /etc/letsencrypt/live/stemedb.example.com/fullchain.pem /etc/stemedb/tls/cert.pem
sudo cp /etc/letsencrypt/live/stemedb.example.com/privkey.pem /etc/stemedb/tls/key.pem

# Set correct permissions
sudo chown stemedb:stemedb /etc/stemedb/tls/*.pem
sudo chmod 600 /etc/stemedb/tls/key.pem
sudo chmod 644 /etc/stemedb/tls/cert.pem

# Restart server
sudo systemctl start stemedb-api

Resolution B: Wrong file paths

# Check environment variables
env | grep STEMEDB_TLS

# Set correct paths
export STEMEDB_TLS_CERT="/path/to/cert.pem"
export STEMEDB_TLS_KEY="/path/to/key.pem"

# Or update systemd service
sudo systemctl edit stemedb-api
# Add correct paths

sudo systemctl daemon-reload
sudo systemctl start stemedb-api

Resolution C: Certificate/key mismatch

# Verify certificate and key match
openssl x509 -noout -modulus -in /etc/stemedb/tls/cert.pem | openssl md5
openssl rsa -noout -modulus -in /etc/stemedb/tls/key.pem | openssl md5

# Hashes should match. If not, regenerate certificate or find matching pair.

If failed: TLS still failing → Temporarily disable TLS for debugging (NOT for production):

# Disable TLS (debugging only)
export STEMEDB_TLS_ENABLED=false
systemctl start stemedb-api

§3. WAL Corruption

Diagnostic:

# Check logs for WAL errors
journalctl -u stemedb-api -n 50 | grep -i wal

# Common errors:
# - "WAL magic byte validation failed"
# - "Failed to recover WAL segment"
# - "Checksum mismatch in WAL"

# Check WAL directory
ls -lh data/wal/

Resolution: Restore from backup

⚠️ WARNING: This destroys current WAL data. Only proceed if backup is available and data loss is acceptable.

# Stop server (if running)
sudo systemctl stop stemedb-api

# Backup corrupted WAL for forensics
sudo mv data/wal data/wal.corrupted.$(date +%Y%m%d-%H%M%S)

# List available backups
ls -lh backups/

# Restore from most recent backup
sudo ./scripts/restore-stemedb.sh backups/stemedb-backup-YYYYMMDD-HHMMSS

# Verify restoration
cat data/metadata.json

# Start server
sudo systemctl start stemedb-api

# Verify health
curl http://localhost:18180/v1/health

Expected output after restore:

{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 5,
  "assertion_count": 10234
}

If failed: Restore failed → Check backup integrity. See Restore from Backup Runbook.


§4. Disk Full

See: Disk Full Runbook for full procedure.

Quick emergency fix:

# Check disk usage
df -h

# If >98%, emergency cleanup
sudo find data/wal -name "*.log" -mtime +7 -delete

# Start server
sudo systemctl start stemedb-api

§5. Permission Issues

Diagnostic:

# Check directory permissions
ls -la data/

# Expected ownership:
# drwxr-xr-x stemedb stemedb wal/
# drwxr-xr-x stemedb stemedb db/

# Check SELinux denials (RHEL/CentOS)
sudo ausearch -m avc -ts recent

Resolution A: Fix ownership

# Fix ownership recursively
sudo chown -R stemedb:stemedb data/

# Fix permissions
sudo chmod -R 755 data/
sudo chmod -R 644 data/wal/*.log
sudo chmod -R 644 data/db/*.kv

# Start server
sudo systemctl start stemedb-api

Resolution B: SELinux context

# Restore SELinux context
sudo restorecon -Rv data/

# Or set permissive for debugging (NOT for production)
sudo setenforce 0

# Start server
sudo systemctl start stemedb-api

# If works, add SELinux policy instead of disabling

Resolution C: Container user mismatch

# In Docker/Kubernetes, ensure volumes have correct UID
# docker-compose.yml example:
# services:
#   stemedb:
#     user: "1000:1000"  # Match host UID
#     volumes:
#       - ./data:/data

# Or use chown in entrypoint:
# entrypoint: ["sh", "-c", "chown -R stemedb:stemedb /data && exec stemedb-api"]

If failed: Permissions correct but still denied → Check AppArmor profiles or mandatory access controls.


Validation

After applying resolution, validate server is healthy:

  • Server starts successfully

    systemctl status stemedb-api
    # Should show "active (running)"
    
  • Health endpoint returns 200

    curl http://localhost:18180/v1/health
    # Should return: {"status":"healthy", ...}
    
  • Port is bound

    lsof -i :18180
    # Should show stemedb-api listening
    
  • Logs show successful startup

    journalctl -u stemedb-api -n 20
    # Should show 10 startup steps completed
    
  • Test query succeeds

    curl -X POST http://localhost:18180/v1/query \
      -H "Content-Type: application/json" \
      -d '{"concept_path":"test/health","lens":"recency"}'
    # Should return 200 (even if empty results)
    
  • Metrics endpoint works

    curl http://localhost:18180/metrics | head -20
    # Should return Prometheus metrics
    

Prevention

Monitoring

Set up alerts for:

# Prometheus alert rules
groups:
  - name: stemedb_availability
    rules:
      - alert: StemeDBDown
        expr: up{job="stemedb"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "StemeDB server is down"
          description: "Server has been down for >1 minute"

      - alert: StemeDBRestartLoop
        expr: rate(stemedb_restarts_total[5m]) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "StemeDB restarting frequently"
          description: "Server has restarted >2 times in 5 minutes"

Configuration Changes

To prevent recurrence:

  1. Port conflicts: Reserve port 18180 in your infrastructure registry
  2. TLS expiry: Automate certificate renewal with certbot + systemd timer
  3. WAL corruption: Enable daily backups via cron
  4. Disk full: Monitor disk at 80% threshold, alert at 90%
  5. Permissions: Document correct UID/GID in deployment guide

Example: Automated TLS renewal

# /etc/systemd/system/certbot-renewal.timer
[Unit]
Description=Certbot renewal timer

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target

Startup Sequence Reference

Normal startup takes 2-5 seconds and includes 10 steps:

  1. Initialize logging (tracing subscriber)
  2. Start metrics registry
  3. Load configuration (env vars)
  4. Verify data directories exist
  5. Open WAL journal (crash recovery if needed)
  6. Initialize HybridStore (KV + indexes)
  7. Start IngestWorker (background thread)
  8. Build HTTP router (axum)
  9. Bind TCP listener on configured address
  10. Start accepting connections

If server hangs at specific step, check:

  • Step 5 (WAL): Corruption or disk full
  • Step 6 (HybridStore): Database corruption
  • Step 9 (Bind): Port already in use

Environment Variables Reference

Variable Default Description
STEMEDB_BIND_ADDR 127.0.0.1:18180 HTTP API listen address
STEMEDB_WAL_DIR data/wal Write-ahead log directory
STEMEDB_DB_DIR data/db Database directory
STEMEDB_TLS_ENABLED false Enable TLS termination
STEMEDB_TLS_CERT (none) Path to TLS certificate
STEMEDB_TLS_KEY (none) Path to TLS private key
STEMEDB_METER_ENABLED true Enable Prometheus metrics


Last Updated

2026-02-11