This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
11 KiB
Runbook: Server Won't Start
Symptom
stemedb-apiprocess exits immediately after startup- Port binding fails with "Address already in use"
- TLS certificate errors in logs
- "No space left on device" errors
- WAL magic byte validation failures
- Permission denied errors on data directories
Metrics Alerts:
- N/A (server never starts, metrics unavailable)
Quick Diagnosis
Server won't start
│
├─► Check: lsof -i :18180
│ └─► Port in use? → §1 Port Conflict
│
├─► Check: journalctl -u stemedb-api | grep -i tls
│ └─► TLS errors? → §2 TLS Error
│
├─► Check: df -h
│ └─► Disk full? → [Disk Full Runbook](./disk-full.md)
│
├─► Check: journalctl -u stemedb-api | grep -i magic
│ └─► WAL corruption? → §3 WAL Corruption
│
└─► Check: ls -la data/wal/
└─► Permission denied? → §4 Permissions
Common Causes
-
Port already in use — Likelihood: 40%
- Previous instance didn't shut down cleanly
- Another service using port 18180
- Development server still running
-
TLS certificate issues — Likelihood: 25%
- Certificate expired
- Wrong file paths in config
- Certificate/key mismatch
-
WAL corruption — Likelihood: 15%
- Unclean shutdown (power loss, OOM kill)
- Disk corruption
- Version mismatch after upgrade
-
Disk full — Likelihood: 10%
- WAL directory out of space
- DB directory out of space
- No inodes available
-
Permission issues — Likelihood: 10%
- Wrong ownership on data directories
- SELinux/AppArmor blocking access
- Container user mismatch
Resolution Steps
§1. Port Conflict
Diagnostic:
# Check if port 18180 is in use
lsof -i :18180
# Expected output if port in use:
# COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
# stemedb- 1234 root 10u IPv4 12345 0t0 TCP *:18180 (LISTEN)
Resolution A: Kill stale process
# Find process using port
lsof -ti :18180
# Kill gracefully (SIGTERM)
kill $(lsof -ti :18180)
# Wait 5 seconds
sleep 5
# Verify port is free
lsof -i :18180
# (Should return empty)
# Start server
systemctl start stemedb-api
Resolution B: Change port
# Set custom port via environment variable
export STEMEDB_BIND_ADDR="127.0.0.1:18280"
# Or in systemd service file
sudo systemctl edit stemedb-api
# Add:
# [Service]
# Environment="STEMEDB_BIND_ADDR=127.0.0.1:18280"
sudo systemctl daemon-reload
sudo systemctl start stemedb-api
If failed: Port still in use after kill → Check for multiple instances or conflicting services. Proceed to reboot if critical.
§2. TLS Certificate Error
Diagnostic:
# Check logs for TLS errors
journalctl -u stemedb-api -n 50 | grep -i tls
# Common errors:
# - "certificate has expired"
# - "No such file or directory: /etc/stemedb/tls/cert.pem"
# - "key values mismatch"
# Verify certificate files exist
ls -lh /etc/stemedb/tls/
Resolution A: Certificate expired
# Check expiration date
openssl x509 -in /etc/stemedb/tls/cert.pem -noout -enddate
# Renew with Let's Encrypt (example)
sudo certbot renew --cert-name stemedb.example.com
# Copy renewed certificates
sudo cp /etc/letsencrypt/live/stemedb.example.com/fullchain.pem /etc/stemedb/tls/cert.pem
sudo cp /etc/letsencrypt/live/stemedb.example.com/privkey.pem /etc/stemedb/tls/key.pem
# Set correct permissions
sudo chown stemedb:stemedb /etc/stemedb/tls/*.pem
sudo chmod 600 /etc/stemedb/tls/key.pem
sudo chmod 644 /etc/stemedb/tls/cert.pem
# Restart server
sudo systemctl start stemedb-api
Resolution B: Wrong file paths
# Check environment variables
env | grep STEMEDB_TLS
# Set correct paths
export STEMEDB_TLS_CERT="/path/to/cert.pem"
export STEMEDB_TLS_KEY="/path/to/key.pem"
# Or update systemd service
sudo systemctl edit stemedb-api
# Add correct paths
sudo systemctl daemon-reload
sudo systemctl start stemedb-api
Resolution C: Certificate/key mismatch
# Verify certificate and key match
openssl x509 -noout -modulus -in /etc/stemedb/tls/cert.pem | openssl md5
openssl rsa -noout -modulus -in /etc/stemedb/tls/key.pem | openssl md5
# Hashes should match. If not, regenerate certificate or find matching pair.
If failed: TLS still failing → Temporarily disable TLS for debugging (NOT for production):
# Disable TLS (debugging only)
export STEMEDB_TLS_ENABLED=false
systemctl start stemedb-api
§3. WAL Corruption
Diagnostic:
# Check logs for WAL errors
journalctl -u stemedb-api -n 50 | grep -i wal
# Common errors:
# - "WAL magic byte validation failed"
# - "Failed to recover WAL segment"
# - "Checksum mismatch in WAL"
# Check WAL directory
ls -lh data/wal/
Resolution: Restore from backup
⚠️ WARNING: This destroys current WAL data. Only proceed if backup is available and data loss is acceptable.
# Stop server (if running)
sudo systemctl stop stemedb-api
# Backup corrupted WAL for forensics
sudo mv data/wal data/wal.corrupted.$(date +%Y%m%d-%H%M%S)
# List available backups
ls -lh backups/
# Restore from most recent backup
sudo ./scripts/restore-stemedb.sh backups/stemedb-backup-YYYYMMDD-HHMMSS
# Verify restoration
cat data/metadata.json
# Start server
sudo systemctl start stemedb-api
# Verify health
curl http://localhost:18180/v1/health
Expected output after restore:
{
"status": "healthy",
"version": "0.1.0",
"uptime_seconds": 5,
"assertion_count": 10234
}
If failed: Restore failed → Check backup integrity. See Restore from Backup Runbook.
§4. Disk Full
See: Disk Full Runbook for full procedure.
Quick emergency fix:
# Check disk usage
df -h
# If >98%, emergency cleanup
sudo find data/wal -name "*.log" -mtime +7 -delete
# Start server
sudo systemctl start stemedb-api
§5. Permission Issues
Diagnostic:
# Check directory permissions
ls -la data/
# Expected ownership:
# drwxr-xr-x stemedb stemedb wal/
# drwxr-xr-x stemedb stemedb db/
# Check SELinux denials (RHEL/CentOS)
sudo ausearch -m avc -ts recent
Resolution A: Fix ownership
# Fix ownership recursively
sudo chown -R stemedb:stemedb data/
# Fix permissions
sudo chmod -R 755 data/
sudo chmod -R 644 data/wal/*.log
sudo chmod -R 644 data/db/*.kv
# Start server
sudo systemctl start stemedb-api
Resolution B: SELinux context
# Restore SELinux context
sudo restorecon -Rv data/
# Or set permissive for debugging (NOT for production)
sudo setenforce 0
# Start server
sudo systemctl start stemedb-api
# If works, add SELinux policy instead of disabling
Resolution C: Container user mismatch
# In Docker/Kubernetes, ensure volumes have correct UID
# docker-compose.yml example:
# services:
# stemedb:
# user: "1000:1000" # Match host UID
# volumes:
# - ./data:/data
# Or use chown in entrypoint:
# entrypoint: ["sh", "-c", "chown -R stemedb:stemedb /data && exec stemedb-api"]
If failed: Permissions correct but still denied → Check AppArmor profiles or mandatory access controls.
Validation
After applying resolution, validate server is healthy:
-
Server starts successfully
systemctl status stemedb-api # Should show "active (running)" -
Health endpoint returns 200
curl http://localhost:18180/v1/health # Should return: {"status":"healthy", ...} -
Port is bound
lsof -i :18180 # Should show stemedb-api listening -
Logs show successful startup
journalctl -u stemedb-api -n 20 # Should show 10 startup steps completed -
Test query succeeds
curl -X POST http://localhost:18180/v1/query \ -H "Content-Type: application/json" \ -d '{"concept_path":"test/health","lens":"recency"}' # Should return 200 (even if empty results) -
Metrics endpoint works
curl http://localhost:18180/metrics | head -20 # Should return Prometheus metrics
Prevention
Monitoring
Set up alerts for:
# Prometheus alert rules
groups:
- name: stemedb_availability
rules:
- alert: StemeDBDown
expr: up{job="stemedb"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "StemeDB server is down"
description: "Server has been down for >1 minute"
- alert: StemeDBRestartLoop
expr: rate(stemedb_restarts_total[5m]) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "StemeDB restarting frequently"
description: "Server has restarted >2 times in 5 minutes"
Configuration Changes
To prevent recurrence:
- Port conflicts: Reserve port 18180 in your infrastructure registry
- TLS expiry: Automate certificate renewal with certbot + systemd timer
- WAL corruption: Enable daily backups via cron
- Disk full: Monitor disk at 80% threshold, alert at 90%
- Permissions: Document correct UID/GID in deployment guide
Example: Automated TLS renewal
# /etc/systemd/system/certbot-renewal.timer
[Unit]
Description=Certbot renewal timer
[Timer]
OnCalendar=daily
Persistent=true
[Install]
WantedBy=timers.target
Startup Sequence Reference
Normal startup takes 2-5 seconds and includes 10 steps:
- Initialize logging (tracing subscriber)
- Start metrics registry
- Load configuration (env vars)
- Verify data directories exist
- Open WAL journal (crash recovery if needed)
- Initialize HybridStore (KV + indexes)
- Start IngestWorker (background thread)
- Build HTTP router (axum)
- Bind TCP listener on configured address
- Start accepting connections
If server hangs at specific step, check:
- Step 5 (WAL): Corruption or disk full
- Step 6 (HybridStore): Database corruption
- Step 9 (Bind): Port already in use
Environment Variables Reference
| Variable | Default | Description |
|---|---|---|
STEMEDB_BIND_ADDR |
127.0.0.1:18180 |
HTTP API listen address |
STEMEDB_WAL_DIR |
data/wal |
Write-ahead log directory |
STEMEDB_DB_DIR |
data/db |
Database directory |
STEMEDB_TLS_ENABLED |
false |
Enable TLS termination |
STEMEDB_TLS_CERT |
(none) | Path to TLS certificate |
STEMEDB_TLS_KEY |
(none) | Path to TLS private key |
STEMEDB_METER_ENABLED |
true |
Enable Prometheus metrics |
Related Runbooks
- Disk Full - Storage management
- Restore from Backup - WAL corruption recovery
- High Query Latency - Performance issues after startup
Last Updated
2026-02-11