# Runbook: Server Won't Start ## Symptom - `stemedb-api` process exits immediately after startup - Port binding fails with "Address already in use" - TLS certificate errors in logs - "No space left on device" errors - WAL magic byte validation failures - Permission denied errors on data directories **Metrics Alerts:** - N/A (server never starts, metrics unavailable) --- ## Quick Diagnosis ``` Server won't start │ ├─► Check: lsof -i :18180 │ └─► Port in use? → §1 Port Conflict │ ├─► Check: journalctl -u stemedb-api | grep -i tls │ └─► TLS errors? → §2 TLS Error │ ├─► Check: df -h │ └─► Disk full? → [Disk Full Runbook](./disk-full.md) │ ├─► Check: journalctl -u stemedb-api | grep -i magic │ └─► WAL corruption? → §3 WAL Corruption │ └─► Check: ls -la data/wal/ └─► Permission denied? → §4 Permissions ``` --- ## Common Causes 1. **Port already in use** — Likelihood: **40%** - Previous instance didn't shut down cleanly - Another service using port 18180 - Development server still running 2. **TLS certificate issues** — Likelihood: **25%** - Certificate expired - Wrong file paths in config - Certificate/key mismatch 3. **WAL corruption** — Likelihood: **15%** - Unclean shutdown (power loss, OOM kill) - Disk corruption - Version mismatch after upgrade 4. **Disk full** — Likelihood: **10%** - WAL directory out of space - DB directory out of space - No inodes available 5. **Permission issues** — Likelihood: **10%** - Wrong ownership on data directories - SELinux/AppArmor blocking access - Container user mismatch --- ## Resolution Steps ### §1. Port Conflict **Diagnostic:** ```bash # Check if port 18180 is in use lsof -i :18180 # Expected output if port in use: # COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME # stemedb- 1234 root 10u IPv4 12345 0t0 TCP *:18180 (LISTEN) ``` **Resolution A: Kill stale process** ```bash # Find process using port lsof -ti :18180 # Kill gracefully (SIGTERM) kill $(lsof -ti :18180) # Wait 5 seconds sleep 5 # Verify port is free lsof -i :18180 # (Should return empty) # Start server systemctl start stemedb-api ``` **Resolution B: Change port** ```bash # Set custom port via environment variable export STEMEDB_BIND_ADDR="127.0.0.1:18280" # Or in systemd service file sudo systemctl edit stemedb-api # Add: # [Service] # Environment="STEMEDB_BIND_ADDR=127.0.0.1:18280" sudo systemctl daemon-reload sudo systemctl start stemedb-api ``` **If failed:** Port still in use after kill → Check for multiple instances or conflicting services. Proceed to reboot if critical. --- ### §2. TLS Certificate Error **Diagnostic:** ```bash # Check logs for TLS errors journalctl -u stemedb-api -n 50 | grep -i tls # Common errors: # - "certificate has expired" # - "No such file or directory: /etc/stemedb/tls/cert.pem" # - "key values mismatch" # Verify certificate files exist ls -lh /etc/stemedb/tls/ ``` **Resolution A: Certificate expired** ```bash # Check expiration date openssl x509 -in /etc/stemedb/tls/cert.pem -noout -enddate # Renew with Let's Encrypt (example) sudo certbot renew --cert-name stemedb.example.com # Copy renewed certificates sudo cp /etc/letsencrypt/live/stemedb.example.com/fullchain.pem /etc/stemedb/tls/cert.pem sudo cp /etc/letsencrypt/live/stemedb.example.com/privkey.pem /etc/stemedb/tls/key.pem # Set correct permissions sudo chown stemedb:stemedb /etc/stemedb/tls/*.pem sudo chmod 600 /etc/stemedb/tls/key.pem sudo chmod 644 /etc/stemedb/tls/cert.pem # Restart server sudo systemctl start stemedb-api ``` **Resolution B: Wrong file paths** ```bash # Check environment variables env | grep STEMEDB_TLS # Set correct paths export STEMEDB_TLS_CERT="/path/to/cert.pem" export STEMEDB_TLS_KEY="/path/to/key.pem" # Or update systemd service sudo systemctl edit stemedb-api # Add correct paths sudo systemctl daemon-reload sudo systemctl start stemedb-api ``` **Resolution C: Certificate/key mismatch** ```bash # Verify certificate and key match openssl x509 -noout -modulus -in /etc/stemedb/tls/cert.pem | openssl md5 openssl rsa -noout -modulus -in /etc/stemedb/tls/key.pem | openssl md5 # Hashes should match. If not, regenerate certificate or find matching pair. ``` **If failed:** TLS still failing → Temporarily disable TLS for debugging (NOT for production): ```bash # Disable TLS (debugging only) export STEMEDB_TLS_ENABLED=false systemctl start stemedb-api ``` --- ### §3. WAL Corruption **Diagnostic:** ```bash # Check logs for WAL errors journalctl -u stemedb-api -n 50 | grep -i wal # Common errors: # - "WAL magic byte validation failed" # - "Failed to recover WAL segment" # - "Checksum mismatch in WAL" # Check WAL directory ls -lh data/wal/ ``` **Resolution: Restore from backup** ⚠️ **WARNING:** This destroys current WAL data. Only proceed if backup is available and data loss is acceptable. ```bash # Stop server (if running) sudo systemctl stop stemedb-api # Backup corrupted WAL for forensics sudo mv data/wal data/wal.corrupted.$(date +%Y%m%d-%H%M%S) # List available backups ls -lh backups/ # Restore from most recent backup sudo ./scripts/restore-stemedb.sh backups/stemedb-backup-YYYYMMDD-HHMMSS # Verify restoration cat data/metadata.json # Start server sudo systemctl start stemedb-api # Verify health curl http://localhost:18180/v1/health ``` **Expected output after restore:** ```json { "status": "healthy", "version": "0.1.0", "uptime_seconds": 5, "assertion_count": 10234 } ``` **If failed:** Restore failed → Check backup integrity. See [Restore from Backup Runbook](./restore-from-backup.md). --- ### §4. Disk Full **See:** [Disk Full Runbook](./disk-full.md) for full procedure. **Quick emergency fix:** ```bash # Check disk usage df -h # If >98%, emergency cleanup sudo find data/wal -name "*.log" -mtime +7 -delete # Start server sudo systemctl start stemedb-api ``` --- ### §5. Permission Issues **Diagnostic:** ```bash # Check directory permissions ls -la data/ # Expected ownership: # drwxr-xr-x stemedb stemedb wal/ # drwxr-xr-x stemedb stemedb db/ # Check SELinux denials (RHEL/CentOS) sudo ausearch -m avc -ts recent ``` **Resolution A: Fix ownership** ```bash # Fix ownership recursively sudo chown -R stemedb:stemedb data/ # Fix permissions sudo chmod -R 755 data/ sudo chmod -R 644 data/wal/*.log sudo chmod -R 644 data/db/*.kv # Start server sudo systemctl start stemedb-api ``` **Resolution B: SELinux context** ```bash # Restore SELinux context sudo restorecon -Rv data/ # Or set permissive for debugging (NOT for production) sudo setenforce 0 # Start server sudo systemctl start stemedb-api # If works, add SELinux policy instead of disabling ``` **Resolution C: Container user mismatch** ```bash # In Docker/Kubernetes, ensure volumes have correct UID # docker-compose.yml example: # services: # stemedb: # user: "1000:1000" # Match host UID # volumes: # - ./data:/data # Or use chown in entrypoint: # entrypoint: ["sh", "-c", "chown -R stemedb:stemedb /data && exec stemedb-api"] ``` **If failed:** Permissions correct but still denied → Check AppArmor profiles or mandatory access controls. --- ## Validation After applying resolution, validate server is healthy: - [ ] **Server starts successfully** ```bash systemctl status stemedb-api # Should show "active (running)" ``` - [ ] **Health endpoint returns 200** ```bash curl http://localhost:18180/v1/health # Should return: {"status":"healthy", ...} ``` - [ ] **Port is bound** ```bash lsof -i :18180 # Should show stemedb-api listening ``` - [ ] **Logs show successful startup** ```bash journalctl -u stemedb-api -n 20 # Should show 10 startup steps completed ``` - [ ] **Test query succeeds** ```bash curl -X POST http://localhost:18180/v1/query \ -H "Content-Type: application/json" \ -d '{"concept_path":"test/health","lens":"recency"}' # Should return 200 (even if empty results) ``` - [ ] **Metrics endpoint works** ```bash curl http://localhost:18180/metrics | head -20 # Should return Prometheus metrics ``` --- ## Prevention ### Monitoring **Set up alerts for:** ```yaml # Prometheus alert rules groups: - name: stemedb_availability rules: - alert: StemeDBDown expr: up{job="stemedb"} == 0 for: 1m labels: severity: critical annotations: summary: "StemeDB server is down" description: "Server has been down for >1 minute" - alert: StemeDBRestartLoop expr: rate(stemedb_restarts_total[5m]) > 2 for: 5m labels: severity: warning annotations: summary: "StemeDB restarting frequently" description: "Server has restarted >2 times in 5 minutes" ``` ### Configuration Changes **To prevent recurrence:** 1. **Port conflicts:** Reserve port 18180 in your infrastructure registry 2. **TLS expiry:** Automate certificate renewal with certbot + systemd timer 3. **WAL corruption:** Enable daily backups via cron 4. **Disk full:** Monitor disk at 80% threshold, alert at 90% 5. **Permissions:** Document correct UID/GID in deployment guide **Example: Automated TLS renewal** ```bash # /etc/systemd/system/certbot-renewal.timer [Unit] Description=Certbot renewal timer [Timer] OnCalendar=daily Persistent=true [Install] WantedBy=timers.target ``` --- ## Startup Sequence Reference **Normal startup takes 2-5 seconds and includes 10 steps:** 1. Initialize logging (tracing subscriber) 2. Start metrics registry 3. Load configuration (env vars) 4. Verify data directories exist 5. Open WAL journal (crash recovery if needed) 6. Initialize HybridStore (KV + indexes) 7. Start IngestWorker (background thread) 8. Build HTTP router (axum) 9. Bind TCP listener on configured address 10. Start accepting connections **If server hangs at specific step, check:** - Step 5 (WAL): Corruption or disk full - Step 6 (HybridStore): Database corruption - Step 9 (Bind): Port already in use --- ## Environment Variables Reference | Variable | Default | Description | |----------|---------|-------------| | `STEMEDB_BIND_ADDR` | `127.0.0.1:18180` | HTTP API listen address | | `STEMEDB_WAL_DIR` | `data/wal` | Write-ahead log directory | | `STEMEDB_DB_DIR` | `data/db` | Database directory | | `STEMEDB_TLS_ENABLED` | `false` | Enable TLS termination | | `STEMEDB_TLS_CERT` | (none) | Path to TLS certificate | | `STEMEDB_TLS_KEY` | (none) | Path to TLS private key | | `STEMEDB_METER_ENABLED` | `true` | Enable Prometheus metrics | --- ## Related Runbooks - [Disk Full](./disk-full.md) - Storage management - [Restore from Backup](./restore-from-backup.md) - WAL corruption recovery - [High Query Latency](./high-query-latency.md) - Performance issues after startup --- ## Last Updated 2026-02-11