stemedb/docs/operations/runbooks/server-wont-start.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

477 lines
11 KiB
Markdown

# Runbook: Server Won't Start
## Symptom
- `stemedb-api` process exits immediately after startup
- Port binding fails with "Address already in use"
- TLS certificate errors in logs
- "No space left on device" errors
- WAL magic byte validation failures
- Permission denied errors on data directories
**Metrics Alerts:**
- N/A (server never starts, metrics unavailable)
---
## Quick Diagnosis
```
Server won't start
├─► Check: lsof -i :18180
│ └─► Port in use? → §1 Port Conflict
├─► Check: journalctl -u stemedb-api | grep -i tls
│ └─► TLS errors? → §2 TLS Error
├─► Check: df -h
│ └─► Disk full? → [Disk Full Runbook](./disk-full.md)
├─► Check: journalctl -u stemedb-api | grep -i magic
│ └─► WAL corruption? → §3 WAL Corruption
└─► Check: ls -la data/wal/
└─► Permission denied? → §4 Permissions
```
---
## Common Causes
1. **Port already in use** — Likelihood: **40%**
- Previous instance didn't shut down cleanly
- Another service using port 18180
- Development server still running
2. **TLS certificate issues** — Likelihood: **25%**
- Certificate expired
- Wrong file paths in config
- Certificate/key mismatch
3. **WAL corruption** — Likelihood: **15%**
- Unclean shutdown (power loss, OOM kill)
- Disk corruption
- Version mismatch after upgrade
4. **Disk full** — Likelihood: **10%**
- WAL directory out of space
- DB directory out of space
- No inodes available
5. **Permission issues** — Likelihood: **10%**
- Wrong ownership on data directories
- SELinux/AppArmor blocking access
- Container user mismatch
---
## Resolution Steps
### §1. Port Conflict
**Diagnostic:**
```bash
# Check if port 18180 is in use
lsof -i :18180
# Expected output if port in use:
# COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
# stemedb- 1234 root 10u IPv4 12345 0t0 TCP *:18180 (LISTEN)
```
**Resolution A: Kill stale process**
```bash
# Find process using port
lsof -ti :18180
# Kill gracefully (SIGTERM)
kill $(lsof -ti :18180)
# Wait 5 seconds
sleep 5
# Verify port is free
lsof -i :18180
# (Should return empty)
# Start server
systemctl start stemedb-api
```
**Resolution B: Change port**
```bash
# Set custom port via environment variable
export STEMEDB_BIND_ADDR="127.0.0.1:18280"
# Or in systemd service file
sudo systemctl edit stemedb-api
# Add:
# [Service]
# Environment="STEMEDB_BIND_ADDR=127.0.0.1:18280"
sudo systemctl daemon-reload
sudo systemctl start stemedb-api
```
**If failed:** Port still in use after kill → Check for multiple instances or conflicting services. Proceed to reboot if critical.
---
### §2. TLS Certificate Error
**Diagnostic:**
```bash
# Check logs for TLS errors
journalctl -u stemedb-api -n 50 | grep -i tls
# Common errors:
# - "certificate has expired"
# - "No such file or directory: /etc/stemedb/tls/cert.pem"
# - "key values mismatch"
# Verify certificate files exist
ls -lh /etc/stemedb/tls/
```
**Resolution A: Certificate expired**
```bash
# Check expiration date
openssl x509 -in /etc/stemedb/tls/cert.pem -noout -enddate
# Renew with Let's Encrypt (example)
sudo certbot renew --cert-name stemedb.example.com
# Copy renewed certificates
sudo cp /etc/letsencrypt/live/stemedb.example.com/fullchain.pem /etc/stemedb/tls/cert.pem
sudo cp /etc/letsencrypt/live/stemedb.example.com/privkey.pem /etc/stemedb/tls/key.pem
# Set correct permissions
sudo chown stemedb:stemedb /etc/stemedb/tls/*.pem
sudo chmod 600 /etc/stemedb/tls/key.pem
sudo chmod 644 /etc/stemedb/tls/cert.pem
# Restart server
sudo systemctl start stemedb-api
```
**Resolution B: Wrong file paths**
```bash
# Check environment variables
env | grep STEMEDB_TLS
# Set correct paths
export STEMEDB_TLS_CERT="/path/to/cert.pem"
export STEMEDB_TLS_KEY="/path/to/key.pem"
# Or update systemd service
sudo systemctl edit stemedb-api
# Add correct paths
sudo systemctl daemon-reload
sudo systemctl start stemedb-api
```
**Resolution C: Certificate/key mismatch**
```bash
# Verify certificate and key match
openssl x509 -noout -modulus -in /etc/stemedb/tls/cert.pem | openssl md5
openssl rsa -noout -modulus -in /etc/stemedb/tls/key.pem | openssl md5
# Hashes should match. If not, regenerate certificate or find matching pair.
```
**If failed:** TLS still failing → Temporarily disable TLS for debugging (NOT for production):
```bash
# Disable TLS (debugging only)
export STEMEDB_TLS_ENABLED=false
systemctl start stemedb-api
```
---
### §3. WAL Corruption
**Diagnostic:**
```bash
# Check logs for WAL errors
journalctl -u stemedb-api -n 50 | grep -i wal
# Common errors:
# - "WAL magic byte validation failed"
# - "Failed to recover WAL segment"
# - "Checksum mismatch in WAL"
# Check WAL directory
ls -lh data/wal/
```
**Resolution: Restore from backup**
⚠️ **WARNING:** This destroys current WAL data. Only proceed if backup is available and data loss is acceptable.
```bash
# Stop server (if running)
sudo systemctl stop stemedb-api
# Backup corrupted WAL for forensics
sudo mv data/wal data/wal.corrupted.$(date +%Y%m%d-%H%M%S)
# List available backups
ls -lh backups/
# Restore from most recent backup
sudo ./scripts/restore-stemedb.sh backups/stemedb-backup-YYYYMMDD-HHMMSS
# Verify restoration
cat data/metadata.json
# Start server
sudo systemctl start stemedb-api
# Verify health
curl http://localhost:18180/v1/health
```
**Expected output after restore:**
```json
{
"status": "healthy",
"version": "0.1.0",
"uptime_seconds": 5,
"assertion_count": 10234
}
```
**If failed:** Restore failed → Check backup integrity. See [Restore from Backup Runbook](./restore-from-backup.md).
---
### §4. Disk Full
**See:** [Disk Full Runbook](./disk-full.md) for full procedure.
**Quick emergency fix:**
```bash
# Check disk usage
df -h
# If >98%, emergency cleanup
sudo find data/wal -name "*.log" -mtime +7 -delete
# Start server
sudo systemctl start stemedb-api
```
---
### §5. Permission Issues
**Diagnostic:**
```bash
# Check directory permissions
ls -la data/
# Expected ownership:
# drwxr-xr-x stemedb stemedb wal/
# drwxr-xr-x stemedb stemedb db/
# Check SELinux denials (RHEL/CentOS)
sudo ausearch -m avc -ts recent
```
**Resolution A: Fix ownership**
```bash
# Fix ownership recursively
sudo chown -R stemedb:stemedb data/
# Fix permissions
sudo chmod -R 755 data/
sudo chmod -R 644 data/wal/*.log
sudo chmod -R 644 data/db/*.kv
# Start server
sudo systemctl start stemedb-api
```
**Resolution B: SELinux context**
```bash
# Restore SELinux context
sudo restorecon -Rv data/
# Or set permissive for debugging (NOT for production)
sudo setenforce 0
# Start server
sudo systemctl start stemedb-api
# If works, add SELinux policy instead of disabling
```
**Resolution C: Container user mismatch**
```bash
# In Docker/Kubernetes, ensure volumes have correct UID
# docker-compose.yml example:
# services:
# stemedb:
# user: "1000:1000" # Match host UID
# volumes:
# - ./data:/data
# Or use chown in entrypoint:
# entrypoint: ["sh", "-c", "chown -R stemedb:stemedb /data && exec stemedb-api"]
```
**If failed:** Permissions correct but still denied → Check AppArmor profiles or mandatory access controls.
---
## Validation
After applying resolution, validate server is healthy:
- [ ] **Server starts successfully**
```bash
systemctl status stemedb-api
# Should show "active (running)"
```
- [ ] **Health endpoint returns 200**
```bash
curl http://localhost:18180/v1/health
# Should return: {"status":"healthy", ...}
```
- [ ] **Port is bound**
```bash
lsof -i :18180
# Should show stemedb-api listening
```
- [ ] **Logs show successful startup**
```bash
journalctl -u stemedb-api -n 20
# Should show 10 startup steps completed
```
- [ ] **Test query succeeds**
```bash
curl -X POST http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path":"test/health","lens":"recency"}'
# Should return 200 (even if empty results)
```
- [ ] **Metrics endpoint works**
```bash
curl http://localhost:18180/metrics | head -20
# Should return Prometheus metrics
```
---
## Prevention
### Monitoring
**Set up alerts for:**
```yaml
# Prometheus alert rules
groups:
- name: stemedb_availability
rules:
- alert: StemeDBDown
expr: up{job="stemedb"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "StemeDB server is down"
description: "Server has been down for >1 minute"
- alert: StemeDBRestartLoop
expr: rate(stemedb_restarts_total[5m]) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "StemeDB restarting frequently"
description: "Server has restarted >2 times in 5 minutes"
```
### Configuration Changes
**To prevent recurrence:**
1. **Port conflicts:** Reserve port 18180 in your infrastructure registry
2. **TLS expiry:** Automate certificate renewal with certbot + systemd timer
3. **WAL corruption:** Enable daily backups via cron
4. **Disk full:** Monitor disk at 80% threshold, alert at 90%
5. **Permissions:** Document correct UID/GID in deployment guide
**Example: Automated TLS renewal**
```bash
# /etc/systemd/system/certbot-renewal.timer
[Unit]
Description=Certbot renewal timer
[Timer]
OnCalendar=daily
Persistent=true
[Install]
WantedBy=timers.target
```
---
## Startup Sequence Reference
**Normal startup takes 2-5 seconds and includes 10 steps:**
1. Initialize logging (tracing subscriber)
2. Start metrics registry
3. Load configuration (env vars)
4. Verify data directories exist
5. Open WAL journal (crash recovery if needed)
6. Initialize HybridStore (KV + indexes)
7. Start IngestWorker (background thread)
8. Build HTTP router (axum)
9. Bind TCP listener on configured address
10. Start accepting connections
**If server hangs at specific step, check:**
- Step 5 (WAL): Corruption or disk full
- Step 6 (HybridStore): Database corruption
- Step 9 (Bind): Port already in use
---
## Environment Variables Reference
| Variable | Default | Description |
|----------|---------|-------------|
| `STEMEDB_BIND_ADDR` | `127.0.0.1:18180` | HTTP API listen address |
| `STEMEDB_WAL_DIR` | `data/wal` | Write-ahead log directory |
| `STEMEDB_DB_DIR` | `data/db` | Database directory |
| `STEMEDB_TLS_ENABLED` | `false` | Enable TLS termination |
| `STEMEDB_TLS_CERT` | (none) | Path to TLS certificate |
| `STEMEDB_TLS_KEY` | (none) | Path to TLS private key |
| `STEMEDB_METER_ENABLED` | `true` | Enable Prometheus metrics |
---
## Related Runbooks
- [Disk Full](./disk-full.md) - Storage management
- [Restore from Backup](./restore-from-backup.md) - WAL corruption recovery
- [High Query Latency](./high-query-latency.md) - Performance issues after startup
---
## Last Updated
2026-02-11