stemedb/docs/operations/runbooks/server-wont-start.md

# Runbook: Server Won't Start

## Symptom

- `stemedb-api` process exits immediately after startup
- Port binding fails with "Address already in use"
- TLS certificate errors in logs
- "No space left on device" errors
- WAL magic byte validation failures
- Permission denied errors on data directories

**Metrics Alerts:**
- N/A (server never starts, metrics unavailable)

---

## Quick Diagnosis

```
Server won't start
    │
    ├─► Check: lsof -i :18180
    │   └─► Port in use? → §1 Port Conflict
    │
    ├─► Check: journalctl -u stemedb-api | grep -i tls
    │   └─► TLS errors? → §2 TLS Error
    │
    ├─► Check: df -h
    │   └─► Disk full? → [Disk Full Runbook](./disk-full.md)
    │
    ├─► Check: journalctl -u stemedb-api | grep -i magic
    │   └─► WAL corruption? → §3 WAL Corruption
    │
    └─► Check: ls -la data/wal/
        └─► Permission denied? → §4 Permissions
```

---

## Common Causes

1. **Port already in use** — Likelihood: **40%**
   - Previous instance didn't shut down cleanly
   - Another service using port 18180
   - Development server still running

2. **TLS certificate issues** — Likelihood: **25%**
   - Certificate expired
   - Wrong file paths in config
   - Certificate/key mismatch

3. **WAL corruption** — Likelihood: **15%**
   - Unclean shutdown (power loss, OOM kill)
   - Disk corruption
   - Version mismatch after upgrade

4. **Disk full** — Likelihood: **10%**
   - WAL directory out of space
   - DB directory out of space
   - No inodes available

5. **Permission issues** — Likelihood: **10%**
   - Wrong ownership on data directories
   - SELinux/AppArmor blocking access
   - Container user mismatch

---

## Resolution Steps

### §1. Port Conflict

**Diagnostic:**
```bash
# Check if port 18180 is in use
lsof -i :18180

# Expected output if port in use:
# COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
# stemedb- 1234 root   10u  IPv4  12345      0t0  TCP *:18180 (LISTEN)
```

**Resolution A: Kill stale process**
```bash
# Find process using port
lsof -ti :18180

# Kill gracefully (SIGTERM)
kill $(lsof -ti :18180)

# Wait 5 seconds
sleep 5

# Verify port is free
lsof -i :18180
# (Should return empty)

# Start server
systemctl start stemedb-api
```

**Resolution B: Change port**
```bash
# Set custom port via environment variable
export STEMEDB_BIND_ADDR="127.0.0.1:18280"

# Or in systemd service file
sudo systemctl edit stemedb-api

# Add:
# [Service]
# Environment="STEMEDB_BIND_ADDR=127.0.0.1:18280"

sudo systemctl daemon-reload
sudo systemctl start stemedb-api
```

**If failed:** Port still in use after kill → Check for multiple instances or conflicting services. Proceed to reboot if critical.

---

### §2. TLS Certificate Error

**Diagnostic:**
```bash
# Check logs for TLS errors
journalctl -u stemedb-api -n 50 | grep -i tls

# Common errors:
# - "certificate has expired"
# - "No such file or directory: /etc/stemedb/tls/cert.pem"
# - "key values mismatch"

# Verify certificate files exist
ls -lh /etc/stemedb/tls/
```

**Resolution A: Certificate expired**
```bash
# Check expiration date
openssl x509 -in /etc/stemedb/tls/cert.pem -noout -enddate

# Renew with Let's Encrypt (example)
sudo certbot renew --cert-name stemedb.example.com

# Copy renewed certificates
sudo cp /etc/letsencrypt/live/stemedb.example.com/fullchain.pem /etc/stemedb/tls/cert.pem
sudo cp /etc/letsencrypt/live/stemedb.example.com/privkey.pem /etc/stemedb/tls/key.pem

# Set correct permissions
sudo chown stemedb:stemedb /etc/stemedb/tls/*.pem
sudo chmod 600 /etc/stemedb/tls/key.pem
sudo chmod 644 /etc/stemedb/tls/cert.pem

# Restart server
sudo systemctl start stemedb-api
```

**Resolution B: Wrong file paths**
```bash
# Check environment variables
env | grep STEMEDB_TLS

# Set correct paths
export STEMEDB_TLS_CERT="/path/to/cert.pem"
export STEMEDB_TLS_KEY="/path/to/key.pem"

# Or update systemd service
sudo systemctl edit stemedb-api
# Add correct paths

sudo systemctl daemon-reload
sudo systemctl start stemedb-api
```

**Resolution C: Certificate/key mismatch**
```bash
# Verify certificate and key match
openssl x509 -noout -modulus -in /etc/stemedb/tls/cert.pem | openssl md5
openssl rsa -noout -modulus -in /etc/stemedb/tls/key.pem | openssl md5

# Hashes should match. If not, regenerate certificate or find matching pair.
```

**If failed:** TLS still failing → Temporarily disable TLS for debugging (NOT for production):
```bash
# Disable TLS (debugging only)
export STEMEDB_TLS_ENABLED=false
systemctl start stemedb-api
```

---

### §3. WAL Corruption

**Diagnostic:**
```bash
# Check logs for WAL errors
journalctl -u stemedb-api -n 50 | grep -i wal

# Common errors:
# - "WAL magic byte validation failed"
# - "Failed to recover WAL segment"
# - "Checksum mismatch in WAL"

# Check WAL directory
ls -lh data/wal/
```

**Resolution: Restore from backup**

⚠️ **WARNING:** This destroys current WAL data. Only proceed if backup is available and data loss is acceptable.

```bash
# Stop server (if running)
sudo systemctl stop stemedb-api

# Backup corrupted WAL for forensics
sudo mv data/wal data/wal.corrupted.$(date +%Y%m%d-%H%M%S)

# List available backups
ls -lh backups/

# Restore from most recent backup
sudo ./scripts/restore-stemedb.sh backups/stemedb-backup-YYYYMMDD-HHMMSS

# Verify restoration
cat data/metadata.json

# Start server
sudo systemctl start stemedb-api

# Verify health
curl http://localhost:18180/v1/health
```

**Expected output after restore:**
```json
{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 5,
  "assertion_count": 10234
}
```

**If failed:** Restore failed → Check backup integrity. See [Restore from Backup Runbook](./restore-from-backup.md).

---

### §4. Disk Full

**See:** [Disk Full Runbook](./disk-full.md) for full procedure.

**Quick emergency fix:**
```bash
# Check disk usage
df -h

# If >98%, emergency cleanup
sudo find data/wal -name "*.log" -mtime +7 -delete

# Start server
sudo systemctl start stemedb-api
```

---

### §5. Permission Issues

**Diagnostic:**
```bash
# Check directory permissions
ls -la data/

# Expected ownership:
# drwxr-xr-x stemedb stemedb wal/
# drwxr-xr-x stemedb stemedb db/

# Check SELinux denials (RHEL/CentOS)
sudo ausearch -m avc -ts recent
```

**Resolution A: Fix ownership**
```bash
# Fix ownership recursively
sudo chown -R stemedb:stemedb data/

# Fix permissions
sudo chmod -R 755 data/
sudo chmod -R 644 data/wal/*.log
sudo chmod -R 644 data/db/*.kv

# Start server
sudo systemctl start stemedb-api
```

**Resolution B: SELinux context**
```bash
# Restore SELinux context
sudo restorecon -Rv data/

# Or set permissive for debugging (NOT for production)
sudo setenforce 0

# Start server
sudo systemctl start stemedb-api

# If works, add SELinux policy instead of disabling
```

**Resolution C: Container user mismatch**
```bash
# In Docker/Kubernetes, ensure volumes have correct UID
# docker-compose.yml example:
# services:
#   stemedb:
#     user: "1000:1000"  # Match host UID
#     volumes:
#       - ./data:/data

# Or use chown in entrypoint:
# entrypoint: ["sh", "-c", "chown -R stemedb:stemedb /data && exec stemedb-api"]
```

**If failed:** Permissions correct but still denied → Check AppArmor profiles or mandatory access controls.

---

## Validation

After applying resolution, validate server is healthy:

- [ ] **Server starts successfully**
  ```bash
  systemctl status stemedb-api
  # Should show "active (running)"
  ```

- [ ] **Health endpoint returns 200**
  ```bash
  curl http://localhost:18180/v1/health
  # Should return: {"status":"healthy", ...}
  ```

- [ ] **Port is bound**
  ```bash
  lsof -i :18180
  # Should show stemedb-api listening
  ```

- [ ] **Logs show successful startup**
  ```bash
  journalctl -u stemedb-api -n 20
  # Should show 10 startup steps completed
  ```

- [ ] **Test query succeeds**
  ```bash
  curl -X POST http://localhost:18180/v1/query \
    -H "Content-Type: application/json" \
    -d '{"concept_path":"test/health","lens":"recency"}'
  # Should return 200 (even if empty results)
  ```

- [ ] **Metrics endpoint works**
  ```bash
  curl http://localhost:18180/metrics | head -20
  # Should return Prometheus metrics
  ```

---

## Prevention

### Monitoring

**Set up alerts for:**

```yaml
# Prometheus alert rules
groups:
  - name: stemedb_availability
    rules:
      - alert: StemeDBDown
        expr: up{job="stemedb"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "StemeDB server is down"
          description: "Server has been down for >1 minute"

      - alert: StemeDBRestartLoop
        expr: rate(stemedb_restarts_total[5m]) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "StemeDB restarting frequently"
          description: "Server has restarted >2 times in 5 minutes"
```

### Configuration Changes

**To prevent recurrence:**

1. **Port conflicts:** Reserve port 18180 in your infrastructure registry
2. **TLS expiry:** Automate certificate renewal with certbot + systemd timer
3. **WAL corruption:** Enable daily backups via cron
4. **Disk full:** Monitor disk at 80% threshold, alert at 90%
5. **Permissions:** Document correct UID/GID in deployment guide

**Example: Automated TLS renewal**
```bash
# /etc/systemd/system/certbot-renewal.timer
[Unit]
Description=Certbot renewal timer

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target
```

---

## Startup Sequence Reference

**Normal startup takes 2-5 seconds and includes 10 steps:**

1. Initialize logging (tracing subscriber)
2. Start metrics registry
3. Load configuration (env vars)
4. Verify data directories exist
5. Open WAL journal (crash recovery if needed)
6. Initialize HybridStore (KV + indexes)
7. Start IngestWorker (background thread)
8. Build HTTP router (axum)
9. Bind TCP listener on configured address
10. Start accepting connections

**If server hangs at specific step, check:**
- Step 5 (WAL): Corruption or disk full
- Step 6 (HybridStore): Database corruption
- Step 9 (Bind): Port already in use

---

## Environment Variables Reference

| Variable | Default | Description |
|----------|---------|-------------|
| `STEMEDB_BIND_ADDR` | `127.0.0.1:18180` | HTTP API listen address |
| `STEMEDB_WAL_DIR` | `data/wal` | Write-ahead log directory |
| `STEMEDB_DB_DIR` | `data/db` | Database directory |
| `STEMEDB_TLS_ENABLED` | `false` | Enable TLS termination |
| `STEMEDB_TLS_CERT` | (none) | Path to TLS certificate |
| `STEMEDB_TLS_KEY` | (none) | Path to TLS private key |
| `STEMEDB_METER_ENABLED` | `true` | Enable Prometheus metrics |

---

## Related Runbooks

- [Disk Full](./disk-full.md) - Storage management
- [Restore from Backup](./restore-from-backup.md) - WAL corruption recovery
- [High Query Latency](./high-query-latency.md) - Performance issues after startup

---

## Last Updated

2026-02-11