This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
477 lines
11 KiB
Markdown
477 lines
11 KiB
Markdown
# Runbook: Server Won't Start
|
|
|
|
## Symptom
|
|
|
|
- `stemedb-api` process exits immediately after startup
|
|
- Port binding fails with "Address already in use"
|
|
- TLS certificate errors in logs
|
|
- "No space left on device" errors
|
|
- WAL magic byte validation failures
|
|
- Permission denied errors on data directories
|
|
|
|
**Metrics Alerts:**
|
|
- N/A (server never starts, metrics unavailable)
|
|
|
|
---
|
|
|
|
## Quick Diagnosis
|
|
|
|
```
|
|
Server won't start
|
|
│
|
|
├─► Check: lsof -i :18180
|
|
│ └─► Port in use? → §1 Port Conflict
|
|
│
|
|
├─► Check: journalctl -u stemedb-api | grep -i tls
|
|
│ └─► TLS errors? → §2 TLS Error
|
|
│
|
|
├─► Check: df -h
|
|
│ └─► Disk full? → [Disk Full Runbook](./disk-full.md)
|
|
│
|
|
├─► Check: journalctl -u stemedb-api | grep -i magic
|
|
│ └─► WAL corruption? → §3 WAL Corruption
|
|
│
|
|
└─► Check: ls -la data/wal/
|
|
└─► Permission denied? → §4 Permissions
|
|
```
|
|
|
|
---
|
|
|
|
## Common Causes
|
|
|
|
1. **Port already in use** — Likelihood: **40%**
|
|
- Previous instance didn't shut down cleanly
|
|
- Another service using port 18180
|
|
- Development server still running
|
|
|
|
2. **TLS certificate issues** — Likelihood: **25%**
|
|
- Certificate expired
|
|
- Wrong file paths in config
|
|
- Certificate/key mismatch
|
|
|
|
3. **WAL corruption** — Likelihood: **15%**
|
|
- Unclean shutdown (power loss, OOM kill)
|
|
- Disk corruption
|
|
- Version mismatch after upgrade
|
|
|
|
4. **Disk full** — Likelihood: **10%**
|
|
- WAL directory out of space
|
|
- DB directory out of space
|
|
- No inodes available
|
|
|
|
5. **Permission issues** — Likelihood: **10%**
|
|
- Wrong ownership on data directories
|
|
- SELinux/AppArmor blocking access
|
|
- Container user mismatch
|
|
|
|
---
|
|
|
|
## Resolution Steps
|
|
|
|
### §1. Port Conflict
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check if port 18180 is in use
|
|
lsof -i :18180
|
|
|
|
# Expected output if port in use:
|
|
# COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
|
|
# stemedb- 1234 root 10u IPv4 12345 0t0 TCP *:18180 (LISTEN)
|
|
```
|
|
|
|
**Resolution A: Kill stale process**
|
|
```bash
|
|
# Find process using port
|
|
lsof -ti :18180
|
|
|
|
# Kill gracefully (SIGTERM)
|
|
kill $(lsof -ti :18180)
|
|
|
|
# Wait 5 seconds
|
|
sleep 5
|
|
|
|
# Verify port is free
|
|
lsof -i :18180
|
|
# (Should return empty)
|
|
|
|
# Start server
|
|
systemctl start stemedb-api
|
|
```
|
|
|
|
**Resolution B: Change port**
|
|
```bash
|
|
# Set custom port via environment variable
|
|
export STEMEDB_BIND_ADDR="127.0.0.1:18280"
|
|
|
|
# Or in systemd service file
|
|
sudo systemctl edit stemedb-api
|
|
|
|
# Add:
|
|
# [Service]
|
|
# Environment="STEMEDB_BIND_ADDR=127.0.0.1:18280"
|
|
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl start stemedb-api
|
|
```
|
|
|
|
**If failed:** Port still in use after kill → Check for multiple instances or conflicting services. Proceed to reboot if critical.
|
|
|
|
---
|
|
|
|
### §2. TLS Certificate Error
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check logs for TLS errors
|
|
journalctl -u stemedb-api -n 50 | grep -i tls
|
|
|
|
# Common errors:
|
|
# - "certificate has expired"
|
|
# - "No such file or directory: /etc/stemedb/tls/cert.pem"
|
|
# - "key values mismatch"
|
|
|
|
# Verify certificate files exist
|
|
ls -lh /etc/stemedb/tls/
|
|
```
|
|
|
|
**Resolution A: Certificate expired**
|
|
```bash
|
|
# Check expiration date
|
|
openssl x509 -in /etc/stemedb/tls/cert.pem -noout -enddate
|
|
|
|
# Renew with Let's Encrypt (example)
|
|
sudo certbot renew --cert-name stemedb.example.com
|
|
|
|
# Copy renewed certificates
|
|
sudo cp /etc/letsencrypt/live/stemedb.example.com/fullchain.pem /etc/stemedb/tls/cert.pem
|
|
sudo cp /etc/letsencrypt/live/stemedb.example.com/privkey.pem /etc/stemedb/tls/key.pem
|
|
|
|
# Set correct permissions
|
|
sudo chown stemedb:stemedb /etc/stemedb/tls/*.pem
|
|
sudo chmod 600 /etc/stemedb/tls/key.pem
|
|
sudo chmod 644 /etc/stemedb/tls/cert.pem
|
|
|
|
# Restart server
|
|
sudo systemctl start stemedb-api
|
|
```
|
|
|
|
**Resolution B: Wrong file paths**
|
|
```bash
|
|
# Check environment variables
|
|
env | grep STEMEDB_TLS
|
|
|
|
# Set correct paths
|
|
export STEMEDB_TLS_CERT="/path/to/cert.pem"
|
|
export STEMEDB_TLS_KEY="/path/to/key.pem"
|
|
|
|
# Or update systemd service
|
|
sudo systemctl edit stemedb-api
|
|
# Add correct paths
|
|
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl start stemedb-api
|
|
```
|
|
|
|
**Resolution C: Certificate/key mismatch**
|
|
```bash
|
|
# Verify certificate and key match
|
|
openssl x509 -noout -modulus -in /etc/stemedb/tls/cert.pem | openssl md5
|
|
openssl rsa -noout -modulus -in /etc/stemedb/tls/key.pem | openssl md5
|
|
|
|
# Hashes should match. If not, regenerate certificate or find matching pair.
|
|
```
|
|
|
|
**If failed:** TLS still failing → Temporarily disable TLS for debugging (NOT for production):
|
|
```bash
|
|
# Disable TLS (debugging only)
|
|
export STEMEDB_TLS_ENABLED=false
|
|
systemctl start stemedb-api
|
|
```
|
|
|
|
---
|
|
|
|
### §3. WAL Corruption
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check logs for WAL errors
|
|
journalctl -u stemedb-api -n 50 | grep -i wal
|
|
|
|
# Common errors:
|
|
# - "WAL magic byte validation failed"
|
|
# - "Failed to recover WAL segment"
|
|
# - "Checksum mismatch in WAL"
|
|
|
|
# Check WAL directory
|
|
ls -lh data/wal/
|
|
```
|
|
|
|
**Resolution: Restore from backup**
|
|
|
|
⚠️ **WARNING:** This destroys current WAL data. Only proceed if backup is available and data loss is acceptable.
|
|
|
|
```bash
|
|
# Stop server (if running)
|
|
sudo systemctl stop stemedb-api
|
|
|
|
# Backup corrupted WAL for forensics
|
|
sudo mv data/wal data/wal.corrupted.$(date +%Y%m%d-%H%M%S)
|
|
|
|
# List available backups
|
|
ls -lh backups/
|
|
|
|
# Restore from most recent backup
|
|
sudo ./scripts/restore-stemedb.sh backups/stemedb-backup-YYYYMMDD-HHMMSS
|
|
|
|
# Verify restoration
|
|
cat data/metadata.json
|
|
|
|
# Start server
|
|
sudo systemctl start stemedb-api
|
|
|
|
# Verify health
|
|
curl http://localhost:18180/v1/health
|
|
```
|
|
|
|
**Expected output after restore:**
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"version": "0.1.0",
|
|
"uptime_seconds": 5,
|
|
"assertion_count": 10234
|
|
}
|
|
```
|
|
|
|
**If failed:** Restore failed → Check backup integrity. See [Restore from Backup Runbook](./restore-from-backup.md).
|
|
|
|
---
|
|
|
|
### §4. Disk Full
|
|
|
|
**See:** [Disk Full Runbook](./disk-full.md) for full procedure.
|
|
|
|
**Quick emergency fix:**
|
|
```bash
|
|
# Check disk usage
|
|
df -h
|
|
|
|
# If >98%, emergency cleanup
|
|
sudo find data/wal -name "*.log" -mtime +7 -delete
|
|
|
|
# Start server
|
|
sudo systemctl start stemedb-api
|
|
```
|
|
|
|
---
|
|
|
|
### §5. Permission Issues
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check directory permissions
|
|
ls -la data/
|
|
|
|
# Expected ownership:
|
|
# drwxr-xr-x stemedb stemedb wal/
|
|
# drwxr-xr-x stemedb stemedb db/
|
|
|
|
# Check SELinux denials (RHEL/CentOS)
|
|
sudo ausearch -m avc -ts recent
|
|
```
|
|
|
|
**Resolution A: Fix ownership**
|
|
```bash
|
|
# Fix ownership recursively
|
|
sudo chown -R stemedb:stemedb data/
|
|
|
|
# Fix permissions
|
|
sudo chmod -R 755 data/
|
|
sudo chmod -R 644 data/wal/*.log
|
|
sudo chmod -R 644 data/db/*.kv
|
|
|
|
# Start server
|
|
sudo systemctl start stemedb-api
|
|
```
|
|
|
|
**Resolution B: SELinux context**
|
|
```bash
|
|
# Restore SELinux context
|
|
sudo restorecon -Rv data/
|
|
|
|
# Or set permissive for debugging (NOT for production)
|
|
sudo setenforce 0
|
|
|
|
# Start server
|
|
sudo systemctl start stemedb-api
|
|
|
|
# If works, add SELinux policy instead of disabling
|
|
```
|
|
|
|
**Resolution C: Container user mismatch**
|
|
```bash
|
|
# In Docker/Kubernetes, ensure volumes have correct UID
|
|
# docker-compose.yml example:
|
|
# services:
|
|
# stemedb:
|
|
# user: "1000:1000" # Match host UID
|
|
# volumes:
|
|
# - ./data:/data
|
|
|
|
# Or use chown in entrypoint:
|
|
# entrypoint: ["sh", "-c", "chown -R stemedb:stemedb /data && exec stemedb-api"]
|
|
```
|
|
|
|
**If failed:** Permissions correct but still denied → Check AppArmor profiles or mandatory access controls.
|
|
|
|
---
|
|
|
|
## Validation
|
|
|
|
After applying resolution, validate server is healthy:
|
|
|
|
- [ ] **Server starts successfully**
|
|
```bash
|
|
systemctl status stemedb-api
|
|
# Should show "active (running)"
|
|
```
|
|
|
|
- [ ] **Health endpoint returns 200**
|
|
```bash
|
|
curl http://localhost:18180/v1/health
|
|
# Should return: {"status":"healthy", ...}
|
|
```
|
|
|
|
- [ ] **Port is bound**
|
|
```bash
|
|
lsof -i :18180
|
|
# Should show stemedb-api listening
|
|
```
|
|
|
|
- [ ] **Logs show successful startup**
|
|
```bash
|
|
journalctl -u stemedb-api -n 20
|
|
# Should show 10 startup steps completed
|
|
```
|
|
|
|
- [ ] **Test query succeeds**
|
|
```bash
|
|
curl -X POST http://localhost:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path":"test/health","lens":"recency"}'
|
|
# Should return 200 (even if empty results)
|
|
```
|
|
|
|
- [ ] **Metrics endpoint works**
|
|
```bash
|
|
curl http://localhost:18180/metrics | head -20
|
|
# Should return Prometheus metrics
|
|
```
|
|
|
|
---
|
|
|
|
## Prevention
|
|
|
|
### Monitoring
|
|
|
|
**Set up alerts for:**
|
|
|
|
```yaml
|
|
# Prometheus alert rules
|
|
groups:
|
|
- name: stemedb_availability
|
|
rules:
|
|
- alert: StemeDBDown
|
|
expr: up{job="stemedb"} == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "StemeDB server is down"
|
|
description: "Server has been down for >1 minute"
|
|
|
|
- alert: StemeDBRestartLoop
|
|
expr: rate(stemedb_restarts_total[5m]) > 2
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "StemeDB restarting frequently"
|
|
description: "Server has restarted >2 times in 5 minutes"
|
|
```
|
|
|
|
### Configuration Changes
|
|
|
|
**To prevent recurrence:**
|
|
|
|
1. **Port conflicts:** Reserve port 18180 in your infrastructure registry
|
|
2. **TLS expiry:** Automate certificate renewal with certbot + systemd timer
|
|
3. **WAL corruption:** Enable daily backups via cron
|
|
4. **Disk full:** Monitor disk at 80% threshold, alert at 90%
|
|
5. **Permissions:** Document correct UID/GID in deployment guide
|
|
|
|
**Example: Automated TLS renewal**
|
|
```bash
|
|
# /etc/systemd/system/certbot-renewal.timer
|
|
[Unit]
|
|
Description=Certbot renewal timer
|
|
|
|
[Timer]
|
|
OnCalendar=daily
|
|
Persistent=true
|
|
|
|
[Install]
|
|
WantedBy=timers.target
|
|
```
|
|
|
|
---
|
|
|
|
## Startup Sequence Reference
|
|
|
|
**Normal startup takes 2-5 seconds and includes 10 steps:**
|
|
|
|
1. Initialize logging (tracing subscriber)
|
|
2. Start metrics registry
|
|
3. Load configuration (env vars)
|
|
4. Verify data directories exist
|
|
5. Open WAL journal (crash recovery if needed)
|
|
6. Initialize HybridStore (KV + indexes)
|
|
7. Start IngestWorker (background thread)
|
|
8. Build HTTP router (axum)
|
|
9. Bind TCP listener on configured address
|
|
10. Start accepting connections
|
|
|
|
**If server hangs at specific step, check:**
|
|
- Step 5 (WAL): Corruption or disk full
|
|
- Step 6 (HybridStore): Database corruption
|
|
- Step 9 (Bind): Port already in use
|
|
|
|
---
|
|
|
|
## Environment Variables Reference
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `STEMEDB_BIND_ADDR` | `127.0.0.1:18180` | HTTP API listen address |
|
|
| `STEMEDB_WAL_DIR` | `data/wal` | Write-ahead log directory |
|
|
| `STEMEDB_DB_DIR` | `data/db` | Database directory |
|
|
| `STEMEDB_TLS_ENABLED` | `false` | Enable TLS termination |
|
|
| `STEMEDB_TLS_CERT` | (none) | Path to TLS certificate |
|
|
| `STEMEDB_TLS_KEY` | (none) | Path to TLS private key |
|
|
| `STEMEDB_METER_ENABLED` | `true` | Enable Prometheus metrics |
|
|
|
|
---
|
|
|
|
## Related Runbooks
|
|
|
|
- [Disk Full](./disk-full.md) - Storage management
|
|
- [Restore from Backup](./restore-from-backup.md) - WAL corruption recovery
|
|
- [High Query Latency](./high-query-latency.md) - Performance issues after startup
|
|
|
|
---
|
|
|
|
## Last Updated
|
|
|
|
2026-02-11
|