This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
308 lines
8.8 KiB
Markdown
308 lines
8.8 KiB
Markdown
# StemeDB Troubleshooting Flowchart
|
||
|
||
**Decision tree: Symptom → Cause → Runbook**
|
||
|
||
Use this flowchart to quickly identify the right runbook for your incident.
|
||
|
||
---
|
||
|
||
## Start Here: What's the Symptom?
|
||
|
||
```
|
||
┌─────────────────────────────────────────┐
|
||
│ What observable problem are you seeing? │
|
||
└─────────────────────────────────────────┘
|
||
│
|
||
┌───────────┴───────────┐
|
||
│ │
|
||
┌─────▼──────┐ ┌─────▼──────┐
|
||
│ Server │ │ Service is │
|
||
│ won't │ │ running │
|
||
│ start │ │ but slow │
|
||
└─────┬──────┘ └─────┬──────┘
|
||
│ │
|
||
│ ┌──────┴──────┐
|
||
│ │ │
|
||
│ ┌──────▼──────┐ ┌──▼────────┐
|
||
│ │ Queries │ │ Admin │
|
||
│ │ slow/fail │ │ panel │
|
||
│ └──────┬──────┘ │ issues │
|
||
│ │ └──┬────────┘
|
||
│ │ │
|
||
```
|
||
|
||
---
|
||
|
||
## Decision Tree
|
||
|
||
### 1️⃣ Server Won't Start
|
||
|
||
**Symptom:** `stemedb-api` process exits immediately or won't bind to port
|
||
|
||
```
|
||
Server won't start
|
||
│
|
||
├─► Port already in use?
|
||
│ └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "Port Conflict"
|
||
│
|
||
├─► TLS certificate error?
|
||
│ └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "TLS Error"
|
||
│
|
||
├─► "No space left on device"?
|
||
│ └─► [Runbook: Disk Full](./runbooks/disk-full.md)
|
||
│
|
||
├─► WAL magic byte validation failed?
|
||
│ └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "WAL Corruption"
|
||
│
|
||
└─► Permission denied errors?
|
||
└─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "Permissions"
|
||
```
|
||
|
||
**Quick Diagnostic:**
|
||
```bash
|
||
# Check if port is in use
|
||
lsof -i :18180
|
||
|
||
# Check disk space
|
||
df -h
|
||
|
||
# Check WAL directory permissions
|
||
ls -la data/wal/
|
||
|
||
# View startup logs
|
||
journalctl -u stemedb-api -n 50
|
||
```
|
||
|
||
---
|
||
|
||
### 2️⃣ Queries Are Slow or Failing
|
||
|
||
**Symptom:** API returns 200 but p99 latency >1s, or queries timeout (504)
|
||
|
||
```
|
||
High query latency
|
||
│
|
||
├─► Metrics show replication_lag_seconds >5?
|
||
│ └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Replication Lag"
|
||
│
|
||
├─► Queries to specific shard failing?
|
||
│ └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Shard Hotspot"
|
||
│
|
||
├─► Memory usage >90%?
|
||
│ └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Memory Pressure"
|
||
│
|
||
└─► Random queries fail with "index error"?
|
||
└─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Index Corruption"
|
||
```
|
||
|
||
**Quick Diagnostic:**
|
||
```bash
|
||
# Check query latency metrics
|
||
curl http://localhost:18180/metrics | grep stemedb_query_latency_seconds
|
||
|
||
# Check replication lag (cluster only)
|
||
curl http://localhost:18180/metrics | grep replication_lag_seconds
|
||
|
||
# Check memory usage
|
||
free -h
|
||
```
|
||
|
||
---
|
||
|
||
### 3️⃣ Admin Dashboard Issues
|
||
|
||
**Symptom:** Quarantine queue growing, circuit breakers stuck, agents banned
|
||
|
||
```
|
||
Admin issues
|
||
│
|
||
├─► Quarantine panel shows 100+ pending items?
|
||
│ └─► [Runbook: Quarantine Overflow](./runbooks/quarantine-overflow.md)
|
||
│
|
||
├─► Circuit breaker shows agent as "OPEN" (banned)?
|
||
│ └─► [Runbook: Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)
|
||
│
|
||
└─► Agent getting 429 responses?
|
||
└─► [Runbook: Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)
|
||
```
|
||
|
||
**Quick Diagnostic:**
|
||
```bash
|
||
# Check quarantine queue size
|
||
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'
|
||
|
||
# Check circuit breaker states
|
||
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state != "CLOSED")'
|
||
|
||
# Check metrics
|
||
curl http://localhost:18180/metrics | grep -E 'quarantine_pending|circuit_breaker_state'
|
||
```
|
||
|
||
---
|
||
|
||
### 4️⃣ Disk Space Issues
|
||
|
||
**Symptom:** Writes fail, "No space left on device" errors, disk >95%
|
||
|
||
```
|
||
Disk full
|
||
│
|
||
├─► Disk >98% (emergency)?
|
||
│ └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "Emergency Cleanup"
|
||
│
|
||
├─► WAL directory growing rapidly?
|
||
│ └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "WAL Cleanup"
|
||
│
|
||
└─► Normal growth, need expansion?
|
||
└─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "Volume Expansion"
|
||
```
|
||
|
||
**Quick Diagnostic:**
|
||
```bash
|
||
# Check disk usage
|
||
df -h
|
||
|
||
# Check WAL size
|
||
du -sh data/wal/
|
||
|
||
# Check DB size
|
||
du -sh data/db/
|
||
```
|
||
|
||
---
|
||
|
||
### 5️⃣ Data Loss / Corruption
|
||
|
||
**Symptom:** Need to restore from backup, data inconsistency, WAL corruption
|
||
|
||
```
|
||
Data issues
|
||
│
|
||
├─► Need to restore from backup?
|
||
│ └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md)
|
||
│
|
||
├─► WAL corruption detected on startup?
|
||
│ └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md)
|
||
│
|
||
└─► Assertion count doesn't match expectations?
|
||
└─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md) - Validate backup integrity
|
||
```
|
||
|
||
**Quick Diagnostic:**
|
||
```bash
|
||
# Check health endpoint
|
||
curl http://localhost:18180/v1/health
|
||
|
||
# List available backups
|
||
ls -lh backups/
|
||
|
||
# Verify backup integrity
|
||
cat backups/stemedb-backup-YYYYMMDD-HHMMSS/metadata.json
|
||
```
|
||
|
||
---
|
||
|
||
### 6️⃣ Cluster Operations
|
||
|
||
**Symptom:** Need to add node, node failed, rebalancing needed
|
||
|
||
```
|
||
Cluster ops
|
||
│
|
||
├─► Adding first cluster nodes (1→3 migration)?
|
||
│ └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Bootstrap Cluster"
|
||
│
|
||
├─► Adding node to existing cluster?
|
||
│ └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Join Existing"
|
||
│
|
||
└─► Replacing failed node?
|
||
└─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Replace Failed"
|
||
```
|
||
|
||
**Quick Diagnostic:**
|
||
```bash
|
||
# Check cluster membership (SWIM)
|
||
curl http://localhost:18181/cluster/members
|
||
|
||
# Check replication status
|
||
curl http://localhost:18180/metrics | grep replication
|
||
|
||
# Check SWIM gossip health
|
||
curl http://localhost:18183/swim/health
|
||
```
|
||
|
||
---
|
||
|
||
## Incident Priority Matrix
|
||
|
||
| Priority | Response Time | Examples |
|
||
|----------|---------------|----------|
|
||
| **P0 - Critical** | <15 min | Server down, data loss, complete outage |
|
||
| **P1 - High** | <1 hour | High latency (p99 >1s), circuit breakers stuck, disk >95% |
|
||
| **P2 - Medium** | <4 hours | Quarantine overflow, single node down (cluster), replication lag |
|
||
| **P3 - Low** | <24 hours | Performance tuning, proactive capacity planning |
|
||
|
||
---
|
||
|
||
## Common Metrics to Check
|
||
|
||
**Always check these first:**
|
||
|
||
```bash
|
||
# Health endpoint
|
||
curl http://localhost:18180/v1/health
|
||
|
||
# Key metrics
|
||
curl http://localhost:18180/metrics | grep -E '(stemedb_query_latency|wal_fsync_latency|quarantine_pending|circuit_breaker_state|replication_lag)'
|
||
|
||
# Recent logs
|
||
journalctl -u stemedb-api -n 100 --no-pager
|
||
```
|
||
|
||
---
|
||
|
||
## Escalation Path
|
||
|
||
**If runbook doesn't resolve incident:**
|
||
|
||
1. **Document what you tried** - Commands run, outputs observed
|
||
2. **Collect diagnostic bundle:**
|
||
```bash
|
||
# Create diagnostic bundle
|
||
mkdir incident-$(date +%Y%m%d-%H%M%S)
|
||
cd incident-*
|
||
|
||
# Collect logs
|
||
journalctl -u stemedb-api -n 1000 > logs.txt
|
||
|
||
# Collect metrics
|
||
curl http://localhost:18180/metrics > metrics.txt
|
||
|
||
# Collect health
|
||
curl http://localhost:18180/v1/health > health.json
|
||
|
||
# Collect config
|
||
env | grep STEMEDB > config.env
|
||
|
||
# Collect disk usage
|
||
df -h > disk.txt
|
||
du -sh data/* > data-usage.txt
|
||
```
|
||
3. **Escalate** with diagnostic bundle to:
|
||
- Engineering team Slack channel
|
||
- On-call engineer (PagerDuty/Opsgenie)
|
||
- Support ticket with bundle attached
|
||
|
||
---
|
||
|
||
## Related Documentation
|
||
|
||
- [Operations Hub](./README.md) - Main operations documentation
|
||
- [All Runbooks](./runbooks/) - Incident response procedures
|
||
- [Reference Architectures](./reference-architecture/) - Deployment models
|
||
- [Production Readiness](../../uat/production-readiness/README.md) - Pre-deployment validation
|
||
|
||
---
|
||
|
||
**Last Updated:** 2026-02-11
|