# StemeDB Troubleshooting Flowchart **Decision tree: Symptom → Cause → Runbook** Use this flowchart to quickly identify the right runbook for your incident. --- ## Start Here: What's the Symptom? ``` ┌─────────────────────────────────────────┐ │ What observable problem are you seeing? │ └─────────────────────────────────────────┘ │ ┌───────────┴───────────┐ │ │ ┌─────▼──────┐ ┌─────▼──────┐ │ Server │ │ Service is │ │ won't │ │ running │ │ start │ │ but slow │ └─────┬──────┘ └─────┬──────┘ │ │ │ ┌──────┴──────┐ │ │ │ │ ┌──────▼──────┐ ┌──▼────────┐ │ │ Queries │ │ Admin │ │ │ slow/fail │ │ panel │ │ └──────┬──────┘ │ issues │ │ │ └──┬────────┘ │ │ │ ``` --- ## Decision Tree ### 1️⃣ Server Won't Start **Symptom:** `stemedb-api` process exits immediately or won't bind to port ``` Server won't start │ ├─► Port already in use? │ └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "Port Conflict" │ ├─► TLS certificate error? │ └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "TLS Error" │ ├─► "No space left on device"? │ └─► [Runbook: Disk Full](./runbooks/disk-full.md) │ ├─► WAL magic byte validation failed? │ └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "WAL Corruption" │ └─► Permission denied errors? └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "Permissions" ``` **Quick Diagnostic:** ```bash # Check if port is in use lsof -i :18180 # Check disk space df -h # Check WAL directory permissions ls -la data/wal/ # View startup logs journalctl -u stemedb-api -n 50 ``` --- ### 2️⃣ Queries Are Slow or Failing **Symptom:** API returns 200 but p99 latency >1s, or queries timeout (504) ``` High query latency │ ├─► Metrics show replication_lag_seconds >5? │ └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Replication Lag" │ ├─► Queries to specific shard failing? │ └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Shard Hotspot" │ ├─► Memory usage >90%? │ └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Memory Pressure" │ └─► Random queries fail with "index error"? └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Index Corruption" ``` **Quick Diagnostic:** ```bash # Check query latency metrics curl http://localhost:18180/metrics | grep stemedb_query_latency_seconds # Check replication lag (cluster only) curl http://localhost:18180/metrics | grep replication_lag_seconds # Check memory usage free -h ``` --- ### 3️⃣ Admin Dashboard Issues **Symptom:** Quarantine queue growing, circuit breakers stuck, agents banned ``` Admin issues │ ├─► Quarantine panel shows 100+ pending items? │ └─► [Runbook: Quarantine Overflow](./runbooks/quarantine-overflow.md) │ ├─► Circuit breaker shows agent as "OPEN" (banned)? │ └─► [Runbook: Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md) │ └─► Agent getting 429 responses? └─► [Runbook: Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md) ``` **Quick Diagnostic:** ```bash # Check quarantine queue size curl http://localhost:18180/v1/admin/quarantine | jq '.items | length' # Check circuit breaker states curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state != "CLOSED")' # Check metrics curl http://localhost:18180/metrics | grep -E 'quarantine_pending|circuit_breaker_state' ``` --- ### 4️⃣ Disk Space Issues **Symptom:** Writes fail, "No space left on device" errors, disk >95% ``` Disk full │ ├─► Disk >98% (emergency)? │ └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "Emergency Cleanup" │ ├─► WAL directory growing rapidly? │ └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "WAL Cleanup" │ └─► Normal growth, need expansion? └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "Volume Expansion" ``` **Quick Diagnostic:** ```bash # Check disk usage df -h # Check WAL size du -sh data/wal/ # Check DB size du -sh data/db/ ``` --- ### 5️⃣ Data Loss / Corruption **Symptom:** Need to restore from backup, data inconsistency, WAL corruption ``` Data issues │ ├─► Need to restore from backup? │ └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md) │ ├─► WAL corruption detected on startup? │ └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md) │ └─► Assertion count doesn't match expectations? └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md) - Validate backup integrity ``` **Quick Diagnostic:** ```bash # Check health endpoint curl http://localhost:18180/v1/health # List available backups ls -lh backups/ # Verify backup integrity cat backups/stemedb-backup-YYYYMMDD-HHMMSS/metadata.json ``` --- ### 6️⃣ Cluster Operations **Symptom:** Need to add node, node failed, rebalancing needed ``` Cluster ops │ ├─► Adding first cluster nodes (1→3 migration)? │ └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Bootstrap Cluster" │ ├─► Adding node to existing cluster? │ └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Join Existing" │ └─► Replacing failed node? └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Replace Failed" ``` **Quick Diagnostic:** ```bash # Check cluster membership (SWIM) curl http://localhost:18181/cluster/members # Check replication status curl http://localhost:18180/metrics | grep replication # Check SWIM gossip health curl http://localhost:18183/swim/health ``` --- ## Incident Priority Matrix | Priority | Response Time | Examples | |----------|---------------|----------| | **P0 - Critical** | <15 min | Server down, data loss, complete outage | | **P1 - High** | <1 hour | High latency (p99 >1s), circuit breakers stuck, disk >95% | | **P2 - Medium** | <4 hours | Quarantine overflow, single node down (cluster), replication lag | | **P3 - Low** | <24 hours | Performance tuning, proactive capacity planning | --- ## Common Metrics to Check **Always check these first:** ```bash # Health endpoint curl http://localhost:18180/v1/health # Key metrics curl http://localhost:18180/metrics | grep -E '(stemedb_query_latency|wal_fsync_latency|quarantine_pending|circuit_breaker_state|replication_lag)' # Recent logs journalctl -u stemedb-api -n 100 --no-pager ``` --- ## Escalation Path **If runbook doesn't resolve incident:** 1. **Document what you tried** - Commands run, outputs observed 2. **Collect diagnostic bundle:** ```bash # Create diagnostic bundle mkdir incident-$(date +%Y%m%d-%H%M%S) cd incident-* # Collect logs journalctl -u stemedb-api -n 1000 > logs.txt # Collect metrics curl http://localhost:18180/metrics > metrics.txt # Collect health curl http://localhost:18180/v1/health > health.json # Collect config env | grep STEMEDB > config.env # Collect disk usage df -h > disk.txt du -sh data/* > data-usage.txt ``` 3. **Escalate** with diagnostic bundle to: - Engineering team Slack channel - On-call engineer (PagerDuty/Opsgenie) - Support ticket with bundle attached --- ## Related Documentation - [Operations Hub](./README.md) - Main operations documentation - [All Runbooks](./runbooks/) - Incident response procedures - [Reference Architectures](./reference-architecture/) - Deployment models - [Production Readiness](../../uat/production-readiness/README.md) - Pre-deployment validation --- **Last Updated:** 2026-02-11