stemedb/docs/operations/troubleshooting-flowchart.md

# StemeDB Troubleshooting Flowchart

**Decision tree: Symptom → Cause → Runbook**

Use this flowchart to quickly identify the right runbook for your incident.

---

## Start Here: What's the Symptom?

```
┌─────────────────────────────────────────┐
│ What observable problem are you seeing? │
└─────────────────────────────────────────┘
                    │
        ┌───────────┴───────────┐
        │                       │
  ┌─────▼──────┐         ┌─────▼──────┐
  │ Server     │         │ Service is │
  │ won't      │         │ running    │
  │ start      │         │ but slow   │
  └─────┬──────┘         └─────┬──────┘
        │                       │
        │                ┌──────┴──────┐
        │                │             │
        │         ┌──────▼──────┐  ┌──▼────────┐
        │         │ Queries     │  │ Admin     │
        │         │ slow/fail   │  │ panel     │
        │         └──────┬──────┘  │ issues    │
        │                │         └──┬────────┘
        │                │            │
```

---

## Decision Tree

### 1️⃣ Server Won't Start

**Symptom:** `stemedb-api` process exits immediately or won't bind to port

```
Server won't start
    │
    ├─► Port already in use?
    │   └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "Port Conflict"
    │
    ├─► TLS certificate error?
    │   └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "TLS Error"
    │
    ├─► "No space left on device"?
    │   └─► [Runbook: Disk Full](./runbooks/disk-full.md)
    │
    ├─► WAL magic byte validation failed?
    │   └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "WAL Corruption"
    │
    └─► Permission denied errors?
        └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "Permissions"
```

**Quick Diagnostic:**
```bash
# Check if port is in use
lsof -i :18180

# Check disk space
df -h

# Check WAL directory permissions
ls -la data/wal/

# View startup logs
journalctl -u stemedb-api -n 50
```

---

### 2️⃣ Queries Are Slow or Failing

**Symptom:** API returns 200 but p99 latency >1s, or queries timeout (504)

```
High query latency
    │
    ├─► Metrics show replication_lag_seconds >5?
    │   └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Replication Lag"
    │
    ├─► Queries to specific shard failing?
    │   └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Shard Hotspot"
    │
    ├─► Memory usage >90%?
    │   └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Memory Pressure"
    │
    └─► Random queries fail with "index error"?
        └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Index Corruption"
```

**Quick Diagnostic:**
```bash
# Check query latency metrics
curl http://localhost:18180/metrics | grep stemedb_query_latency_seconds

# Check replication lag (cluster only)
curl http://localhost:18180/metrics | grep replication_lag_seconds

# Check memory usage
free -h
```

---

### 3️⃣ Admin Dashboard Issues

**Symptom:** Quarantine queue growing, circuit breakers stuck, agents banned

```
Admin issues
    │
    ├─► Quarantine panel shows 100+ pending items?
    │   └─► [Runbook: Quarantine Overflow](./runbooks/quarantine-overflow.md)
    │
    ├─► Circuit breaker shows agent as "OPEN" (banned)?
    │   └─► [Runbook: Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)
    │
    └─► Agent getting 429 responses?
        └─► [Runbook: Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)
```

**Quick Diagnostic:**
```bash
# Check quarantine queue size
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'

# Check circuit breaker states
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state != "CLOSED")'

# Check metrics
curl http://localhost:18180/metrics | grep -E 'quarantine_pending|circuit_breaker_state'
```

---

### 4️⃣ Disk Space Issues

**Symptom:** Writes fail, "No space left on device" errors, disk >95%

```
Disk full
    │
    ├─► Disk >98% (emergency)?
    │   └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "Emergency Cleanup"
    │
    ├─► WAL directory growing rapidly?
    │   └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "WAL Cleanup"
    │
    └─► Normal growth, need expansion?
        └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "Volume Expansion"
```

**Quick Diagnostic:**
```bash
# Check disk usage
df -h

# Check WAL size
du -sh data/wal/

# Check DB size
du -sh data/db/
```

---

### 5️⃣ Data Loss / Corruption

**Symptom:** Need to restore from backup, data inconsistency, WAL corruption

```
Data issues
    │
    ├─► Need to restore from backup?
    │   └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md)
    │
    ├─► WAL corruption detected on startup?
    │   └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md)
    │
    └─► Assertion count doesn't match expectations?
        └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md) - Validate backup integrity
```

**Quick Diagnostic:**
```bash
# Check health endpoint
curl http://localhost:18180/v1/health

# List available backups
ls -lh backups/

# Verify backup integrity
cat backups/stemedb-backup-YYYYMMDD-HHMMSS/metadata.json
```

---

### 6️⃣ Cluster Operations

**Symptom:** Need to add node, node failed, rebalancing needed

```
Cluster ops
    │
    ├─► Adding first cluster nodes (1→3 migration)?
    │   └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Bootstrap Cluster"
    │
    ├─► Adding node to existing cluster?
    │   └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Join Existing"
    │
    └─► Replacing failed node?
        └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Replace Failed"
```

**Quick Diagnostic:**
```bash
# Check cluster membership (SWIM)
curl http://localhost:18181/cluster/members

# Check replication status
curl http://localhost:18180/metrics | grep replication

# Check SWIM gossip health
curl http://localhost:18183/swim/health
```

---

## Incident Priority Matrix

| Priority | Response Time | Examples |
|----------|---------------|----------|
| **P0 - Critical** | <15 min | Server down, data loss, complete outage |
| **P1 - High** | <1 hour | High latency (p99 >1s), circuit breakers stuck, disk >95% |
| **P2 - Medium** | <4 hours | Quarantine overflow, single node down (cluster), replication lag |
| **P3 - Low** | <24 hours | Performance tuning, proactive capacity planning |

---

## Common Metrics to Check

**Always check these first:**

```bash
# Health endpoint
curl http://localhost:18180/v1/health

# Key metrics
curl http://localhost:18180/metrics | grep -E '(stemedb_query_latency|wal_fsync_latency|quarantine_pending|circuit_breaker_state|replication_lag)'

# Recent logs
journalctl -u stemedb-api -n 100 --no-pager
```

---

## Escalation Path

**If runbook doesn't resolve incident:**

1. **Document what you tried** - Commands run, outputs observed
2. **Collect diagnostic bundle:**
   ```bash
   # Create diagnostic bundle
   mkdir incident-$(date +%Y%m%d-%H%M%S)
   cd incident-*

   # Collect logs
   journalctl -u stemedb-api -n 1000 > logs.txt

   # Collect metrics
   curl http://localhost:18180/metrics > metrics.txt

   # Collect health
   curl http://localhost:18180/v1/health > health.json

   # Collect config
   env | grep STEMEDB > config.env

   # Collect disk usage
   df -h > disk.txt
   du -sh data/* > data-usage.txt
   ```
3. **Escalate** with diagnostic bundle to:
   - Engineering team Slack channel
   - On-call engineer (PagerDuty/Opsgenie)
   - Support ticket with bundle attached

---

## Related Documentation

- [Operations Hub](./README.md) - Main operations documentation
- [All Runbooks](./runbooks/) - Incident response procedures
- [Reference Architectures](./reference-architecture/) - Deployment models
- [Production Readiness](../../uat/production-readiness/README.md) - Pre-deployment validation

---

**Last Updated:** 2026-02-11