stemedb/docs/operations/troubleshooting-flowchart.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

308 lines
8.8 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# StemeDB Troubleshooting Flowchart
**Decision tree: Symptom → Cause → Runbook**
Use this flowchart to quickly identify the right runbook for your incident.
---
## Start Here: What's the Symptom?
```
┌─────────────────────────────────────────┐
│ What observable problem are you seeing? │
└─────────────────────────────────────────┘
┌───────────┴───────────┐
│ │
┌─────▼──────┐ ┌─────▼──────┐
│ Server │ │ Service is │
│ won't │ │ running │
│ start │ │ but slow │
└─────┬──────┘ └─────┬──────┘
│ │
│ ┌──────┴──────┐
│ │ │
│ ┌──────▼──────┐ ┌──▼────────┐
│ │ Queries │ │ Admin │
│ │ slow/fail │ │ panel │
│ └──────┬──────┘ │ issues │
│ │ └──┬────────┘
│ │ │
```
---
## Decision Tree
### 1⃣ Server Won't Start
**Symptom:** `stemedb-api` process exits immediately or won't bind to port
```
Server won't start
├─► Port already in use?
│ └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "Port Conflict"
├─► TLS certificate error?
│ └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "TLS Error"
├─► "No space left on device"?
│ └─► [Runbook: Disk Full](./runbooks/disk-full.md)
├─► WAL magic byte validation failed?
│ └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "WAL Corruption"
└─► Permission denied errors?
└─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "Permissions"
```
**Quick Diagnostic:**
```bash
# Check if port is in use
lsof -i :18180
# Check disk space
df -h
# Check WAL directory permissions
ls -la data/wal/
# View startup logs
journalctl -u stemedb-api -n 50
```
---
### 2⃣ Queries Are Slow or Failing
**Symptom:** API returns 200 but p99 latency >1s, or queries timeout (504)
```
High query latency
├─► Metrics show replication_lag_seconds >5?
│ └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Replication Lag"
├─► Queries to specific shard failing?
│ └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Shard Hotspot"
├─► Memory usage >90%?
│ └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Memory Pressure"
└─► Random queries fail with "index error"?
└─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Index Corruption"
```
**Quick Diagnostic:**
```bash
# Check query latency metrics
curl http://localhost:18180/metrics | grep stemedb_query_latency_seconds
# Check replication lag (cluster only)
curl http://localhost:18180/metrics | grep replication_lag_seconds
# Check memory usage
free -h
```
---
### 3⃣ Admin Dashboard Issues
**Symptom:** Quarantine queue growing, circuit breakers stuck, agents banned
```
Admin issues
├─► Quarantine panel shows 100+ pending items?
│ └─► [Runbook: Quarantine Overflow](./runbooks/quarantine-overflow.md)
├─► Circuit breaker shows agent as "OPEN" (banned)?
│ └─► [Runbook: Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)
└─► Agent getting 429 responses?
└─► [Runbook: Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)
```
**Quick Diagnostic:**
```bash
# Check quarantine queue size
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'
# Check circuit breaker states
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state != "CLOSED")'
# Check metrics
curl http://localhost:18180/metrics | grep -E 'quarantine_pending|circuit_breaker_state'
```
---
### 4⃣ Disk Space Issues
**Symptom:** Writes fail, "No space left on device" errors, disk >95%
```
Disk full
├─► Disk >98% (emergency)?
│ └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "Emergency Cleanup"
├─► WAL directory growing rapidly?
│ └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "WAL Cleanup"
└─► Normal growth, need expansion?
└─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "Volume Expansion"
```
**Quick Diagnostic:**
```bash
# Check disk usage
df -h
# Check WAL size
du -sh data/wal/
# Check DB size
du -sh data/db/
```
---
### 5⃣ Data Loss / Corruption
**Symptom:** Need to restore from backup, data inconsistency, WAL corruption
```
Data issues
├─► Need to restore from backup?
│ └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md)
├─► WAL corruption detected on startup?
│ └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md)
└─► Assertion count doesn't match expectations?
└─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md) - Validate backup integrity
```
**Quick Diagnostic:**
```bash
# Check health endpoint
curl http://localhost:18180/v1/health
# List available backups
ls -lh backups/
# Verify backup integrity
cat backups/stemedb-backup-YYYYMMDD-HHMMSS/metadata.json
```
---
### 6⃣ Cluster Operations
**Symptom:** Need to add node, node failed, rebalancing needed
```
Cluster ops
├─► Adding first cluster nodes (1→3 migration)?
│ └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Bootstrap Cluster"
├─► Adding node to existing cluster?
│ └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Join Existing"
└─► Replacing failed node?
└─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Replace Failed"
```
**Quick Diagnostic:**
```bash
# Check cluster membership (SWIM)
curl http://localhost:18181/cluster/members
# Check replication status
curl http://localhost:18180/metrics | grep replication
# Check SWIM gossip health
curl http://localhost:18183/swim/health
```
---
## Incident Priority Matrix
| Priority | Response Time | Examples |
|----------|---------------|----------|
| **P0 - Critical** | <15 min | Server down, data loss, complete outage |
| **P1 - High** | <1 hour | High latency (p99 >1s), circuit breakers stuck, disk >95% |
| **P2 - Medium** | <4 hours | Quarantine overflow, single node down (cluster), replication lag |
| **P3 - Low** | <24 hours | Performance tuning, proactive capacity planning |
---
## Common Metrics to Check
**Always check these first:**
```bash
# Health endpoint
curl http://localhost:18180/v1/health
# Key metrics
curl http://localhost:18180/metrics | grep -E '(stemedb_query_latency|wal_fsync_latency|quarantine_pending|circuit_breaker_state|replication_lag)'
# Recent logs
journalctl -u stemedb-api -n 100 --no-pager
```
---
## Escalation Path
**If runbook doesn't resolve incident:**
1. **Document what you tried** - Commands run, outputs observed
2. **Collect diagnostic bundle:**
```bash
# Create diagnostic bundle
mkdir incident-$(date +%Y%m%d-%H%M%S)
cd incident-*
# Collect logs
journalctl -u stemedb-api -n 1000 > logs.txt
# Collect metrics
curl http://localhost:18180/metrics > metrics.txt
# Collect health
curl http://localhost:18180/v1/health > health.json
# Collect config
env | grep STEMEDB > config.env
# Collect disk usage
df -h > disk.txt
du -sh data/* > data-usage.txt
```
3. **Escalate** with diagnostic bundle to:
- Engineering team Slack channel
- On-call engineer (PagerDuty/Opsgenie)
- Support ticket with bundle attached
---
## Related Documentation
- [Operations Hub](./README.md) - Main operations documentation
- [All Runbooks](./runbooks/) - Incident response procedures
- [Reference Architectures](./reference-architecture/) - Deployment models
- [Production Readiness](../../uat/production-readiness/README.md) - Pre-deployment validation
---
**Last Updated:** 2026-02-11