stemedb/docs/operations/troubleshooting-flowchart.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

8.8 KiB
Raw Permalink Blame History

StemeDB Troubleshooting Flowchart

Decision tree: Symptom → Cause → Runbook

Use this flowchart to quickly identify the right runbook for your incident.


Start Here: What's the Symptom?

┌─────────────────────────────────────────┐
│ What observable problem are you seeing? │
└─────────────────────────────────────────┘
                    │
        ┌───────────┴───────────┐
        │                       │
  ┌─────▼──────┐         ┌─────▼──────┐
  │ Server     │         │ Service is │
  │ won't      │         │ running    │
  │ start      │         │ but slow   │
  └─────┬──────┘         └─────┬──────┘
        │                       │
        │                ┌──────┴──────┐
        │                │             │
        │         ┌──────▼──────┐  ┌──▼────────┐
        │         │ Queries     │  │ Admin     │
        │         │ slow/fail   │  │ panel     │
        │         └──────┬──────┘  │ issues    │
        │                │         └──┬────────┘
        │                │            │

Decision Tree

1 Server Won't Start

Symptom: stemedb-api process exits immediately or won't bind to port

Server won't start
    │
    ├─► Port already in use?
    │   └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "Port Conflict"
    │
    ├─► TLS certificate error?
    │   └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "TLS Error"
    │
    ├─► "No space left on device"?
    │   └─► [Runbook: Disk Full](./runbooks/disk-full.md)
    │
    ├─► WAL magic byte validation failed?
    │   └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "WAL Corruption"
    │
    └─► Permission denied errors?
        └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "Permissions"

Quick Diagnostic:

# Check if port is in use
lsof -i :18180

# Check disk space
df -h

# Check WAL directory permissions
ls -la data/wal/

# View startup logs
journalctl -u stemedb-api -n 50

2 Queries Are Slow or Failing

Symptom: API returns 200 but p99 latency >1s, or queries timeout (504)

High query latency
    │
    ├─► Metrics show replication_lag_seconds >5?
    │   └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Replication Lag"
    │
    ├─► Queries to specific shard failing?
    │   └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Shard Hotspot"
    │
    ├─► Memory usage >90%?
    │   └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Memory Pressure"
    │
    └─► Random queries fail with "index error"?
        └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Index Corruption"

Quick Diagnostic:

# Check query latency metrics
curl http://localhost:18180/metrics | grep stemedb_query_latency_seconds

# Check replication lag (cluster only)
curl http://localhost:18180/metrics | grep replication_lag_seconds

# Check memory usage
free -h

3 Admin Dashboard Issues

Symptom: Quarantine queue growing, circuit breakers stuck, agents banned

Admin issues
    │
    ├─► Quarantine panel shows 100+ pending items?
    │   └─► [Runbook: Quarantine Overflow](./runbooks/quarantine-overflow.md)
    │
    ├─► Circuit breaker shows agent as "OPEN" (banned)?
    │   └─► [Runbook: Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)
    │
    └─► Agent getting 429 responses?
        └─► [Runbook: Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)

Quick Diagnostic:

# Check quarantine queue size
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'

# Check circuit breaker states
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state != "CLOSED")'

# Check metrics
curl http://localhost:18180/metrics | grep -E 'quarantine_pending|circuit_breaker_state'

4 Disk Space Issues

Symptom: Writes fail, "No space left on device" errors, disk >95%

Disk full
    │
    ├─► Disk >98% (emergency)?
    │   └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "Emergency Cleanup"
    │
    ├─► WAL directory growing rapidly?
    │   └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "WAL Cleanup"
    │
    └─► Normal growth, need expansion?
        └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "Volume Expansion"

Quick Diagnostic:

# Check disk usage
df -h

# Check WAL size
du -sh data/wal/

# Check DB size
du -sh data/db/

5 Data Loss / Corruption

Symptom: Need to restore from backup, data inconsistency, WAL corruption

Data issues
    │
    ├─► Need to restore from backup?
    │   └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md)
    │
    ├─► WAL corruption detected on startup?
    │   └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md)
    │
    └─► Assertion count doesn't match expectations?
        └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md) - Validate backup integrity

Quick Diagnostic:

# Check health endpoint
curl http://localhost:18180/v1/health

# List available backups
ls -lh backups/

# Verify backup integrity
cat backups/stemedb-backup-YYYYMMDD-HHMMSS/metadata.json

6 Cluster Operations

Symptom: Need to add node, node failed, rebalancing needed

Cluster ops
    │
    ├─► Adding first cluster nodes (1→3 migration)?
    │   └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Bootstrap Cluster"
    │
    ├─► Adding node to existing cluster?
    │   └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Join Existing"
    │
    └─► Replacing failed node?
        └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Replace Failed"

Quick Diagnostic:

# Check cluster membership (SWIM)
curl http://localhost:18181/cluster/members

# Check replication status
curl http://localhost:18180/metrics | grep replication

# Check SWIM gossip health
curl http://localhost:18183/swim/health

Incident Priority Matrix

Priority Response Time Examples
P0 - Critical <15 min Server down, data loss, complete outage
P1 - High <1 hour High latency (p99 >1s), circuit breakers stuck, disk >95%
P2 - Medium <4 hours Quarantine overflow, single node down (cluster), replication lag
P3 - Low <24 hours Performance tuning, proactive capacity planning

Common Metrics to Check

Always check these first:

# Health endpoint
curl http://localhost:18180/v1/health

# Key metrics
curl http://localhost:18180/metrics | grep -E '(stemedb_query_latency|wal_fsync_latency|quarantine_pending|circuit_breaker_state|replication_lag)'

# Recent logs
journalctl -u stemedb-api -n 100 --no-pager

Escalation Path

If runbook doesn't resolve incident:

  1. Document what you tried - Commands run, outputs observed
  2. Collect diagnostic bundle:
    # Create diagnostic bundle
    mkdir incident-$(date +%Y%m%d-%H%M%S)
    cd incident-*
    
    # Collect logs
    journalctl -u stemedb-api -n 1000 > logs.txt
    
    # Collect metrics
    curl http://localhost:18180/metrics > metrics.txt
    
    # Collect health
    curl http://localhost:18180/v1/health > health.json
    
    # Collect config
    env | grep STEMEDB > config.env
    
    # Collect disk usage
    df -h > disk.txt
    du -sh data/* > data-usage.txt
    
  3. Escalate with diagnostic bundle to:
    • Engineering team Slack channel
    • On-call engineer (PagerDuty/Opsgenie)
    • Support ticket with bundle attached


Last Updated: 2026-02-11