This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
8.8 KiB
8.8 KiB
StemeDB Troubleshooting Flowchart
Decision tree: Symptom → Cause → Runbook
Use this flowchart to quickly identify the right runbook for your incident.
Start Here: What's the Symptom?
┌─────────────────────────────────────────┐
│ What observable problem are you seeing? │
└─────────────────────────────────────────┘
│
┌───────────┴───────────┐
│ │
┌─────▼──────┐ ┌─────▼──────┐
│ Server │ │ Service is │
│ won't │ │ running │
│ start │ │ but slow │
└─────┬──────┘ └─────┬──────┘
│ │
│ ┌──────┴──────┐
│ │ │
│ ┌──────▼──────┐ ┌──▼────────┐
│ │ Queries │ │ Admin │
│ │ slow/fail │ │ panel │
│ └──────┬──────┘ │ issues │
│ │ └──┬────────┘
│ │ │
Decision Tree
1️⃣ Server Won't Start
Symptom: stemedb-api process exits immediately or won't bind to port
Server won't start
│
├─► Port already in use?
│ └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "Port Conflict"
│
├─► TLS certificate error?
│ └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "TLS Error"
│
├─► "No space left on device"?
│ └─► [Runbook: Disk Full](./runbooks/disk-full.md)
│
├─► WAL magic byte validation failed?
│ └─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "WAL Corruption"
│
└─► Permission denied errors?
└─► [Runbook: Server Won't Start](./runbooks/server-wont-start.md) - Section "Permissions"
Quick Diagnostic:
# Check if port is in use
lsof -i :18180
# Check disk space
df -h
# Check WAL directory permissions
ls -la data/wal/
# View startup logs
journalctl -u stemedb-api -n 50
2️⃣ Queries Are Slow or Failing
Symptom: API returns 200 but p99 latency >1s, or queries timeout (504)
High query latency
│
├─► Metrics show replication_lag_seconds >5?
│ └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Replication Lag"
│
├─► Queries to specific shard failing?
│ └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Shard Hotspot"
│
├─► Memory usage >90%?
│ └─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Memory Pressure"
│
└─► Random queries fail with "index error"?
└─► [Runbook: High Query Latency](./runbooks/high-query-latency.md) - Section "Index Corruption"
Quick Diagnostic:
# Check query latency metrics
curl http://localhost:18180/metrics | grep stemedb_query_latency_seconds
# Check replication lag (cluster only)
curl http://localhost:18180/metrics | grep replication_lag_seconds
# Check memory usage
free -h
3️⃣ Admin Dashboard Issues
Symptom: Quarantine queue growing, circuit breakers stuck, agents banned
Admin issues
│
├─► Quarantine panel shows 100+ pending items?
│ └─► [Runbook: Quarantine Overflow](./runbooks/quarantine-overflow.md)
│
├─► Circuit breaker shows agent as "OPEN" (banned)?
│ └─► [Runbook: Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)
│
└─► Agent getting 429 responses?
└─► [Runbook: Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)
Quick Diagnostic:
# Check quarantine queue size
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'
# Check circuit breaker states
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state != "CLOSED")'
# Check metrics
curl http://localhost:18180/metrics | grep -E 'quarantine_pending|circuit_breaker_state'
4️⃣ Disk Space Issues
Symptom: Writes fail, "No space left on device" errors, disk >95%
Disk full
│
├─► Disk >98% (emergency)?
│ └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "Emergency Cleanup"
│
├─► WAL directory growing rapidly?
│ └─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "WAL Cleanup"
│
└─► Normal growth, need expansion?
└─► [Runbook: Disk Full](./runbooks/disk-full.md) - Section "Volume Expansion"
Quick Diagnostic:
# Check disk usage
df -h
# Check WAL size
du -sh data/wal/
# Check DB size
du -sh data/db/
5️⃣ Data Loss / Corruption
Symptom: Need to restore from backup, data inconsistency, WAL corruption
Data issues
│
├─► Need to restore from backup?
│ └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md)
│
├─► WAL corruption detected on startup?
│ └─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md)
│
└─► Assertion count doesn't match expectations?
└─► [Runbook: Restore from Backup](./runbooks/restore-from-backup.md) - Validate backup integrity
Quick Diagnostic:
# Check health endpoint
curl http://localhost:18180/v1/health
# List available backups
ls -lh backups/
# Verify backup integrity
cat backups/stemedb-backup-YYYYMMDD-HHMMSS/metadata.json
6️⃣ Cluster Operations
Symptom: Need to add node, node failed, rebalancing needed
Cluster ops
│
├─► Adding first cluster nodes (1→3 migration)?
│ └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Bootstrap Cluster"
│
├─► Adding node to existing cluster?
│ └─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Join Existing"
│
└─► Replacing failed node?
└─► [Runbook: Add Node](./runbooks/add-node.md) - Section "Replace Failed"
Quick Diagnostic:
# Check cluster membership (SWIM)
curl http://localhost:18181/cluster/members
# Check replication status
curl http://localhost:18180/metrics | grep replication
# Check SWIM gossip health
curl http://localhost:18183/swim/health
Incident Priority Matrix
| Priority | Response Time | Examples |
|---|---|---|
| P0 - Critical | <15 min | Server down, data loss, complete outage |
| P1 - High | <1 hour | High latency (p99 >1s), circuit breakers stuck, disk >95% |
| P2 - Medium | <4 hours | Quarantine overflow, single node down (cluster), replication lag |
| P3 - Low | <24 hours | Performance tuning, proactive capacity planning |
Common Metrics to Check
Always check these first:
# Health endpoint
curl http://localhost:18180/v1/health
# Key metrics
curl http://localhost:18180/metrics | grep -E '(stemedb_query_latency|wal_fsync_latency|quarantine_pending|circuit_breaker_state|replication_lag)'
# Recent logs
journalctl -u stemedb-api -n 100 --no-pager
Escalation Path
If runbook doesn't resolve incident:
- Document what you tried - Commands run, outputs observed
- Collect diagnostic bundle:
# Create diagnostic bundle mkdir incident-$(date +%Y%m%d-%H%M%S) cd incident-* # Collect logs journalctl -u stemedb-api -n 1000 > logs.txt # Collect metrics curl http://localhost:18180/metrics > metrics.txt # Collect health curl http://localhost:18180/v1/health > health.json # Collect config env | grep STEMEDB > config.env # Collect disk usage df -h > disk.txt du -sh data/* > data-usage.txt - Escalate with diagnostic bundle to:
- Engineering team Slack channel
- On-call engineer (PagerDuty/Opsgenie)
- Support ticket with bundle attached
Related Documentation
- Operations Hub - Main operations documentation
- All Runbooks - Incident response procedures
- Reference Architectures - Deployment models
- Production Readiness - Pre-deployment validation
Last Updated: 2026-02-11