stemedb/docs/operations/runbooks/high-query-latency.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

456 lines
12 KiB
Markdown

# Runbook: High Query Latency
## Symptom
- API queries return 200 but take >1 second (p99 >1000ms)
- Queries timeout with 504 Gateway Timeout
- Dashboard slow to load or shows stale data
- Users report "sluggish" performance
**Metrics Alerts:**
- `stemedb_query_latency_seconds{quantile="0.99"}` > 1.0 for 5 minutes
- `replication_lag_seconds` > 5.0 (cluster only)
- `stemedb_query_timeout_total` increasing
---
## Quick Diagnosis
```
High query latency
├─► Check: curl .../metrics | grep replication_lag
│ └─► Lag >5s? → §1 Replication Lag
├─► Check: curl .../metrics | grep query_latency_seconds
│ └─► Single shard slow? → §2 Shard Hotspot
├─► Check: free -h
│ └─► Memory >90%? → §3 Memory Pressure
└─► Check: journalctl | grep "index error"
└─► Index errors? → §4 Index Corruption
```
---
## Common Causes
1. **Replication lag** (cluster only) — Likelihood: **35%**
- Network latency between nodes
- Single node overloaded
- Merkle sync backlog
2. **Shard hotspot** (cluster only) — Likelihood: **25%**
- Popular concept_path on single shard
- Unbalanced shard assignment
- Single node handling all queries
3. **Memory pressure** — Likelihood: **20%**
- Cache evictions due to low memory
- Swap thrashing
- Large result sets
4. **Index corruption** — Likelihood: **10%**
- Partial index rebuild needed
- Corrupted predicate index
- Version mismatch after upgrade
5. **Query complexity** — Likelihood: **10%**
- Complex lens logic (e.g., AuthorityLens with deep chains)
- Large result sets (>10K assertions)
- Inefficient query patterns
---
## Resolution Steps
### §1. Replication Lag (Cluster Only)
**Diagnostic:**
```bash
# Check replication lag on all nodes
for node in node1 node2 node3; do
echo "=== $node ==="
curl http://$node:18180/metrics | grep replication_lag_seconds
done
# Expected output (healthy):
# replication_lag_seconds{node="node1"} 0.123
# replication_lag_seconds{node="node2"} 0.089
# replication_lag_seconds{node="node3"} 0.234
# Check Merkle sync status
curl http://localhost:18181/cluster/sync_status | jq '.'
```
**Resolution A: Manual Merkle sync**
```bash
# Identify lagging node
curl http://localhost:18181/cluster/members | jq '.members[] | select(.replication_lag > 5)'
# Trigger manual sync from healthy node
curl -X POST http://healthy-node:18181/cluster/sync \
-H "Content-Type: application/json" \
-d '{"target_node": "lagging-node-id", "force": true}'
# Monitor progress
watch -n 5 'curl -s http://lagging-node:18180/metrics | grep replication_lag'
# Wait for lag <1s
# (Sync typically takes 1-5 minutes for <100K assertions)
```
**Resolution B: Restart lagging node**
⚠️ **WARNING:** Cluster must have at least 2 nodes healthy. Don't restart if only 1 node up.
```bash
# Check cluster health first
curl http://localhost:18181/cluster/health
# If 2+ nodes healthy, restart lagging node
ssh lagging-node "sudo systemctl restart stemedb-api"
# Monitor rejoin
watch -n 2 'curl -s http://localhost:18181/cluster/members | jq ".members[] | select(.id==\"$LAGGING_NODE_ID\")"'
# Wait for status: "UP" and replication_lag <1s
```
**Resolution C: Network diagnosis**
```bash
# Check inter-node latency
for node in node1 node2 node3; do
echo "=== Ping $node ==="
ping -c 5 $node
done
# Expected: <5ms avg latency within cluster
# Check for packet loss
sudo tcpdump -i eth0 host node2 and port 18182
# Should show steady RPC traffic, no retransmits
```
**If failed:** Lag persists >15 minutes → Check network issues, consider removing lagging node and re-adding. See [Add Node Runbook](./add-node.md).
---
### §2. Shard Hotspot (Cluster Only)
**Diagnostic:**
```bash
# Check query distribution by node
for node in node1 node2 node3; do
echo "=== $node ==="
curl -s http://$node:18180/metrics | grep stemedb_query_total
done
# Expected (balanced):
# stemedb_query_total{node="node1"} 12453
# stemedb_query_total{node="node2"} 12389
# stemedb_query_total{node="node3"} 12501
# Imbalanced (hotspot):
# stemedb_query_total{node="node1"} 45234 <-- Hotspot!
# stemedb_query_total{node="node2"} 1023
# stemedb_query_total{node="node3"} 989
# Identify hot shard
curl http://localhost:18181/cluster/shards | jq '.shards[] | select(.query_rate > 1000)'
```
**Resolution: Manual shard rebalance**
⚠️ **NOTE:** Automatic rebalancing is roadmap item P6.3. Manual process required for Pilot 5.
```bash
# View current shard assignment
curl http://localhost:18181/cluster/shards | jq '.'
# Identify hot concept_path
curl http://localhost:18180/metrics | grep concept_path_query_rate | sort -t'=' -k2 -nr | head -5
# Move shard to different node (manual)
curl -X POST http://localhost:18181/admin/shards/rebalance \
-H "Content-Type: application/json" \
-d '{
"shard_id": "abc123",
"target_node": "node2-id",
"reason": "hotspot_mitigation"
}'
# Monitor rebalance progress
curl http://localhost:18181/cluster/shards/$SHARD_ID | jq '.rebalance_status'
# Wait for status: "COMPLETE"
```
**Temporary workaround: Load balancer weights**
```bash
# If using nginx load balancer, reduce weight of hot node
# /etc/nginx/conf.d/stemedb-upstream.conf
upstream stemedb {
server node1:18180 weight=1; # Reduce from weight=3
server node2:18180 weight=3;
server node3:18180 weight=3;
}
sudo nginx -t
sudo systemctl reload nginx
```
**If failed:** Hotspot persists → Consider scaling horizontally (add node) or caching popular queries. See [Add Node Runbook](./add-node.md).
---
### §3. Memory Pressure
**Diagnostic:**
```bash
# Check memory usage
free -h
# Expected output (healthy):
# total used free shared buff/cache available
# Mem: 16Gi 4.2Gi 10Gi 128Mi 1.8Gi 11Gi
# Swap: 0B 0B 0B
# Memory pressure indicators:
# - "available" <10% of total
# - Swap used (should be 0 for databases)
# - High "buff/cache" eviction rate
# Check for swap usage
cat /proc/swaps
# Check OOM killer logs
journalctl -k | grep -i "out of memory"
# Check StemeDB memory metrics
curl http://localhost:18180/metrics | grep -E '(process_resident_memory|stemedb_cache_size)'
```
**Resolution A: Increase cache size limit**
⚠️ **NOTE:** Default cache: 1GB. Increase if available memory >8GB.
```bash
# Set cache size to 2GB (if 16GB RAM available)
export STEMEDB_CACHE_SIZE_MB=2048
# Or in systemd service
sudo systemctl edit stemedb-api
# Add:
# [Service]
# Environment="STEMEDB_CACHE_SIZE_MB=2048"
sudo systemctl daemon-reload
sudo systemctl restart stemedb-api
# Verify new limit
curl http://localhost:18180/metrics | grep stemedb_cache_size_bytes
```
**Resolution B: Add swap (emergency only)**
⚠️ **NOT RECOMMENDED for production.** Swap causes unpredictable latency. Upgrade RAM instead.
```bash
# Emergency swap for demo/pilot (4GB)
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Verify
free -h
```
**Resolution C: Scale vertically**
```bash
# Upgrade to larger instance (AWS example)
# Stop server
sudo systemctl stop stemedb-api
# Snapshot volumes
aws ec2 create-snapshot --volume-id vol-xxx --description "pre-upgrade"
# Stop instance, change instance type
aws ec2 stop-instances --instance-ids i-xxx
aws ec2 modify-instance-attribute --instance-id i-xxx --instance-type t3.2xlarge
# Start instance
aws ec2 start-instances --instance-ids i-xxx
# Verify memory upgrade
ssh instance "free -h"
# Start server
sudo systemctl start stemedb-api
```
**If failed:** Memory pressure persists after scaling → Investigate memory leaks. Collect heap profile and escalate to engineering.
---
### §4. Index Corruption
**Diagnostic:**
```bash
# Check logs for index errors
journalctl -u stemedb-api -n 100 | grep -i "index"
# Common errors:
# - "predicate index lookup failed"
# - "concept_path not found in index"
# - "index checksum mismatch"
# Check index metrics
curl http://localhost:18180/metrics | grep stemedb_index_
```
**Resolution: Rebuild indexes**
⚠️ **WARNING:** Index rebuild is blocking operation. Queries will fail during rebuild (typically 1-5 minutes for <100K assertions).
```bash
# Option 1: Restart server (triggers automatic rebuild)
sudo systemctl restart stemedb-api
# Monitor rebuild progress
journalctl -u stemedb-api -f | grep -i "index rebuild"
# Expected log:
# "Starting index rebuild from WAL"
# "Rebuilt predicate index: 45123 entries"
# "Rebuilt concept index: 23456 entries"
# "Index rebuild complete in 127ms"
# Option 2: Trigger manual rebuild via admin endpoint
curl -X POST http://localhost:18180/v1/admin/indexes/rebuild
# Wait for completion
curl http://localhost:18180/v1/admin/indexes/status
# Should return: {"status": "ready", "last_rebuild": "2026-02-11T10:23:45Z"}
```
**If failed:** Rebuild fails or corruption persists Restore from backup. See [Restore from Backup Runbook](./restore-from-backup.md).
---
## Validation
After applying resolution, validate performance is restored:
- [ ] **Query latency back to baseline**
```bash
curl http://localhost:18180/metrics | grep 'stemedb_query_latency_seconds{quantile="0.99"}'
# Should be <0.2 (200ms)
```
- [ ] **Test query succeeds with low latency**
```bash
time curl -X POST http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path":"test/performance","lens":"recency"}'
# Should complete in <1 second
```
- [ ] **Replication lag <1s** (cluster only)
```bash
curl http://localhost:18180/metrics | grep replication_lag_seconds
# All nodes should show <1.0
```
- [ ] **No query timeouts**
```bash
curl http://localhost:18180/metrics | grep stemedb_query_timeout_total
# Counter should stop increasing
```
- [ ] **Dashboard loads quickly**
- Open http://localhost:18188/
- Quarantine panel should load in <2 seconds
---
## Prevention
### Monitoring
**Set up alerts for:**
```yaml
# Prometheus alert rules
groups:
- name: stemedb_performance
rules:
- alert: StemeDBHighLatency
expr: stemedb_query_latency_seconds{quantile="0.99"} > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "Query latency high (p99 >1s)"
description: "p99 latency: {{ $value }}s"
- alert: StemeDBReplicationLag
expr: replication_lag_seconds > 5.0
for: 5m
labels:
severity: warning
annotations:
summary: "Replication lag high (>5s)"
description: "Node {{ $labels.node }}: {{ $value }}s"
- alert: StemeDBMemoryPressure
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Memory available <10%"
```
### Configuration Changes
**To prevent recurrence:**
1. **Replication lag:** Ensure <5ms inter-node latency (same region)
2. **Shard hotspot:** Implement read replicas for popular concept_paths (roadmap P6.3)
3. **Memory pressure:** Right-size instances based on [Resource Sizing Guide](../reference-architecture/resource-sizing.md)
4. **Index corruption:** Enable daily backups, test restore procedures monthly
---
## Performance Targets
**From production readiness UAT:**
| Metric | Pilot Target | Production Target |
|--------|--------------|-------------------|
| **Query latency (p50)** | <50ms | <20ms |
| **Query latency (p99)** | <200ms | <100ms |
| **Ingest rate** | 100/sec | 1K/sec |
| **Concurrent queries** | 100 | 1K |
| **Replication lag** | <1s | <200ms |
---
## Related Runbooks
- [Add Node](./add-node.md) - Horizontal scaling
- [Restore from Backup](./restore-from-backup.md) - Index corruption recovery
- [Disk Full](./disk-full.md) - Storage capacity issues
---
## Last Updated
2026-02-11