This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
456 lines
12 KiB
Markdown
456 lines
12 KiB
Markdown
# Runbook: High Query Latency
|
|
|
|
## Symptom
|
|
|
|
- API queries return 200 but take >1 second (p99 >1000ms)
|
|
- Queries timeout with 504 Gateway Timeout
|
|
- Dashboard slow to load or shows stale data
|
|
- Users report "sluggish" performance
|
|
|
|
**Metrics Alerts:**
|
|
- `stemedb_query_latency_seconds{quantile="0.99"}` > 1.0 for 5 minutes
|
|
- `replication_lag_seconds` > 5.0 (cluster only)
|
|
- `stemedb_query_timeout_total` increasing
|
|
|
|
---
|
|
|
|
## Quick Diagnosis
|
|
|
|
```
|
|
High query latency
|
|
│
|
|
├─► Check: curl .../metrics | grep replication_lag
|
|
│ └─► Lag >5s? → §1 Replication Lag
|
|
│
|
|
├─► Check: curl .../metrics | grep query_latency_seconds
|
|
│ └─► Single shard slow? → §2 Shard Hotspot
|
|
│
|
|
├─► Check: free -h
|
|
│ └─► Memory >90%? → §3 Memory Pressure
|
|
│
|
|
└─► Check: journalctl | grep "index error"
|
|
└─► Index errors? → §4 Index Corruption
|
|
```
|
|
|
|
---
|
|
|
|
## Common Causes
|
|
|
|
1. **Replication lag** (cluster only) — Likelihood: **35%**
|
|
- Network latency between nodes
|
|
- Single node overloaded
|
|
- Merkle sync backlog
|
|
|
|
2. **Shard hotspot** (cluster only) — Likelihood: **25%**
|
|
- Popular concept_path on single shard
|
|
- Unbalanced shard assignment
|
|
- Single node handling all queries
|
|
|
|
3. **Memory pressure** — Likelihood: **20%**
|
|
- Cache evictions due to low memory
|
|
- Swap thrashing
|
|
- Large result sets
|
|
|
|
4. **Index corruption** — Likelihood: **10%**
|
|
- Partial index rebuild needed
|
|
- Corrupted predicate index
|
|
- Version mismatch after upgrade
|
|
|
|
5. **Query complexity** — Likelihood: **10%**
|
|
- Complex lens logic (e.g., AuthorityLens with deep chains)
|
|
- Large result sets (>10K assertions)
|
|
- Inefficient query patterns
|
|
|
|
---
|
|
|
|
## Resolution Steps
|
|
|
|
### §1. Replication Lag (Cluster Only)
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check replication lag on all nodes
|
|
for node in node1 node2 node3; do
|
|
echo "=== $node ==="
|
|
curl http://$node:18180/metrics | grep replication_lag_seconds
|
|
done
|
|
|
|
# Expected output (healthy):
|
|
# replication_lag_seconds{node="node1"} 0.123
|
|
# replication_lag_seconds{node="node2"} 0.089
|
|
# replication_lag_seconds{node="node3"} 0.234
|
|
|
|
# Check Merkle sync status
|
|
curl http://localhost:18181/cluster/sync_status | jq '.'
|
|
```
|
|
|
|
**Resolution A: Manual Merkle sync**
|
|
```bash
|
|
# Identify lagging node
|
|
curl http://localhost:18181/cluster/members | jq '.members[] | select(.replication_lag > 5)'
|
|
|
|
# Trigger manual sync from healthy node
|
|
curl -X POST http://healthy-node:18181/cluster/sync \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"target_node": "lagging-node-id", "force": true}'
|
|
|
|
# Monitor progress
|
|
watch -n 5 'curl -s http://lagging-node:18180/metrics | grep replication_lag'
|
|
|
|
# Wait for lag <1s
|
|
# (Sync typically takes 1-5 minutes for <100K assertions)
|
|
```
|
|
|
|
**Resolution B: Restart lagging node**
|
|
|
|
⚠️ **WARNING:** Cluster must have at least 2 nodes healthy. Don't restart if only 1 node up.
|
|
|
|
```bash
|
|
# Check cluster health first
|
|
curl http://localhost:18181/cluster/health
|
|
|
|
# If 2+ nodes healthy, restart lagging node
|
|
ssh lagging-node "sudo systemctl restart stemedb-api"
|
|
|
|
# Monitor rejoin
|
|
watch -n 2 'curl -s http://localhost:18181/cluster/members | jq ".members[] | select(.id==\"$LAGGING_NODE_ID\")"'
|
|
|
|
# Wait for status: "UP" and replication_lag <1s
|
|
```
|
|
|
|
**Resolution C: Network diagnosis**
|
|
|
|
```bash
|
|
# Check inter-node latency
|
|
for node in node1 node2 node3; do
|
|
echo "=== Ping $node ==="
|
|
ping -c 5 $node
|
|
done
|
|
|
|
# Expected: <5ms avg latency within cluster
|
|
|
|
# Check for packet loss
|
|
sudo tcpdump -i eth0 host node2 and port 18182
|
|
# Should show steady RPC traffic, no retransmits
|
|
```
|
|
|
|
**If failed:** Lag persists >15 minutes → Check network issues, consider removing lagging node and re-adding. See [Add Node Runbook](./add-node.md).
|
|
|
|
---
|
|
|
|
### §2. Shard Hotspot (Cluster Only)
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check query distribution by node
|
|
for node in node1 node2 node3; do
|
|
echo "=== $node ==="
|
|
curl -s http://$node:18180/metrics | grep stemedb_query_total
|
|
done
|
|
|
|
# Expected (balanced):
|
|
# stemedb_query_total{node="node1"} 12453
|
|
# stemedb_query_total{node="node2"} 12389
|
|
# stemedb_query_total{node="node3"} 12501
|
|
|
|
# Imbalanced (hotspot):
|
|
# stemedb_query_total{node="node1"} 45234 <-- Hotspot!
|
|
# stemedb_query_total{node="node2"} 1023
|
|
# stemedb_query_total{node="node3"} 989
|
|
|
|
# Identify hot shard
|
|
curl http://localhost:18181/cluster/shards | jq '.shards[] | select(.query_rate > 1000)'
|
|
```
|
|
|
|
**Resolution: Manual shard rebalance**
|
|
|
|
⚠️ **NOTE:** Automatic rebalancing is roadmap item P6.3. Manual process required for Pilot 5.
|
|
|
|
```bash
|
|
# View current shard assignment
|
|
curl http://localhost:18181/cluster/shards | jq '.'
|
|
|
|
# Identify hot concept_path
|
|
curl http://localhost:18180/metrics | grep concept_path_query_rate | sort -t'=' -k2 -nr | head -5
|
|
|
|
# Move shard to different node (manual)
|
|
curl -X POST http://localhost:18181/admin/shards/rebalance \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"shard_id": "abc123",
|
|
"target_node": "node2-id",
|
|
"reason": "hotspot_mitigation"
|
|
}'
|
|
|
|
# Monitor rebalance progress
|
|
curl http://localhost:18181/cluster/shards/$SHARD_ID | jq '.rebalance_status'
|
|
|
|
# Wait for status: "COMPLETE"
|
|
```
|
|
|
|
**Temporary workaround: Load balancer weights**
|
|
|
|
```bash
|
|
# If using nginx load balancer, reduce weight of hot node
|
|
# /etc/nginx/conf.d/stemedb-upstream.conf
|
|
upstream stemedb {
|
|
server node1:18180 weight=1; # Reduce from weight=3
|
|
server node2:18180 weight=3;
|
|
server node3:18180 weight=3;
|
|
}
|
|
|
|
sudo nginx -t
|
|
sudo systemctl reload nginx
|
|
```
|
|
|
|
**If failed:** Hotspot persists → Consider scaling horizontally (add node) or caching popular queries. See [Add Node Runbook](./add-node.md).
|
|
|
|
---
|
|
|
|
### §3. Memory Pressure
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check memory usage
|
|
free -h
|
|
|
|
# Expected output (healthy):
|
|
# total used free shared buff/cache available
|
|
# Mem: 16Gi 4.2Gi 10Gi 128Mi 1.8Gi 11Gi
|
|
# Swap: 0B 0B 0B
|
|
|
|
# Memory pressure indicators:
|
|
# - "available" <10% of total
|
|
# - Swap used (should be 0 for databases)
|
|
# - High "buff/cache" eviction rate
|
|
|
|
# Check for swap usage
|
|
cat /proc/swaps
|
|
|
|
# Check OOM killer logs
|
|
journalctl -k | grep -i "out of memory"
|
|
|
|
# Check StemeDB memory metrics
|
|
curl http://localhost:18180/metrics | grep -E '(process_resident_memory|stemedb_cache_size)'
|
|
```
|
|
|
|
**Resolution A: Increase cache size limit**
|
|
|
|
⚠️ **NOTE:** Default cache: 1GB. Increase if available memory >8GB.
|
|
|
|
```bash
|
|
# Set cache size to 2GB (if 16GB RAM available)
|
|
export STEMEDB_CACHE_SIZE_MB=2048
|
|
|
|
# Or in systemd service
|
|
sudo systemctl edit stemedb-api
|
|
# Add:
|
|
# [Service]
|
|
# Environment="STEMEDB_CACHE_SIZE_MB=2048"
|
|
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl restart stemedb-api
|
|
|
|
# Verify new limit
|
|
curl http://localhost:18180/metrics | grep stemedb_cache_size_bytes
|
|
```
|
|
|
|
**Resolution B: Add swap (emergency only)**
|
|
|
|
⚠️ **NOT RECOMMENDED for production.** Swap causes unpredictable latency. Upgrade RAM instead.
|
|
|
|
```bash
|
|
# Emergency swap for demo/pilot (4GB)
|
|
sudo fallocate -l 4G /swapfile
|
|
sudo chmod 600 /swapfile
|
|
sudo mkswap /swapfile
|
|
sudo swapon /swapfile
|
|
|
|
# Verify
|
|
free -h
|
|
```
|
|
|
|
**Resolution C: Scale vertically**
|
|
|
|
```bash
|
|
# Upgrade to larger instance (AWS example)
|
|
# Stop server
|
|
sudo systemctl stop stemedb-api
|
|
|
|
# Snapshot volumes
|
|
aws ec2 create-snapshot --volume-id vol-xxx --description "pre-upgrade"
|
|
|
|
# Stop instance, change instance type
|
|
aws ec2 stop-instances --instance-ids i-xxx
|
|
aws ec2 modify-instance-attribute --instance-id i-xxx --instance-type t3.2xlarge
|
|
|
|
# Start instance
|
|
aws ec2 start-instances --instance-ids i-xxx
|
|
|
|
# Verify memory upgrade
|
|
ssh instance "free -h"
|
|
|
|
# Start server
|
|
sudo systemctl start stemedb-api
|
|
```
|
|
|
|
**If failed:** Memory pressure persists after scaling → Investigate memory leaks. Collect heap profile and escalate to engineering.
|
|
|
|
---
|
|
|
|
### §4. Index Corruption
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check logs for index errors
|
|
journalctl -u stemedb-api -n 100 | grep -i "index"
|
|
|
|
# Common errors:
|
|
# - "predicate index lookup failed"
|
|
# - "concept_path not found in index"
|
|
# - "index checksum mismatch"
|
|
|
|
# Check index metrics
|
|
curl http://localhost:18180/metrics | grep stemedb_index_
|
|
```
|
|
|
|
**Resolution: Rebuild indexes**
|
|
|
|
⚠️ **WARNING:** Index rebuild is blocking operation. Queries will fail during rebuild (typically 1-5 minutes for <100K assertions).
|
|
|
|
```bash
|
|
# Option 1: Restart server (triggers automatic rebuild)
|
|
sudo systemctl restart stemedb-api
|
|
|
|
# Monitor rebuild progress
|
|
journalctl -u stemedb-api -f | grep -i "index rebuild"
|
|
|
|
# Expected log:
|
|
# "Starting index rebuild from WAL"
|
|
# "Rebuilt predicate index: 45123 entries"
|
|
# "Rebuilt concept index: 23456 entries"
|
|
# "Index rebuild complete in 127ms"
|
|
|
|
# Option 2: Trigger manual rebuild via admin endpoint
|
|
curl -X POST http://localhost:18180/v1/admin/indexes/rebuild
|
|
|
|
# Wait for completion
|
|
curl http://localhost:18180/v1/admin/indexes/status
|
|
# Should return: {"status": "ready", "last_rebuild": "2026-02-11T10:23:45Z"}
|
|
```
|
|
|
|
**If failed:** Rebuild fails or corruption persists → Restore from backup. See [Restore from Backup Runbook](./restore-from-backup.md).
|
|
|
|
---
|
|
|
|
## Validation
|
|
|
|
After applying resolution, validate performance is restored:
|
|
|
|
- [ ] **Query latency back to baseline**
|
|
```bash
|
|
curl http://localhost:18180/metrics | grep 'stemedb_query_latency_seconds{quantile="0.99"}'
|
|
# Should be <0.2 (200ms)
|
|
```
|
|
|
|
- [ ] **Test query succeeds with low latency**
|
|
```bash
|
|
time curl -X POST http://localhost:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path":"test/performance","lens":"recency"}'
|
|
# Should complete in <1 second
|
|
```
|
|
|
|
- [ ] **Replication lag <1s** (cluster only)
|
|
```bash
|
|
curl http://localhost:18180/metrics | grep replication_lag_seconds
|
|
# All nodes should show <1.0
|
|
```
|
|
|
|
- [ ] **No query timeouts**
|
|
```bash
|
|
curl http://localhost:18180/metrics | grep stemedb_query_timeout_total
|
|
# Counter should stop increasing
|
|
```
|
|
|
|
- [ ] **Dashboard loads quickly**
|
|
- Open http://localhost:18188/
|
|
- Quarantine panel should load in <2 seconds
|
|
|
|
---
|
|
|
|
## Prevention
|
|
|
|
### Monitoring
|
|
|
|
**Set up alerts for:**
|
|
|
|
```yaml
|
|
# Prometheus alert rules
|
|
groups:
|
|
- name: stemedb_performance
|
|
rules:
|
|
- alert: StemeDBHighLatency
|
|
expr: stemedb_query_latency_seconds{quantile="0.99"} > 1.0
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Query latency high (p99 >1s)"
|
|
description: "p99 latency: {{ $value }}s"
|
|
|
|
- alert: StemeDBReplicationLag
|
|
expr: replication_lag_seconds > 5.0
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Replication lag high (>5s)"
|
|
description: "Node {{ $labels.node }}: {{ $value }}s"
|
|
|
|
- alert: StemeDBMemoryPressure
|
|
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Memory available <10%"
|
|
```
|
|
|
|
### Configuration Changes
|
|
|
|
**To prevent recurrence:**
|
|
|
|
1. **Replication lag:** Ensure <5ms inter-node latency (same region)
|
|
2. **Shard hotspot:** Implement read replicas for popular concept_paths (roadmap P6.3)
|
|
3. **Memory pressure:** Right-size instances based on [Resource Sizing Guide](../reference-architecture/resource-sizing.md)
|
|
4. **Index corruption:** Enable daily backups, test restore procedures monthly
|
|
|
|
---
|
|
|
|
## Performance Targets
|
|
|
|
**From production readiness UAT:**
|
|
|
|
| Metric | Pilot Target | Production Target |
|
|
|--------|--------------|-------------------|
|
|
| **Query latency (p50)** | <50ms | <20ms |
|
|
| **Query latency (p99)** | <200ms | <100ms |
|
|
| **Ingest rate** | 100/sec | 1K/sec |
|
|
| **Concurrent queries** | 100 | 1K |
|
|
| **Replication lag** | <1s | <200ms |
|
|
|
|
---
|
|
|
|
## Related Runbooks
|
|
|
|
- [Add Node](./add-node.md) - Horizontal scaling
|
|
- [Restore from Backup](./restore-from-backup.md) - Index corruption recovery
|
|
- [Disk Full](./disk-full.md) - Storage capacity issues
|
|
|
|
---
|
|
|
|
## Last Updated
|
|
|
|
2026-02-11
|