# Runbook: High Query Latency ## Symptom - API queries return 200 but take >1 second (p99 >1000ms) - Queries timeout with 504 Gateway Timeout - Dashboard slow to load or shows stale data - Users report "sluggish" performance **Metrics Alerts:** - `stemedb_query_latency_seconds{quantile="0.99"}` > 1.0 for 5 minutes - `replication_lag_seconds` > 5.0 (cluster only) - `stemedb_query_timeout_total` increasing --- ## Quick Diagnosis ``` High query latency │ ├─► Check: curl .../metrics | grep replication_lag │ └─► Lag >5s? → §1 Replication Lag │ ├─► Check: curl .../metrics | grep query_latency_seconds │ └─► Single shard slow? → §2 Shard Hotspot │ ├─► Check: free -h │ └─► Memory >90%? → §3 Memory Pressure │ └─► Check: journalctl | grep "index error" └─► Index errors? → §4 Index Corruption ``` --- ## Common Causes 1. **Replication lag** (cluster only) — Likelihood: **35%** - Network latency between nodes - Single node overloaded - Merkle sync backlog 2. **Shard hotspot** (cluster only) — Likelihood: **25%** - Popular concept_path on single shard - Unbalanced shard assignment - Single node handling all queries 3. **Memory pressure** — Likelihood: **20%** - Cache evictions due to low memory - Swap thrashing - Large result sets 4. **Index corruption** — Likelihood: **10%** - Partial index rebuild needed - Corrupted predicate index - Version mismatch after upgrade 5. **Query complexity** — Likelihood: **10%** - Complex lens logic (e.g., AuthorityLens with deep chains) - Large result sets (>10K assertions) - Inefficient query patterns --- ## Resolution Steps ### §1. Replication Lag (Cluster Only) **Diagnostic:** ```bash # Check replication lag on all nodes for node in node1 node2 node3; do echo "=== $node ===" curl http://$node:18180/metrics | grep replication_lag_seconds done # Expected output (healthy): # replication_lag_seconds{node="node1"} 0.123 # replication_lag_seconds{node="node2"} 0.089 # replication_lag_seconds{node="node3"} 0.234 # Check Merkle sync status curl http://localhost:18181/cluster/sync_status | jq '.' ``` **Resolution A: Manual Merkle sync** ```bash # Identify lagging node curl http://localhost:18181/cluster/members | jq '.members[] | select(.replication_lag > 5)' # Trigger manual sync from healthy node curl -X POST http://healthy-node:18181/cluster/sync \ -H "Content-Type: application/json" \ -d '{"target_node": "lagging-node-id", "force": true}' # Monitor progress watch -n 5 'curl -s http://lagging-node:18180/metrics | grep replication_lag' # Wait for lag <1s # (Sync typically takes 1-5 minutes for <100K assertions) ``` **Resolution B: Restart lagging node** ⚠️ **WARNING:** Cluster must have at least 2 nodes healthy. Don't restart if only 1 node up. ```bash # Check cluster health first curl http://localhost:18181/cluster/health # If 2+ nodes healthy, restart lagging node ssh lagging-node "sudo systemctl restart stemedb-api" # Monitor rejoin watch -n 2 'curl -s http://localhost:18181/cluster/members | jq ".members[] | select(.id==\"$LAGGING_NODE_ID\")"' # Wait for status: "UP" and replication_lag <1s ``` **Resolution C: Network diagnosis** ```bash # Check inter-node latency for node in node1 node2 node3; do echo "=== Ping $node ===" ping -c 5 $node done # Expected: <5ms avg latency within cluster # Check for packet loss sudo tcpdump -i eth0 host node2 and port 18182 # Should show steady RPC traffic, no retransmits ``` **If failed:** Lag persists >15 minutes → Check network issues, consider removing lagging node and re-adding. See [Add Node Runbook](./add-node.md). --- ### §2. Shard Hotspot (Cluster Only) **Diagnostic:** ```bash # Check query distribution by node for node in node1 node2 node3; do echo "=== $node ===" curl -s http://$node:18180/metrics | grep stemedb_query_total done # Expected (balanced): # stemedb_query_total{node="node1"} 12453 # stemedb_query_total{node="node2"} 12389 # stemedb_query_total{node="node3"} 12501 # Imbalanced (hotspot): # stemedb_query_total{node="node1"} 45234 <-- Hotspot! # stemedb_query_total{node="node2"} 1023 # stemedb_query_total{node="node3"} 989 # Identify hot shard curl http://localhost:18181/cluster/shards | jq '.shards[] | select(.query_rate > 1000)' ``` **Resolution: Manual shard rebalance** ⚠️ **NOTE:** Automatic rebalancing is roadmap item P6.3. Manual process required for Pilot 5. ```bash # View current shard assignment curl http://localhost:18181/cluster/shards | jq '.' # Identify hot concept_path curl http://localhost:18180/metrics | grep concept_path_query_rate | sort -t'=' -k2 -nr | head -5 # Move shard to different node (manual) curl -X POST http://localhost:18181/admin/shards/rebalance \ -H "Content-Type: application/json" \ -d '{ "shard_id": "abc123", "target_node": "node2-id", "reason": "hotspot_mitigation" }' # Monitor rebalance progress curl http://localhost:18181/cluster/shards/$SHARD_ID | jq '.rebalance_status' # Wait for status: "COMPLETE" ``` **Temporary workaround: Load balancer weights** ```bash # If using nginx load balancer, reduce weight of hot node # /etc/nginx/conf.d/stemedb-upstream.conf upstream stemedb { server node1:18180 weight=1; # Reduce from weight=3 server node2:18180 weight=3; server node3:18180 weight=3; } sudo nginx -t sudo systemctl reload nginx ``` **If failed:** Hotspot persists → Consider scaling horizontally (add node) or caching popular queries. See [Add Node Runbook](./add-node.md). --- ### §3. Memory Pressure **Diagnostic:** ```bash # Check memory usage free -h # Expected output (healthy): # total used free shared buff/cache available # Mem: 16Gi 4.2Gi 10Gi 128Mi 1.8Gi 11Gi # Swap: 0B 0B 0B # Memory pressure indicators: # - "available" <10% of total # - Swap used (should be 0 for databases) # - High "buff/cache" eviction rate # Check for swap usage cat /proc/swaps # Check OOM killer logs journalctl -k | grep -i "out of memory" # Check StemeDB memory metrics curl http://localhost:18180/metrics | grep -E '(process_resident_memory|stemedb_cache_size)' ``` **Resolution A: Increase cache size limit** ⚠️ **NOTE:** Default cache: 1GB. Increase if available memory >8GB. ```bash # Set cache size to 2GB (if 16GB RAM available) export STEMEDB_CACHE_SIZE_MB=2048 # Or in systemd service sudo systemctl edit stemedb-api # Add: # [Service] # Environment="STEMEDB_CACHE_SIZE_MB=2048" sudo systemctl daemon-reload sudo systemctl restart stemedb-api # Verify new limit curl http://localhost:18180/metrics | grep stemedb_cache_size_bytes ``` **Resolution B: Add swap (emergency only)** ⚠️ **NOT RECOMMENDED for production.** Swap causes unpredictable latency. Upgrade RAM instead. ```bash # Emergency swap for demo/pilot (4GB) sudo fallocate -l 4G /swapfile sudo chmod 600 /swapfile sudo mkswap /swapfile sudo swapon /swapfile # Verify free -h ``` **Resolution C: Scale vertically** ```bash # Upgrade to larger instance (AWS example) # Stop server sudo systemctl stop stemedb-api # Snapshot volumes aws ec2 create-snapshot --volume-id vol-xxx --description "pre-upgrade" # Stop instance, change instance type aws ec2 stop-instances --instance-ids i-xxx aws ec2 modify-instance-attribute --instance-id i-xxx --instance-type t3.2xlarge # Start instance aws ec2 start-instances --instance-ids i-xxx # Verify memory upgrade ssh instance "free -h" # Start server sudo systemctl start stemedb-api ``` **If failed:** Memory pressure persists after scaling → Investigate memory leaks. Collect heap profile and escalate to engineering. --- ### §4. Index Corruption **Diagnostic:** ```bash # Check logs for index errors journalctl -u stemedb-api -n 100 | grep -i "index" # Common errors: # - "predicate index lookup failed" # - "concept_path not found in index" # - "index checksum mismatch" # Check index metrics curl http://localhost:18180/metrics | grep stemedb_index_ ``` **Resolution: Rebuild indexes** ⚠️ **WARNING:** Index rebuild is blocking operation. Queries will fail during rebuild (typically 1-5 minutes for <100K assertions). ```bash # Option 1: Restart server (triggers automatic rebuild) sudo systemctl restart stemedb-api # Monitor rebuild progress journalctl -u stemedb-api -f | grep -i "index rebuild" # Expected log: # "Starting index rebuild from WAL" # "Rebuilt predicate index: 45123 entries" # "Rebuilt concept index: 23456 entries" # "Index rebuild complete in 127ms" # Option 2: Trigger manual rebuild via admin endpoint curl -X POST http://localhost:18180/v1/admin/indexes/rebuild # Wait for completion curl http://localhost:18180/v1/admin/indexes/status # Should return: {"status": "ready", "last_rebuild": "2026-02-11T10:23:45Z"} ``` **If failed:** Rebuild fails or corruption persists → Restore from backup. See [Restore from Backup Runbook](./restore-from-backup.md). --- ## Validation After applying resolution, validate performance is restored: - [ ] **Query latency back to baseline** ```bash curl http://localhost:18180/metrics | grep 'stemedb_query_latency_seconds{quantile="0.99"}' # Should be <0.2 (200ms) ``` - [ ] **Test query succeeds with low latency** ```bash time curl -X POST http://localhost:18180/v1/query \ -H "Content-Type: application/json" \ -d '{"concept_path":"test/performance","lens":"recency"}' # Should complete in <1 second ``` - [ ] **Replication lag <1s** (cluster only) ```bash curl http://localhost:18180/metrics | grep replication_lag_seconds # All nodes should show <1.0 ``` - [ ] **No query timeouts** ```bash curl http://localhost:18180/metrics | grep stemedb_query_timeout_total # Counter should stop increasing ``` - [ ] **Dashboard loads quickly** - Open http://localhost:18188/ - Quarantine panel should load in <2 seconds --- ## Prevention ### Monitoring **Set up alerts for:** ```yaml # Prometheus alert rules groups: - name: stemedb_performance rules: - alert: StemeDBHighLatency expr: stemedb_query_latency_seconds{quantile="0.99"} > 1.0 for: 5m labels: severity: warning annotations: summary: "Query latency high (p99 >1s)" description: "p99 latency: {{ $value }}s" - alert: StemeDBReplicationLag expr: replication_lag_seconds > 5.0 for: 5m labels: severity: warning annotations: summary: "Replication lag high (>5s)" description: "Node {{ $labels.node }}: {{ $value }}s" - alert: StemeDBMemoryPressure expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1 for: 5m labels: severity: warning annotations: summary: "Memory available <10%" ``` ### Configuration Changes **To prevent recurrence:** 1. **Replication lag:** Ensure <5ms inter-node latency (same region) 2. **Shard hotspot:** Implement read replicas for popular concept_paths (roadmap P6.3) 3. **Memory pressure:** Right-size instances based on [Resource Sizing Guide](../reference-architecture/resource-sizing.md) 4. **Index corruption:** Enable daily backups, test restore procedures monthly --- ## Performance Targets **From production readiness UAT:** | Metric | Pilot Target | Production Target | |--------|--------------|-------------------| | **Query latency (p50)** | <50ms | <20ms | | **Query latency (p99)** | <200ms | <100ms | | **Ingest rate** | 100/sec | 1K/sec | | **Concurrent queries** | 100 | 1K | | **Replication lag** | <1s | <200ms | --- ## Related Runbooks - [Add Node](./add-node.md) - Horizontal scaling - [Restore from Backup](./restore-from-backup.md) - Index corruption recovery - [Disk Full](./disk-full.md) - Storage capacity issues --- ## Last Updated 2026-02-11