stemedb/docs/operations/runbooks/high-query-latency.md

# Runbook: High Query Latency

## Symptom

- API queries return 200 but take >1 second (p99 >1000ms)
- Queries timeout with 504 Gateway Timeout
- Dashboard slow to load or shows stale data
- Users report "sluggish" performance

**Metrics Alerts:**
- `stemedb_query_latency_seconds{quantile="0.99"}` > 1.0 for 5 minutes
- `replication_lag_seconds` > 5.0 (cluster only)
- `stemedb_query_timeout_total` increasing

---

## Quick Diagnosis

```
High query latency
    │
    ├─► Check: curl .../metrics | grep replication_lag
    │   └─► Lag >5s? → §1 Replication Lag
    │
    ├─► Check: curl .../metrics | grep query_latency_seconds
    │   └─► Single shard slow? → §2 Shard Hotspot
    │
    ├─► Check: free -h
    │   └─► Memory >90%? → §3 Memory Pressure
    │
    └─► Check: journalctl | grep "index error"
        └─► Index errors? → §4 Index Corruption
```

---

## Common Causes

1. **Replication lag** (cluster only) — Likelihood: **35%**
   - Network latency between nodes
   - Single node overloaded
   - Merkle sync backlog

2. **Shard hotspot** (cluster only) — Likelihood: **25%**
   - Popular concept_path on single shard
   - Unbalanced shard assignment
   - Single node handling all queries

3. **Memory pressure** — Likelihood: **20%**
   - Cache evictions due to low memory
   - Swap thrashing
   - Large result sets

4. **Index corruption** — Likelihood: **10%**
   - Partial index rebuild needed
   - Corrupted predicate index
   - Version mismatch after upgrade

5. **Query complexity** — Likelihood: **10%**
   - Complex lens logic (e.g., AuthorityLens with deep chains)
   - Large result sets (>10K assertions)
   - Inefficient query patterns

---

## Resolution Steps

### §1. Replication Lag (Cluster Only)

**Diagnostic:**
```bash
# Check replication lag on all nodes
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl http://$node:18180/metrics | grep replication_lag_seconds
done

# Expected output (healthy):
# replication_lag_seconds{node="node1"} 0.123
# replication_lag_seconds{node="node2"} 0.089
# replication_lag_seconds{node="node3"} 0.234

# Check Merkle sync status
curl http://localhost:18181/cluster/sync_status | jq '.'
```

**Resolution A: Manual Merkle sync**
```bash
# Identify lagging node
curl http://localhost:18181/cluster/members | jq '.members[] | select(.replication_lag > 5)'

# Trigger manual sync from healthy node
curl -X POST http://healthy-node:18181/cluster/sync \
  -H "Content-Type: application/json" \
  -d '{"target_node": "lagging-node-id", "force": true}'

# Monitor progress
watch -n 5 'curl -s http://lagging-node:18180/metrics | grep replication_lag'

# Wait for lag <1s
# (Sync typically takes 1-5 minutes for <100K assertions)
```

**Resolution B: Restart lagging node**

⚠️ **WARNING:** Cluster must have at least 2 nodes healthy. Don't restart if only 1 node up.

```bash
# Check cluster health first
curl http://localhost:18181/cluster/health

# If 2+ nodes healthy, restart lagging node
ssh lagging-node "sudo systemctl restart stemedb-api"

# Monitor rejoin
watch -n 2 'curl -s http://localhost:18181/cluster/members | jq ".members[] | select(.id==\"$LAGGING_NODE_ID\")"'

# Wait for status: "UP" and replication_lag <1s
```

**Resolution C: Network diagnosis**

```bash
# Check inter-node latency
for node in node1 node2 node3; do
  echo "=== Ping $node ==="
  ping -c 5 $node
done

# Expected: <5ms avg latency within cluster

# Check for packet loss
sudo tcpdump -i eth0 host node2 and port 18182
# Should show steady RPC traffic, no retransmits
```

**If failed:** Lag persists >15 minutes → Check network issues, consider removing lagging node and re-adding. See [Add Node Runbook](./add-node.md).

---

### §2. Shard Hotspot (Cluster Only)

**Diagnostic:**
```bash
# Check query distribution by node
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl -s http://$node:18180/metrics | grep stemedb_query_total
done

# Expected (balanced):
# stemedb_query_total{node="node1"} 12453
# stemedb_query_total{node="node2"} 12389
# stemedb_query_total{node="node3"} 12501

# Imbalanced (hotspot):
# stemedb_query_total{node="node1"} 45234  <-- Hotspot!
# stemedb_query_total{node="node2"} 1023
# stemedb_query_total{node="node3"} 989

# Identify hot shard
curl http://localhost:18181/cluster/shards | jq '.shards[] | select(.query_rate > 1000)'
```

**Resolution: Manual shard rebalance**

⚠️ **NOTE:** Automatic rebalancing is roadmap item P6.3. Manual process required for Pilot 5.

```bash
# View current shard assignment
curl http://localhost:18181/cluster/shards | jq '.'

# Identify hot concept_path
curl http://localhost:18180/metrics | grep concept_path_query_rate | sort -t'=' -k2 -nr | head -5

# Move shard to different node (manual)
curl -X POST http://localhost:18181/admin/shards/rebalance \
  -H "Content-Type: application/json" \
  -d '{
    "shard_id": "abc123",
    "target_node": "node2-id",
    "reason": "hotspot_mitigation"
  }'

# Monitor rebalance progress
curl http://localhost:18181/cluster/shards/$SHARD_ID | jq '.rebalance_status'

# Wait for status: "COMPLETE"
```

**Temporary workaround: Load balancer weights**

```bash
# If using nginx load balancer, reduce weight of hot node
# /etc/nginx/conf.d/stemedb-upstream.conf
upstream stemedb {
    server node1:18180 weight=1;  # Reduce from weight=3
    server node2:18180 weight=3;
    server node3:18180 weight=3;
}

sudo nginx -t
sudo systemctl reload nginx
```

**If failed:** Hotspot persists → Consider scaling horizontally (add node) or caching popular queries. See [Add Node Runbook](./add-node.md).

---

### §3. Memory Pressure

**Diagnostic:**
```bash
# Check memory usage
free -h

# Expected output (healthy):
#               total        used        free      shared  buff/cache   available
# Mem:           16Gi        4.2Gi       10Gi        128Mi       1.8Gi        11Gi
# Swap:           0B          0B          0B

# Memory pressure indicators:
# - "available" <10% of total
# - Swap used (should be 0 for databases)
# - High "buff/cache" eviction rate

# Check for swap usage
cat /proc/swaps

# Check OOM killer logs
journalctl -k | grep -i "out of memory"

# Check StemeDB memory metrics
curl http://localhost:18180/metrics | grep -E '(process_resident_memory|stemedb_cache_size)'
```

**Resolution A: Increase cache size limit**

⚠️ **NOTE:** Default cache: 1GB. Increase if available memory >8GB.

```bash
# Set cache size to 2GB (if 16GB RAM available)
export STEMEDB_CACHE_SIZE_MB=2048

# Or in systemd service
sudo systemctl edit stemedb-api
# Add:
# [Service]
# Environment="STEMEDB_CACHE_SIZE_MB=2048"

sudo systemctl daemon-reload
sudo systemctl restart stemedb-api

# Verify new limit
curl http://localhost:18180/metrics | grep stemedb_cache_size_bytes
```

**Resolution B: Add swap (emergency only)**

⚠️ **NOT RECOMMENDED for production.** Swap causes unpredictable latency. Upgrade RAM instead.

```bash
# Emergency swap for demo/pilot (4GB)
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Verify
free -h
```

**Resolution C: Scale vertically**

```bash
# Upgrade to larger instance (AWS example)
# Stop server
sudo systemctl stop stemedb-api

# Snapshot volumes
aws ec2 create-snapshot --volume-id vol-xxx --description "pre-upgrade"

# Stop instance, change instance type
aws ec2 stop-instances --instance-ids i-xxx
aws ec2 modify-instance-attribute --instance-id i-xxx --instance-type t3.2xlarge

# Start instance
aws ec2 start-instances --instance-ids i-xxx

# Verify memory upgrade
ssh instance "free -h"

# Start server
sudo systemctl start stemedb-api
```

**If failed:** Memory pressure persists after scaling → Investigate memory leaks. Collect heap profile and escalate to engineering.

---

### §4. Index Corruption

**Diagnostic:**
```bash
# Check logs for index errors
journalctl -u stemedb-api -n 100 | grep -i "index"

# Common errors:
# - "predicate index lookup failed"
# - "concept_path not found in index"
# - "index checksum mismatch"

# Check index metrics
curl http://localhost:18180/metrics | grep stemedb_index_
```

**Resolution: Rebuild indexes**

⚠️ **WARNING:** Index rebuild is blocking operation. Queries will fail during rebuild (typically 1-5 minutes for <100K assertions).

```bash
# Option 1: Restart server (triggers automatic rebuild)
sudo systemctl restart stemedb-api

# Monitor rebuild progress
journalctl -u stemedb-api -f | grep -i "index rebuild"

# Expected log:
# "Starting index rebuild from WAL"
# "Rebuilt predicate index: 45123 entries"
# "Rebuilt concept index: 23456 entries"
# "Index rebuild complete in 127ms"

# Option 2: Trigger manual rebuild via admin endpoint
curl -X POST http://localhost:18180/v1/admin/indexes/rebuild

# Wait for completion
curl http://localhost:18180/v1/admin/indexes/status
# Should return: {"status": "ready", "last_rebuild": "2026-02-11T10:23:45Z"}
```

**If failed:** Rebuild fails or corruption persists → Restore from backup. See [Restore from Backup Runbook](./restore-from-backup.md).

---

## Validation

After applying resolution, validate performance is restored:

- [ ] **Query latency back to baseline**
  ```bash
  curl http://localhost:18180/metrics | grep 'stemedb_query_latency_seconds{quantile="0.99"}'
  # Should be <0.2 (200ms)
  ```

- [ ] **Test query succeeds with low latency**
  ```bash
  time curl -X POST http://localhost:18180/v1/query \
    -H "Content-Type: application/json" \
    -d '{"concept_path":"test/performance","lens":"recency"}'
  # Should complete in <1 second
  ```

- [ ] **Replication lag <1s** (cluster only)
  ```bash
  curl http://localhost:18180/metrics | grep replication_lag_seconds
  # All nodes should show <1.0
  ```

- [ ] **No query timeouts**
  ```bash
  curl http://localhost:18180/metrics | grep stemedb_query_timeout_total
  # Counter should stop increasing
  ```

- [ ] **Dashboard loads quickly**
  - Open http://localhost:18188/
  - Quarantine panel should load in <2 seconds

---

## Prevention

### Monitoring

**Set up alerts for:**

```yaml
# Prometheus alert rules
groups:
  - name: stemedb_performance
    rules:
      - alert: StemeDBHighLatency
        expr: stemedb_query_latency_seconds{quantile="0.99"} > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Query latency high (p99 >1s)"
          description: "p99 latency: {{ $value }}s"

      - alert: StemeDBReplicationLag
        expr: replication_lag_seconds > 5.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag high (>5s)"
          description: "Node {{ $labels.node }}: {{ $value }}s"

      - alert: StemeDBMemoryPressure
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory available <10%"
```

### Configuration Changes

**To prevent recurrence:**

1. **Replication lag:** Ensure <5ms inter-node latency (same region)
2. **Shard hotspot:** Implement read replicas for popular concept_paths (roadmap P6.3)
3. **Memory pressure:** Right-size instances based on [Resource Sizing Guide](../reference-architecture/resource-sizing.md)
4. **Index corruption:** Enable daily backups, test restore procedures monthly

---

## Performance Targets

**From production readiness UAT:**

| Metric | Pilot Target | Production Target |
|--------|--------------|-------------------|
| **Query latency (p50)** | <50ms | <20ms |
| **Query latency (p99)** | <200ms | <100ms |
| **Ingest rate** | 100/sec | 1K/sec |
| **Concurrent queries** | 100 | 1K |
| **Replication lag** | <1s | <200ms |

---

## Related Runbooks

- [Add Node](./add-node.md) - Horizontal scaling
- [Restore from Backup](./restore-from-backup.md) - Index corruption recovery
- [Disk Full](./disk-full.md) - Storage capacity issues

---

## Last Updated

2026-02-11