stemedb/docs/operations/runbooks/high-query-latency.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

12 KiB

Runbook: High Query Latency

Symptom

  • API queries return 200 but take >1 second (p99 >1000ms)
  • Queries timeout with 504 Gateway Timeout
  • Dashboard slow to load or shows stale data
  • Users report "sluggish" performance

Metrics Alerts:

  • stemedb_query_latency_seconds{quantile="0.99"} > 1.0 for 5 minutes
  • replication_lag_seconds > 5.0 (cluster only)
  • stemedb_query_timeout_total increasing

Quick Diagnosis

High query latency
    │
    ├─► Check: curl .../metrics | grep replication_lag
    │   └─► Lag >5s? → §1 Replication Lag
    │
    ├─► Check: curl .../metrics | grep query_latency_seconds
    │   └─► Single shard slow? → §2 Shard Hotspot
    │
    ├─► Check: free -h
    │   └─► Memory >90%? → §3 Memory Pressure
    │
    └─► Check: journalctl | grep "index error"
        └─► Index errors? → §4 Index Corruption

Common Causes

  1. Replication lag (cluster only) — Likelihood: 35%

    • Network latency between nodes
    • Single node overloaded
    • Merkle sync backlog
  2. Shard hotspot (cluster only) — Likelihood: 25%

    • Popular concept_path on single shard
    • Unbalanced shard assignment
    • Single node handling all queries
  3. Memory pressure — Likelihood: 20%

    • Cache evictions due to low memory
    • Swap thrashing
    • Large result sets
  4. Index corruption — Likelihood: 10%

    • Partial index rebuild needed
    • Corrupted predicate index
    • Version mismatch after upgrade
  5. Query complexity — Likelihood: 10%

    • Complex lens logic (e.g., AuthorityLens with deep chains)
    • Large result sets (>10K assertions)
    • Inefficient query patterns

Resolution Steps

§1. Replication Lag (Cluster Only)

Diagnostic:

# Check replication lag on all nodes
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl http://$node:18180/metrics | grep replication_lag_seconds
done

# Expected output (healthy):
# replication_lag_seconds{node="node1"} 0.123
# replication_lag_seconds{node="node2"} 0.089
# replication_lag_seconds{node="node3"} 0.234

# Check Merkle sync status
curl http://localhost:18181/cluster/sync_status | jq '.'

Resolution A: Manual Merkle sync

# Identify lagging node
curl http://localhost:18181/cluster/members | jq '.members[] | select(.replication_lag > 5)'

# Trigger manual sync from healthy node
curl -X POST http://healthy-node:18181/cluster/sync \
  -H "Content-Type: application/json" \
  -d '{"target_node": "lagging-node-id", "force": true}'

# Monitor progress
watch -n 5 'curl -s http://lagging-node:18180/metrics | grep replication_lag'

# Wait for lag <1s
# (Sync typically takes 1-5 minutes for <100K assertions)

Resolution B: Restart lagging node

⚠️ WARNING: Cluster must have at least 2 nodes healthy. Don't restart if only 1 node up.

# Check cluster health first
curl http://localhost:18181/cluster/health

# If 2+ nodes healthy, restart lagging node
ssh lagging-node "sudo systemctl restart stemedb-api"

# Monitor rejoin
watch -n 2 'curl -s http://localhost:18181/cluster/members | jq ".members[] | select(.id==\"$LAGGING_NODE_ID\")"'

# Wait for status: "UP" and replication_lag <1s

Resolution C: Network diagnosis

# Check inter-node latency
for node in node1 node2 node3; do
  echo "=== Ping $node ==="
  ping -c 5 $node
done

# Expected: <5ms avg latency within cluster

# Check for packet loss
sudo tcpdump -i eth0 host node2 and port 18182
# Should show steady RPC traffic, no retransmits

If failed: Lag persists >15 minutes → Check network issues, consider removing lagging node and re-adding. See Add Node Runbook.


§2. Shard Hotspot (Cluster Only)

Diagnostic:

# Check query distribution by node
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl -s http://$node:18180/metrics | grep stemedb_query_total
done

# Expected (balanced):
# stemedb_query_total{node="node1"} 12453
# stemedb_query_total{node="node2"} 12389
# stemedb_query_total{node="node3"} 12501

# Imbalanced (hotspot):
# stemedb_query_total{node="node1"} 45234  <-- Hotspot!
# stemedb_query_total{node="node2"} 1023
# stemedb_query_total{node="node3"} 989

# Identify hot shard
curl http://localhost:18181/cluster/shards | jq '.shards[] | select(.query_rate > 1000)'

Resolution: Manual shard rebalance

⚠️ NOTE: Automatic rebalancing is roadmap item P6.3. Manual process required for Pilot 5.

# View current shard assignment
curl http://localhost:18181/cluster/shards | jq '.'

# Identify hot concept_path
curl http://localhost:18180/metrics | grep concept_path_query_rate | sort -t'=' -k2 -nr | head -5

# Move shard to different node (manual)
curl -X POST http://localhost:18181/admin/shards/rebalance \
  -H "Content-Type: application/json" \
  -d '{
    "shard_id": "abc123",
    "target_node": "node2-id",
    "reason": "hotspot_mitigation"
  }'

# Monitor rebalance progress
curl http://localhost:18181/cluster/shards/$SHARD_ID | jq '.rebalance_status'

# Wait for status: "COMPLETE"

Temporary workaround: Load balancer weights

# If using nginx load balancer, reduce weight of hot node
# /etc/nginx/conf.d/stemedb-upstream.conf
upstream stemedb {
    server node1:18180 weight=1;  # Reduce from weight=3
    server node2:18180 weight=3;
    server node3:18180 weight=3;
}

sudo nginx -t
sudo systemctl reload nginx

If failed: Hotspot persists → Consider scaling horizontally (add node) or caching popular queries. See Add Node Runbook.


§3. Memory Pressure

Diagnostic:

# Check memory usage
free -h

# Expected output (healthy):
#               total        used        free      shared  buff/cache   available
# Mem:           16Gi        4.2Gi       10Gi        128Mi       1.8Gi        11Gi
# Swap:           0B          0B          0B

# Memory pressure indicators:
# - "available" <10% of total
# - Swap used (should be 0 for databases)
# - High "buff/cache" eviction rate

# Check for swap usage
cat /proc/swaps

# Check OOM killer logs
journalctl -k | grep -i "out of memory"

# Check StemeDB memory metrics
curl http://localhost:18180/metrics | grep -E '(process_resident_memory|stemedb_cache_size)'

Resolution A: Increase cache size limit

⚠️ NOTE: Default cache: 1GB. Increase if available memory >8GB.

# Set cache size to 2GB (if 16GB RAM available)
export STEMEDB_CACHE_SIZE_MB=2048

# Or in systemd service
sudo systemctl edit stemedb-api
# Add:
# [Service]
# Environment="STEMEDB_CACHE_SIZE_MB=2048"

sudo systemctl daemon-reload
sudo systemctl restart stemedb-api

# Verify new limit
curl http://localhost:18180/metrics | grep stemedb_cache_size_bytes

Resolution B: Add swap (emergency only)

⚠️ NOT RECOMMENDED for production. Swap causes unpredictable latency. Upgrade RAM instead.

# Emergency swap for demo/pilot (4GB)
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Verify
free -h

Resolution C: Scale vertically

# Upgrade to larger instance (AWS example)
# Stop server
sudo systemctl stop stemedb-api

# Snapshot volumes
aws ec2 create-snapshot --volume-id vol-xxx --description "pre-upgrade"

# Stop instance, change instance type
aws ec2 stop-instances --instance-ids i-xxx
aws ec2 modify-instance-attribute --instance-id i-xxx --instance-type t3.2xlarge

# Start instance
aws ec2 start-instances --instance-ids i-xxx

# Verify memory upgrade
ssh instance "free -h"

# Start server
sudo systemctl start stemedb-api

If failed: Memory pressure persists after scaling → Investigate memory leaks. Collect heap profile and escalate to engineering.


§4. Index Corruption

Diagnostic:

# Check logs for index errors
journalctl -u stemedb-api -n 100 | grep -i "index"

# Common errors:
# - "predicate index lookup failed"
# - "concept_path not found in index"
# - "index checksum mismatch"

# Check index metrics
curl http://localhost:18180/metrics | grep stemedb_index_

Resolution: Rebuild indexes

⚠️ WARNING: Index rebuild is blocking operation. Queries will fail during rebuild (typically 1-5 minutes for <100K assertions).

# Option 1: Restart server (triggers automatic rebuild)
sudo systemctl restart stemedb-api

# Monitor rebuild progress
journalctl -u stemedb-api -f | grep -i "index rebuild"

# Expected log:
# "Starting index rebuild from WAL"
# "Rebuilt predicate index: 45123 entries"
# "Rebuilt concept index: 23456 entries"
# "Index rebuild complete in 127ms"

# Option 2: Trigger manual rebuild via admin endpoint
curl -X POST http://localhost:18180/v1/admin/indexes/rebuild

# Wait for completion
curl http://localhost:18180/v1/admin/indexes/status
# Should return: {"status": "ready", "last_rebuild": "2026-02-11T10:23:45Z"}

If failed: Rebuild fails or corruption persists → Restore from backup. See Restore from Backup Runbook.


Validation

After applying resolution, validate performance is restored:

  • Query latency back to baseline

    curl http://localhost:18180/metrics | grep 'stemedb_query_latency_seconds{quantile="0.99"}'
    # Should be <0.2 (200ms)
    
  • Test query succeeds with low latency

    time curl -X POST http://localhost:18180/v1/query \
      -H "Content-Type: application/json" \
      -d '{"concept_path":"test/performance","lens":"recency"}'
    # Should complete in <1 second
    
  • Replication lag <1s (cluster only)

    curl http://localhost:18180/metrics | grep replication_lag_seconds
    # All nodes should show <1.0
    
  • No query timeouts

    curl http://localhost:18180/metrics | grep stemedb_query_timeout_total
    # Counter should stop increasing
    
  • Dashboard loads quickly


Prevention

Monitoring

Set up alerts for:

# Prometheus alert rules
groups:
  - name: stemedb_performance
    rules:
      - alert: StemeDBHighLatency
        expr: stemedb_query_latency_seconds{quantile="0.99"} > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Query latency high (p99 >1s)"
          description: "p99 latency: {{ $value }}s"

      - alert: StemeDBReplicationLag
        expr: replication_lag_seconds > 5.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag high (>5s)"
          description: "Node {{ $labels.node }}: {{ $value }}s"

      - alert: StemeDBMemoryPressure
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory available <10%"

Configuration Changes

To prevent recurrence:

  1. Replication lag: Ensure <5ms inter-node latency (same region)
  2. Shard hotspot: Implement read replicas for popular concept_paths (roadmap P6.3)
  3. Memory pressure: Right-size instances based on Resource Sizing Guide
  4. Index corruption: Enable daily backups, test restore procedures monthly

Performance Targets

From production readiness UAT:

Metric Pilot Target Production Target
Query latency (p50) <50ms <20ms
Query latency (p99) <200ms <100ms
Ingest rate 100/sec 1K/sec
Concurrent queries 100 1K
Replication lag <1s <200ms


Last Updated

2026-02-11