This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
12 KiB
Runbook: High Query Latency
Symptom
- API queries return 200 but take >1 second (p99 >1000ms)
- Queries timeout with 504 Gateway Timeout
- Dashboard slow to load or shows stale data
- Users report "sluggish" performance
Metrics Alerts:
stemedb_query_latency_seconds{quantile="0.99"}> 1.0 for 5 minutesreplication_lag_seconds> 5.0 (cluster only)stemedb_query_timeout_totalincreasing
Quick Diagnosis
High query latency
│
├─► Check: curl .../metrics | grep replication_lag
│ └─► Lag >5s? → §1 Replication Lag
│
├─► Check: curl .../metrics | grep query_latency_seconds
│ └─► Single shard slow? → §2 Shard Hotspot
│
├─► Check: free -h
│ └─► Memory >90%? → §3 Memory Pressure
│
└─► Check: journalctl | grep "index error"
└─► Index errors? → §4 Index Corruption
Common Causes
-
Replication lag (cluster only) — Likelihood: 35%
- Network latency between nodes
- Single node overloaded
- Merkle sync backlog
-
Shard hotspot (cluster only) — Likelihood: 25%
- Popular concept_path on single shard
- Unbalanced shard assignment
- Single node handling all queries
-
Memory pressure — Likelihood: 20%
- Cache evictions due to low memory
- Swap thrashing
- Large result sets
-
Index corruption — Likelihood: 10%
- Partial index rebuild needed
- Corrupted predicate index
- Version mismatch after upgrade
-
Query complexity — Likelihood: 10%
- Complex lens logic (e.g., AuthorityLens with deep chains)
- Large result sets (>10K assertions)
- Inefficient query patterns
Resolution Steps
§1. Replication Lag (Cluster Only)
Diagnostic:
# Check replication lag on all nodes
for node in node1 node2 node3; do
echo "=== $node ==="
curl http://$node:18180/metrics | grep replication_lag_seconds
done
# Expected output (healthy):
# replication_lag_seconds{node="node1"} 0.123
# replication_lag_seconds{node="node2"} 0.089
# replication_lag_seconds{node="node3"} 0.234
# Check Merkle sync status
curl http://localhost:18181/cluster/sync_status | jq '.'
Resolution A: Manual Merkle sync
# Identify lagging node
curl http://localhost:18181/cluster/members | jq '.members[] | select(.replication_lag > 5)'
# Trigger manual sync from healthy node
curl -X POST http://healthy-node:18181/cluster/sync \
-H "Content-Type: application/json" \
-d '{"target_node": "lagging-node-id", "force": true}'
# Monitor progress
watch -n 5 'curl -s http://lagging-node:18180/metrics | grep replication_lag'
# Wait for lag <1s
# (Sync typically takes 1-5 minutes for <100K assertions)
Resolution B: Restart lagging node
⚠️ WARNING: Cluster must have at least 2 nodes healthy. Don't restart if only 1 node up.
# Check cluster health first
curl http://localhost:18181/cluster/health
# If 2+ nodes healthy, restart lagging node
ssh lagging-node "sudo systemctl restart stemedb-api"
# Monitor rejoin
watch -n 2 'curl -s http://localhost:18181/cluster/members | jq ".members[] | select(.id==\"$LAGGING_NODE_ID\")"'
# Wait for status: "UP" and replication_lag <1s
Resolution C: Network diagnosis
# Check inter-node latency
for node in node1 node2 node3; do
echo "=== Ping $node ==="
ping -c 5 $node
done
# Expected: <5ms avg latency within cluster
# Check for packet loss
sudo tcpdump -i eth0 host node2 and port 18182
# Should show steady RPC traffic, no retransmits
If failed: Lag persists >15 minutes → Check network issues, consider removing lagging node and re-adding. See Add Node Runbook.
§2. Shard Hotspot (Cluster Only)
Diagnostic:
# Check query distribution by node
for node in node1 node2 node3; do
echo "=== $node ==="
curl -s http://$node:18180/metrics | grep stemedb_query_total
done
# Expected (balanced):
# stemedb_query_total{node="node1"} 12453
# stemedb_query_total{node="node2"} 12389
# stemedb_query_total{node="node3"} 12501
# Imbalanced (hotspot):
# stemedb_query_total{node="node1"} 45234 <-- Hotspot!
# stemedb_query_total{node="node2"} 1023
# stemedb_query_total{node="node3"} 989
# Identify hot shard
curl http://localhost:18181/cluster/shards | jq '.shards[] | select(.query_rate > 1000)'
Resolution: Manual shard rebalance
⚠️ NOTE: Automatic rebalancing is roadmap item P6.3. Manual process required for Pilot 5.
# View current shard assignment
curl http://localhost:18181/cluster/shards | jq '.'
# Identify hot concept_path
curl http://localhost:18180/metrics | grep concept_path_query_rate | sort -t'=' -k2 -nr | head -5
# Move shard to different node (manual)
curl -X POST http://localhost:18181/admin/shards/rebalance \
-H "Content-Type: application/json" \
-d '{
"shard_id": "abc123",
"target_node": "node2-id",
"reason": "hotspot_mitigation"
}'
# Monitor rebalance progress
curl http://localhost:18181/cluster/shards/$SHARD_ID | jq '.rebalance_status'
# Wait for status: "COMPLETE"
Temporary workaround: Load balancer weights
# If using nginx load balancer, reduce weight of hot node
# /etc/nginx/conf.d/stemedb-upstream.conf
upstream stemedb {
server node1:18180 weight=1; # Reduce from weight=3
server node2:18180 weight=3;
server node3:18180 weight=3;
}
sudo nginx -t
sudo systemctl reload nginx
If failed: Hotspot persists → Consider scaling horizontally (add node) or caching popular queries. See Add Node Runbook.
§3. Memory Pressure
Diagnostic:
# Check memory usage
free -h
# Expected output (healthy):
# total used free shared buff/cache available
# Mem: 16Gi 4.2Gi 10Gi 128Mi 1.8Gi 11Gi
# Swap: 0B 0B 0B
# Memory pressure indicators:
# - "available" <10% of total
# - Swap used (should be 0 for databases)
# - High "buff/cache" eviction rate
# Check for swap usage
cat /proc/swaps
# Check OOM killer logs
journalctl -k | grep -i "out of memory"
# Check StemeDB memory metrics
curl http://localhost:18180/metrics | grep -E '(process_resident_memory|stemedb_cache_size)'
Resolution A: Increase cache size limit
⚠️ NOTE: Default cache: 1GB. Increase if available memory >8GB.
# Set cache size to 2GB (if 16GB RAM available)
export STEMEDB_CACHE_SIZE_MB=2048
# Or in systemd service
sudo systemctl edit stemedb-api
# Add:
# [Service]
# Environment="STEMEDB_CACHE_SIZE_MB=2048"
sudo systemctl daemon-reload
sudo systemctl restart stemedb-api
# Verify new limit
curl http://localhost:18180/metrics | grep stemedb_cache_size_bytes
Resolution B: Add swap (emergency only)
⚠️ NOT RECOMMENDED for production. Swap causes unpredictable latency. Upgrade RAM instead.
# Emergency swap for demo/pilot (4GB)
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Verify
free -h
Resolution C: Scale vertically
# Upgrade to larger instance (AWS example)
# Stop server
sudo systemctl stop stemedb-api
# Snapshot volumes
aws ec2 create-snapshot --volume-id vol-xxx --description "pre-upgrade"
# Stop instance, change instance type
aws ec2 stop-instances --instance-ids i-xxx
aws ec2 modify-instance-attribute --instance-id i-xxx --instance-type t3.2xlarge
# Start instance
aws ec2 start-instances --instance-ids i-xxx
# Verify memory upgrade
ssh instance "free -h"
# Start server
sudo systemctl start stemedb-api
If failed: Memory pressure persists after scaling → Investigate memory leaks. Collect heap profile and escalate to engineering.
§4. Index Corruption
Diagnostic:
# Check logs for index errors
journalctl -u stemedb-api -n 100 | grep -i "index"
# Common errors:
# - "predicate index lookup failed"
# - "concept_path not found in index"
# - "index checksum mismatch"
# Check index metrics
curl http://localhost:18180/metrics | grep stemedb_index_
Resolution: Rebuild indexes
⚠️ WARNING: Index rebuild is blocking operation. Queries will fail during rebuild (typically 1-5 minutes for <100K assertions).
# Option 1: Restart server (triggers automatic rebuild)
sudo systemctl restart stemedb-api
# Monitor rebuild progress
journalctl -u stemedb-api -f | grep -i "index rebuild"
# Expected log:
# "Starting index rebuild from WAL"
# "Rebuilt predicate index: 45123 entries"
# "Rebuilt concept index: 23456 entries"
# "Index rebuild complete in 127ms"
# Option 2: Trigger manual rebuild via admin endpoint
curl -X POST http://localhost:18180/v1/admin/indexes/rebuild
# Wait for completion
curl http://localhost:18180/v1/admin/indexes/status
# Should return: {"status": "ready", "last_rebuild": "2026-02-11T10:23:45Z"}
If failed: Rebuild fails or corruption persists → Restore from backup. See Restore from Backup Runbook.
Validation
After applying resolution, validate performance is restored:
-
Query latency back to baseline
curl http://localhost:18180/metrics | grep 'stemedb_query_latency_seconds{quantile="0.99"}' # Should be <0.2 (200ms) -
Test query succeeds with low latency
time curl -X POST http://localhost:18180/v1/query \ -H "Content-Type: application/json" \ -d '{"concept_path":"test/performance","lens":"recency"}' # Should complete in <1 second -
Replication lag <1s (cluster only)
curl http://localhost:18180/metrics | grep replication_lag_seconds # All nodes should show <1.0 -
No query timeouts
curl http://localhost:18180/metrics | grep stemedb_query_timeout_total # Counter should stop increasing -
Dashboard loads quickly
- Open http://localhost:18188/
- Quarantine panel should load in <2 seconds
Prevention
Monitoring
Set up alerts for:
# Prometheus alert rules
groups:
- name: stemedb_performance
rules:
- alert: StemeDBHighLatency
expr: stemedb_query_latency_seconds{quantile="0.99"} > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "Query latency high (p99 >1s)"
description: "p99 latency: {{ $value }}s"
- alert: StemeDBReplicationLag
expr: replication_lag_seconds > 5.0
for: 5m
labels:
severity: warning
annotations:
summary: "Replication lag high (>5s)"
description: "Node {{ $labels.node }}: {{ $value }}s"
- alert: StemeDBMemoryPressure
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Memory available <10%"
Configuration Changes
To prevent recurrence:
- Replication lag: Ensure <5ms inter-node latency (same region)
- Shard hotspot: Implement read replicas for popular concept_paths (roadmap P6.3)
- Memory pressure: Right-size instances based on Resource Sizing Guide
- Index corruption: Enable daily backups, test restore procedures monthly
Performance Targets
From production readiness UAT:
| Metric | Pilot Target | Production Target |
|---|---|---|
| Query latency (p50) | <50ms | <20ms |
| Query latency (p99) | <200ms | <100ms |
| Ingest rate | 100/sec | 1K/sec |
| Concurrent queries | 100 | 1K |
| Replication lag | <1s | <200ms |
Related Runbooks
- Add Node - Horizontal scaling
- Restore from Backup - Index corruption recovery
- Disk Full - Storage capacity issues
Last Updated
2026-02-11