jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

12 KiB

Raw Blame History

Runbook: High Query Latency

Symptom

API queries return 200 but take >1 second (p99 >1000ms)
Queries timeout with 504 Gateway Timeout
Dashboard slow to load or shows stale data
Users report "sluggish" performance

Metrics Alerts:

stemedb_query_latency_seconds{quantile="0.99"} > 1.0 for 5 minutes
replication_lag_seconds > 5.0 (cluster only)
stemedb_query_timeout_total increasing

Quick Diagnosis

High query latency
    │
    ├─► Check: curl .../metrics | grep replication_lag
    │   └─► Lag >5s? → §1 Replication Lag
    │
    ├─► Check: curl .../metrics | grep query_latency_seconds
    │   └─► Single shard slow? → §2 Shard Hotspot
    │
    ├─► Check: free -h
    │   └─► Memory >90%? → §3 Memory Pressure
    │
    └─► Check: journalctl | grep "index error"
        └─► Index errors? → §4 Index Corruption

Common Causes

Replication lag (cluster only) — Likelihood: 35%
- Network latency between nodes
- Single node overloaded
- Merkle sync backlog
Shard hotspot (cluster only) — Likelihood: 25%
- Popular concept_path on single shard
- Unbalanced shard assignment
- Single node handling all queries
Memory pressure — Likelihood: 20%
- Cache evictions due to low memory
- Swap thrashing
- Large result sets
Index corruption — Likelihood: 10%
- Partial index rebuild needed
- Corrupted predicate index
- Version mismatch after upgrade
Query complexity — Likelihood: 10%
- Complex lens logic (e.g., AuthorityLens with deep chains)
- Large result sets (>10K assertions)
- Inefficient query patterns

Resolution Steps

§1. Replication Lag (Cluster Only)

Diagnostic:

# Check replication lag on all nodes
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl http://$node:18180/metrics | grep replication_lag_seconds
done

# Expected output (healthy):
# replication_lag_seconds{node="node1"} 0.123
# replication_lag_seconds{node="node2"} 0.089
# replication_lag_seconds{node="node3"} 0.234

# Check Merkle sync status
curl http://localhost:18181/cluster/sync_status | jq '.'

Resolution A: Manual Merkle sync

# Identify lagging node
curl http://localhost:18181/cluster/members | jq '.members[] | select(.replication_lag > 5)'

# Trigger manual sync from healthy node
curl -X POST http://healthy-node:18181/cluster/sync \
  -H "Content-Type: application/json" \
  -d '{"target_node": "lagging-node-id", "force": true}'

# Monitor progress
watch -n 5 'curl -s http://lagging-node:18180/metrics | grep replication_lag'

# Wait for lag <1s
# (Sync typically takes 1-5 minutes for <100K assertions)

Resolution B: Restart lagging node

⚠️ WARNING: Cluster must have at least 2 nodes healthy. Don't restart if only 1 node up.

# Check cluster health first
curl http://localhost:18181/cluster/health

# If 2+ nodes healthy, restart lagging node
ssh lagging-node "sudo systemctl restart stemedb-api"

# Monitor rejoin
watch -n 2 'curl -s http://localhost:18181/cluster/members | jq ".members[] | select(.id==\"$LAGGING_NODE_ID\")"'

# Wait for status: "UP" and replication_lag <1s

Resolution C: Network diagnosis

# Check inter-node latency
for node in node1 node2 node3; do
  echo "=== Ping $node ==="
  ping -c 5 $node
done

# Expected: <5ms avg latency within cluster

# Check for packet loss
sudo tcpdump -i eth0 host node2 and port 18182
# Should show steady RPC traffic, no retransmits

If failed: Lag persists >15 minutes → Check network issues, consider removing lagging node and re-adding. See Add Node Runbook.

§2. Shard Hotspot (Cluster Only)

Diagnostic:

# Check query distribution by node
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl -s http://$node:18180/metrics | grep stemedb_query_total
done

# Expected (balanced):
# stemedb_query_total{node="node1"} 12453
# stemedb_query_total{node="node2"} 12389
# stemedb_query_total{node="node3"} 12501

# Imbalanced (hotspot):
# stemedb_query_total{node="node1"} 45234  <-- Hotspot!
# stemedb_query_total{node="node2"} 1023
# stemedb_query_total{node="node3"} 989

# Identify hot shard
curl http://localhost:18181/cluster/shards | jq '.shards[] | select(.query_rate > 1000)'

Resolution: Manual shard rebalance

⚠️ NOTE: Automatic rebalancing is roadmap item P6.3. Manual process required for Pilot 5.

# View current shard assignment
curl http://localhost:18181/cluster/shards | jq '.'

# Identify hot concept_path
curl http://localhost:18180/metrics | grep concept_path_query_rate | sort -t'=' -k2 -nr | head -5

# Move shard to different node (manual)
curl -X POST http://localhost:18181/admin/shards/rebalance \
  -H "Content-Type: application/json" \
  -d '{
    "shard_id": "abc123",
    "target_node": "node2-id",
    "reason": "hotspot_mitigation"
  }'

# Monitor rebalance progress
curl http://localhost:18181/cluster/shards/$SHARD_ID | jq '.rebalance_status'

# Wait for status: "COMPLETE"

Temporary workaround: Load balancer weights

# If using nginx load balancer, reduce weight of hot node
# /etc/nginx/conf.d/stemedb-upstream.conf
upstream stemedb {
    server node1:18180 weight=1;  # Reduce from weight=3
    server node2:18180 weight=3;
    server node3:18180 weight=3;
}

sudo nginx -t
sudo systemctl reload nginx

If failed: Hotspot persists → Consider scaling horizontally (add node) or caching popular queries. See Add Node Runbook.

§3. Memory Pressure

Diagnostic:

# Check memory usage
free -h

# Expected output (healthy):
#               total        used        free      shared  buff/cache   available
# Mem:           16Gi        4.2Gi       10Gi        128Mi       1.8Gi        11Gi
# Swap:           0B          0B          0B

# Memory pressure indicators:
# - "available" <10% of total
# - Swap used (should be 0 for databases)
# - High "buff/cache" eviction rate

# Check for swap usage
cat /proc/swaps

# Check OOM killer logs
journalctl -k | grep -i "out of memory"

# Check StemeDB memory metrics
curl http://localhost:18180/metrics | grep -E '(process_resident_memory|stemedb_cache_size)'

Resolution A: Increase cache size limit

⚠️ NOTE: Default cache: 1GB. Increase if available memory >8GB.

# Set cache size to 2GB (if 16GB RAM available)
export STEMEDB_CACHE_SIZE_MB=2048

# Or in systemd service
sudo systemctl edit stemedb-api
# Add:
# [Service]
# Environment="STEMEDB_CACHE_SIZE_MB=2048"

sudo systemctl daemon-reload
sudo systemctl restart stemedb-api

# Verify new limit
curl http://localhost:18180/metrics | grep stemedb_cache_size_bytes

Resolution B: Add swap (emergency only)

⚠️ NOT RECOMMENDED for production. Swap causes unpredictable latency. Upgrade RAM instead.

# Emergency swap for demo/pilot (4GB)
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Verify
free -h

Resolution C: Scale vertically

# Upgrade to larger instance (AWS example)
# Stop server
sudo systemctl stop stemedb-api

# Snapshot volumes
aws ec2 create-snapshot --volume-id vol-xxx --description "pre-upgrade"

# Stop instance, change instance type
aws ec2 stop-instances --instance-ids i-xxx
aws ec2 modify-instance-attribute --instance-id i-xxx --instance-type t3.2xlarge

# Start instance
aws ec2 start-instances --instance-ids i-xxx

# Verify memory upgrade
ssh instance "free -h"

# Start server
sudo systemctl start stemedb-api

If failed: Memory pressure persists after scaling → Investigate memory leaks. Collect heap profile and escalate to engineering.

§4. Index Corruption

Diagnostic:

# Check logs for index errors
journalctl -u stemedb-api -n 100 | grep -i "index"

# Common errors:
# - "predicate index lookup failed"
# - "concept_path not found in index"
# - "index checksum mismatch"

# Check index metrics
curl http://localhost:18180/metrics | grep stemedb_index_

Resolution: Rebuild indexes

⚠️ WARNING: Index rebuild is blocking operation. Queries will fail during rebuild (typically 1-5 minutes for <100K assertions).

# Option 1: Restart server (triggers automatic rebuild)
sudo systemctl restart stemedb-api

# Monitor rebuild progress
journalctl -u stemedb-api -f | grep -i "index rebuild"

# Expected log:
# "Starting index rebuild from WAL"
# "Rebuilt predicate index: 45123 entries"
# "Rebuilt concept index: 23456 entries"
# "Index rebuild complete in 127ms"

# Option 2: Trigger manual rebuild via admin endpoint
curl -X POST http://localhost:18180/v1/admin/indexes/rebuild

# Wait for completion
curl http://localhost:18180/v1/admin/indexes/status
# Should return: {"status": "ready", "last_rebuild": "2026-02-11T10:23:45Z"}

If failed: Rebuild fails or corruption persists → Restore from backup. See Restore from Backup Runbook.

Validation

After applying resolution, validate performance is restored:

Query latency back to baseline

curl http://localhost:18180/metrics | grep 'stemedb_query_latency_seconds{quantile="0.99"}'
# Should be <0.2 (200ms)

Test query succeeds with low latency

time curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path":"test/performance","lens":"recency"}'
# Should complete in <1 second

Replication lag <1s (cluster only)

curl http://localhost:18180/metrics | grep replication_lag_seconds
# All nodes should show <1.0

No query timeouts

curl http://localhost:18180/metrics | grep stemedb_query_timeout_total
# Counter should stop increasing

Dashboard loads quickly
- Open http://localhost:18188/
- Quarantine panel should load in <2 seconds

Prevention

Monitoring

Set up alerts for:

# Prometheus alert rules
groups:
  - name: stemedb_performance
    rules:
      - alert: StemeDBHighLatency
        expr: stemedb_query_latency_seconds{quantile="0.99"} > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Query latency high (p99 >1s)"
          description: "p99 latency: {{ $value }}s"

      - alert: StemeDBReplicationLag
        expr: replication_lag_seconds > 5.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag high (>5s)"
          description: "Node {{ $labels.node }}: {{ $value }}s"

      - alert: StemeDBMemoryPressure
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory available <10%"

Configuration Changes

To prevent recurrence:

Replication lag: Ensure <5ms inter-node latency (same region)
Shard hotspot: Implement read replicas for popular concept_paths (roadmap P6.3)
Memory pressure: Right-size instances based on Resource Sizing Guide
Index corruption: Enable daily backups, test restore procedures monthly

Performance Targets

From production readiness UAT:

Metric	Pilot Target	Production Target
Query latency (p50)	<50ms	<20ms
Query latency (p99)	<200ms	<100ms
Ingest rate	100/sec	1K/sec
Concurrent queries	100	1K
Replication lag	<1s	<200ms

Add Node - Horizontal scaling
Restore from Backup - Index corruption recovery
Disk Full - Storage capacity issues

Last Updated

2026-02-11

12 KiB Raw Blame History

Runbook: High Query Latency

Symptom

Quick Diagnosis

Common Causes

Resolution Steps

§1. Replication Lag (Cluster Only)

§2. Shard Hotspot (Cluster Only)

§3. Memory Pressure

§4. Index Corruption

Validation

Prevention

Monitoring

Configuration Changes

Performance Targets

Related Runbooks

Last Updated

12 KiB

Raw Blame History