jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

5.8 KiB

Raw Blame History

High Replication Lag

Severity: CRITICAL

Alert Rule

Alert: ReplicationLagCritical Trigger: Replica lag exceeds 10 seconds Duration: 3m

Symptom

Query results from replicas are stale (missing recent assertions)
Replication metrics show increasing lag (e.g., stemedb_replication_lag_seconds > 10)
Merkle tree sync reports large diffs between primary and replica
Clients reading from replicas see inconsistent data

Impact

User Impact:

Queries to replicas return outdated results
Reads may miss assertions written in the last 10+ seconds
Eventual consistency SLAs violated

System Impact:

Replica may fall too far behind to catch up (cascading failure)
Increased Merkle tree diff volume (bandwidth spike)
Risk of replica demotion or rebuild

Investigation Steps

1. Check Replication Status

# Query replication lag metric
curl -s http://localhost:18180/metrics | grep replication_lag

# Expected output (example):
# stemedb_replication_lag_seconds{replica="node2"} 12.5

2. Identify Bottleneck

A. Network latency:

# Ping replica from primary
ping -c 10 <replica-ip>

# Check bandwidth usage
iftop -i eth0 -f "port 18182"

B. Replica disk I/O:

# SSH to replica
iostat -x 1 10

# Look for high %util on WAL partition

C. Replica CPU saturation:

# SSH to replica
top -b -n 1 | grep stemedb

3. Check for Merkle Sync Errors

# Primary logs
journalctl -u stemedb-api | grep -i "merkle sync" | tail -20

# Replica logs
ssh replica "journalctl -u stemedb-api | grep -i 'sync error' | tail -20"

4. Compare Assertion Counts

# Primary assertion count
curl -s http://localhost:18180/metrics | grep assertions_indexed_total

# Replica assertion count
curl -s http://<replica>:18180/metrics | grep assertions_indexed_total

Resolution

If Network Latency is High

1. Check network path:

traceroute <replica-ip>
mtr -r -c 10 <replica-ip>

2. Verify firewall rules:

# RPC port 18182 should be open
telnet <replica-ip> 18182

3. Increase RPC timeout if needed:

Edit /etc/stemedb/api.toml on primary:

[cluster]
rpc_timeout_ms = 10000  # Increase from default 5000

Restart primary:

systemctl restart stemedb-api

If Replica Disk I/O is Saturated

1. Verify WAL write performance:

# SSH to replica
cd /var/lib/stemedb/wal
time dd if=/dev/zero of=test.dat bs=1M count=1000 oflag=direct
rm test.dat

Expected: >100 MB/s on SSD.

2. Check for competing I/O:

iotop -o

3. Temporarily reduce ingestion rate on primary:

# Apply rate limit via admin endpoint
curl -X POST http://localhost:18180/v1/admin/rate-limit \
  -H 'Content-Type: application/json' \
  -d '{"max_assertions_per_sec": 1000}'

If Replica is Falling Further Behind

1. Initiate manual Merkle sync:

curl -X POST http://localhost:18180/v1/admin/cluster/sync \
  -H 'Content-Type: application/json' \
  -d '{"replica_id": "node2", "force": true}'

2. Monitor sync progress:

watch -n 5 'curl -s http://localhost:18180/metrics | grep merkle_sync_progress'

3. If sync fails repeatedly, rebuild replica:

See docs/operations/runbooks/rebuild-replica.md.

If Replication Stream is Blocked

1. Check for circuit breaker trip:

curl -s http://localhost:18180/v1/admin/circuit-breakers/tripped | jq

2. Reset circuit breaker if needed:

curl -X POST http://localhost:18180/v1/admin/circuit-breaker/reset \
  -H 'Content-Type: application/json' \
  -d '{"agent_id": "<replica_agent_id>"}'

Prevention

Monitoring and Alerting

1. Add warning-level lag alert:

# Prometheus alert rule
- alert: ReplicationLagWarning
  expr: stemedb_replication_lag_seconds > 5
  for: 5m
  annotations:
    summary: "Replica lag exceeds 5 seconds"

2. Monitor Merkle sync errors:

- alert: MerkleSyncFailures
  expr: rate(stemedb_merkle_sync_errors_total[5m]) > 0.1
  annotations:
    summary: "Frequent Merkle sync failures detected"

Capacity Planning

1. Ensure replica hardware matches primary:

Same or better disk I/O (IOPS)
Same network bandwidth
Sufficient CPU headroom

2. Set replication backpressure threshold:

# /etc/stemedb/api.toml
[cluster]
max_replication_lag_seconds = 30  # Pause ingestion if lag exceeds

Operational Best Practices

1. Gradual rollout of high-volume ingestion:

# Ramp up assertion rate slowly
for rate in 100 500 1000 2000; do
  echo "Testing rate: $rate/sec"
  # Apply rate via API
  curl -X POST http://localhost:18180/v1/admin/rate-limit \
    -d "{\"max_assertions_per_sec\": $rate}"
  sleep 300  # Monitor for 5 minutes
  # Check lag
  curl -s http://localhost:18180/metrics | grep replication_lag
done

2. Pre-provision replicas before traffic spikes:

Add replicas 24 hours before expected load increase.

Escalation

Escalate immediately if:

Lag exceeds 60 seconds (replica rebuild likely needed)
Replica is stuck in crash loop during sync
Merkle sync reports corruption (data integrity issue)
Multiple replicas lagging simultaneously (primary overload)

Escalation path:

Primary on-call: Cluster SRE
Secondary: Distributed systems engineer
Final escalation: Principal engineer (data corruption suspected)

References

Dashboard: StemeDB Cluster Overview
Related alerts: ClusterSplitBrain, MerkleSyncFailure, HighNetworkUtilization
Metrics to check:
- stemedb_replication_lag_seconds (lag duration)
- stemedb_merkle_sync_duration_seconds (sync timing)
- stemedb_assertions_indexed_total (ingestion rate)
- stemedb_network_bytes_sent_total (replication bandwidth)
Runbooks: rebuild-replica.md, split-brain.md

5.8 KiB Raw Blame History