jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

6.9 KiB

Raw Blame History

Slow WAL Fsync

Severity: WARNING

Alert Rule

Alert: WALFsyncSlow Trigger: WAL fsync p99 latency > 100ms Duration: 10m

Symptom

Metrics show stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.1
API write latency increasing (p99 > 200ms)
Logs may show "slow fsync" warnings
Ingestion throughput degrading

Impact

User Impact:

Slower API responses for write operations
Reduced ingestion throughput (assertions/sec)
Client timeouts if latency exceeds configured limits

System Impact:

Write pipeline backpressure
Increased memory usage (buffered writes)
Risk of WAL segment rotation delays

Investigation Steps

1. Check Fsync Latency Metrics

# Current p50, p90, p99 latency
curl -s http://localhost:18180/metrics | grep wal_fsync_duration_seconds

# Expected output:
# stemedb_wal_fsync_duration_seconds{quantile="0.5"} 0.001
# stemedb_wal_fsync_duration_seconds{quantile="0.9"} 0.01
# stemedb_wal_fsync_duration_seconds{quantile="0.99"} 0.15  # ← HIGH

2. Check Disk I/O Utilization

# Disk stats
iostat -x 2 10

# Look for:
# - High %util on WAL partition (>80% sustained)
# - High await (>50ms indicates congestion)

3. Check for Competing I/O

# Processes doing disk I/O
iotop -o -b -n 5

# Look for other processes writing to same disk

4. Check Disk Write Cache

# Verify write cache is enabled (should be for durability)
hdparm -W /dev/sda
# write-caching =  1 (on)

5. Test Raw Disk Performance

# Benchmark fsync performance
cd /var/lib/stemedb/wal
time sh -c "dd if=/dev/zero of=test.dat bs=4k count=10000 && sync"
rm test.dat

# Expected: <5 seconds on SSD, <15 seconds on spinning disk

Resolution

If Disk I/O is Saturated

1. Identify competing workload:

# Top I/O consumers
iotop -o -b -n 1 | head -20

2. Reduce competing I/O:

# Pause non-critical I/O (backups, log compression, etc.)
systemctl stop backup.service
systemctl stop log-archiver.timer

3. Monitor improvement:

watch -n 5 'curl -s http://localhost:18180/metrics | grep wal_fsync_duration'

If Disk is Slow (Hardware Issue)

1. Check SMART status:

smartctl -a /dev/sda | grep -E "(Seek_Error|Reallocated_Sector)"

2. If disk is failing, prepare for migration:

# Mark node for draining
curl -X POST http://localhost:18180/v1/admin/node/drain

# Schedule maintenance window for disk replacement

3. Temporarily reduce write rate:

# Apply rate limit to reduce I/O pressure
curl -X POST http://localhost:18180/v1/admin/rate-limit \
  -d '{"max_writes_per_sec": 500}'

If Filesystem is Misconfigured

1. Check mount options:

mount | grep /var/lib/stemedb/wal

Expected: data=ordered or data=writeback (not data=journal which is slower)

2. If using wrong mount options, remount:

# Edit /etc/fstab
/dev/sdb1 /var/lib/stemedb/wal ext4 data=ordered,noatime 0 2

# Remount (requires downtime)
systemctl stop stemedb-api
umount /var/lib/stemedb/wal
mount /var/lib/stemedb/wal
systemctl start stemedb-api

If Group Commit Not Optimal

1. Tune group commit settings:

Edit /etc/stemedb/api.toml:

[wal]
group_commit_max_wait_ms = 10  # Increase batching window
group_commit_max_bytes = 1048576  # 1MB batches

2. Restart service:

systemctl restart stemedb-api

3. Monitor fsync frequency:

# Fsync count should decrease with larger batches
curl -s http://localhost:18180/metrics | grep wal_fsync_total

If Cloud Provider Throttling

1. Check for IOPS throttling (AWS EBS example):

# CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS \
  --metric-name VolumeQueueLength \
  --dimensions Name=VolumeId,Value=vol-abc123 \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

2. Increase provisioned IOPS:

# Modify EBS volume (AWS example)
aws ec2 modify-volume --volume-id vol-abc123 \
  --iops 3000 --volume-type gp3

3. Wait for optimization to complete:

watch aws ec2 describe-volumes-modifications \
  --volume-ids vol-abc123 \
  --query 'VolumesModifications[0].ModificationState'

Prevention

Monitoring

1. Alert on sustained high latency:

- alert: WALFsyncDegrading
  expr: stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.05
  for: 15m
  annotations:
    summary: "WAL fsync p99 latency degrading (>50ms)"

2. Monitor disk queue depth:

- alert: DiskQueueDepthHigh
  expr: node_disk_io_weighted_seconds_total > 100
  for: 10m
  annotations:
    summary: "Disk queue depth indicates congestion"

Capacity Planning

1. Use dedicated disk for WAL:

NVMe SSD with capacitor-backed cache
Separate physical disk from KV store
Provisioned IOPS (cloud deployments)

2. Benchmark before production:

# Test fsync performance under load
fio --name=fsync-test --rw=write --bs=4k --size=1G \
  --fsync=1 --numjobs=4 --runtime=60 \
  --filename=/var/lib/stemedb/wal/test.dat

Expected: p99 latency <10ms on NVMe, <50ms on SATA SSD.

3. Right-size provisioned IOPS (cloud):

IOPS needed = (writes_per_sec * 1.5)  # 1.5x for overhead

Example:
- 1000 writes/sec → 1500 IOPS minimum
- Use 3000 IOPS for headroom (2x)

Operational Best Practices

1. Regular disk health checks:

# Weekly SMART check
smartctl -a /dev/sda | grep -E "(PASSED|FAILED)"

# Alert on pending sectors
smartctl -a /dev/sda | awk '/Current_Pending_Sector/ {if($10>0) print "WARNING: Pending sectors detected"}'

2. Monitor filesystem age:

# Check filesystem age (ext4)
tune2fs -l /dev/sdb1 | grep "Filesystem created"

# Consider reformatting if >2 years old (fragmentation)

3. Test I/O performance quarterly:

# Benchmark and compare to baseline
fio --name=seq-write --rw=write --bs=1M --size=10G \
  --filename=/var/lib/stemedb/wal/bench.dat \
  --output-format=json > /tmp/fio-$(date +%Y%m%d).json

Escalation

Escalate if:

Fsync latency exceeds 200ms for >30 minutes
Disk errors appear in logs (hardware failure)
Tuning and optimization has no effect
Cloud provider throttling cannot be resolved

Escalation path:

Primary on-call: Storage SRE
Secondary: Infrastructure engineer
Final escalation: Cloud vendor TAM (if cloud-related)

References

Dashboard: StemeDB WAL Performance
Related alerts: WALFsyncFailure, HighStorageErrorRate, DiskUtilizationHigh
Metrics:
- stemedb_wal_fsync_duration_seconds (latency distribution)
- stemedb_wal_fsync_total (fsync count)
- node_disk_io_time_weighted_seconds_total (disk queue time)
Runbooks: wal-fsync-failure.md, disk-full.md

6.9 KiB Raw Blame History