jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

5.3 KiB

Raw Blame History

WAL Fsync Failure

Severity: CRITICAL

Alert Rule

Alert: WALFsyncFailure Trigger: WAL fsync operations failing (error rate > 0) Duration: 1m

Symptom

Metrics show stemedb_wal_fsync_errors_total increasing
Logs contain "fsync failed" or "WAL write error"
Write operations return 500 errors
API logs show: Error: Failed to fsync WAL segment

Impact

User Impact:

All writes fail immediately (assertions, votes, epochs)
API returns HTTP 500 on POST/PUT operations
Data loss risk if errors persist (WAL not durable)

System Impact:

Write pipeline completely blocked
Risk of WAL corruption if partial writes occurred
Potential need for WAL rebuild from replicas

Investigation Steps

1. Check Fsync Error Count

curl -s http://localhost:18180/metrics | grep wal_fsync_errors
# stemedb_wal_fsync_errors_total{segment="segment.001.wal"} 15

2. Check Disk Status

# I/O errors
dmesg | grep -i "i/o error" | tail -20

# Filesystem errors
journalctl --dmesg | grep -i "ext4.*error"

# SMART status
smartctl -a /dev/sda

3. Check WAL Partition Health

# Disk space
df -h /var/lib/stemedb/wal

# Mount options (must include sync or data=ordered)
mount | grep /var/lib/stemedb

# Test write + fsync
cd /var/lib/stemedb/wal
time sh -c "dd if=/dev/zero of=test.dat bs=4k count=1000 && sync"
rm test.dat

4. Check for Read-Only Filesystem

# Attempt write
touch /var/lib/stemedb/wal/test.file
# If fails with "Read-only file system", remount needed

Resolution

If Filesystem is Read-Only

1. Remount as read-write:

mount -o remount,rw /var/lib/stemedb/wal

2. Check for underlying errors:

dmesg | tail -50

3. If errors persist, run filesystem check:

systemctl stop stemedb-api
umount /var/lib/stemedb/wal
fsck -y /dev/sdb1  # Replace with actual device
mount /var/lib/stemedb/wal
systemctl start stemedb-api

If Disk is Failing

1. Verify hardware status:

smartctl -a /dev/sda | grep -E "(Reallocated_Sector|Current_Pending_Sector|Offline_Uncorrectable)"

2. If bad sectors detected, initiate failover:

# Mark node as unhealthy
curl -X POST http://localhost:18180/v1/admin/node/drain

# Failover to replica
# See: docs/operations/runbooks/failover-to-replica.md

If WAL Segment is Corrupted

1. Identify corrupted segment:

journalctl -u stemedb-api | grep "WAL.*corrupt" | tail -10

2. Attempt recovery:

systemctl stop stemedb-api

# Backup corrupted segment
mv /var/lib/stemedb/wal/segment.001.wal \
   /var/lib/stemedb/wal/segment.001.wal.corrupted

# Truncate at last known good position (if identified in logs)
stemedb-wal-repair \
  --segment /var/lib/stemedb/wal/segment.001.wal.corrupted \
  --output /var/lib/stemedb/wal/segment.001.wal \
  --truncate-at <byte-offset>

systemctl start stemedb-api

3. If repair fails, restore from replica:

See docs/operations/runbooks/restore-from-backup.md.

If No Hardware/FS Issues Found

1. Check for kernel/driver bugs:

# Kernel version
uname -r

# Recent kernel updates
grep -i "kernel.*upgrade" /var/log/dpkg.log | tail -10

2. Enable WAL fsync debug logging:

Edit /etc/stemedb/api.toml:

[wal]
log_fsync_errors = true

Restart:

systemctl restart stemedb-api

3. Collect diagnostic data:

strace -p $(pgrep stemedb-api) -e fsync,fdatasync -o /tmp/fsync-trace.txt &
sleep 30
kill %1
grep -i error /tmp/fsync-trace.txt

Prevention

Monitoring

1. Alert on fsync latency degradation:

- alert: WALFsyncSlow
  expr: stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.1
  for: 5m
  annotations:
    summary: "WAL fsync latency degrading (p99 > 100ms)"

2. Monitor disk health:

# Daily SMART check
0 2 * * * smartctl -a /dev/sda | grep -q "FAILING_NOW" && \
  curl -X POST http://alertmanager/api/v1/alerts -d @disk-alert.json

Capacity Planning

1. Use enterprise-grade SSDs with power-loss protection:

NVMe with capacitor-backed write cache
Avoid consumer SSDs in production

2. Configure filesystem for durability:

# /etc/fstab
/dev/sdb1 /var/lib/stemedb/wal ext4 data=ordered,barrier=1 0 2

Operational Best Practices

1. Regular WAL health checks:

# Weekly verification
cd /var/lib/stemedb/wal
for segment in segment.*.wal; do
  stemedb-wal-verify --file $segment || echo "ERROR: $segment corrupted"
done

2. Automate disk replacement:

Set up alerts to trigger replacement before failure.

Escalation

Escalate immediately if:

Fsync errors continue after remount
Disk SMART status shows imminent failure
WAL corruption cannot be repaired
Multiple nodes affected (infrastructure issue)

Escalation path:

Primary on-call: Storage SRE
Secondary: Kernel/systems engineer
Final escalation: VP Engineering (data loss imminent)

References

Dashboard: StemeDB WAL Health
Related alerts: WALDiskNearlyFull, WALFsyncSlow, HighStorageErrorRate
Metrics:
- stemedb_wal_fsync_errors_total
- stemedb_wal_fsync_duration_seconds
- stemedb_wal_segment_rotations_total
Runbooks: disk-full.md, storage-errors.md, failover-to-replica.md

5.3 KiB Raw Blame History