This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
5.3 KiB
WAL Fsync Failure
Severity: CRITICAL
Alert Rule
Alert: WALFsyncFailure
Trigger: WAL fsync operations failing (error rate > 0)
Duration: 1m
Symptom
- Metrics show
stemedb_wal_fsync_errors_totalincreasing - Logs contain "fsync failed" or "WAL write error"
- Write operations return 500 errors
- API logs show:
Error: Failed to fsync WAL segment
Impact
User Impact:
- All writes fail immediately (assertions, votes, epochs)
- API returns HTTP 500 on POST/PUT operations
- Data loss risk if errors persist (WAL not durable)
System Impact:
- Write pipeline completely blocked
- Risk of WAL corruption if partial writes occurred
- Potential need for WAL rebuild from replicas
Investigation Steps
1. Check Fsync Error Count
curl -s http://localhost:18180/metrics | grep wal_fsync_errors
# stemedb_wal_fsync_errors_total{segment="segment.001.wal"} 15
2. Check Disk Status
# I/O errors
dmesg | grep -i "i/o error" | tail -20
# Filesystem errors
journalctl --dmesg | grep -i "ext4.*error"
# SMART status
smartctl -a /dev/sda
3. Check WAL Partition Health
# Disk space
df -h /var/lib/stemedb/wal
# Mount options (must include sync or data=ordered)
mount | grep /var/lib/stemedb
# Test write + fsync
cd /var/lib/stemedb/wal
time sh -c "dd if=/dev/zero of=test.dat bs=4k count=1000 && sync"
rm test.dat
4. Check for Read-Only Filesystem
# Attempt write
touch /var/lib/stemedb/wal/test.file
# If fails with "Read-only file system", remount needed
Resolution
If Filesystem is Read-Only
1. Remount as read-write:
mount -o remount,rw /var/lib/stemedb/wal
2. Check for underlying errors:
dmesg | tail -50
3. If errors persist, run filesystem check:
systemctl stop stemedb-api
umount /var/lib/stemedb/wal
fsck -y /dev/sdb1 # Replace with actual device
mount /var/lib/stemedb/wal
systemctl start stemedb-api
If Disk is Failing
1. Verify hardware status:
smartctl -a /dev/sda | grep -E "(Reallocated_Sector|Current_Pending_Sector|Offline_Uncorrectable)"
2. If bad sectors detected, initiate failover:
# Mark node as unhealthy
curl -X POST http://localhost:18180/v1/admin/node/drain
# Failover to replica
# See: docs/operations/runbooks/failover-to-replica.md
If WAL Segment is Corrupted
1. Identify corrupted segment:
journalctl -u stemedb-api | grep "WAL.*corrupt" | tail -10
2. Attempt recovery:
systemctl stop stemedb-api
# Backup corrupted segment
mv /var/lib/stemedb/wal/segment.001.wal \
/var/lib/stemedb/wal/segment.001.wal.corrupted
# Truncate at last known good position (if identified in logs)
stemedb-wal-repair \
--segment /var/lib/stemedb/wal/segment.001.wal.corrupted \
--output /var/lib/stemedb/wal/segment.001.wal \
--truncate-at <byte-offset>
systemctl start stemedb-api
3. If repair fails, restore from replica:
See docs/operations/runbooks/restore-from-backup.md.
If No Hardware/FS Issues Found
1. Check for kernel/driver bugs:
# Kernel version
uname -r
# Recent kernel updates
grep -i "kernel.*upgrade" /var/log/dpkg.log | tail -10
2. Enable WAL fsync debug logging:
Edit /etc/stemedb/api.toml:
[wal]
log_fsync_errors = true
Restart:
systemctl restart stemedb-api
3. Collect diagnostic data:
strace -p $(pgrep stemedb-api) -e fsync,fdatasync -o /tmp/fsync-trace.txt &
sleep 30
kill %1
grep -i error /tmp/fsync-trace.txt
Prevention
Monitoring
1. Alert on fsync latency degradation:
- alert: WALFsyncSlow
expr: stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.1
for: 5m
annotations:
summary: "WAL fsync latency degrading (p99 > 100ms)"
2. Monitor disk health:
# Daily SMART check
0 2 * * * smartctl -a /dev/sda | grep -q "FAILING_NOW" && \
curl -X POST http://alertmanager/api/v1/alerts -d @disk-alert.json
Capacity Planning
1. Use enterprise-grade SSDs with power-loss protection:
- NVMe with capacitor-backed write cache
- Avoid consumer SSDs in production
2. Configure filesystem for durability:
# /etc/fstab
/dev/sdb1 /var/lib/stemedb/wal ext4 data=ordered,barrier=1 0 2
Operational Best Practices
1. Regular WAL health checks:
# Weekly verification
cd /var/lib/stemedb/wal
for segment in segment.*.wal; do
stemedb-wal-verify --file $segment || echo "ERROR: $segment corrupted"
done
2. Automate disk replacement:
Set up alerts to trigger replacement before failure.
Escalation
Escalate immediately if:
- Fsync errors continue after remount
- Disk SMART status shows imminent failure
- WAL corruption cannot be repaired
- Multiple nodes affected (infrastructure issue)
Escalation path:
- Primary on-call: Storage SRE
- Secondary: Kernel/systems engineer
- Final escalation: VP Engineering (data loss imminent)
References
- Dashboard: StemeDB WAL Health
- Related alerts:
WALDiskNearlyFull,WALFsyncSlow,HighStorageErrorRate - Metrics:
stemedb_wal_fsync_errors_totalstemedb_wal_fsync_duration_secondsstemedb_wal_segment_rotations_total
- Runbooks:
disk-full.md,storage-errors.md,failover-to-replica.md