stemedb/docs/operations/runbooks/wal-fsync-failure.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

261 lines
5.3 KiB
Markdown

# WAL Fsync Failure
## Severity: CRITICAL
## Alert Rule
**Alert:** `WALFsyncFailure`
**Trigger:** WAL fsync operations failing (error rate > 0)
**Duration:** 1m
## Symptom
- Metrics show `stemedb_wal_fsync_errors_total` increasing
- Logs contain "fsync failed" or "WAL write error"
- Write operations return 500 errors
- API logs show: `Error: Failed to fsync WAL segment`
## Impact
**User Impact:**
- All writes fail immediately (assertions, votes, epochs)
- API returns HTTP 500 on POST/PUT operations
- Data loss risk if errors persist (WAL not durable)
**System Impact:**
- Write pipeline completely blocked
- Risk of WAL corruption if partial writes occurred
- Potential need for WAL rebuild from replicas
## Investigation Steps
### 1. Check Fsync Error Count
```bash
curl -s http://localhost:18180/metrics | grep wal_fsync_errors
# stemedb_wal_fsync_errors_total{segment="segment.001.wal"} 15
```
### 2. Check Disk Status
```bash
# I/O errors
dmesg | grep -i "i/o error" | tail -20
# Filesystem errors
journalctl --dmesg | grep -i "ext4.*error"
# SMART status
smartctl -a /dev/sda
```
### 3. Check WAL Partition Health
```bash
# Disk space
df -h /var/lib/stemedb/wal
# Mount options (must include sync or data=ordered)
mount | grep /var/lib/stemedb
# Test write + fsync
cd /var/lib/stemedb/wal
time sh -c "dd if=/dev/zero of=test.dat bs=4k count=1000 && sync"
rm test.dat
```
### 4. Check for Read-Only Filesystem
```bash
# Attempt write
touch /var/lib/stemedb/wal/test.file
# If fails with "Read-only file system", remount needed
```
## Resolution
### If Filesystem is Read-Only
**1. Remount as read-write:**
```bash
mount -o remount,rw /var/lib/stemedb/wal
```
**2. Check for underlying errors:**
```bash
dmesg | tail -50
```
**3. If errors persist, run filesystem check:**
```bash
systemctl stop stemedb-api
umount /var/lib/stemedb/wal
fsck -y /dev/sdb1 # Replace with actual device
mount /var/lib/stemedb/wal
systemctl start stemedb-api
```
### If Disk is Failing
**1. Verify hardware status:**
```bash
smartctl -a /dev/sda | grep -E "(Reallocated_Sector|Current_Pending_Sector|Offline_Uncorrectable)"
```
**2. If bad sectors detected, initiate failover:**
```bash
# Mark node as unhealthy
curl -X POST http://localhost:18180/v1/admin/node/drain
# Failover to replica
# See: docs/operations/runbooks/failover-to-replica.md
```
### If WAL Segment is Corrupted
**1. Identify corrupted segment:**
```bash
journalctl -u stemedb-api | grep "WAL.*corrupt" | tail -10
```
**2. Attempt recovery:**
```bash
systemctl stop stemedb-api
# Backup corrupted segment
mv /var/lib/stemedb/wal/segment.001.wal \
/var/lib/stemedb/wal/segment.001.wal.corrupted
# Truncate at last known good position (if identified in logs)
stemedb-wal-repair \
--segment /var/lib/stemedb/wal/segment.001.wal.corrupted \
--output /var/lib/stemedb/wal/segment.001.wal \
--truncate-at <byte-offset>
systemctl start stemedb-api
```
**3. If repair fails, restore from replica:**
See `docs/operations/runbooks/restore-from-backup.md`.
### If No Hardware/FS Issues Found
**1. Check for kernel/driver bugs:**
```bash
# Kernel version
uname -r
# Recent kernel updates
grep -i "kernel.*upgrade" /var/log/dpkg.log | tail -10
```
**2. Enable WAL fsync debug logging:**
Edit `/etc/stemedb/api.toml`:
```toml
[wal]
log_fsync_errors = true
```
Restart:
```bash
systemctl restart stemedb-api
```
**3. Collect diagnostic data:**
```bash
strace -p $(pgrep stemedb-api) -e fsync,fdatasync -o /tmp/fsync-trace.txt &
sleep 30
kill %1
grep -i error /tmp/fsync-trace.txt
```
## Prevention
### Monitoring
**1. Alert on fsync latency degradation:**
```yaml
- alert: WALFsyncSlow
expr: stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.1
for: 5m
annotations:
summary: "WAL fsync latency degrading (p99 > 100ms)"
```
**2. Monitor disk health:**
```bash
# Daily SMART check
0 2 * * * smartctl -a /dev/sda | grep -q "FAILING_NOW" && \
curl -X POST http://alertmanager/api/v1/alerts -d @disk-alert.json
```
### Capacity Planning
**1. Use enterprise-grade SSDs with power-loss protection:**
- NVMe with capacitor-backed write cache
- Avoid consumer SSDs in production
**2. Configure filesystem for durability:**
```bash
# /etc/fstab
/dev/sdb1 /var/lib/stemedb/wal ext4 data=ordered,barrier=1 0 2
```
### Operational Best Practices
**1. Regular WAL health checks:**
```bash
# Weekly verification
cd /var/lib/stemedb/wal
for segment in segment.*.wal; do
stemedb-wal-verify --file $segment || echo "ERROR: $segment corrupted"
done
```
**2. Automate disk replacement:**
Set up alerts to trigger replacement before failure.
## Escalation
**Escalate immediately if:**
- Fsync errors continue after remount
- Disk SMART status shows imminent failure
- WAL corruption cannot be repaired
- Multiple nodes affected (infrastructure issue)
**Escalation path:**
1. **Primary on-call:** Storage SRE
2. **Secondary:** Kernel/systems engineer
3. **Final escalation:** VP Engineering (data loss imminent)
## References
- **Dashboard:** [StemeDB WAL Health](http://grafana.example.com/d/stemedb-wal)
- **Related alerts:** `WALDiskNearlyFull`, `WALFsyncSlow`, `HighStorageErrorRate`
- **Metrics:**
- `stemedb_wal_fsync_errors_total`
- `stemedb_wal_fsync_duration_seconds`
- `stemedb_wal_segment_rotations_total`
- **Runbooks:** `disk-full.md`, `storage-errors.md`, `failover-to-replica.md`