# WAL Fsync Failure ## Severity: CRITICAL ## Alert Rule **Alert:** `WALFsyncFailure` **Trigger:** WAL fsync operations failing (error rate > 0) **Duration:** 1m ## Symptom - Metrics show `stemedb_wal_fsync_errors_total` increasing - Logs contain "fsync failed" or "WAL write error" - Write operations return 500 errors - API logs show: `Error: Failed to fsync WAL segment` ## Impact **User Impact:** - All writes fail immediately (assertions, votes, epochs) - API returns HTTP 500 on POST/PUT operations - Data loss risk if errors persist (WAL not durable) **System Impact:** - Write pipeline completely blocked - Risk of WAL corruption if partial writes occurred - Potential need for WAL rebuild from replicas ## Investigation Steps ### 1. Check Fsync Error Count ```bash curl -s http://localhost:18180/metrics | grep wal_fsync_errors # stemedb_wal_fsync_errors_total{segment="segment.001.wal"} 15 ``` ### 2. Check Disk Status ```bash # I/O errors dmesg | grep -i "i/o error" | tail -20 # Filesystem errors journalctl --dmesg | grep -i "ext4.*error" # SMART status smartctl -a /dev/sda ``` ### 3. Check WAL Partition Health ```bash # Disk space df -h /var/lib/stemedb/wal # Mount options (must include sync or data=ordered) mount | grep /var/lib/stemedb # Test write + fsync cd /var/lib/stemedb/wal time sh -c "dd if=/dev/zero of=test.dat bs=4k count=1000 && sync" rm test.dat ``` ### 4. Check for Read-Only Filesystem ```bash # Attempt write touch /var/lib/stemedb/wal/test.file # If fails with "Read-only file system", remount needed ``` ## Resolution ### If Filesystem is Read-Only **1. Remount as read-write:** ```bash mount -o remount,rw /var/lib/stemedb/wal ``` **2. Check for underlying errors:** ```bash dmesg | tail -50 ``` **3. If errors persist, run filesystem check:** ```bash systemctl stop stemedb-api umount /var/lib/stemedb/wal fsck -y /dev/sdb1 # Replace with actual device mount /var/lib/stemedb/wal systemctl start stemedb-api ``` ### If Disk is Failing **1. Verify hardware status:** ```bash smartctl -a /dev/sda | grep -E "(Reallocated_Sector|Current_Pending_Sector|Offline_Uncorrectable)" ``` **2. If bad sectors detected, initiate failover:** ```bash # Mark node as unhealthy curl -X POST http://localhost:18180/v1/admin/node/drain # Failover to replica # See: docs/operations/runbooks/failover-to-replica.md ``` ### If WAL Segment is Corrupted **1. Identify corrupted segment:** ```bash journalctl -u stemedb-api | grep "WAL.*corrupt" | tail -10 ``` **2. Attempt recovery:** ```bash systemctl stop stemedb-api # Backup corrupted segment mv /var/lib/stemedb/wal/segment.001.wal \ /var/lib/stemedb/wal/segment.001.wal.corrupted # Truncate at last known good position (if identified in logs) stemedb-wal-repair \ --segment /var/lib/stemedb/wal/segment.001.wal.corrupted \ --output /var/lib/stemedb/wal/segment.001.wal \ --truncate-at systemctl start stemedb-api ``` **3. If repair fails, restore from replica:** See `docs/operations/runbooks/restore-from-backup.md`. ### If No Hardware/FS Issues Found **1. Check for kernel/driver bugs:** ```bash # Kernel version uname -r # Recent kernel updates grep -i "kernel.*upgrade" /var/log/dpkg.log | tail -10 ``` **2. Enable WAL fsync debug logging:** Edit `/etc/stemedb/api.toml`: ```toml [wal] log_fsync_errors = true ``` Restart: ```bash systemctl restart stemedb-api ``` **3. Collect diagnostic data:** ```bash strace -p $(pgrep stemedb-api) -e fsync,fdatasync -o /tmp/fsync-trace.txt & sleep 30 kill %1 grep -i error /tmp/fsync-trace.txt ``` ## Prevention ### Monitoring **1. Alert on fsync latency degradation:** ```yaml - alert: WALFsyncSlow expr: stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.1 for: 5m annotations: summary: "WAL fsync latency degrading (p99 > 100ms)" ``` **2. Monitor disk health:** ```bash # Daily SMART check 0 2 * * * smartctl -a /dev/sda | grep -q "FAILING_NOW" && \ curl -X POST http://alertmanager/api/v1/alerts -d @disk-alert.json ``` ### Capacity Planning **1. Use enterprise-grade SSDs with power-loss protection:** - NVMe with capacitor-backed write cache - Avoid consumer SSDs in production **2. Configure filesystem for durability:** ```bash # /etc/fstab /dev/sdb1 /var/lib/stemedb/wal ext4 data=ordered,barrier=1 0 2 ``` ### Operational Best Practices **1. Regular WAL health checks:** ```bash # Weekly verification cd /var/lib/stemedb/wal for segment in segment.*.wal; do stemedb-wal-verify --file $segment || echo "ERROR: $segment corrupted" done ``` **2. Automate disk replacement:** Set up alerts to trigger replacement before failure. ## Escalation **Escalate immediately if:** - Fsync errors continue after remount - Disk SMART status shows imminent failure - WAL corruption cannot be repaired - Multiple nodes affected (infrastructure issue) **Escalation path:** 1. **Primary on-call:** Storage SRE 2. **Secondary:** Kernel/systems engineer 3. **Final escalation:** VP Engineering (data loss imminent) ## References - **Dashboard:** [StemeDB WAL Health](http://grafana.example.com/d/stemedb-wal) - **Related alerts:** `WALDiskNearlyFull`, `WALFsyncSlow`, `HighStorageErrorRate` - **Metrics:** - `stemedb_wal_fsync_errors_total` - `stemedb_wal_fsync_duration_seconds` - `stemedb_wal_segment_rotations_total` - **Runbooks:** `disk-full.md`, `storage-errors.md`, `failover-to-replica.md`