stemedb/docs/operations/runbooks/wal-fsync-failure.md

# WAL Fsync Failure

## Severity: CRITICAL

## Alert Rule

**Alert:** `WALFsyncFailure`
**Trigger:** WAL fsync operations failing (error rate > 0)
**Duration:** 1m

## Symptom

- Metrics show `stemedb_wal_fsync_errors_total` increasing
- Logs contain "fsync failed" or "WAL write error"
- Write operations return 500 errors
- API logs show: `Error: Failed to fsync WAL segment`

## Impact

**User Impact:**
- All writes fail immediately (assertions, votes, epochs)
- API returns HTTP 500 on POST/PUT operations
- Data loss risk if errors persist (WAL not durable)

**System Impact:**
- Write pipeline completely blocked
- Risk of WAL corruption if partial writes occurred
- Potential need for WAL rebuild from replicas

## Investigation Steps

### 1. Check Fsync Error Count

```bash
curl -s http://localhost:18180/metrics | grep wal_fsync_errors
# stemedb_wal_fsync_errors_total{segment="segment.001.wal"} 15
```

### 2. Check Disk Status

```bash
# I/O errors
dmesg | grep -i "i/o error" | tail -20

# Filesystem errors
journalctl --dmesg | grep -i "ext4.*error"

# SMART status
smartctl -a /dev/sda
```

### 3. Check WAL Partition Health

```bash
# Disk space
df -h /var/lib/stemedb/wal

# Mount options (must include sync or data=ordered)
mount | grep /var/lib/stemedb

# Test write + fsync
cd /var/lib/stemedb/wal
time sh -c "dd if=/dev/zero of=test.dat bs=4k count=1000 && sync"
rm test.dat
```

### 4. Check for Read-Only Filesystem

```bash
# Attempt write
touch /var/lib/stemedb/wal/test.file
# If fails with "Read-only file system", remount needed
```

## Resolution

### If Filesystem is Read-Only

**1. Remount as read-write:**

```bash
mount -o remount,rw /var/lib/stemedb/wal
```

**2. Check for underlying errors:**

```bash
dmesg | tail -50
```

**3. If errors persist, run filesystem check:**

```bash
systemctl stop stemedb-api
umount /var/lib/stemedb/wal
fsck -y /dev/sdb1  # Replace with actual device
mount /var/lib/stemedb/wal
systemctl start stemedb-api
```

### If Disk is Failing

**1. Verify hardware status:**

```bash
smartctl -a /dev/sda | grep -E "(Reallocated_Sector|Current_Pending_Sector|Offline_Uncorrectable)"
```

**2. If bad sectors detected, initiate failover:**

```bash
# Mark node as unhealthy
curl -X POST http://localhost:18180/v1/admin/node/drain

# Failover to replica
# See: docs/operations/runbooks/failover-to-replica.md
```

### If WAL Segment is Corrupted

**1. Identify corrupted segment:**

```bash
journalctl -u stemedb-api | grep "WAL.*corrupt" | tail -10
```

**2. Attempt recovery:**

```bash
systemctl stop stemedb-api

# Backup corrupted segment
mv /var/lib/stemedb/wal/segment.001.wal \
   /var/lib/stemedb/wal/segment.001.wal.corrupted

# Truncate at last known good position (if identified in logs)
stemedb-wal-repair \
  --segment /var/lib/stemedb/wal/segment.001.wal.corrupted \
  --output /var/lib/stemedb/wal/segment.001.wal \
  --truncate-at <byte-offset>

systemctl start stemedb-api
```

**3. If repair fails, restore from replica:**

See `docs/operations/runbooks/restore-from-backup.md`.

### If No Hardware/FS Issues Found

**1. Check for kernel/driver bugs:**

```bash
# Kernel version
uname -r

# Recent kernel updates
grep -i "kernel.*upgrade" /var/log/dpkg.log | tail -10
```

**2. Enable WAL fsync debug logging:**

Edit `/etc/stemedb/api.toml`:

```toml
[wal]
log_fsync_errors = true
```

Restart:

```bash
systemctl restart stemedb-api
```

**3. Collect diagnostic data:**

```bash
strace -p $(pgrep stemedb-api) -e fsync,fdatasync -o /tmp/fsync-trace.txt &
sleep 30
kill %1
grep -i error /tmp/fsync-trace.txt
```

## Prevention

### Monitoring

**1. Alert on fsync latency degradation:**

```yaml
- alert: WALFsyncSlow
  expr: stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.1
  for: 5m
  annotations:
    summary: "WAL fsync latency degrading (p99 > 100ms)"
```

**2. Monitor disk health:**

```bash
# Daily SMART check
0 2 * * * smartctl -a /dev/sda | grep -q "FAILING_NOW" && \
  curl -X POST http://alertmanager/api/v1/alerts -d @disk-alert.json
```

### Capacity Planning

**1. Use enterprise-grade SSDs with power-loss protection:**

- NVMe with capacitor-backed write cache
- Avoid consumer SSDs in production

**2. Configure filesystem for durability:**

```bash
# /etc/fstab
/dev/sdb1 /var/lib/stemedb/wal ext4 data=ordered,barrier=1 0 2
```

### Operational Best Practices

**1. Regular WAL health checks:**

```bash
# Weekly verification
cd /var/lib/stemedb/wal
for segment in segment.*.wal; do
  stemedb-wal-verify --file $segment || echo "ERROR: $segment corrupted"
done
```

**2. Automate disk replacement:**

Set up alerts to trigger replacement before failure.

## Escalation

**Escalate immediately if:**

- Fsync errors continue after remount
- Disk SMART status shows imminent failure
- WAL corruption cannot be repaired
- Multiple nodes affected (infrastructure issue)

**Escalation path:**

1. **Primary on-call:** Storage SRE
2. **Secondary:** Kernel/systems engineer
3. **Final escalation:** VP Engineering (data loss imminent)

## References

- **Dashboard:** [StemeDB WAL Health](http://grafana.example.com/d/stemedb-wal)
- **Related alerts:** `WALDiskNearlyFull`, `WALFsyncSlow`, `HighStorageErrorRate`
- **Metrics:**
  - `stemedb_wal_fsync_errors_total`
  - `stemedb_wal_fsync_duration_seconds`
  - `stemedb_wal_segment_rotations_total`
- **Runbooks:** `disk-full.md`, `storage-errors.md`, `failover-to-replica.md`