stemedb/docs/operations/runbooks/slow-fsync.md

# Slow WAL Fsync

## Severity: WARNING

## Alert Rule

**Alert:** `WALFsyncSlow`
**Trigger:** WAL fsync p99 latency > 100ms
**Duration:** 10m

## Symptom

- Metrics show `stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.1`
- API write latency increasing (p99 > 200ms)
- Logs may show "slow fsync" warnings
- Ingestion throughput degrading

## Impact

**User Impact:**
- Slower API responses for write operations
- Reduced ingestion throughput (assertions/sec)
- Client timeouts if latency exceeds configured limits

**System Impact:**
- Write pipeline backpressure
- Increased memory usage (buffered writes)
- Risk of WAL segment rotation delays

## Investigation Steps

### 1. Check Fsync Latency Metrics

```bash
# Current p50, p90, p99 latency
curl -s http://localhost:18180/metrics | grep wal_fsync_duration_seconds

# Expected output:
# stemedb_wal_fsync_duration_seconds{quantile="0.5"} 0.001
# stemedb_wal_fsync_duration_seconds{quantile="0.9"} 0.01
# stemedb_wal_fsync_duration_seconds{quantile="0.99"} 0.15  # ← HIGH
```

### 2. Check Disk I/O Utilization

```bash
# Disk stats
iostat -x 2 10

# Look for:
# - High %util on WAL partition (>80% sustained)
# - High await (>50ms indicates congestion)
```

### 3. Check for Competing I/O

```bash
# Processes doing disk I/O
iotop -o -b -n 5

# Look for other processes writing to same disk
```

### 4. Check Disk Write Cache

```bash
# Verify write cache is enabled (should be for durability)
hdparm -W /dev/sda
# write-caching =  1 (on)
```

### 5. Test Raw Disk Performance

```bash
# Benchmark fsync performance
cd /var/lib/stemedb/wal
time sh -c "dd if=/dev/zero of=test.dat bs=4k count=10000 && sync"
rm test.dat

# Expected: <5 seconds on SSD, <15 seconds on spinning disk
```

## Resolution

### If Disk I/O is Saturated

**1. Identify competing workload:**

```bash
# Top I/O consumers
iotop -o -b -n 1 | head -20
```

**2. Reduce competing I/O:**

```bash
# Pause non-critical I/O (backups, log compression, etc.)
systemctl stop backup.service
systemctl stop log-archiver.timer
```

**3. Monitor improvement:**

```bash
watch -n 5 'curl -s http://localhost:18180/metrics | grep wal_fsync_duration'
```

### If Disk is Slow (Hardware Issue)

**1. Check SMART status:**

```bash
smartctl -a /dev/sda | grep -E "(Seek_Error|Reallocated_Sector)"
```

**2. If disk is failing, prepare for migration:**

```bash
# Mark node for draining
curl -X POST http://localhost:18180/v1/admin/node/drain

# Schedule maintenance window for disk replacement
```

**3. Temporarily reduce write rate:**

```bash
# Apply rate limit to reduce I/O pressure
curl -X POST http://localhost:18180/v1/admin/rate-limit \
  -d '{"max_writes_per_sec": 500}'
```

### If Filesystem is Misconfigured

**1. Check mount options:**

```bash
mount | grep /var/lib/stemedb/wal
```

**Expected:** `data=ordered` or `data=writeback` (not `data=journal` which is slower)

**2. If using wrong mount options, remount:**

```bash
# Edit /etc/fstab
/dev/sdb1 /var/lib/stemedb/wal ext4 data=ordered,noatime 0 2

# Remount (requires downtime)
systemctl stop stemedb-api
umount /var/lib/stemedb/wal
mount /var/lib/stemedb/wal
systemctl start stemedb-api
```

### If Group Commit Not Optimal

**1. Tune group commit settings:**

Edit `/etc/stemedb/api.toml`:

```toml
[wal]
group_commit_max_wait_ms = 10  # Increase batching window
group_commit_max_bytes = 1048576  # 1MB batches
```

**2. Restart service:**

```bash
systemctl restart stemedb-api
```

**3. Monitor fsync frequency:**

```bash
# Fsync count should decrease with larger batches
curl -s http://localhost:18180/metrics | grep wal_fsync_total
```

### If Cloud Provider Throttling

**1. Check for IOPS throttling (AWS EBS example):**

```bash
# CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS \
  --metric-name VolumeQueueLength \
  --dimensions Name=VolumeId,Value=vol-abc123 \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average
```

**2. Increase provisioned IOPS:**

```bash
# Modify EBS volume (AWS example)
aws ec2 modify-volume --volume-id vol-abc123 \
  --iops 3000 --volume-type gp3
```

**3. Wait for optimization to complete:**

```bash
watch aws ec2 describe-volumes-modifications \
  --volume-ids vol-abc123 \
  --query 'VolumesModifications[0].ModificationState'
```

## Prevention

### Monitoring

**1. Alert on sustained high latency:**

```yaml
- alert: WALFsyncDegrading
  expr: stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.05
  for: 15m
  annotations:
    summary: "WAL fsync p99 latency degrading (>50ms)"
```

**2. Monitor disk queue depth:**

```yaml
- alert: DiskQueueDepthHigh
  expr: node_disk_io_weighted_seconds_total > 100
  for: 10m
  annotations:
    summary: "Disk queue depth indicates congestion"
```

### Capacity Planning

**1. Use dedicated disk for WAL:**

- NVMe SSD with capacitor-backed cache
- Separate physical disk from KV store
- Provisioned IOPS (cloud deployments)

**2. Benchmark before production:**

```bash
# Test fsync performance under load
fio --name=fsync-test --rw=write --bs=4k --size=1G \
  --fsync=1 --numjobs=4 --runtime=60 \
  --filename=/var/lib/stemedb/wal/test.dat
```

Expected: p99 latency <10ms on NVMe, <50ms on SATA SSD.

**3. Right-size provisioned IOPS (cloud):**

```
IOPS needed = (writes_per_sec * 1.5)  # 1.5x for overhead

Example:
- 1000 writes/sec → 1500 IOPS minimum
- Use 3000 IOPS for headroom (2x)
```

### Operational Best Practices

**1. Regular disk health checks:**

```bash
# Weekly SMART check
smartctl -a /dev/sda | grep -E "(PASSED|FAILED)"

# Alert on pending sectors
smartctl -a /dev/sda | awk '/Current_Pending_Sector/ {if($10>0) print "WARNING: Pending sectors detected"}'
```

**2. Monitor filesystem age:**

```bash
# Check filesystem age (ext4)
tune2fs -l /dev/sdb1 | grep "Filesystem created"

# Consider reformatting if >2 years old (fragmentation)
```

**3. Test I/O performance quarterly:**

```bash
# Benchmark and compare to baseline
fio --name=seq-write --rw=write --bs=1M --size=10G \
  --filename=/var/lib/stemedb/wal/bench.dat \
  --output-format=json > /tmp/fio-$(date +%Y%m%d).json
```

## Escalation

**Escalate if:**

- Fsync latency exceeds 200ms for >30 minutes
- Disk errors appear in logs (hardware failure)
- Tuning and optimization has no effect
- Cloud provider throttling cannot be resolved

**Escalation path:**

1. **Primary on-call:** Storage SRE
2. **Secondary:** Infrastructure engineer
3. **Final escalation:** Cloud vendor TAM (if cloud-related)

## References

- **Dashboard:** [StemeDB WAL Performance](http://grafana.example.com/d/stemedb-wal)
- **Related alerts:** `WALFsyncFailure`, `HighStorageErrorRate`, `DiskUtilizationHigh`
- **Metrics:**
  - `stemedb_wal_fsync_duration_seconds` (latency distribution)
  - `stemedb_wal_fsync_total` (fsync count)
  - `node_disk_io_time_weighted_seconds_total` (disk queue time)
- **Runbooks:** `wal-fsync-failure.md`, `disk-full.md`