stemedb/docs/operations/runbooks/slow-fsync.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

6.9 KiB

Slow WAL Fsync

Severity: WARNING

Alert Rule

Alert: WALFsyncSlow Trigger: WAL fsync p99 latency > 100ms Duration: 10m

Symptom

  • Metrics show stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.1
  • API write latency increasing (p99 > 200ms)
  • Logs may show "slow fsync" warnings
  • Ingestion throughput degrading

Impact

User Impact:

  • Slower API responses for write operations
  • Reduced ingestion throughput (assertions/sec)
  • Client timeouts if latency exceeds configured limits

System Impact:

  • Write pipeline backpressure
  • Increased memory usage (buffered writes)
  • Risk of WAL segment rotation delays

Investigation Steps

1. Check Fsync Latency Metrics

# Current p50, p90, p99 latency
curl -s http://localhost:18180/metrics | grep wal_fsync_duration_seconds

# Expected output:
# stemedb_wal_fsync_duration_seconds{quantile="0.5"} 0.001
# stemedb_wal_fsync_duration_seconds{quantile="0.9"} 0.01
# stemedb_wal_fsync_duration_seconds{quantile="0.99"} 0.15  # ← HIGH

2. Check Disk I/O Utilization

# Disk stats
iostat -x 2 10

# Look for:
# - High %util on WAL partition (>80% sustained)
# - High await (>50ms indicates congestion)

3. Check for Competing I/O

# Processes doing disk I/O
iotop -o -b -n 5

# Look for other processes writing to same disk

4. Check Disk Write Cache

# Verify write cache is enabled (should be for durability)
hdparm -W /dev/sda
# write-caching =  1 (on)

5. Test Raw Disk Performance

# Benchmark fsync performance
cd /var/lib/stemedb/wal
time sh -c "dd if=/dev/zero of=test.dat bs=4k count=10000 && sync"
rm test.dat

# Expected: <5 seconds on SSD, <15 seconds on spinning disk

Resolution

If Disk I/O is Saturated

1. Identify competing workload:

# Top I/O consumers
iotop -o -b -n 1 | head -20

2. Reduce competing I/O:

# Pause non-critical I/O (backups, log compression, etc.)
systemctl stop backup.service
systemctl stop log-archiver.timer

3. Monitor improvement:

watch -n 5 'curl -s http://localhost:18180/metrics | grep wal_fsync_duration'

If Disk is Slow (Hardware Issue)

1. Check SMART status:

smartctl -a /dev/sda | grep -E "(Seek_Error|Reallocated_Sector)"

2. If disk is failing, prepare for migration:

# Mark node for draining
curl -X POST http://localhost:18180/v1/admin/node/drain

# Schedule maintenance window for disk replacement

3. Temporarily reduce write rate:

# Apply rate limit to reduce I/O pressure
curl -X POST http://localhost:18180/v1/admin/rate-limit \
  -d '{"max_writes_per_sec": 500}'

If Filesystem is Misconfigured

1. Check mount options:

mount | grep /var/lib/stemedb/wal

Expected: data=ordered or data=writeback (not data=journal which is slower)

2. If using wrong mount options, remount:

# Edit /etc/fstab
/dev/sdb1 /var/lib/stemedb/wal ext4 data=ordered,noatime 0 2

# Remount (requires downtime)
systemctl stop stemedb-api
umount /var/lib/stemedb/wal
mount /var/lib/stemedb/wal
systemctl start stemedb-api

If Group Commit Not Optimal

1. Tune group commit settings:

Edit /etc/stemedb/api.toml:

[wal]
group_commit_max_wait_ms = 10  # Increase batching window
group_commit_max_bytes = 1048576  # 1MB batches

2. Restart service:

systemctl restart stemedb-api

3. Monitor fsync frequency:

# Fsync count should decrease with larger batches
curl -s http://localhost:18180/metrics | grep wal_fsync_total

If Cloud Provider Throttling

1. Check for IOPS throttling (AWS EBS example):

# CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS \
  --metric-name VolumeQueueLength \
  --dimensions Name=VolumeId,Value=vol-abc123 \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

2. Increase provisioned IOPS:

# Modify EBS volume (AWS example)
aws ec2 modify-volume --volume-id vol-abc123 \
  --iops 3000 --volume-type gp3

3. Wait for optimization to complete:

watch aws ec2 describe-volumes-modifications \
  --volume-ids vol-abc123 \
  --query 'VolumesModifications[0].ModificationState'

Prevention

Monitoring

1. Alert on sustained high latency:

- alert: WALFsyncDegrading
  expr: stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.05
  for: 15m
  annotations:
    summary: "WAL fsync p99 latency degrading (>50ms)"

2. Monitor disk queue depth:

- alert: DiskQueueDepthHigh
  expr: node_disk_io_weighted_seconds_total > 100
  for: 10m
  annotations:
    summary: "Disk queue depth indicates congestion"

Capacity Planning

1. Use dedicated disk for WAL:

  • NVMe SSD with capacitor-backed cache
  • Separate physical disk from KV store
  • Provisioned IOPS (cloud deployments)

2. Benchmark before production:

# Test fsync performance under load
fio --name=fsync-test --rw=write --bs=4k --size=1G \
  --fsync=1 --numjobs=4 --runtime=60 \
  --filename=/var/lib/stemedb/wal/test.dat

Expected: p99 latency <10ms on NVMe, <50ms on SATA SSD.

3. Right-size provisioned IOPS (cloud):

IOPS needed = (writes_per_sec * 1.5)  # 1.5x for overhead

Example:
- 1000 writes/sec → 1500 IOPS minimum
- Use 3000 IOPS for headroom (2x)

Operational Best Practices

1. Regular disk health checks:

# Weekly SMART check
smartctl -a /dev/sda | grep -E "(PASSED|FAILED)"

# Alert on pending sectors
smartctl -a /dev/sda | awk '/Current_Pending_Sector/ {if($10>0) print "WARNING: Pending sectors detected"}'

2. Monitor filesystem age:

# Check filesystem age (ext4)
tune2fs -l /dev/sdb1 | grep "Filesystem created"

# Consider reformatting if >2 years old (fragmentation)

3. Test I/O performance quarterly:

# Benchmark and compare to baseline
fio --name=seq-write --rw=write --bs=1M --size=10G \
  --filename=/var/lib/stemedb/wal/bench.dat \
  --output-format=json > /tmp/fio-$(date +%Y%m%d).json

Escalation

Escalate if:

  • Fsync latency exceeds 200ms for >30 minutes
  • Disk errors appear in logs (hardware failure)
  • Tuning and optimization has no effect
  • Cloud provider throttling cannot be resolved

Escalation path:

  1. Primary on-call: Storage SRE
  2. Secondary: Infrastructure engineer
  3. Final escalation: Cloud vendor TAM (if cloud-related)

References

  • Dashboard: StemeDB WAL Performance
  • Related alerts: WALFsyncFailure, HighStorageErrorRate, DiskUtilizationHigh
  • Metrics:
    • stemedb_wal_fsync_duration_seconds (latency distribution)
    • stemedb_wal_fsync_total (fsync count)
    • node_disk_io_time_weighted_seconds_total (disk queue time)
  • Runbooks: wal-fsync-failure.md, disk-full.md