stemedb/docs/operations/runbooks/slow-fsync.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

320 lines
6.9 KiB
Markdown

# Slow WAL Fsync
## Severity: WARNING
## Alert Rule
**Alert:** `WALFsyncSlow`
**Trigger:** WAL fsync p99 latency > 100ms
**Duration:** 10m
## Symptom
- Metrics show `stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.1`
- API write latency increasing (p99 > 200ms)
- Logs may show "slow fsync" warnings
- Ingestion throughput degrading
## Impact
**User Impact:**
- Slower API responses for write operations
- Reduced ingestion throughput (assertions/sec)
- Client timeouts if latency exceeds configured limits
**System Impact:**
- Write pipeline backpressure
- Increased memory usage (buffered writes)
- Risk of WAL segment rotation delays
## Investigation Steps
### 1. Check Fsync Latency Metrics
```bash
# Current p50, p90, p99 latency
curl -s http://localhost:18180/metrics | grep wal_fsync_duration_seconds
# Expected output:
# stemedb_wal_fsync_duration_seconds{quantile="0.5"} 0.001
# stemedb_wal_fsync_duration_seconds{quantile="0.9"} 0.01
# stemedb_wal_fsync_duration_seconds{quantile="0.99"} 0.15 # ← HIGH
```
### 2. Check Disk I/O Utilization
```bash
# Disk stats
iostat -x 2 10
# Look for:
# - High %util on WAL partition (>80% sustained)
# - High await (>50ms indicates congestion)
```
### 3. Check for Competing I/O
```bash
# Processes doing disk I/O
iotop -o -b -n 5
# Look for other processes writing to same disk
```
### 4. Check Disk Write Cache
```bash
# Verify write cache is enabled (should be for durability)
hdparm -W /dev/sda
# write-caching = 1 (on)
```
### 5. Test Raw Disk Performance
```bash
# Benchmark fsync performance
cd /var/lib/stemedb/wal
time sh -c "dd if=/dev/zero of=test.dat bs=4k count=10000 && sync"
rm test.dat
# Expected: <5 seconds on SSD, <15 seconds on spinning disk
```
## Resolution
### If Disk I/O is Saturated
**1. Identify competing workload:**
```bash
# Top I/O consumers
iotop -o -b -n 1 | head -20
```
**2. Reduce competing I/O:**
```bash
# Pause non-critical I/O (backups, log compression, etc.)
systemctl stop backup.service
systemctl stop log-archiver.timer
```
**3. Monitor improvement:**
```bash
watch -n 5 'curl -s http://localhost:18180/metrics | grep wal_fsync_duration'
```
### If Disk is Slow (Hardware Issue)
**1. Check SMART status:**
```bash
smartctl -a /dev/sda | grep -E "(Seek_Error|Reallocated_Sector)"
```
**2. If disk is failing, prepare for migration:**
```bash
# Mark node for draining
curl -X POST http://localhost:18180/v1/admin/node/drain
# Schedule maintenance window for disk replacement
```
**3. Temporarily reduce write rate:**
```bash
# Apply rate limit to reduce I/O pressure
curl -X POST http://localhost:18180/v1/admin/rate-limit \
-d '{"max_writes_per_sec": 500}'
```
### If Filesystem is Misconfigured
**1. Check mount options:**
```bash
mount | grep /var/lib/stemedb/wal
```
**Expected:** `data=ordered` or `data=writeback` (not `data=journal` which is slower)
**2. If using wrong mount options, remount:**
```bash
# Edit /etc/fstab
/dev/sdb1 /var/lib/stemedb/wal ext4 data=ordered,noatime 0 2
# Remount (requires downtime)
systemctl stop stemedb-api
umount /var/lib/stemedb/wal
mount /var/lib/stemedb/wal
systemctl start stemedb-api
```
### If Group Commit Not Optimal
**1. Tune group commit settings:**
Edit `/etc/stemedb/api.toml`:
```toml
[wal]
group_commit_max_wait_ms = 10 # Increase batching window
group_commit_max_bytes = 1048576 # 1MB batches
```
**2. Restart service:**
```bash
systemctl restart stemedb-api
```
**3. Monitor fsync frequency:**
```bash
# Fsync count should decrease with larger batches
curl -s http://localhost:18180/metrics | grep wal_fsync_total
```
### If Cloud Provider Throttling
**1. Check for IOPS throttling (AWS EBS example):**
```bash
# CloudWatch metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/EBS \
--metric-name VolumeQueueLength \
--dimensions Name=VolumeId,Value=vol-abc123 \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average
```
**2. Increase provisioned IOPS:**
```bash
# Modify EBS volume (AWS example)
aws ec2 modify-volume --volume-id vol-abc123 \
--iops 3000 --volume-type gp3
```
**3. Wait for optimization to complete:**
```bash
watch aws ec2 describe-volumes-modifications \
--volume-ids vol-abc123 \
--query 'VolumesModifications[0].ModificationState'
```
## Prevention
### Monitoring
**1. Alert on sustained high latency:**
```yaml
- alert: WALFsyncDegrading
expr: stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.05
for: 15m
annotations:
summary: "WAL fsync p99 latency degrading (>50ms)"
```
**2. Monitor disk queue depth:**
```yaml
- alert: DiskQueueDepthHigh
expr: node_disk_io_weighted_seconds_total > 100
for: 10m
annotations:
summary: "Disk queue depth indicates congestion"
```
### Capacity Planning
**1. Use dedicated disk for WAL:**
- NVMe SSD with capacitor-backed cache
- Separate physical disk from KV store
- Provisioned IOPS (cloud deployments)
**2. Benchmark before production:**
```bash
# Test fsync performance under load
fio --name=fsync-test --rw=write --bs=4k --size=1G \
--fsync=1 --numjobs=4 --runtime=60 \
--filename=/var/lib/stemedb/wal/test.dat
```
Expected: p99 latency <10ms on NVMe, <50ms on SATA SSD.
**3. Right-size provisioned IOPS (cloud):**
```
IOPS needed = (writes_per_sec * 1.5) # 1.5x for overhead
Example:
- 1000 writes/sec → 1500 IOPS minimum
- Use 3000 IOPS for headroom (2x)
```
### Operational Best Practices
**1. Regular disk health checks:**
```bash
# Weekly SMART check
smartctl -a /dev/sda | grep -E "(PASSED|FAILED)"
# Alert on pending sectors
smartctl -a /dev/sda | awk '/Current_Pending_Sector/ {if($10>0) print "WARNING: Pending sectors detected"}'
```
**2. Monitor filesystem age:**
```bash
# Check filesystem age (ext4)
tune2fs -l /dev/sdb1 | grep "Filesystem created"
# Consider reformatting if >2 years old (fragmentation)
```
**3. Test I/O performance quarterly:**
```bash
# Benchmark and compare to baseline
fio --name=seq-write --rw=write --bs=1M --size=10G \
--filename=/var/lib/stemedb/wal/bench.dat \
--output-format=json > /tmp/fio-$(date +%Y%m%d).json
```
## Escalation
**Escalate if:**
- Fsync latency exceeds 200ms for >30 minutes
- Disk errors appear in logs (hardware failure)
- Tuning and optimization has no effect
- Cloud provider throttling cannot be resolved
**Escalation path:**
1. **Primary on-call:** Storage SRE
2. **Secondary:** Infrastructure engineer
3. **Final escalation:** Cloud vendor TAM (if cloud-related)
## References
- **Dashboard:** [StemeDB WAL Performance](http://grafana.example.com/d/stemedb-wal)
- **Related alerts:** `WALFsyncFailure`, `HighStorageErrorRate`, `DiskUtilizationHigh`
- **Metrics:**
- `stemedb_wal_fsync_duration_seconds` (latency distribution)
- `stemedb_wal_fsync_total` (fsync count)
- `node_disk_io_time_weighted_seconds_total` (disk queue time)
- **Runbooks:** `wal-fsync-failure.md`, `disk-full.md`