This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
320 lines
6.9 KiB
Markdown
320 lines
6.9 KiB
Markdown
# Slow WAL Fsync
|
|
|
|
## Severity: WARNING
|
|
|
|
## Alert Rule
|
|
|
|
**Alert:** `WALFsyncSlow`
|
|
**Trigger:** WAL fsync p99 latency > 100ms
|
|
**Duration:** 10m
|
|
|
|
## Symptom
|
|
|
|
- Metrics show `stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.1`
|
|
- API write latency increasing (p99 > 200ms)
|
|
- Logs may show "slow fsync" warnings
|
|
- Ingestion throughput degrading
|
|
|
|
## Impact
|
|
|
|
**User Impact:**
|
|
- Slower API responses for write operations
|
|
- Reduced ingestion throughput (assertions/sec)
|
|
- Client timeouts if latency exceeds configured limits
|
|
|
|
**System Impact:**
|
|
- Write pipeline backpressure
|
|
- Increased memory usage (buffered writes)
|
|
- Risk of WAL segment rotation delays
|
|
|
|
## Investigation Steps
|
|
|
|
### 1. Check Fsync Latency Metrics
|
|
|
|
```bash
|
|
# Current p50, p90, p99 latency
|
|
curl -s http://localhost:18180/metrics | grep wal_fsync_duration_seconds
|
|
|
|
# Expected output:
|
|
# stemedb_wal_fsync_duration_seconds{quantile="0.5"} 0.001
|
|
# stemedb_wal_fsync_duration_seconds{quantile="0.9"} 0.01
|
|
# stemedb_wal_fsync_duration_seconds{quantile="0.99"} 0.15 # ← HIGH
|
|
```
|
|
|
|
### 2. Check Disk I/O Utilization
|
|
|
|
```bash
|
|
# Disk stats
|
|
iostat -x 2 10
|
|
|
|
# Look for:
|
|
# - High %util on WAL partition (>80% sustained)
|
|
# - High await (>50ms indicates congestion)
|
|
```
|
|
|
|
### 3. Check for Competing I/O
|
|
|
|
```bash
|
|
# Processes doing disk I/O
|
|
iotop -o -b -n 5
|
|
|
|
# Look for other processes writing to same disk
|
|
```
|
|
|
|
### 4. Check Disk Write Cache
|
|
|
|
```bash
|
|
# Verify write cache is enabled (should be for durability)
|
|
hdparm -W /dev/sda
|
|
# write-caching = 1 (on)
|
|
```
|
|
|
|
### 5. Test Raw Disk Performance
|
|
|
|
```bash
|
|
# Benchmark fsync performance
|
|
cd /var/lib/stemedb/wal
|
|
time sh -c "dd if=/dev/zero of=test.dat bs=4k count=10000 && sync"
|
|
rm test.dat
|
|
|
|
# Expected: <5 seconds on SSD, <15 seconds on spinning disk
|
|
```
|
|
|
|
## Resolution
|
|
|
|
### If Disk I/O is Saturated
|
|
|
|
**1. Identify competing workload:**
|
|
|
|
```bash
|
|
# Top I/O consumers
|
|
iotop -o -b -n 1 | head -20
|
|
```
|
|
|
|
**2. Reduce competing I/O:**
|
|
|
|
```bash
|
|
# Pause non-critical I/O (backups, log compression, etc.)
|
|
systemctl stop backup.service
|
|
systemctl stop log-archiver.timer
|
|
```
|
|
|
|
**3. Monitor improvement:**
|
|
|
|
```bash
|
|
watch -n 5 'curl -s http://localhost:18180/metrics | grep wal_fsync_duration'
|
|
```
|
|
|
|
### If Disk is Slow (Hardware Issue)
|
|
|
|
**1. Check SMART status:**
|
|
|
|
```bash
|
|
smartctl -a /dev/sda | grep -E "(Seek_Error|Reallocated_Sector)"
|
|
```
|
|
|
|
**2. If disk is failing, prepare for migration:**
|
|
|
|
```bash
|
|
# Mark node for draining
|
|
curl -X POST http://localhost:18180/v1/admin/node/drain
|
|
|
|
# Schedule maintenance window for disk replacement
|
|
```
|
|
|
|
**3. Temporarily reduce write rate:**
|
|
|
|
```bash
|
|
# Apply rate limit to reduce I/O pressure
|
|
curl -X POST http://localhost:18180/v1/admin/rate-limit \
|
|
-d '{"max_writes_per_sec": 500}'
|
|
```
|
|
|
|
### If Filesystem is Misconfigured
|
|
|
|
**1. Check mount options:**
|
|
|
|
```bash
|
|
mount | grep /var/lib/stemedb/wal
|
|
```
|
|
|
|
**Expected:** `data=ordered` or `data=writeback` (not `data=journal` which is slower)
|
|
|
|
**2. If using wrong mount options, remount:**
|
|
|
|
```bash
|
|
# Edit /etc/fstab
|
|
/dev/sdb1 /var/lib/stemedb/wal ext4 data=ordered,noatime 0 2
|
|
|
|
# Remount (requires downtime)
|
|
systemctl stop stemedb-api
|
|
umount /var/lib/stemedb/wal
|
|
mount /var/lib/stemedb/wal
|
|
systemctl start stemedb-api
|
|
```
|
|
|
|
### If Group Commit Not Optimal
|
|
|
|
**1. Tune group commit settings:**
|
|
|
|
Edit `/etc/stemedb/api.toml`:
|
|
|
|
```toml
|
|
[wal]
|
|
group_commit_max_wait_ms = 10 # Increase batching window
|
|
group_commit_max_bytes = 1048576 # 1MB batches
|
|
```
|
|
|
|
**2. Restart service:**
|
|
|
|
```bash
|
|
systemctl restart stemedb-api
|
|
```
|
|
|
|
**3. Monitor fsync frequency:**
|
|
|
|
```bash
|
|
# Fsync count should decrease with larger batches
|
|
curl -s http://localhost:18180/metrics | grep wal_fsync_total
|
|
```
|
|
|
|
### If Cloud Provider Throttling
|
|
|
|
**1. Check for IOPS throttling (AWS EBS example):**
|
|
|
|
```bash
|
|
# CloudWatch metrics
|
|
aws cloudwatch get-metric-statistics \
|
|
--namespace AWS/EBS \
|
|
--metric-name VolumeQueueLength \
|
|
--dimensions Name=VolumeId,Value=vol-abc123 \
|
|
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
|
|
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
|
|
--period 300 \
|
|
--statistics Average
|
|
```
|
|
|
|
**2. Increase provisioned IOPS:**
|
|
|
|
```bash
|
|
# Modify EBS volume (AWS example)
|
|
aws ec2 modify-volume --volume-id vol-abc123 \
|
|
--iops 3000 --volume-type gp3
|
|
```
|
|
|
|
**3. Wait for optimization to complete:**
|
|
|
|
```bash
|
|
watch aws ec2 describe-volumes-modifications \
|
|
--volume-ids vol-abc123 \
|
|
--query 'VolumesModifications[0].ModificationState'
|
|
```
|
|
|
|
## Prevention
|
|
|
|
### Monitoring
|
|
|
|
**1. Alert on sustained high latency:**
|
|
|
|
```yaml
|
|
- alert: WALFsyncDegrading
|
|
expr: stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.05
|
|
for: 15m
|
|
annotations:
|
|
summary: "WAL fsync p99 latency degrading (>50ms)"
|
|
```
|
|
|
|
**2. Monitor disk queue depth:**
|
|
|
|
```yaml
|
|
- alert: DiskQueueDepthHigh
|
|
expr: node_disk_io_weighted_seconds_total > 100
|
|
for: 10m
|
|
annotations:
|
|
summary: "Disk queue depth indicates congestion"
|
|
```
|
|
|
|
### Capacity Planning
|
|
|
|
**1. Use dedicated disk for WAL:**
|
|
|
|
- NVMe SSD with capacitor-backed cache
|
|
- Separate physical disk from KV store
|
|
- Provisioned IOPS (cloud deployments)
|
|
|
|
**2. Benchmark before production:**
|
|
|
|
```bash
|
|
# Test fsync performance under load
|
|
fio --name=fsync-test --rw=write --bs=4k --size=1G \
|
|
--fsync=1 --numjobs=4 --runtime=60 \
|
|
--filename=/var/lib/stemedb/wal/test.dat
|
|
```
|
|
|
|
Expected: p99 latency <10ms on NVMe, <50ms on SATA SSD.
|
|
|
|
**3. Right-size provisioned IOPS (cloud):**
|
|
|
|
```
|
|
IOPS needed = (writes_per_sec * 1.5) # 1.5x for overhead
|
|
|
|
Example:
|
|
- 1000 writes/sec → 1500 IOPS minimum
|
|
- Use 3000 IOPS for headroom (2x)
|
|
```
|
|
|
|
### Operational Best Practices
|
|
|
|
**1. Regular disk health checks:**
|
|
|
|
```bash
|
|
# Weekly SMART check
|
|
smartctl -a /dev/sda | grep -E "(PASSED|FAILED)"
|
|
|
|
# Alert on pending sectors
|
|
smartctl -a /dev/sda | awk '/Current_Pending_Sector/ {if($10>0) print "WARNING: Pending sectors detected"}'
|
|
```
|
|
|
|
**2. Monitor filesystem age:**
|
|
|
|
```bash
|
|
# Check filesystem age (ext4)
|
|
tune2fs -l /dev/sdb1 | grep "Filesystem created"
|
|
|
|
# Consider reformatting if >2 years old (fragmentation)
|
|
```
|
|
|
|
**3. Test I/O performance quarterly:**
|
|
|
|
```bash
|
|
# Benchmark and compare to baseline
|
|
fio --name=seq-write --rw=write --bs=1M --size=10G \
|
|
--filename=/var/lib/stemedb/wal/bench.dat \
|
|
--output-format=json > /tmp/fio-$(date +%Y%m%d).json
|
|
```
|
|
|
|
## Escalation
|
|
|
|
**Escalate if:**
|
|
|
|
- Fsync latency exceeds 200ms for >30 minutes
|
|
- Disk errors appear in logs (hardware failure)
|
|
- Tuning and optimization has no effect
|
|
- Cloud provider throttling cannot be resolved
|
|
|
|
**Escalation path:**
|
|
|
|
1. **Primary on-call:** Storage SRE
|
|
2. **Secondary:** Infrastructure engineer
|
|
3. **Final escalation:** Cloud vendor TAM (if cloud-related)
|
|
|
|
## References
|
|
|
|
- **Dashboard:** [StemeDB WAL Performance](http://grafana.example.com/d/stemedb-wal)
|
|
- **Related alerts:** `WALFsyncFailure`, `HighStorageErrorRate`, `DiskUtilizationHigh`
|
|
- **Metrics:**
|
|
- `stemedb_wal_fsync_duration_seconds` (latency distribution)
|
|
- `stemedb_wal_fsync_total` (fsync count)
|
|
- `node_disk_io_time_weighted_seconds_total` (disk queue time)
|
|
- **Runbooks:** `wal-fsync-failure.md`, `disk-full.md`
|