This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
6.9 KiB
Slow WAL Fsync
Severity: WARNING
Alert Rule
Alert: WALFsyncSlow
Trigger: WAL fsync p99 latency > 100ms
Duration: 10m
Symptom
- Metrics show
stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.1 - API write latency increasing (p99 > 200ms)
- Logs may show "slow fsync" warnings
- Ingestion throughput degrading
Impact
User Impact:
- Slower API responses for write operations
- Reduced ingestion throughput (assertions/sec)
- Client timeouts if latency exceeds configured limits
System Impact:
- Write pipeline backpressure
- Increased memory usage (buffered writes)
- Risk of WAL segment rotation delays
Investigation Steps
1. Check Fsync Latency Metrics
# Current p50, p90, p99 latency
curl -s http://localhost:18180/metrics | grep wal_fsync_duration_seconds
# Expected output:
# stemedb_wal_fsync_duration_seconds{quantile="0.5"} 0.001
# stemedb_wal_fsync_duration_seconds{quantile="0.9"} 0.01
# stemedb_wal_fsync_duration_seconds{quantile="0.99"} 0.15 # ← HIGH
2. Check Disk I/O Utilization
# Disk stats
iostat -x 2 10
# Look for:
# - High %util on WAL partition (>80% sustained)
# - High await (>50ms indicates congestion)
3. Check for Competing I/O
# Processes doing disk I/O
iotop -o -b -n 5
# Look for other processes writing to same disk
4. Check Disk Write Cache
# Verify write cache is enabled (should be for durability)
hdparm -W /dev/sda
# write-caching = 1 (on)
5. Test Raw Disk Performance
# Benchmark fsync performance
cd /var/lib/stemedb/wal
time sh -c "dd if=/dev/zero of=test.dat bs=4k count=10000 && sync"
rm test.dat
# Expected: <5 seconds on SSD, <15 seconds on spinning disk
Resolution
If Disk I/O is Saturated
1. Identify competing workload:
# Top I/O consumers
iotop -o -b -n 1 | head -20
2. Reduce competing I/O:
# Pause non-critical I/O (backups, log compression, etc.)
systemctl stop backup.service
systemctl stop log-archiver.timer
3. Monitor improvement:
watch -n 5 'curl -s http://localhost:18180/metrics | grep wal_fsync_duration'
If Disk is Slow (Hardware Issue)
1. Check SMART status:
smartctl -a /dev/sda | grep -E "(Seek_Error|Reallocated_Sector)"
2. If disk is failing, prepare for migration:
# Mark node for draining
curl -X POST http://localhost:18180/v1/admin/node/drain
# Schedule maintenance window for disk replacement
3. Temporarily reduce write rate:
# Apply rate limit to reduce I/O pressure
curl -X POST http://localhost:18180/v1/admin/rate-limit \
-d '{"max_writes_per_sec": 500}'
If Filesystem is Misconfigured
1. Check mount options:
mount | grep /var/lib/stemedb/wal
Expected: data=ordered or data=writeback (not data=journal which is slower)
2. If using wrong mount options, remount:
# Edit /etc/fstab
/dev/sdb1 /var/lib/stemedb/wal ext4 data=ordered,noatime 0 2
# Remount (requires downtime)
systemctl stop stemedb-api
umount /var/lib/stemedb/wal
mount /var/lib/stemedb/wal
systemctl start stemedb-api
If Group Commit Not Optimal
1. Tune group commit settings:
Edit /etc/stemedb/api.toml:
[wal]
group_commit_max_wait_ms = 10 # Increase batching window
group_commit_max_bytes = 1048576 # 1MB batches
2. Restart service:
systemctl restart stemedb-api
3. Monitor fsync frequency:
# Fsync count should decrease with larger batches
curl -s http://localhost:18180/metrics | grep wal_fsync_total
If Cloud Provider Throttling
1. Check for IOPS throttling (AWS EBS example):
# CloudWatch metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/EBS \
--metric-name VolumeQueueLength \
--dimensions Name=VolumeId,Value=vol-abc123 \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average
2. Increase provisioned IOPS:
# Modify EBS volume (AWS example)
aws ec2 modify-volume --volume-id vol-abc123 \
--iops 3000 --volume-type gp3
3. Wait for optimization to complete:
watch aws ec2 describe-volumes-modifications \
--volume-ids vol-abc123 \
--query 'VolumesModifications[0].ModificationState'
Prevention
Monitoring
1. Alert on sustained high latency:
- alert: WALFsyncDegrading
expr: stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.05
for: 15m
annotations:
summary: "WAL fsync p99 latency degrading (>50ms)"
2. Monitor disk queue depth:
- alert: DiskQueueDepthHigh
expr: node_disk_io_weighted_seconds_total > 100
for: 10m
annotations:
summary: "Disk queue depth indicates congestion"
Capacity Planning
1. Use dedicated disk for WAL:
- NVMe SSD with capacitor-backed cache
- Separate physical disk from KV store
- Provisioned IOPS (cloud deployments)
2. Benchmark before production:
# Test fsync performance under load
fio --name=fsync-test --rw=write --bs=4k --size=1G \
--fsync=1 --numjobs=4 --runtime=60 \
--filename=/var/lib/stemedb/wal/test.dat
Expected: p99 latency <10ms on NVMe, <50ms on SATA SSD.
3. Right-size provisioned IOPS (cloud):
IOPS needed = (writes_per_sec * 1.5) # 1.5x for overhead
Example:
- 1000 writes/sec → 1500 IOPS minimum
- Use 3000 IOPS for headroom (2x)
Operational Best Practices
1. Regular disk health checks:
# Weekly SMART check
smartctl -a /dev/sda | grep -E "(PASSED|FAILED)"
# Alert on pending sectors
smartctl -a /dev/sda | awk '/Current_Pending_Sector/ {if($10>0) print "WARNING: Pending sectors detected"}'
2. Monitor filesystem age:
# Check filesystem age (ext4)
tune2fs -l /dev/sdb1 | grep "Filesystem created"
# Consider reformatting if >2 years old (fragmentation)
3. Test I/O performance quarterly:
# Benchmark and compare to baseline
fio --name=seq-write --rw=write --bs=1M --size=10G \
--filename=/var/lib/stemedb/wal/bench.dat \
--output-format=json > /tmp/fio-$(date +%Y%m%d).json
Escalation
Escalate if:
- Fsync latency exceeds 200ms for >30 minutes
- Disk errors appear in logs (hardware failure)
- Tuning and optimization has no effect
- Cloud provider throttling cannot be resolved
Escalation path:
- Primary on-call: Storage SRE
- Secondary: Infrastructure engineer
- Final escalation: Cloud vendor TAM (if cloud-related)
References
- Dashboard: StemeDB WAL Performance
- Related alerts:
WALFsyncFailure,HighStorageErrorRate,DiskUtilizationHigh - Metrics:
stemedb_wal_fsync_duration_seconds(latency distribution)stemedb_wal_fsync_total(fsync count)node_disk_io_time_weighted_seconds_total(disk queue time)
- Runbooks:
wal-fsync-failure.md,disk-full.md