This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
261 lines
5.3 KiB
Markdown
261 lines
5.3 KiB
Markdown
# WAL Fsync Failure
|
|
|
|
## Severity: CRITICAL
|
|
|
|
## Alert Rule
|
|
|
|
**Alert:** `WALFsyncFailure`
|
|
**Trigger:** WAL fsync operations failing (error rate > 0)
|
|
**Duration:** 1m
|
|
|
|
## Symptom
|
|
|
|
- Metrics show `stemedb_wal_fsync_errors_total` increasing
|
|
- Logs contain "fsync failed" or "WAL write error"
|
|
- Write operations return 500 errors
|
|
- API logs show: `Error: Failed to fsync WAL segment`
|
|
|
|
## Impact
|
|
|
|
**User Impact:**
|
|
- All writes fail immediately (assertions, votes, epochs)
|
|
- API returns HTTP 500 on POST/PUT operations
|
|
- Data loss risk if errors persist (WAL not durable)
|
|
|
|
**System Impact:**
|
|
- Write pipeline completely blocked
|
|
- Risk of WAL corruption if partial writes occurred
|
|
- Potential need for WAL rebuild from replicas
|
|
|
|
## Investigation Steps
|
|
|
|
### 1. Check Fsync Error Count
|
|
|
|
```bash
|
|
curl -s http://localhost:18180/metrics | grep wal_fsync_errors
|
|
# stemedb_wal_fsync_errors_total{segment="segment.001.wal"} 15
|
|
```
|
|
|
|
### 2. Check Disk Status
|
|
|
|
```bash
|
|
# I/O errors
|
|
dmesg | grep -i "i/o error" | tail -20
|
|
|
|
# Filesystem errors
|
|
journalctl --dmesg | grep -i "ext4.*error"
|
|
|
|
# SMART status
|
|
smartctl -a /dev/sda
|
|
```
|
|
|
|
### 3. Check WAL Partition Health
|
|
|
|
```bash
|
|
# Disk space
|
|
df -h /var/lib/stemedb/wal
|
|
|
|
# Mount options (must include sync or data=ordered)
|
|
mount | grep /var/lib/stemedb
|
|
|
|
# Test write + fsync
|
|
cd /var/lib/stemedb/wal
|
|
time sh -c "dd if=/dev/zero of=test.dat bs=4k count=1000 && sync"
|
|
rm test.dat
|
|
```
|
|
|
|
### 4. Check for Read-Only Filesystem
|
|
|
|
```bash
|
|
# Attempt write
|
|
touch /var/lib/stemedb/wal/test.file
|
|
# If fails with "Read-only file system", remount needed
|
|
```
|
|
|
|
## Resolution
|
|
|
|
### If Filesystem is Read-Only
|
|
|
|
**1. Remount as read-write:**
|
|
|
|
```bash
|
|
mount -o remount,rw /var/lib/stemedb/wal
|
|
```
|
|
|
|
**2. Check for underlying errors:**
|
|
|
|
```bash
|
|
dmesg | tail -50
|
|
```
|
|
|
|
**3. If errors persist, run filesystem check:**
|
|
|
|
```bash
|
|
systemctl stop stemedb-api
|
|
umount /var/lib/stemedb/wal
|
|
fsck -y /dev/sdb1 # Replace with actual device
|
|
mount /var/lib/stemedb/wal
|
|
systemctl start stemedb-api
|
|
```
|
|
|
|
### If Disk is Failing
|
|
|
|
**1. Verify hardware status:**
|
|
|
|
```bash
|
|
smartctl -a /dev/sda | grep -E "(Reallocated_Sector|Current_Pending_Sector|Offline_Uncorrectable)"
|
|
```
|
|
|
|
**2. If bad sectors detected, initiate failover:**
|
|
|
|
```bash
|
|
# Mark node as unhealthy
|
|
curl -X POST http://localhost:18180/v1/admin/node/drain
|
|
|
|
# Failover to replica
|
|
# See: docs/operations/runbooks/failover-to-replica.md
|
|
```
|
|
|
|
### If WAL Segment is Corrupted
|
|
|
|
**1. Identify corrupted segment:**
|
|
|
|
```bash
|
|
journalctl -u stemedb-api | grep "WAL.*corrupt" | tail -10
|
|
```
|
|
|
|
**2. Attempt recovery:**
|
|
|
|
```bash
|
|
systemctl stop stemedb-api
|
|
|
|
# Backup corrupted segment
|
|
mv /var/lib/stemedb/wal/segment.001.wal \
|
|
/var/lib/stemedb/wal/segment.001.wal.corrupted
|
|
|
|
# Truncate at last known good position (if identified in logs)
|
|
stemedb-wal-repair \
|
|
--segment /var/lib/stemedb/wal/segment.001.wal.corrupted \
|
|
--output /var/lib/stemedb/wal/segment.001.wal \
|
|
--truncate-at <byte-offset>
|
|
|
|
systemctl start stemedb-api
|
|
```
|
|
|
|
**3. If repair fails, restore from replica:**
|
|
|
|
See `docs/operations/runbooks/restore-from-backup.md`.
|
|
|
|
### If No Hardware/FS Issues Found
|
|
|
|
**1. Check for kernel/driver bugs:**
|
|
|
|
```bash
|
|
# Kernel version
|
|
uname -r
|
|
|
|
# Recent kernel updates
|
|
grep -i "kernel.*upgrade" /var/log/dpkg.log | tail -10
|
|
```
|
|
|
|
**2. Enable WAL fsync debug logging:**
|
|
|
|
Edit `/etc/stemedb/api.toml`:
|
|
|
|
```toml
|
|
[wal]
|
|
log_fsync_errors = true
|
|
```
|
|
|
|
Restart:
|
|
|
|
```bash
|
|
systemctl restart stemedb-api
|
|
```
|
|
|
|
**3. Collect diagnostic data:**
|
|
|
|
```bash
|
|
strace -p $(pgrep stemedb-api) -e fsync,fdatasync -o /tmp/fsync-trace.txt &
|
|
sleep 30
|
|
kill %1
|
|
grep -i error /tmp/fsync-trace.txt
|
|
```
|
|
|
|
## Prevention
|
|
|
|
### Monitoring
|
|
|
|
**1. Alert on fsync latency degradation:**
|
|
|
|
```yaml
|
|
- alert: WALFsyncSlow
|
|
expr: stemedb_wal_fsync_duration_seconds{quantile="0.99"} > 0.1
|
|
for: 5m
|
|
annotations:
|
|
summary: "WAL fsync latency degrading (p99 > 100ms)"
|
|
```
|
|
|
|
**2. Monitor disk health:**
|
|
|
|
```bash
|
|
# Daily SMART check
|
|
0 2 * * * smartctl -a /dev/sda | grep -q "FAILING_NOW" && \
|
|
curl -X POST http://alertmanager/api/v1/alerts -d @disk-alert.json
|
|
```
|
|
|
|
### Capacity Planning
|
|
|
|
**1. Use enterprise-grade SSDs with power-loss protection:**
|
|
|
|
- NVMe with capacitor-backed write cache
|
|
- Avoid consumer SSDs in production
|
|
|
|
**2. Configure filesystem for durability:**
|
|
|
|
```bash
|
|
# /etc/fstab
|
|
/dev/sdb1 /var/lib/stemedb/wal ext4 data=ordered,barrier=1 0 2
|
|
```
|
|
|
|
### Operational Best Practices
|
|
|
|
**1. Regular WAL health checks:**
|
|
|
|
```bash
|
|
# Weekly verification
|
|
cd /var/lib/stemedb/wal
|
|
for segment in segment.*.wal; do
|
|
stemedb-wal-verify --file $segment || echo "ERROR: $segment corrupted"
|
|
done
|
|
```
|
|
|
|
**2. Automate disk replacement:**
|
|
|
|
Set up alerts to trigger replacement before failure.
|
|
|
|
## Escalation
|
|
|
|
**Escalate immediately if:**
|
|
|
|
- Fsync errors continue after remount
|
|
- Disk SMART status shows imminent failure
|
|
- WAL corruption cannot be repaired
|
|
- Multiple nodes affected (infrastructure issue)
|
|
|
|
**Escalation path:**
|
|
|
|
1. **Primary on-call:** Storage SRE
|
|
2. **Secondary:** Kernel/systems engineer
|
|
3. **Final escalation:** VP Engineering (data loss imminent)
|
|
|
|
## References
|
|
|
|
- **Dashboard:** [StemeDB WAL Health](http://grafana.example.com/d/stemedb-wal)
|
|
- **Related alerts:** `WALDiskNearlyFull`, `WALFsyncSlow`, `HighStorageErrorRate`
|
|
- **Metrics:**
|
|
- `stemedb_wal_fsync_errors_total`
|
|
- `stemedb_wal_fsync_duration_seconds`
|
|
- `stemedb_wal_segment_rotations_total`
|
|
- **Runbooks:** `disk-full.md`, `storage-errors.md`, `failover-to-replica.md`
|