This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
354 lines
7.8 KiB
Markdown
354 lines
7.8 KiB
Markdown
# High Storage Error Rate
|
|
|
|
## Severity: CRITICAL
|
|
|
|
## Alert Rule
|
|
|
|
**Alert:** `HighStorageErrorRate`
|
|
**Trigger:** Storage operation errors > 1% of total operations
|
|
**Duration:** 5m
|
|
|
|
## Symptom
|
|
|
|
- API returns 500 Internal Server Error on write operations
|
|
- Metrics show `stemedb_storage_operation_errors_total` increasing
|
|
- Logs contain `StorageError` or failed `put/get` operations
|
|
- Specific error patterns:
|
|
- "Failed to write to KV store"
|
|
- "LSM tree compaction failed"
|
|
- "Index update failed"
|
|
|
|
## Impact
|
|
|
|
**User Impact:**
|
|
- Assertion writes fail silently or return errors
|
|
- Query results may be incomplete (missing recent data)
|
|
- Votes and supersessions not persisted
|
|
|
|
**System Impact:**
|
|
- Data loss if errors persist (WAL entries not indexed)
|
|
- Index corruption possible (partial writes)
|
|
- Performance degradation (retry storms)
|
|
|
|
## Investigation Steps
|
|
|
|
### 1. Check Error Metrics
|
|
|
|
```bash
|
|
# Get error rate by operation type
|
|
curl -s http://localhost:18180/metrics | grep storage_operation_errors
|
|
|
|
# Expected output showing errors by operation:
|
|
# stemedb_storage_operation_errors_total{operation="put"} 42
|
|
# stemedb_storage_operation_errors_total{operation="get"} 5
|
|
```
|
|
|
|
### 2. Identify Error Pattern in Logs
|
|
|
|
```bash
|
|
# Recent storage errors
|
|
journalctl -u stemedb-api --since "5 min ago" | grep -i "storage.*error" | tail -50
|
|
```
|
|
|
|
**Common error patterns:**
|
|
|
|
**A. Disk I/O errors:**
|
|
```
|
|
Error: Custom { kind: Other, error: "IO error: No space left on device" }
|
|
Error: Custom { kind: Other, error: "Input/output error" }
|
|
```
|
|
|
|
**B. LSM tree corruption:**
|
|
```
|
|
Error: Corruption: block checksum mismatch
|
|
Error: Corruption: invalid SST file header
|
|
```
|
|
|
|
**C. Lock contention:**
|
|
```
|
|
Error: Failed to acquire write lock within timeout
|
|
Error: Deadlock detected in KV store
|
|
```
|
|
|
|
### 3. Check Disk Health
|
|
|
|
```bash
|
|
# Disk space
|
|
df -h /var/lib/stemedb
|
|
|
|
# I/O errors (check dmesg for hardware failures)
|
|
dmesg | grep -i "i/o error" | tail -20
|
|
|
|
# SMART status (if available)
|
|
smartctl -a /dev/sda | grep -E "(Reallocated_Sector|Current_Pending_Sector)"
|
|
```
|
|
|
|
### 4. Check LSM Tree Health
|
|
|
|
```bash
|
|
# SSH to server, check LSM stats
|
|
cd /var/lib/stemedb/kv
|
|
du -sh ./*
|
|
|
|
# Check for large number of files (compaction falling behind)
|
|
ls -1 | wc -l
|
|
```
|
|
|
|
Expected: <100 SST files. If >500, compaction is failing.
|
|
|
|
### 5. Check for Lock Contention
|
|
|
|
```bash
|
|
# Look for lock timeout messages
|
|
journalctl -u stemedb-api | grep -i "lock.*timeout" | tail -20
|
|
|
|
# Check write throughput (should be consistent)
|
|
curl -s http://localhost:18180/metrics | grep stemedb_storage_put_duration
|
|
```
|
|
|
|
## Resolution
|
|
|
|
### If Disk Space Exhausted
|
|
|
|
**1. Free up space immediately:**
|
|
|
|
```bash
|
|
# Compress old WAL segments
|
|
cd /var/lib/stemedb/wal
|
|
gzip $(ls -t segment.*.wal | tail -n +20)
|
|
|
|
# Or move to backup
|
|
mkdir -p /backup/wal-$(date +%Y%m%d)
|
|
mv segment.00[0-5]*.wal /backup/wal-$(date +%Y%m%d)/
|
|
```
|
|
|
|
**2. Trigger manual LSM compaction:**
|
|
|
|
```bash
|
|
curl -X POST http://localhost:18180/v1/admin/storage/compact \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"force": true}'
|
|
```
|
|
|
|
**3. Monitor compaction progress:**
|
|
|
|
```bash
|
|
journalctl -u stemedb-api -f | grep compaction
|
|
```
|
|
|
|
### If Disk Hardware Failure Suspected
|
|
|
|
**1. Verify I/O errors:**
|
|
|
|
```bash
|
|
dmesg | grep -i "sd[a-z].*error"
|
|
```
|
|
|
|
**2. Run filesystem check (requires downtime):**
|
|
|
|
```bash
|
|
systemctl stop stemedb-api
|
|
umount /var/lib/stemedb
|
|
fsck -y /dev/sdb1 # Replace with actual device
|
|
mount /var/lib/stemedb
|
|
systemctl start stemedb-api
|
|
```
|
|
|
|
**3. If hardware is failing, initiate failover:**
|
|
|
|
See `docs/operations/runbooks/failover-to-replica.md`.
|
|
|
|
### If LSM Tree Corruption Detected
|
|
|
|
**1. Attempt recovery from WAL:**
|
|
|
|
```bash
|
|
systemctl stop stemedb-api
|
|
|
|
# Backup corrupted KV store
|
|
mv /var/lib/stemedb/kv /var/lib/stemedb/kv.corrupted.$(date +%Y%m%d)
|
|
|
|
# Rebuild from WAL
|
|
stemedb-api --rebuild-from-wal \
|
|
--wal-path /var/lib/stemedb/wal \
|
|
--kv-path /var/lib/stemedb/kv
|
|
|
|
systemctl start stemedb-api
|
|
```
|
|
|
|
**2. Verify rebuild succeeded:**
|
|
|
|
```bash
|
|
journalctl -u stemedb-api | grep -i "rebuild complete"
|
|
curl -s http://localhost:18180/metrics | grep assertions_indexed_total
|
|
```
|
|
|
|
**3. If rebuild fails, restore from backup:**
|
|
|
|
See `docs/operations/runbooks/restore-from-backup.md`.
|
|
|
|
### If Lock Contention Detected
|
|
|
|
**1. Check for long-running transactions:**
|
|
|
|
```bash
|
|
# Look for slow queries
|
|
curl -s http://localhost:18180/v1/admin/slow-queries | jq
|
|
```
|
|
|
|
**2. Increase lock timeout temporarily:**
|
|
|
|
```bash
|
|
# Restart with increased timeout
|
|
systemctl stop stemedb-api
|
|
|
|
# Edit /etc/stemedb/api.toml:
|
|
# [storage]
|
|
# lock_timeout_ms = 10000 # Increase from default 5000
|
|
|
|
systemctl start stemedb-api
|
|
```
|
|
|
|
**3. Monitor lock acquisition time:**
|
|
|
|
```bash
|
|
curl -s http://localhost:18180/metrics | grep lock_wait_duration
|
|
```
|
|
|
|
### If Errors Persist Despite Above Steps
|
|
|
|
**1. Enable debug logging:**
|
|
|
|
Edit `/etc/stemedb/api.toml`:
|
|
|
|
```toml
|
|
[logging]
|
|
level = "debug"
|
|
```
|
|
|
|
Restart:
|
|
|
|
```bash
|
|
systemctl restart stemedb-api
|
|
```
|
|
|
|
**2. Capture detailed error trace:**
|
|
|
|
```bash
|
|
journalctl -u stemedb-api -f --output=json | jq 'select(.level=="ERROR")'
|
|
```
|
|
|
|
**3. Escalate with logs:**
|
|
|
|
Collect logs and metrics for engineering team.
|
|
|
|
## Prevention
|
|
|
|
### Monitoring and Alerting
|
|
|
|
**1. Set up disk space warning alerts:**
|
|
|
|
```yaml
|
|
# Prometheus alert
|
|
- alert: DiskSpaceWarning
|
|
expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/stemedb"} /
|
|
node_filesystem_size_bytes{mountpoint="/var/lib/stemedb"}) < 0.2
|
|
for: 10m
|
|
annotations:
|
|
summary: "Disk space below 20% on StemeDB partition"
|
|
```
|
|
|
|
**2. Monitor LSM compaction lag:**
|
|
|
|
```yaml
|
|
- alert: LSMCompactionLag
|
|
expr: stemedb_lsm_pending_compaction_bytes > 10e9 # 10GB
|
|
for: 15m
|
|
annotations:
|
|
summary: "LSM tree compaction falling behind"
|
|
```
|
|
|
|
**3. Alert on I/O errors:**
|
|
|
|
```yaml
|
|
- alert: DiskIOErrors
|
|
expr: rate(node_disk_io_errors_total[5m]) > 0.1
|
|
annotations:
|
|
summary: "Disk I/O errors detected on StemeDB node"
|
|
```
|
|
|
|
### Capacity Planning
|
|
|
|
**1. Set up automated disk cleanup:**
|
|
|
|
```bash
|
|
# Cron job to archive old WAL segments
|
|
# /etc/cron.daily/stemedb-cleanup
|
|
|
|
#!/bin/bash
|
|
cd /var/lib/stemedb/wal
|
|
# Keep 30 days of WAL
|
|
find . -name "segment.*.wal" -mtime +30 -exec gzip {} \;
|
|
find . -name "segment.*.wal.gz" -mtime +90 -exec rm {} \;
|
|
```
|
|
|
|
**2. Enable LSM auto-compaction:**
|
|
|
|
```toml
|
|
# /etc/stemedb/api.toml
|
|
[storage]
|
|
enable_auto_compaction = true
|
|
compaction_trigger_mb = 1024 # Trigger at 1GB
|
|
```
|
|
|
|
**3. Monitor write amplification:**
|
|
|
|
Track `stemedb_storage_write_amplification` metric (should be <10).
|
|
|
|
### Operational Best Practices
|
|
|
|
**1. Regular LSM health checks:**
|
|
|
|
```bash
|
|
# Weekly compaction report
|
|
curl -s http://localhost:18180/v1/admin/storage/stats | jq '{
|
|
sst_files: .sst_file_count,
|
|
total_size_mb: (.total_bytes / 1e6),
|
|
pending_compaction_mb: (.pending_compaction_bytes / 1e6)
|
|
}'
|
|
```
|
|
|
|
**2. Backup before major operations:**
|
|
|
|
Always snapshot KV store before:
|
|
- Major version upgrades
|
|
- Manual compaction
|
|
- Schema migrations
|
|
|
|
## Escalation
|
|
|
|
**Escalate immediately if:**
|
|
|
|
- Error rate exceeds 10% (critical data loss risk)
|
|
- LSM corruption cannot be repaired from WAL
|
|
- Disk I/O errors persist after reboot (hardware failure)
|
|
- Lock contention causes cascading failures (deadlock)
|
|
|
|
**Escalation path:**
|
|
|
|
1. **Primary on-call:** Storage SRE
|
|
2. **Secondary:** Database engineer
|
|
3. **Final escalation:** Principal engineer + on-call manager
|
|
|
|
## References
|
|
|
|
- **Dashboard:** [StemeDB Storage Health](http://grafana.example.com/d/stemedb-storage)
|
|
- **Related alerts:** `WALDiskNearlyFull`, `WALFsyncFailure`, `MemoryExhaustion`
|
|
- **Metrics to check:**
|
|
- `stemedb_storage_operation_errors_total` (error count by type)
|
|
- `stemedb_lsm_compaction_duration_seconds` (compaction timing)
|
|
- `stemedb_storage_put_duration_seconds` (write latency)
|
|
- `node_disk_io_errors_total` (hardware errors)
|
|
- **Logs:** `/var/log/stemedb/storage.log` or `journalctl -u stemedb-api`
|
|
- **Runbooks:** `restore-from-backup.md`, `disk-full.md`, `failover-to-replica.md`
|