stemedb/docs/operations/runbooks/storage-errors.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

354 lines
7.8 KiB
Markdown

# High Storage Error Rate
## Severity: CRITICAL
## Alert Rule
**Alert:** `HighStorageErrorRate`
**Trigger:** Storage operation errors > 1% of total operations
**Duration:** 5m
## Symptom
- API returns 500 Internal Server Error on write operations
- Metrics show `stemedb_storage_operation_errors_total` increasing
- Logs contain `StorageError` or failed `put/get` operations
- Specific error patterns:
- "Failed to write to KV store"
- "LSM tree compaction failed"
- "Index update failed"
## Impact
**User Impact:**
- Assertion writes fail silently or return errors
- Query results may be incomplete (missing recent data)
- Votes and supersessions not persisted
**System Impact:**
- Data loss if errors persist (WAL entries not indexed)
- Index corruption possible (partial writes)
- Performance degradation (retry storms)
## Investigation Steps
### 1. Check Error Metrics
```bash
# Get error rate by operation type
curl -s http://localhost:18180/metrics | grep storage_operation_errors
# Expected output showing errors by operation:
# stemedb_storage_operation_errors_total{operation="put"} 42
# stemedb_storage_operation_errors_total{operation="get"} 5
```
### 2. Identify Error Pattern in Logs
```bash
# Recent storage errors
journalctl -u stemedb-api --since "5 min ago" | grep -i "storage.*error" | tail -50
```
**Common error patterns:**
**A. Disk I/O errors:**
```
Error: Custom { kind: Other, error: "IO error: No space left on device" }
Error: Custom { kind: Other, error: "Input/output error" }
```
**B. LSM tree corruption:**
```
Error: Corruption: block checksum mismatch
Error: Corruption: invalid SST file header
```
**C. Lock contention:**
```
Error: Failed to acquire write lock within timeout
Error: Deadlock detected in KV store
```
### 3. Check Disk Health
```bash
# Disk space
df -h /var/lib/stemedb
# I/O errors (check dmesg for hardware failures)
dmesg | grep -i "i/o error" | tail -20
# SMART status (if available)
smartctl -a /dev/sda | grep -E "(Reallocated_Sector|Current_Pending_Sector)"
```
### 4. Check LSM Tree Health
```bash
# SSH to server, check LSM stats
cd /var/lib/stemedb/kv
du -sh ./*
# Check for large number of files (compaction falling behind)
ls -1 | wc -l
```
Expected: <100 SST files. If >500, compaction is failing.
### 5. Check for Lock Contention
```bash
# Look for lock timeout messages
journalctl -u stemedb-api | grep -i "lock.*timeout" | tail -20
# Check write throughput (should be consistent)
curl -s http://localhost:18180/metrics | grep stemedb_storage_put_duration
```
## Resolution
### If Disk Space Exhausted
**1. Free up space immediately:**
```bash
# Compress old WAL segments
cd /var/lib/stemedb/wal
gzip $(ls -t segment.*.wal | tail -n +20)
# Or move to backup
mkdir -p /backup/wal-$(date +%Y%m%d)
mv segment.00[0-5]*.wal /backup/wal-$(date +%Y%m%d)/
```
**2. Trigger manual LSM compaction:**
```bash
curl -X POST http://localhost:18180/v1/admin/storage/compact \
-H 'Content-Type: application/json' \
-d '{"force": true}'
```
**3. Monitor compaction progress:**
```bash
journalctl -u stemedb-api -f | grep compaction
```
### If Disk Hardware Failure Suspected
**1. Verify I/O errors:**
```bash
dmesg | grep -i "sd[a-z].*error"
```
**2. Run filesystem check (requires downtime):**
```bash
systemctl stop stemedb-api
umount /var/lib/stemedb
fsck -y /dev/sdb1 # Replace with actual device
mount /var/lib/stemedb
systemctl start stemedb-api
```
**3. If hardware is failing, initiate failover:**
See `docs/operations/runbooks/failover-to-replica.md`.
### If LSM Tree Corruption Detected
**1. Attempt recovery from WAL:**
```bash
systemctl stop stemedb-api
# Backup corrupted KV store
mv /var/lib/stemedb/kv /var/lib/stemedb/kv.corrupted.$(date +%Y%m%d)
# Rebuild from WAL
stemedb-api --rebuild-from-wal \
--wal-path /var/lib/stemedb/wal \
--kv-path /var/lib/stemedb/kv
systemctl start stemedb-api
```
**2. Verify rebuild succeeded:**
```bash
journalctl -u stemedb-api | grep -i "rebuild complete"
curl -s http://localhost:18180/metrics | grep assertions_indexed_total
```
**3. If rebuild fails, restore from backup:**
See `docs/operations/runbooks/restore-from-backup.md`.
### If Lock Contention Detected
**1. Check for long-running transactions:**
```bash
# Look for slow queries
curl -s http://localhost:18180/v1/admin/slow-queries | jq
```
**2. Increase lock timeout temporarily:**
```bash
# Restart with increased timeout
systemctl stop stemedb-api
# Edit /etc/stemedb/api.toml:
# [storage]
# lock_timeout_ms = 10000 # Increase from default 5000
systemctl start stemedb-api
```
**3. Monitor lock acquisition time:**
```bash
curl -s http://localhost:18180/metrics | grep lock_wait_duration
```
### If Errors Persist Despite Above Steps
**1. Enable debug logging:**
Edit `/etc/stemedb/api.toml`:
```toml
[logging]
level = "debug"
```
Restart:
```bash
systemctl restart stemedb-api
```
**2. Capture detailed error trace:**
```bash
journalctl -u stemedb-api -f --output=json | jq 'select(.level=="ERROR")'
```
**3. Escalate with logs:**
Collect logs and metrics for engineering team.
## Prevention
### Monitoring and Alerting
**1. Set up disk space warning alerts:**
```yaml
# Prometheus alert
- alert: DiskSpaceWarning
expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/stemedb"} /
node_filesystem_size_bytes{mountpoint="/var/lib/stemedb"}) < 0.2
for: 10m
annotations:
summary: "Disk space below 20% on StemeDB partition"
```
**2. Monitor LSM compaction lag:**
```yaml
- alert: LSMCompactionLag
expr: stemedb_lsm_pending_compaction_bytes > 10e9 # 10GB
for: 15m
annotations:
summary: "LSM tree compaction falling behind"
```
**3. Alert on I/O errors:**
```yaml
- alert: DiskIOErrors
expr: rate(node_disk_io_errors_total[5m]) > 0.1
annotations:
summary: "Disk I/O errors detected on StemeDB node"
```
### Capacity Planning
**1. Set up automated disk cleanup:**
```bash
# Cron job to archive old WAL segments
# /etc/cron.daily/stemedb-cleanup
#!/bin/bash
cd /var/lib/stemedb/wal
# Keep 30 days of WAL
find . -name "segment.*.wal" -mtime +30 -exec gzip {} \;
find . -name "segment.*.wal.gz" -mtime +90 -exec rm {} \;
```
**2. Enable LSM auto-compaction:**
```toml
# /etc/stemedb/api.toml
[storage]
enable_auto_compaction = true
compaction_trigger_mb = 1024 # Trigger at 1GB
```
**3. Monitor write amplification:**
Track `stemedb_storage_write_amplification` metric (should be <10).
### Operational Best Practices
**1. Regular LSM health checks:**
```bash
# Weekly compaction report
curl -s http://localhost:18180/v1/admin/storage/stats | jq '{
sst_files: .sst_file_count,
total_size_mb: (.total_bytes / 1e6),
pending_compaction_mb: (.pending_compaction_bytes / 1e6)
}'
```
**2. Backup before major operations:**
Always snapshot KV store before:
- Major version upgrades
- Manual compaction
- Schema migrations
## Escalation
**Escalate immediately if:**
- Error rate exceeds 10% (critical data loss risk)
- LSM corruption cannot be repaired from WAL
- Disk I/O errors persist after reboot (hardware failure)
- Lock contention causes cascading failures (deadlock)
**Escalation path:**
1. **Primary on-call:** Storage SRE
2. **Secondary:** Database engineer
3. **Final escalation:** Principal engineer + on-call manager
## References
- **Dashboard:** [StemeDB Storage Health](http://grafana.example.com/d/stemedb-storage)
- **Related alerts:** `WALDiskNearlyFull`, `WALFsyncFailure`, `MemoryExhaustion`
- **Metrics to check:**
- `stemedb_storage_operation_errors_total` (error count by type)
- `stemedb_lsm_compaction_duration_seconds` (compaction timing)
- `stemedb_storage_put_duration_seconds` (write latency)
- `node_disk_io_errors_total` (hardware errors)
- **Logs:** `/var/log/stemedb/storage.log` or `journalctl -u stemedb-api`
- **Runbooks:** `restore-from-backup.md`, `disk-full.md`, `failover-to-replica.md`