# High Storage Error Rate ## Severity: CRITICAL ## Alert Rule **Alert:** `HighStorageErrorRate` **Trigger:** Storage operation errors > 1% of total operations **Duration:** 5m ## Symptom - API returns 500 Internal Server Error on write operations - Metrics show `stemedb_storage_operation_errors_total` increasing - Logs contain `StorageError` or failed `put/get` operations - Specific error patterns: - "Failed to write to KV store" - "LSM tree compaction failed" - "Index update failed" ## Impact **User Impact:** - Assertion writes fail silently or return errors - Query results may be incomplete (missing recent data) - Votes and supersessions not persisted **System Impact:** - Data loss if errors persist (WAL entries not indexed) - Index corruption possible (partial writes) - Performance degradation (retry storms) ## Investigation Steps ### 1. Check Error Metrics ```bash # Get error rate by operation type curl -s http://localhost:18180/metrics | grep storage_operation_errors # Expected output showing errors by operation: # stemedb_storage_operation_errors_total{operation="put"} 42 # stemedb_storage_operation_errors_total{operation="get"} 5 ``` ### 2. Identify Error Pattern in Logs ```bash # Recent storage errors journalctl -u stemedb-api --since "5 min ago" | grep -i "storage.*error" | tail -50 ``` **Common error patterns:** **A. Disk I/O errors:** ``` Error: Custom { kind: Other, error: "IO error: No space left on device" } Error: Custom { kind: Other, error: "Input/output error" } ``` **B. LSM tree corruption:** ``` Error: Corruption: block checksum mismatch Error: Corruption: invalid SST file header ``` **C. Lock contention:** ``` Error: Failed to acquire write lock within timeout Error: Deadlock detected in KV store ``` ### 3. Check Disk Health ```bash # Disk space df -h /var/lib/stemedb # I/O errors (check dmesg for hardware failures) dmesg | grep -i "i/o error" | tail -20 # SMART status (if available) smartctl -a /dev/sda | grep -E "(Reallocated_Sector|Current_Pending_Sector)" ``` ### 4. Check LSM Tree Health ```bash # SSH to server, check LSM stats cd /var/lib/stemedb/kv du -sh ./* # Check for large number of files (compaction falling behind) ls -1 | wc -l ``` Expected: <100 SST files. If >500, compaction is failing. ### 5. Check for Lock Contention ```bash # Look for lock timeout messages journalctl -u stemedb-api | grep -i "lock.*timeout" | tail -20 # Check write throughput (should be consistent) curl -s http://localhost:18180/metrics | grep stemedb_storage_put_duration ``` ## Resolution ### If Disk Space Exhausted **1. Free up space immediately:** ```bash # Compress old WAL segments cd /var/lib/stemedb/wal gzip $(ls -t segment.*.wal | tail -n +20) # Or move to backup mkdir -p /backup/wal-$(date +%Y%m%d) mv segment.00[0-5]*.wal /backup/wal-$(date +%Y%m%d)/ ``` **2. Trigger manual LSM compaction:** ```bash curl -X POST http://localhost:18180/v1/admin/storage/compact \ -H 'Content-Type: application/json' \ -d '{"force": true}' ``` **3. Monitor compaction progress:** ```bash journalctl -u stemedb-api -f | grep compaction ``` ### If Disk Hardware Failure Suspected **1. Verify I/O errors:** ```bash dmesg | grep -i "sd[a-z].*error" ``` **2. Run filesystem check (requires downtime):** ```bash systemctl stop stemedb-api umount /var/lib/stemedb fsck -y /dev/sdb1 # Replace with actual device mount /var/lib/stemedb systemctl start stemedb-api ``` **3. If hardware is failing, initiate failover:** See `docs/operations/runbooks/failover-to-replica.md`. ### If LSM Tree Corruption Detected **1. Attempt recovery from WAL:** ```bash systemctl stop stemedb-api # Backup corrupted KV store mv /var/lib/stemedb/kv /var/lib/stemedb/kv.corrupted.$(date +%Y%m%d) # Rebuild from WAL stemedb-api --rebuild-from-wal \ --wal-path /var/lib/stemedb/wal \ --kv-path /var/lib/stemedb/kv systemctl start stemedb-api ``` **2. Verify rebuild succeeded:** ```bash journalctl -u stemedb-api | grep -i "rebuild complete" curl -s http://localhost:18180/metrics | grep assertions_indexed_total ``` **3. If rebuild fails, restore from backup:** See `docs/operations/runbooks/restore-from-backup.md`. ### If Lock Contention Detected **1. Check for long-running transactions:** ```bash # Look for slow queries curl -s http://localhost:18180/v1/admin/slow-queries | jq ``` **2. Increase lock timeout temporarily:** ```bash # Restart with increased timeout systemctl stop stemedb-api # Edit /etc/stemedb/api.toml: # [storage] # lock_timeout_ms = 10000 # Increase from default 5000 systemctl start stemedb-api ``` **3. Monitor lock acquisition time:** ```bash curl -s http://localhost:18180/metrics | grep lock_wait_duration ``` ### If Errors Persist Despite Above Steps **1. Enable debug logging:** Edit `/etc/stemedb/api.toml`: ```toml [logging] level = "debug" ``` Restart: ```bash systemctl restart stemedb-api ``` **2. Capture detailed error trace:** ```bash journalctl -u stemedb-api -f --output=json | jq 'select(.level=="ERROR")' ``` **3. Escalate with logs:** Collect logs and metrics for engineering team. ## Prevention ### Monitoring and Alerting **1. Set up disk space warning alerts:** ```yaml # Prometheus alert - alert: DiskSpaceWarning expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/stemedb"} / node_filesystem_size_bytes{mountpoint="/var/lib/stemedb"}) < 0.2 for: 10m annotations: summary: "Disk space below 20% on StemeDB partition" ``` **2. Monitor LSM compaction lag:** ```yaml - alert: LSMCompactionLag expr: stemedb_lsm_pending_compaction_bytes > 10e9 # 10GB for: 15m annotations: summary: "LSM tree compaction falling behind" ``` **3. Alert on I/O errors:** ```yaml - alert: DiskIOErrors expr: rate(node_disk_io_errors_total[5m]) > 0.1 annotations: summary: "Disk I/O errors detected on StemeDB node" ``` ### Capacity Planning **1. Set up automated disk cleanup:** ```bash # Cron job to archive old WAL segments # /etc/cron.daily/stemedb-cleanup #!/bin/bash cd /var/lib/stemedb/wal # Keep 30 days of WAL find . -name "segment.*.wal" -mtime +30 -exec gzip {} \; find . -name "segment.*.wal.gz" -mtime +90 -exec rm {} \; ``` **2. Enable LSM auto-compaction:** ```toml # /etc/stemedb/api.toml [storage] enable_auto_compaction = true compaction_trigger_mb = 1024 # Trigger at 1GB ``` **3. Monitor write amplification:** Track `stemedb_storage_write_amplification` metric (should be <10). ### Operational Best Practices **1. Regular LSM health checks:** ```bash # Weekly compaction report curl -s http://localhost:18180/v1/admin/storage/stats | jq '{ sst_files: .sst_file_count, total_size_mb: (.total_bytes / 1e6), pending_compaction_mb: (.pending_compaction_bytes / 1e6) }' ``` **2. Backup before major operations:** Always snapshot KV store before: - Major version upgrades - Manual compaction - Schema migrations ## Escalation **Escalate immediately if:** - Error rate exceeds 10% (critical data loss risk) - LSM corruption cannot be repaired from WAL - Disk I/O errors persist after reboot (hardware failure) - Lock contention causes cascading failures (deadlock) **Escalation path:** 1. **Primary on-call:** Storage SRE 2. **Secondary:** Database engineer 3. **Final escalation:** Principal engineer + on-call manager ## References - **Dashboard:** [StemeDB Storage Health](http://grafana.example.com/d/stemedb-storage) - **Related alerts:** `WALDiskNearlyFull`, `WALFsyncFailure`, `MemoryExhaustion` - **Metrics to check:** - `stemedb_storage_operation_errors_total` (error count by type) - `stemedb_lsm_compaction_duration_seconds` (compaction timing) - `stemedb_storage_put_duration_seconds` (write latency) - `node_disk_io_errors_total` (hardware errors) - **Logs:** `/var/log/stemedb/storage.log` or `journalctl -u stemedb-api` - **Runbooks:** `restore-from-backup.md`, `disk-full.md`, `failover-to-replica.md`