stemedb/docs/operations/runbooks/storage-errors.md

# High Storage Error Rate

## Severity: CRITICAL

## Alert Rule

**Alert:** `HighStorageErrorRate`
**Trigger:** Storage operation errors > 1% of total operations
**Duration:** 5m

## Symptom

- API returns 500 Internal Server Error on write operations
- Metrics show `stemedb_storage_operation_errors_total` increasing
- Logs contain `StorageError` or failed `put/get` operations
- Specific error patterns:
  - "Failed to write to KV store"
  - "LSM tree compaction failed"
  - "Index update failed"

## Impact

**User Impact:**
- Assertion writes fail silently or return errors
- Query results may be incomplete (missing recent data)
- Votes and supersessions not persisted

**System Impact:**
- Data loss if errors persist (WAL entries not indexed)
- Index corruption possible (partial writes)
- Performance degradation (retry storms)

## Investigation Steps

### 1. Check Error Metrics

```bash
# Get error rate by operation type
curl -s http://localhost:18180/metrics | grep storage_operation_errors

# Expected output showing errors by operation:
# stemedb_storage_operation_errors_total{operation="put"} 42
# stemedb_storage_operation_errors_total{operation="get"} 5
```

### 2. Identify Error Pattern in Logs

```bash
# Recent storage errors
journalctl -u stemedb-api --since "5 min ago" | grep -i "storage.*error" | tail -50
```

**Common error patterns:**

**A. Disk I/O errors:**
```
Error: Custom { kind: Other, error: "IO error: No space left on device" }
Error: Custom { kind: Other, error: "Input/output error" }
```

**B. LSM tree corruption:**
```
Error: Corruption: block checksum mismatch
Error: Corruption: invalid SST file header
```

**C. Lock contention:**
```
Error: Failed to acquire write lock within timeout
Error: Deadlock detected in KV store
```

### 3. Check Disk Health

```bash
# Disk space
df -h /var/lib/stemedb

# I/O errors (check dmesg for hardware failures)
dmesg | grep -i "i/o error" | tail -20

# SMART status (if available)
smartctl -a /dev/sda | grep -E "(Reallocated_Sector|Current_Pending_Sector)"
```

### 4. Check LSM Tree Health

```bash
# SSH to server, check LSM stats
cd /var/lib/stemedb/kv
du -sh ./*

# Check for large number of files (compaction falling behind)
ls -1 | wc -l
```

Expected: <100 SST files. If >500, compaction is failing.

### 5. Check for Lock Contention

```bash
# Look for lock timeout messages
journalctl -u stemedb-api | grep -i "lock.*timeout" | tail -20

# Check write throughput (should be consistent)
curl -s http://localhost:18180/metrics | grep stemedb_storage_put_duration
```

## Resolution

### If Disk Space Exhausted

**1. Free up space immediately:**

```bash
# Compress old WAL segments
cd /var/lib/stemedb/wal
gzip $(ls -t segment.*.wal | tail -n +20)

# Or move to backup
mkdir -p /backup/wal-$(date +%Y%m%d)
mv segment.00[0-5]*.wal /backup/wal-$(date +%Y%m%d)/
```

**2. Trigger manual LSM compaction:**

```bash
curl -X POST http://localhost:18180/v1/admin/storage/compact \
  -H 'Content-Type: application/json' \
  -d '{"force": true}'
```

**3. Monitor compaction progress:**

```bash
journalctl -u stemedb-api -f | grep compaction
```

### If Disk Hardware Failure Suspected

**1. Verify I/O errors:**

```bash
dmesg | grep -i "sd[a-z].*error"
```

**2. Run filesystem check (requires downtime):**

```bash
systemctl stop stemedb-api
umount /var/lib/stemedb
fsck -y /dev/sdb1  # Replace with actual device
mount /var/lib/stemedb
systemctl start stemedb-api
```

**3. If hardware is failing, initiate failover:**

See `docs/operations/runbooks/failover-to-replica.md`.

### If LSM Tree Corruption Detected

**1. Attempt recovery from WAL:**

```bash
systemctl stop stemedb-api

# Backup corrupted KV store
mv /var/lib/stemedb/kv /var/lib/stemedb/kv.corrupted.$(date +%Y%m%d)

# Rebuild from WAL
stemedb-api --rebuild-from-wal \
  --wal-path /var/lib/stemedb/wal \
  --kv-path /var/lib/stemedb/kv

systemctl start stemedb-api
```

**2. Verify rebuild succeeded:**

```bash
journalctl -u stemedb-api | grep -i "rebuild complete"
curl -s http://localhost:18180/metrics | grep assertions_indexed_total
```

**3. If rebuild fails, restore from backup:**

See `docs/operations/runbooks/restore-from-backup.md`.

### If Lock Contention Detected

**1. Check for long-running transactions:**

```bash
# Look for slow queries
curl -s http://localhost:18180/v1/admin/slow-queries | jq
```

**2. Increase lock timeout temporarily:**

```bash
# Restart with increased timeout
systemctl stop stemedb-api

# Edit /etc/stemedb/api.toml:
# [storage]
# lock_timeout_ms = 10000  # Increase from default 5000

systemctl start stemedb-api
```

**3. Monitor lock acquisition time:**

```bash
curl -s http://localhost:18180/metrics | grep lock_wait_duration
```

### If Errors Persist Despite Above Steps

**1. Enable debug logging:**

Edit `/etc/stemedb/api.toml`:

```toml
[logging]
level = "debug"
```

Restart:

```bash
systemctl restart stemedb-api
```

**2. Capture detailed error trace:**

```bash
journalctl -u stemedb-api -f --output=json | jq 'select(.level=="ERROR")'
```

**3. Escalate with logs:**

Collect logs and metrics for engineering team.

## Prevention

### Monitoring and Alerting

**1. Set up disk space warning alerts:**

```yaml
# Prometheus alert
- alert: DiskSpaceWarning
  expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/stemedb"} /
         node_filesystem_size_bytes{mountpoint="/var/lib/stemedb"}) < 0.2
  for: 10m
  annotations:
    summary: "Disk space below 20% on StemeDB partition"
```

**2. Monitor LSM compaction lag:**

```yaml
- alert: LSMCompactionLag
  expr: stemedb_lsm_pending_compaction_bytes > 10e9  # 10GB
  for: 15m
  annotations:
    summary: "LSM tree compaction falling behind"
```

**3. Alert on I/O errors:**

```yaml
- alert: DiskIOErrors
  expr: rate(node_disk_io_errors_total[5m]) > 0.1
  annotations:
    summary: "Disk I/O errors detected on StemeDB node"
```

### Capacity Planning

**1. Set up automated disk cleanup:**

```bash
# Cron job to archive old WAL segments
# /etc/cron.daily/stemedb-cleanup

#!/bin/bash
cd /var/lib/stemedb/wal
# Keep 30 days of WAL
find . -name "segment.*.wal" -mtime +30 -exec gzip {} \;
find . -name "segment.*.wal.gz" -mtime +90 -exec rm {} \;
```

**2. Enable LSM auto-compaction:**

```toml
# /etc/stemedb/api.toml
[storage]
enable_auto_compaction = true
compaction_trigger_mb = 1024  # Trigger at 1GB
```

**3. Monitor write amplification:**

Track `stemedb_storage_write_amplification` metric (should be <10).

### Operational Best Practices

**1. Regular LSM health checks:**

```bash
# Weekly compaction report
curl -s http://localhost:18180/v1/admin/storage/stats | jq '{
  sst_files: .sst_file_count,
  total_size_mb: (.total_bytes / 1e6),
  pending_compaction_mb: (.pending_compaction_bytes / 1e6)
}'
```

**2. Backup before major operations:**

Always snapshot KV store before:
- Major version upgrades
- Manual compaction
- Schema migrations

## Escalation

**Escalate immediately if:**

- Error rate exceeds 10% (critical data loss risk)
- LSM corruption cannot be repaired from WAL
- Disk I/O errors persist after reboot (hardware failure)
- Lock contention causes cascading failures (deadlock)

**Escalation path:**

1. **Primary on-call:** Storage SRE
2. **Secondary:** Database engineer
3. **Final escalation:** Principal engineer + on-call manager

## References

- **Dashboard:** [StemeDB Storage Health](http://grafana.example.com/d/stemedb-storage)
- **Related alerts:** `WALDiskNearlyFull`, `WALFsyncFailure`, `MemoryExhaustion`
- **Metrics to check:**
  - `stemedb_storage_operation_errors_total` (error count by type)
  - `stemedb_lsm_compaction_duration_seconds` (compaction timing)
  - `stemedb_storage_put_duration_seconds` (write latency)
  - `node_disk_io_errors_total` (hardware errors)
- **Logs:** `/var/log/stemedb/storage.log` or `journalctl -u stemedb-api`
- **Runbooks:** `restore-from-backup.md`, `disk-full.md`, `failover-to-replica.md`