stemedb/docs/operations/runbooks/storage-errors.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

7.8 KiB

High Storage Error Rate

Severity: CRITICAL

Alert Rule

Alert: HighStorageErrorRate Trigger: Storage operation errors > 1% of total operations Duration: 5m

Symptom

  • API returns 500 Internal Server Error on write operations
  • Metrics show stemedb_storage_operation_errors_total increasing
  • Logs contain StorageError or failed put/get operations
  • Specific error patterns:
    • "Failed to write to KV store"
    • "LSM tree compaction failed"
    • "Index update failed"

Impact

User Impact:

  • Assertion writes fail silently or return errors
  • Query results may be incomplete (missing recent data)
  • Votes and supersessions not persisted

System Impact:

  • Data loss if errors persist (WAL entries not indexed)
  • Index corruption possible (partial writes)
  • Performance degradation (retry storms)

Investigation Steps

1. Check Error Metrics

# Get error rate by operation type
curl -s http://localhost:18180/metrics | grep storage_operation_errors

# Expected output showing errors by operation:
# stemedb_storage_operation_errors_total{operation="put"} 42
# stemedb_storage_operation_errors_total{operation="get"} 5

2. Identify Error Pattern in Logs

# Recent storage errors
journalctl -u stemedb-api --since "5 min ago" | grep -i "storage.*error" | tail -50

Common error patterns:

A. Disk I/O errors:

Error: Custom { kind: Other, error: "IO error: No space left on device" }
Error: Custom { kind: Other, error: "Input/output error" }

B. LSM tree corruption:

Error: Corruption: block checksum mismatch
Error: Corruption: invalid SST file header

C. Lock contention:

Error: Failed to acquire write lock within timeout
Error: Deadlock detected in KV store

3. Check Disk Health

# Disk space
df -h /var/lib/stemedb

# I/O errors (check dmesg for hardware failures)
dmesg | grep -i "i/o error" | tail -20

# SMART status (if available)
smartctl -a /dev/sda | grep -E "(Reallocated_Sector|Current_Pending_Sector)"

4. Check LSM Tree Health

# SSH to server, check LSM stats
cd /var/lib/stemedb/kv
du -sh ./*

# Check for large number of files (compaction falling behind)
ls -1 | wc -l

Expected: <100 SST files. If >500, compaction is failing.

5. Check for Lock Contention

# Look for lock timeout messages
journalctl -u stemedb-api | grep -i "lock.*timeout" | tail -20

# Check write throughput (should be consistent)
curl -s http://localhost:18180/metrics | grep stemedb_storage_put_duration

Resolution

If Disk Space Exhausted

1. Free up space immediately:

# Compress old WAL segments
cd /var/lib/stemedb/wal
gzip $(ls -t segment.*.wal | tail -n +20)

# Or move to backup
mkdir -p /backup/wal-$(date +%Y%m%d)
mv segment.00[0-5]*.wal /backup/wal-$(date +%Y%m%d)/

2. Trigger manual LSM compaction:

curl -X POST http://localhost:18180/v1/admin/storage/compact \
  -H 'Content-Type: application/json' \
  -d '{"force": true}'

3. Monitor compaction progress:

journalctl -u stemedb-api -f | grep compaction

If Disk Hardware Failure Suspected

1. Verify I/O errors:

dmesg | grep -i "sd[a-z].*error"

2. Run filesystem check (requires downtime):

systemctl stop stemedb-api
umount /var/lib/stemedb
fsck -y /dev/sdb1  # Replace with actual device
mount /var/lib/stemedb
systemctl start stemedb-api

3. If hardware is failing, initiate failover:

See docs/operations/runbooks/failover-to-replica.md.

If LSM Tree Corruption Detected

1. Attempt recovery from WAL:

systemctl stop stemedb-api

# Backup corrupted KV store
mv /var/lib/stemedb/kv /var/lib/stemedb/kv.corrupted.$(date +%Y%m%d)

# Rebuild from WAL
stemedb-api --rebuild-from-wal \
  --wal-path /var/lib/stemedb/wal \
  --kv-path /var/lib/stemedb/kv

systemctl start stemedb-api

2. Verify rebuild succeeded:

journalctl -u stemedb-api | grep -i "rebuild complete"
curl -s http://localhost:18180/metrics | grep assertions_indexed_total

3. If rebuild fails, restore from backup:

See docs/operations/runbooks/restore-from-backup.md.

If Lock Contention Detected

1. Check for long-running transactions:

# Look for slow queries
curl -s http://localhost:18180/v1/admin/slow-queries | jq

2. Increase lock timeout temporarily:

# Restart with increased timeout
systemctl stop stemedb-api

# Edit /etc/stemedb/api.toml:
# [storage]
# lock_timeout_ms = 10000  # Increase from default 5000

systemctl start stemedb-api

3. Monitor lock acquisition time:

curl -s http://localhost:18180/metrics | grep lock_wait_duration

If Errors Persist Despite Above Steps

1. Enable debug logging:

Edit /etc/stemedb/api.toml:

[logging]
level = "debug"

Restart:

systemctl restart stemedb-api

2. Capture detailed error trace:

journalctl -u stemedb-api -f --output=json | jq 'select(.level=="ERROR")'

3. Escalate with logs:

Collect logs and metrics for engineering team.

Prevention

Monitoring and Alerting

1. Set up disk space warning alerts:

# Prometheus alert
- alert: DiskSpaceWarning
  expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/stemedb"} /
         node_filesystem_size_bytes{mountpoint="/var/lib/stemedb"}) < 0.2
  for: 10m
  annotations:
    summary: "Disk space below 20% on StemeDB partition"

2. Monitor LSM compaction lag:

- alert: LSMCompactionLag
  expr: stemedb_lsm_pending_compaction_bytes > 10e9  # 10GB
  for: 15m
  annotations:
    summary: "LSM tree compaction falling behind"

3. Alert on I/O errors:

- alert: DiskIOErrors
  expr: rate(node_disk_io_errors_total[5m]) > 0.1
  annotations:
    summary: "Disk I/O errors detected on StemeDB node"

Capacity Planning

1. Set up automated disk cleanup:

# Cron job to archive old WAL segments
# /etc/cron.daily/stemedb-cleanup

#!/bin/bash
cd /var/lib/stemedb/wal
# Keep 30 days of WAL
find . -name "segment.*.wal" -mtime +30 -exec gzip {} \;
find . -name "segment.*.wal.gz" -mtime +90 -exec rm {} \;

2. Enable LSM auto-compaction:

# /etc/stemedb/api.toml
[storage]
enable_auto_compaction = true
compaction_trigger_mb = 1024  # Trigger at 1GB

3. Monitor write amplification:

Track stemedb_storage_write_amplification metric (should be <10).

Operational Best Practices

1. Regular LSM health checks:

# Weekly compaction report
curl -s http://localhost:18180/v1/admin/storage/stats | jq '{
  sst_files: .sst_file_count,
  total_size_mb: (.total_bytes / 1e6),
  pending_compaction_mb: (.pending_compaction_bytes / 1e6)
}'

2. Backup before major operations:

Always snapshot KV store before:

  • Major version upgrades
  • Manual compaction
  • Schema migrations

Escalation

Escalate immediately if:

  • Error rate exceeds 10% (critical data loss risk)
  • LSM corruption cannot be repaired from WAL
  • Disk I/O errors persist after reboot (hardware failure)
  • Lock contention causes cascading failures (deadlock)

Escalation path:

  1. Primary on-call: Storage SRE
  2. Secondary: Database engineer
  3. Final escalation: Principal engineer + on-call manager

References

  • Dashboard: StemeDB Storage Health
  • Related alerts: WALDiskNearlyFull, WALFsyncFailure, MemoryExhaustion
  • Metrics to check:
    • stemedb_storage_operation_errors_total (error count by type)
    • stemedb_lsm_compaction_duration_seconds (compaction timing)
    • stemedb_storage_put_duration_seconds (write latency)
    • node_disk_io_errors_total (hardware errors)
  • Logs: /var/log/stemedb/storage.log or journalctl -u stemedb-api
  • Runbooks: restore-from-backup.md, disk-full.md, failover-to-replica.md