This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
7.8 KiB
High Storage Error Rate
Severity: CRITICAL
Alert Rule
Alert: HighStorageErrorRate
Trigger: Storage operation errors > 1% of total operations
Duration: 5m
Symptom
- API returns 500 Internal Server Error on write operations
- Metrics show
stemedb_storage_operation_errors_totalincreasing - Logs contain
StorageErroror failedput/getoperations - Specific error patterns:
- "Failed to write to KV store"
- "LSM tree compaction failed"
- "Index update failed"
Impact
User Impact:
- Assertion writes fail silently or return errors
- Query results may be incomplete (missing recent data)
- Votes and supersessions not persisted
System Impact:
- Data loss if errors persist (WAL entries not indexed)
- Index corruption possible (partial writes)
- Performance degradation (retry storms)
Investigation Steps
1. Check Error Metrics
# Get error rate by operation type
curl -s http://localhost:18180/metrics | grep storage_operation_errors
# Expected output showing errors by operation:
# stemedb_storage_operation_errors_total{operation="put"} 42
# stemedb_storage_operation_errors_total{operation="get"} 5
2. Identify Error Pattern in Logs
# Recent storage errors
journalctl -u stemedb-api --since "5 min ago" | grep -i "storage.*error" | tail -50
Common error patterns:
A. Disk I/O errors:
Error: Custom { kind: Other, error: "IO error: No space left on device" }
Error: Custom { kind: Other, error: "Input/output error" }
B. LSM tree corruption:
Error: Corruption: block checksum mismatch
Error: Corruption: invalid SST file header
C. Lock contention:
Error: Failed to acquire write lock within timeout
Error: Deadlock detected in KV store
3. Check Disk Health
# Disk space
df -h /var/lib/stemedb
# I/O errors (check dmesg for hardware failures)
dmesg | grep -i "i/o error" | tail -20
# SMART status (if available)
smartctl -a /dev/sda | grep -E "(Reallocated_Sector|Current_Pending_Sector)"
4. Check LSM Tree Health
# SSH to server, check LSM stats
cd /var/lib/stemedb/kv
du -sh ./*
# Check for large number of files (compaction falling behind)
ls -1 | wc -l
Expected: <100 SST files. If >500, compaction is failing.
5. Check for Lock Contention
# Look for lock timeout messages
journalctl -u stemedb-api | grep -i "lock.*timeout" | tail -20
# Check write throughput (should be consistent)
curl -s http://localhost:18180/metrics | grep stemedb_storage_put_duration
Resolution
If Disk Space Exhausted
1. Free up space immediately:
# Compress old WAL segments
cd /var/lib/stemedb/wal
gzip $(ls -t segment.*.wal | tail -n +20)
# Or move to backup
mkdir -p /backup/wal-$(date +%Y%m%d)
mv segment.00[0-5]*.wal /backup/wal-$(date +%Y%m%d)/
2. Trigger manual LSM compaction:
curl -X POST http://localhost:18180/v1/admin/storage/compact \
-H 'Content-Type: application/json' \
-d '{"force": true}'
3. Monitor compaction progress:
journalctl -u stemedb-api -f | grep compaction
If Disk Hardware Failure Suspected
1. Verify I/O errors:
dmesg | grep -i "sd[a-z].*error"
2. Run filesystem check (requires downtime):
systemctl stop stemedb-api
umount /var/lib/stemedb
fsck -y /dev/sdb1 # Replace with actual device
mount /var/lib/stemedb
systemctl start stemedb-api
3. If hardware is failing, initiate failover:
See docs/operations/runbooks/failover-to-replica.md.
If LSM Tree Corruption Detected
1. Attempt recovery from WAL:
systemctl stop stemedb-api
# Backup corrupted KV store
mv /var/lib/stemedb/kv /var/lib/stemedb/kv.corrupted.$(date +%Y%m%d)
# Rebuild from WAL
stemedb-api --rebuild-from-wal \
--wal-path /var/lib/stemedb/wal \
--kv-path /var/lib/stemedb/kv
systemctl start stemedb-api
2. Verify rebuild succeeded:
journalctl -u stemedb-api | grep -i "rebuild complete"
curl -s http://localhost:18180/metrics | grep assertions_indexed_total
3. If rebuild fails, restore from backup:
See docs/operations/runbooks/restore-from-backup.md.
If Lock Contention Detected
1. Check for long-running transactions:
# Look for slow queries
curl -s http://localhost:18180/v1/admin/slow-queries | jq
2. Increase lock timeout temporarily:
# Restart with increased timeout
systemctl stop stemedb-api
# Edit /etc/stemedb/api.toml:
# [storage]
# lock_timeout_ms = 10000 # Increase from default 5000
systemctl start stemedb-api
3. Monitor lock acquisition time:
curl -s http://localhost:18180/metrics | grep lock_wait_duration
If Errors Persist Despite Above Steps
1. Enable debug logging:
Edit /etc/stemedb/api.toml:
[logging]
level = "debug"
Restart:
systemctl restart stemedb-api
2. Capture detailed error trace:
journalctl -u stemedb-api -f --output=json | jq 'select(.level=="ERROR")'
3. Escalate with logs:
Collect logs and metrics for engineering team.
Prevention
Monitoring and Alerting
1. Set up disk space warning alerts:
# Prometheus alert
- alert: DiskSpaceWarning
expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/stemedb"} /
node_filesystem_size_bytes{mountpoint="/var/lib/stemedb"}) < 0.2
for: 10m
annotations:
summary: "Disk space below 20% on StemeDB partition"
2. Monitor LSM compaction lag:
- alert: LSMCompactionLag
expr: stemedb_lsm_pending_compaction_bytes > 10e9 # 10GB
for: 15m
annotations:
summary: "LSM tree compaction falling behind"
3. Alert on I/O errors:
- alert: DiskIOErrors
expr: rate(node_disk_io_errors_total[5m]) > 0.1
annotations:
summary: "Disk I/O errors detected on StemeDB node"
Capacity Planning
1. Set up automated disk cleanup:
# Cron job to archive old WAL segments
# /etc/cron.daily/stemedb-cleanup
#!/bin/bash
cd /var/lib/stemedb/wal
# Keep 30 days of WAL
find . -name "segment.*.wal" -mtime +30 -exec gzip {} \;
find . -name "segment.*.wal.gz" -mtime +90 -exec rm {} \;
2. Enable LSM auto-compaction:
# /etc/stemedb/api.toml
[storage]
enable_auto_compaction = true
compaction_trigger_mb = 1024 # Trigger at 1GB
3. Monitor write amplification:
Track stemedb_storage_write_amplification metric (should be <10).
Operational Best Practices
1. Regular LSM health checks:
# Weekly compaction report
curl -s http://localhost:18180/v1/admin/storage/stats | jq '{
sst_files: .sst_file_count,
total_size_mb: (.total_bytes / 1e6),
pending_compaction_mb: (.pending_compaction_bytes / 1e6)
}'
2. Backup before major operations:
Always snapshot KV store before:
- Major version upgrades
- Manual compaction
- Schema migrations
Escalation
Escalate immediately if:
- Error rate exceeds 10% (critical data loss risk)
- LSM corruption cannot be repaired from WAL
- Disk I/O errors persist after reboot (hardware failure)
- Lock contention causes cascading failures (deadlock)
Escalation path:
- Primary on-call: Storage SRE
- Secondary: Database engineer
- Final escalation: Principal engineer + on-call manager
References
- Dashboard: StemeDB Storage Health
- Related alerts:
WALDiskNearlyFull,WALFsyncFailure,MemoryExhaustion - Metrics to check:
stemedb_storage_operation_errors_total(error count by type)stemedb_lsm_compaction_duration_seconds(compaction timing)stemedb_storage_put_duration_seconds(write latency)node_disk_io_errors_total(hardware errors)
- Logs:
/var/log/stemedb/storage.logorjournalctl -u stemedb-api - Runbooks:
restore-from-backup.md,disk-full.md,failover-to-replica.md