stemedb/docs/operations/runbooks/disk-full.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

523 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook: Disk Full
## Symptom
- Writes fail with "No space left on device"
- Server won't start due to disk space
- Disk usage >95%
- WAL segments filling disk rapidly
- "No inodes available" errors
**Metrics Alerts:**
- `node_filesystem_avail_bytes` < 5% of total
- `node_filesystem_files_free` < 1000 (inode exhaustion)
---
## Quick Diagnosis
```
Disk full
├─► Check: df -h
│ └─► >98%? → §1 Emergency Cleanup
├─► Check: du -sh data/wal/
│ └─► WAL using most space? → §2 WAL Cleanup
├─► Check: du -sh data/db/
│ └─► Database using most space? → §3 Compaction
├─► Check: df -i
│ └─► Inodes exhausted? → §4 Inode Exhaustion
└─► Normal growth, no cleanup options?
└─► §5 Volume Expansion
```
---
## Common Causes
1. **WAL segments not being cleaned up** Likelihood: **50%**
- WAL retention too long
- Backup process holding references
- Compaction not running
2. **Database growth** Likelihood: **25%**
- High ingest rate
- No compaction configured
- Expected growth, undersized volume
3. **Log files accumulating** Likelihood: **15%**
- Application logs not rotated
- systemd journal filling disk
- Old backups not deleted
4. **Inode exhaustion** Likelihood: **5%**
- Many small WAL segments
- Temporary files not cleaned
- Filesystem fragmentation
5. **Unexpected data** Likelihood: **5%**
- Core dumps
- Large test datasets
- Temporary files from failed operations
---
## Resolution Steps
### §1. Emergency Cleanup (Disk >98%)
**Diagnostic:**
```bash
# Check disk usage
df -h
# Expected output (critical):
# Filesystem Size Used Avail Use% Mounted on
# /dev/sda1 100G 99G 500M 99% /
# Find largest directories
sudo du -h /data | sort -rh | head -20
```
**Resolution: Immediate cleanup**
**WARNING:** Only perform when disk >98%. Always backup first if possible.
```bash
# Step 1: Delete old WAL segments (>7 days)
# ONLY if you have a recent backup!
sudo find data/wal -name "*.log" -mtime +7 -exec ls -lh {} \;
# Review list, then delete:
sudo find data/wal -name "*.log" -mtime +7 -delete
# Step 2: Delete old backups
sudo find backups/ -name "stemedb-backup-*" -mtime +30 -exec rm -rf {} \;
# Step 3: Delete old logs
sudo journalctl --vacuum-time=7d
# Step 4: Delete core dumps
sudo find /var/lib/systemd/coredump -name "core.*" -mtime +1 -delete
# Step 5: Verify space freed
df -h
# Should show >10% free now
```
**Start server:**
```bash
sudo systemctl start stemedb-api
# Verify startup
curl http://localhost:18180/v1/health
```
**If failed:** Still >95% after cleanup → Proceed to §5 Volume Expansion immediately.
---
### §2. WAL Cleanup (Planned)
**Diagnostic:**
```bash
# Check WAL directory size
du -sh data/wal/
# Count WAL segments
ls data/wal/*.log | wc -l
# Check oldest segment
ls -lt data/wal/*.log | tail -1
# Expected: Oldest segment <7 days for pilot workloads
```
**Resolution: Configure WAL retention**
```bash
# Set WAL retention to 7 days (default: unlimited)
export STEMEDB_WAL_RETENTION_DAYS=7
# Or in config file
cat >> /etc/stemedb/config.toml <<EOF
[wal]
retention_days = 7
max_segments = 100 # Cap at 100 segments
segment_size_mb = 64 # 64MB per segment
EOF
# Restart server to apply
sudo systemctl restart stemedb-api
# Verify WAL cleanup runs
journalctl -u stemedb-api | grep "WAL cleanup"
# Expected log:
# "WAL cleanup: removed 15 segments older than 7 days"
```
**Manual WAL cleanup (safe):**
```bash
# Stop server (required for safe WAL cleanup)
sudo systemctl stop stemedb-api
# Backup current WAL first
sudo ./scripts/backup-stemedb.sh
# Archive old WAL segments to S3/backup storage
sudo tar czf wal-archive-$(date +%Y%m%d).tar.gz data/wal/*.log
sudo mv wal-archive-*.tar.gz backups/
# Delete segments older than 7 days
sudo find data/wal -name "*.log" -mtime +7 -delete
# Start server
sudo systemctl start stemedb-api
# Verify health
curl http://localhost:18180/v1/health
```
**If failed:** WAL still growing rapidly → Check ingest rate, may need larger volume or WAL archival to S3 (roadmap P6.4).
---
### §3. Database Compaction
**Diagnostic:**
```bash
# Check database size
du -sh data/db/
# Check for fragmentation
ls -lh data/db/*.kv | awk '{sum+=$5} END {print sum/1024/1024 " MB"}'
# Check compaction metrics
curl http://localhost:18180/metrics | grep stemedb_compaction_
```
**Resolution: Trigger manual compaction**
⚠️ **NOTE:** Compaction is I/O intensive. Run during low-traffic periods.
```bash
# Trigger compaction via admin endpoint
curl -X POST http://localhost:18180/v1/admin/compact \
-H "Content-Type: application/json" \
-d '{"aggressive": false}'
# Monitor progress
watch -n 5 'curl -s http://localhost:18180/metrics | grep compaction_progress'
# Expected duration: 5-30 minutes for <100K assertions
# Verify space freed
df -h
du -sh data/db/
```
**Automatic compaction (recommended):**
```toml
# /etc/stemedb/config.toml
[storage]
compaction_enabled = true
compaction_interval_hours = 24 # Daily
compaction_threshold_mb = 1000 # Trigger at 1GB growth
```
**If failed:** Compaction doesn't free space → Database growth is legitimate. Proceed to §5 Volume Expansion.
---
### §4. Inode Exhaustion
**Diagnostic:**
```bash
# Check inode usage
df -i
# Expected output (exhausted):
# Filesystem Inodes IUsed IFree IUse% Mounted on
# /dev/sda1 6.2M 6.2M 0 100% /
# Find directories with most files
sudo find /data -xdev -printf '%h\n' | sort | uniq -c | sort -k 1 -n | tail -20
```
**Resolution: Delete small files**
```bash
# Find temp files
sudo find data/ -name "*.tmp" -delete
# Find empty files
sudo find data/ -type f -empty -delete
# Consolidate small WAL segments (if many tiny files)
sudo systemctl stop stemedb-api
# Archive and consolidate
cd data/wal
sudo tar czf consolidated-$(date +%Y%m%d).tar.gz segment-*.log
sudo rm segment-*.log
# (Server will recreate on startup)
sudo systemctl start stemedb-api
# Verify inodes freed
df -i
```
**If failed:** Can't free inodes → May need to increase inode ratio (requires filesystem recreation) or migrate to larger volume.
---
### §5. Volume Expansion
**Diagnostic:**
```bash
# Check current volume size
df -h /data
# Check if volume is expandable
# AWS EBS example:
aws ec2 describe-volumes --volume-ids vol-xxx | jq '.Volumes[].Size'
```
**Resolution A: Expand existing volume (AWS EBS)**
```bash
# Step 1: Expand EBS volume (AWS example)
aws ec2 modify-volume --volume-id vol-xxx --size 200
# (Doubles from 100GB to 200GB)
# Step 2: Wait for modification to complete
aws ec2 describe-volumes-modifications --volume-id vol-xxx
# Step 3: Expand filesystem
sudo growpart /dev/nvme0n1 1 # Expand partition
sudo resize2fs /dev/nvme0n1p1 # Resize ext4
# (For XFS: sudo xfs_growfs /data)
# Step 4: Verify expansion
df -h
# Should show new size
# No restart needed, server continues running
```
**Resolution B: Add secondary volume**
```bash
# Step 1: Attach new volume (AWS example)
aws ec2 attach-volume --volume-id vol-yyy --instance-id i-xxx --device /dev/sdf
# Step 2: Format new volume
sudo mkfs.ext4 /dev/sdf
# Step 3: Mount temporarily
sudo mount /dev/sdf /mnt/newdata
# Step 4: Stop server and migrate
sudo systemctl stop stemedb-api
sudo rsync -av /data/ /mnt/newdata/
# Step 5: Update fstab
echo "/dev/sdf /data ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab
# Step 6: Remount
sudo umount /data
sudo mount /data
# Step 7: Start server
sudo systemctl start stemedb-api
# Verify health
curl http://localhost:18180/v1/health
```
**Resolution C: Archive old data to S3**
⚠️ **NOTE:** Requires roadmap P6.4 (WAL archival). Workaround: Manual archival.
```bash
# Archive WAL segments older than 30 days to S3
sudo find data/wal -name "*.log" -mtime +30 -exec echo {} \; > wal-to-archive.txt
# Upload to S3
cat wal-to-archive.txt | xargs -I {} aws s3 cp {} s3://stemedb-archive/wal/
# Verify upload, then delete local copies
cat wal-to-archive.txt | xargs -I {} sudo rm {}
# Verify space freed
df -h
```
**If failed:** Can't expand volume → Migrate to new server with larger storage. See [Add Node Runbook](./add-node.md) for cluster migration.
---
## Validation
After applying resolution, validate disk health:
- [ ] **Disk usage <80%**
```bash
df -h
# Should show <80% used
```
- [ ] **Inodes available**
```bash
df -i
# Should show >10% inodes free
```
- [ ] **Server running**
```bash
systemctl status stemedb-api
# Should show: active (running)
```
- [ ] **Writes succeed**
```bash
curl -X POST http://localhost:18180/v1/assert \
-H "Content-Type: application/json" \
-d '{"concept_path": "test/disk", "predicate": "space_ok", "value": true}'
# Should return: 201 Created
```
- [ ] **No disk errors in logs**
```bash
journalctl -u stemedb-api | grep -i "no space"
# Should return empty
```
---
## Prevention
### Monitoring
**Set up alerts for:**
```yaml
# Prometheus alert rules
groups:
- name: stemedb_disk
rules:
- alert: StemeDBDiskSpaceWarning
expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes) < 0.2
for: 15m
labels:
severity: warning
annotations:
summary: "Disk space <20% on /data"
description: "Available: {{ $value | humanizePercentage }}"
- alert: StemeDBDiskSpaceCritical
expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes) < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space <10% on /data"
description: "Available: {{ $value | humanizePercentage }}"
- alert: StemeDBInodeExhaustion
expr: (node_filesystem_files_free / node_filesystem_files) < 0.1
for: 15m
labels:
severity: warning
annotations:
summary: "Inodes <10% available"
```
### Configuration Changes
**To prevent recurrence:**
1. **WAL retention:** Set to 7 days for pilot, 3 days for production with frequent backups
2. **Compaction:** Enable automatic daily compaction
3. **Backup cleanup:** Retain last 7 daily backups only
4. **Log rotation:** Configure systemd journal vacuum
5. **Capacity planning:** Right-size volumes based on [Resource Sizing Guide](../reference-architecture/resource-sizing.md)
**Example: Comprehensive disk management**
```toml
# /etc/stemedb/config.toml
[wal]
retention_days = 7
max_segments = 100
segment_size_mb = 64
[storage]
compaction_enabled = true
compaction_interval_hours = 24
compaction_threshold_mb = 1000
[backup]
retention_days = 7
compression_enabled = true
```
**Systemd journal vacuum:**
```bash
# Limit journal to 500MB
sudo journalctl --vacuum-size=500M
# Or limit to 7 days
sudo journalctl --vacuum-time=7d
# Make permanent
sudo mkdir -p /etc/systemd/journald.conf.d/
cat <<EOF | sudo tee /etc/systemd/journald.conf.d/vacuum.conf
[Journal]
SystemMaxUse=500M
MaxRetentionSec=7day
EOF
sudo systemctl restart systemd-journald
```
---
## Capacity Planning
**Disk growth formula:**
| Component | Growth Rate | Calculation |
|-----------|-------------|-------------|
| **WAL** | ~10MB per 1K assertions | retention_days × daily_assertions × 10MB / 1000 |
| **Database** | ~50MB per 10K assertions | (total_assertions / 10000) × 50MB |
| **Indexes** | ~10% of database size | database_size × 0.1 |
| **Backups** | 1x data size per backup | (wal_size + db_size) × retention_count |
**Example: Pilot with 100K assertions, 7-day retention:**
- WAL: 7 days × 1K/day × 10MB / 1000 = 70MB
- Database: (100K / 10K) × 50MB = 500MB
- Indexes: 500MB × 0.1 = 50MB
- Backups: (70MB + 500MB) × 7 = 4GB
- **Total: ~5GB** (provision 20GB for 4x headroom)
**See:** [Resource Sizing Guide](../reference-architecture/resource-sizing.md) for detailed calculations.
---
## Related Runbooks
- [Server Won't Start](./server-wont-start.md) - Disk full preventing startup
- [Restore from Backup](./restore-from-backup.md) - Need space for restore operations
- [High Query Latency](./high-query-latency.md) - Performance impact of disk pressure
---
## Last Updated
2026-02-11