# Runbook: Disk Full ## Symptom - Writes fail with "No space left on device" - Server won't start due to disk space - Disk usage >95% - WAL segments filling disk rapidly - "No inodes available" errors **Metrics Alerts:** - `node_filesystem_avail_bytes` < 5% of total - `node_filesystem_files_free` < 1000 (inode exhaustion) --- ## Quick Diagnosis ``` Disk full │ ├─► Check: df -h │ └─► >98%? → §1 Emergency Cleanup │ ├─► Check: du -sh data/wal/ │ └─► WAL using most space? → §2 WAL Cleanup │ ├─► Check: du -sh data/db/ │ └─► Database using most space? → §3 Compaction │ ├─► Check: df -i │ └─► Inodes exhausted? → §4 Inode Exhaustion │ └─► Normal growth, no cleanup options? └─► §5 Volume Expansion ``` --- ## Common Causes 1. **WAL segments not being cleaned up** — Likelihood: **50%** - WAL retention too long - Backup process holding references - Compaction not running 2. **Database growth** — Likelihood: **25%** - High ingest rate - No compaction configured - Expected growth, undersized volume 3. **Log files accumulating** — Likelihood: **15%** - Application logs not rotated - systemd journal filling disk - Old backups not deleted 4. **Inode exhaustion** — Likelihood: **5%** - Many small WAL segments - Temporary files not cleaned - Filesystem fragmentation 5. **Unexpected data** — Likelihood: **5%** - Core dumps - Large test datasets - Temporary files from failed operations --- ## Resolution Steps ### §1. Emergency Cleanup (Disk >98%) **Diagnostic:** ```bash # Check disk usage df -h # Expected output (critical): # Filesystem Size Used Avail Use% Mounted on # /dev/sda1 100G 99G 500M 99% / # Find largest directories sudo du -h /data | sort -rh | head -20 ``` **Resolution: Immediate cleanup** ⚠️ **WARNING:** Only perform when disk >98%. Always backup first if possible. ```bash # Step 1: Delete old WAL segments (>7 days) # ONLY if you have a recent backup! sudo find data/wal -name "*.log" -mtime +7 -exec ls -lh {} \; # Review list, then delete: sudo find data/wal -name "*.log" -mtime +7 -delete # Step 2: Delete old backups sudo find backups/ -name "stemedb-backup-*" -mtime +30 -exec rm -rf {} \; # Step 3: Delete old logs sudo journalctl --vacuum-time=7d # Step 4: Delete core dumps sudo find /var/lib/systemd/coredump -name "core.*" -mtime +1 -delete # Step 5: Verify space freed df -h # Should show >10% free now ``` **Start server:** ```bash sudo systemctl start stemedb-api # Verify startup curl http://localhost:18180/v1/health ``` **If failed:** Still >95% after cleanup → Proceed to §5 Volume Expansion immediately. --- ### §2. WAL Cleanup (Planned) **Diagnostic:** ```bash # Check WAL directory size du -sh data/wal/ # Count WAL segments ls data/wal/*.log | wc -l # Check oldest segment ls -lt data/wal/*.log | tail -1 # Expected: Oldest segment <7 days for pilot workloads ``` **Resolution: Configure WAL retention** ```bash # Set WAL retention to 7 days (default: unlimited) export STEMEDB_WAL_RETENTION_DAYS=7 # Or in config file cat >> /etc/stemedb/config.toml < wal-to-archive.txt # Upload to S3 cat wal-to-archive.txt | xargs -I {} aws s3 cp {} s3://stemedb-archive/wal/ # Verify upload, then delete local copies cat wal-to-archive.txt | xargs -I {} sudo rm {} # Verify space freed df -h ``` **If failed:** Can't expand volume → Migrate to new server with larger storage. See [Add Node Runbook](./add-node.md) for cluster migration. --- ## Validation After applying resolution, validate disk health: - [ ] **Disk usage <80%** ```bash df -h # Should show <80% used ``` - [ ] **Inodes available** ```bash df -i # Should show >10% inodes free ``` - [ ] **Server running** ```bash systemctl status stemedb-api # Should show: active (running) ``` - [ ] **Writes succeed** ```bash curl -X POST http://localhost:18180/v1/assert \ -H "Content-Type: application/json" \ -d '{"concept_path": "test/disk", "predicate": "space_ok", "value": true}' # Should return: 201 Created ``` - [ ] **No disk errors in logs** ```bash journalctl -u stemedb-api | grep -i "no space" # Should return empty ``` --- ## Prevention ### Monitoring **Set up alerts for:** ```yaml # Prometheus alert rules groups: - name: stemedb_disk rules: - alert: StemeDBDiskSpaceWarning expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes) < 0.2 for: 15m labels: severity: warning annotations: summary: "Disk space <20% on /data" description: "Available: {{ $value | humanizePercentage }}" - alert: StemeDBDiskSpaceCritical expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes) < 0.1 for: 5m labels: severity: critical annotations: summary: "Disk space <10% on /data" description: "Available: {{ $value | humanizePercentage }}" - alert: StemeDBInodeExhaustion expr: (node_filesystem_files_free / node_filesystem_files) < 0.1 for: 15m labels: severity: warning annotations: summary: "Inodes <10% available" ``` ### Configuration Changes **To prevent recurrence:** 1. **WAL retention:** Set to 7 days for pilot, 3 days for production with frequent backups 2. **Compaction:** Enable automatic daily compaction 3. **Backup cleanup:** Retain last 7 daily backups only 4. **Log rotation:** Configure systemd journal vacuum 5. **Capacity planning:** Right-size volumes based on [Resource Sizing Guide](../reference-architecture/resource-sizing.md) **Example: Comprehensive disk management** ```toml # /etc/stemedb/config.toml [wal] retention_days = 7 max_segments = 100 segment_size_mb = 64 [storage] compaction_enabled = true compaction_interval_hours = 24 compaction_threshold_mb = 1000 [backup] retention_days = 7 compression_enabled = true ``` **Systemd journal vacuum:** ```bash # Limit journal to 500MB sudo journalctl --vacuum-size=500M # Or limit to 7 days sudo journalctl --vacuum-time=7d # Make permanent sudo mkdir -p /etc/systemd/journald.conf.d/ cat <