stemedb/docs/operations/runbooks/disk-full.md

# Runbook: Disk Full

## Symptom

- Writes fail with "No space left on device"
- Server won't start due to disk space
- Disk usage >95%
- WAL segments filling disk rapidly
- "No inodes available" errors

**Metrics Alerts:**
- `node_filesystem_avail_bytes` < 5% of total
- `node_filesystem_files_free` < 1000 (inode exhaustion)

---

## Quick Diagnosis

```
Disk full
    │
    ├─► Check: df -h
    │   └─► >98%? → §1 Emergency Cleanup
    │
    ├─► Check: du -sh data/wal/
    │   └─► WAL using most space? → §2 WAL Cleanup
    │
    ├─► Check: du -sh data/db/
    │   └─► Database using most space? → §3 Compaction
    │
    ├─► Check: df -i
    │   └─► Inodes exhausted? → §4 Inode Exhaustion
    │
    └─► Normal growth, no cleanup options?
        └─► §5 Volume Expansion
```

---

## Common Causes

1. **WAL segments not being cleaned up** — Likelihood: **50%**
   - WAL retention too long
   - Backup process holding references
   - Compaction not running

2. **Database growth** — Likelihood: **25%**
   - High ingest rate
   - No compaction configured
   - Expected growth, undersized volume

3. **Log files accumulating** — Likelihood: **15%**
   - Application logs not rotated
   - systemd journal filling disk
   - Old backups not deleted

4. **Inode exhaustion** — Likelihood: **5%**
   - Many small WAL segments
   - Temporary files not cleaned
   - Filesystem fragmentation

5. **Unexpected data** — Likelihood: **5%**
   - Core dumps
   - Large test datasets
   - Temporary files from failed operations

---

## Resolution Steps

### §1. Emergency Cleanup (Disk >98%)

**Diagnostic:**
```bash
# Check disk usage
df -h

# Expected output (critical):
# Filesystem      Size  Used Avail Use% Mounted on
# /dev/sda1       100G   99G  500M  99% /

# Find largest directories
sudo du -h /data | sort -rh | head -20
```

**Resolution: Immediate cleanup**

⚠️ **WARNING:** Only perform when disk >98%. Always backup first if possible.

```bash
# Step 1: Delete old WAL segments (>7 days)
# ONLY if you have a recent backup!
sudo find data/wal -name "*.log" -mtime +7 -exec ls -lh {} \;
# Review list, then delete:
sudo find data/wal -name "*.log" -mtime +7 -delete

# Step 2: Delete old backups
sudo find backups/ -name "stemedb-backup-*" -mtime +30 -exec rm -rf {} \;

# Step 3: Delete old logs
sudo journalctl --vacuum-time=7d

# Step 4: Delete core dumps
sudo find /var/lib/systemd/coredump -name "core.*" -mtime +1 -delete

# Step 5: Verify space freed
df -h
# Should show >10% free now
```

**Start server:**
```bash
sudo systemctl start stemedb-api

# Verify startup
curl http://localhost:18180/v1/health
```

**If failed:** Still >95% after cleanup → Proceed to §5 Volume Expansion immediately.

---

### §2. WAL Cleanup (Planned)

**Diagnostic:**
```bash
# Check WAL directory size
du -sh data/wal/

# Count WAL segments
ls data/wal/*.log | wc -l

# Check oldest segment
ls -lt data/wal/*.log | tail -1

# Expected: Oldest segment <7 days for pilot workloads
```

**Resolution: Configure WAL retention**

```bash
# Set WAL retention to 7 days (default: unlimited)
export STEMEDB_WAL_RETENTION_DAYS=7

# Or in config file
cat >> /etc/stemedb/config.toml <<EOF
[wal]
retention_days = 7
max_segments = 100  # Cap at 100 segments
segment_size_mb = 64  # 64MB per segment
EOF

# Restart server to apply
sudo systemctl restart stemedb-api

# Verify WAL cleanup runs
journalctl -u stemedb-api | grep "WAL cleanup"

# Expected log:
# "WAL cleanup: removed 15 segments older than 7 days"
```

**Manual WAL cleanup (safe):**
```bash
# Stop server (required for safe WAL cleanup)
sudo systemctl stop stemedb-api

# Backup current WAL first
sudo ./scripts/backup-stemedb.sh

# Archive old WAL segments to S3/backup storage
sudo tar czf wal-archive-$(date +%Y%m%d).tar.gz data/wal/*.log
sudo mv wal-archive-*.tar.gz backups/

# Delete segments older than 7 days
sudo find data/wal -name "*.log" -mtime +7 -delete

# Start server
sudo systemctl start stemedb-api

# Verify health
curl http://localhost:18180/v1/health
```

**If failed:** WAL still growing rapidly → Check ingest rate, may need larger volume or WAL archival to S3 (roadmap P6.4).

---

### §3. Database Compaction

**Diagnostic:**
```bash
# Check database size
du -sh data/db/

# Check for fragmentation
ls -lh data/db/*.kv | awk '{sum+=$5} END {print sum/1024/1024 " MB"}'

# Check compaction metrics
curl http://localhost:18180/metrics | grep stemedb_compaction_
```

**Resolution: Trigger manual compaction**

⚠️ **NOTE:** Compaction is I/O intensive. Run during low-traffic periods.

```bash
# Trigger compaction via admin endpoint
curl -X POST http://localhost:18180/v1/admin/compact \
  -H "Content-Type: application/json" \
  -d '{"aggressive": false}'

# Monitor progress
watch -n 5 'curl -s http://localhost:18180/metrics | grep compaction_progress'

# Expected duration: 5-30 minutes for <100K assertions

# Verify space freed
df -h
du -sh data/db/
```

**Automatic compaction (recommended):**
```toml
# /etc/stemedb/config.toml
[storage]
compaction_enabled = true
compaction_interval_hours = 24  # Daily
compaction_threshold_mb = 1000  # Trigger at 1GB growth
```

**If failed:** Compaction doesn't free space → Database growth is legitimate. Proceed to §5 Volume Expansion.

---

### §4. Inode Exhaustion

**Diagnostic:**
```bash
# Check inode usage
df -i

# Expected output (exhausted):
# Filesystem     Inodes  IUsed  IFree IUse% Mounted on
# /dev/sda1      6.2M    6.2M      0  100% /

# Find directories with most files
sudo find /data -xdev -printf '%h\n' | sort | uniq -c | sort -k 1 -n | tail -20
```

**Resolution: Delete small files**

```bash
# Find temp files
sudo find data/ -name "*.tmp" -delete

# Find empty files
sudo find data/ -type f -empty -delete

# Consolidate small WAL segments (if many tiny files)
sudo systemctl stop stemedb-api

# Archive and consolidate
cd data/wal
sudo tar czf consolidated-$(date +%Y%m%d).tar.gz segment-*.log
sudo rm segment-*.log
# (Server will recreate on startup)

sudo systemctl start stemedb-api

# Verify inodes freed
df -i
```

**If failed:** Can't free inodes → May need to increase inode ratio (requires filesystem recreation) or migrate to larger volume.

---

### §5. Volume Expansion

**Diagnostic:**
```bash
# Check current volume size
df -h /data

# Check if volume is expandable
# AWS EBS example:
aws ec2 describe-volumes --volume-ids vol-xxx | jq '.Volumes[].Size'
```

**Resolution A: Expand existing volume (AWS EBS)**

```bash
# Step 1: Expand EBS volume (AWS example)
aws ec2 modify-volume --volume-id vol-xxx --size 200
# (Doubles from 100GB to 200GB)

# Step 2: Wait for modification to complete
aws ec2 describe-volumes-modifications --volume-id vol-xxx

# Step 3: Expand filesystem
sudo growpart /dev/nvme0n1 1  # Expand partition
sudo resize2fs /dev/nvme0n1p1  # Resize ext4
# (For XFS: sudo xfs_growfs /data)

# Step 4: Verify expansion
df -h
# Should show new size

# No restart needed, server continues running
```

**Resolution B: Add secondary volume**

```bash
# Step 1: Attach new volume (AWS example)
aws ec2 attach-volume --volume-id vol-yyy --instance-id i-xxx --device /dev/sdf

# Step 2: Format new volume
sudo mkfs.ext4 /dev/sdf

# Step 3: Mount temporarily
sudo mount /dev/sdf /mnt/newdata

# Step 4: Stop server and migrate
sudo systemctl stop stemedb-api
sudo rsync -av /data/ /mnt/newdata/

# Step 5: Update fstab
echo "/dev/sdf /data ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab

# Step 6: Remount
sudo umount /data
sudo mount /data

# Step 7: Start server
sudo systemctl start stemedb-api

# Verify health
curl http://localhost:18180/v1/health
```

**Resolution C: Archive old data to S3**

⚠️ **NOTE:** Requires roadmap P6.4 (WAL archival). Workaround: Manual archival.

```bash
# Archive WAL segments older than 30 days to S3
sudo find data/wal -name "*.log" -mtime +30 -exec echo {} \; > wal-to-archive.txt

# Upload to S3
cat wal-to-archive.txt | xargs -I {} aws s3 cp {} s3://stemedb-archive/wal/

# Verify upload, then delete local copies
cat wal-to-archive.txt | xargs -I {} sudo rm {}

# Verify space freed
df -h
```

**If failed:** Can't expand volume → Migrate to new server with larger storage. See [Add Node Runbook](./add-node.md) for cluster migration.

---

## Validation

After applying resolution, validate disk health:

- [ ] **Disk usage <80%**
  ```bash
  df -h
  # Should show <80% used
  ```

- [ ] **Inodes available**
  ```bash
  df -i
  # Should show >10% inodes free
  ```

- [ ] **Server running**
  ```bash
  systemctl status stemedb-api
  # Should show: active (running)
  ```

- [ ] **Writes succeed**
  ```bash
  curl -X POST http://localhost:18180/v1/assert \
    -H "Content-Type: application/json" \
    -d '{"concept_path": "test/disk", "predicate": "space_ok", "value": true}'
  # Should return: 201 Created
  ```

- [ ] **No disk errors in logs**
  ```bash
  journalctl -u stemedb-api | grep -i "no space"
  # Should return empty
  ```

---

## Prevention

### Monitoring

**Set up alerts for:**

```yaml
# Prometheus alert rules
groups:
  - name: stemedb_disk
    rules:
      - alert: StemeDBDiskSpaceWarning
        expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes) < 0.2
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Disk space <20% on /data"
          description: "Available: {{ $value | humanizePercentage }}"

      - alert: StemeDBDiskSpaceCritical
        expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes) < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space <10% on /data"
          description: "Available: {{ $value | humanizePercentage }}"

      - alert: StemeDBInodeExhaustion
        expr: (node_filesystem_files_free / node_filesystem_files) < 0.1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Inodes <10% available"
```

### Configuration Changes

**To prevent recurrence:**

1. **WAL retention:** Set to 7 days for pilot, 3 days for production with frequent backups
2. **Compaction:** Enable automatic daily compaction
3. **Backup cleanup:** Retain last 7 daily backups only
4. **Log rotation:** Configure systemd journal vacuum
5. **Capacity planning:** Right-size volumes based on [Resource Sizing Guide](../reference-architecture/resource-sizing.md)

**Example: Comprehensive disk management**
```toml
# /etc/stemedb/config.toml
[wal]
retention_days = 7
max_segments = 100
segment_size_mb = 64

[storage]
compaction_enabled = true
compaction_interval_hours = 24
compaction_threshold_mb = 1000

[backup]
retention_days = 7
compression_enabled = true
```

**Systemd journal vacuum:**
```bash
# Limit journal to 500MB
sudo journalctl --vacuum-size=500M

# Or limit to 7 days
sudo journalctl --vacuum-time=7d

# Make permanent
sudo mkdir -p /etc/systemd/journald.conf.d/
cat <<EOF | sudo tee /etc/systemd/journald.conf.d/vacuum.conf
[Journal]
SystemMaxUse=500M
MaxRetentionSec=7day
EOF

sudo systemctl restart systemd-journald
```

---

## Capacity Planning

**Disk growth formula:**

| Component | Growth Rate | Calculation |
|-----------|-------------|-------------|
| **WAL** | ~10MB per 1K assertions | retention_days × daily_assertions × 10MB / 1000 |
| **Database** | ~50MB per 10K assertions | (total_assertions / 10000) × 50MB |
| **Indexes** | ~10% of database size | database_size × 0.1 |
| **Backups** | 1x data size per backup | (wal_size + db_size) × retention_count |

**Example: Pilot with 100K assertions, 7-day retention:**
- WAL: 7 days × 1K/day × 10MB / 1000 = 70MB
- Database: (100K / 10K) × 50MB = 500MB
- Indexes: 500MB × 0.1 = 50MB
- Backups: (70MB + 500MB) × 7 = 4GB
- **Total: ~5GB** (provision 20GB for 4x headroom)

**See:** [Resource Sizing Guide](../reference-architecture/resource-sizing.md) for detailed calculations.

---

## Related Runbooks

- [Server Won't Start](./server-wont-start.md) - Disk full preventing startup
- [Restore from Backup](./restore-from-backup.md) - Need space for restore operations
- [High Query Latency](./high-query-latency.md) - Performance impact of disk pressure

---

## Last Updated

2026-02-11