This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
523 lines
12 KiB
Markdown
523 lines
12 KiB
Markdown
# Runbook: Disk Full
|
||
|
||
## Symptom
|
||
|
||
- Writes fail with "No space left on device"
|
||
- Server won't start due to disk space
|
||
- Disk usage >95%
|
||
- WAL segments filling disk rapidly
|
||
- "No inodes available" errors
|
||
|
||
**Metrics Alerts:**
|
||
- `node_filesystem_avail_bytes` < 5% of total
|
||
- `node_filesystem_files_free` < 1000 (inode exhaustion)
|
||
|
||
---
|
||
|
||
## Quick Diagnosis
|
||
|
||
```
|
||
Disk full
|
||
│
|
||
├─► Check: df -h
|
||
│ └─► >98%? → §1 Emergency Cleanup
|
||
│
|
||
├─► Check: du -sh data/wal/
|
||
│ └─► WAL using most space? → §2 WAL Cleanup
|
||
│
|
||
├─► Check: du -sh data/db/
|
||
│ └─► Database using most space? → §3 Compaction
|
||
│
|
||
├─► Check: df -i
|
||
│ └─► Inodes exhausted? → §4 Inode Exhaustion
|
||
│
|
||
└─► Normal growth, no cleanup options?
|
||
└─► §5 Volume Expansion
|
||
```
|
||
|
||
---
|
||
|
||
## Common Causes
|
||
|
||
1. **WAL segments not being cleaned up** — Likelihood: **50%**
|
||
- WAL retention too long
|
||
- Backup process holding references
|
||
- Compaction not running
|
||
|
||
2. **Database growth** — Likelihood: **25%**
|
||
- High ingest rate
|
||
- No compaction configured
|
||
- Expected growth, undersized volume
|
||
|
||
3. **Log files accumulating** — Likelihood: **15%**
|
||
- Application logs not rotated
|
||
- systemd journal filling disk
|
||
- Old backups not deleted
|
||
|
||
4. **Inode exhaustion** — Likelihood: **5%**
|
||
- Many small WAL segments
|
||
- Temporary files not cleaned
|
||
- Filesystem fragmentation
|
||
|
||
5. **Unexpected data** — Likelihood: **5%**
|
||
- Core dumps
|
||
- Large test datasets
|
||
- Temporary files from failed operations
|
||
|
||
---
|
||
|
||
## Resolution Steps
|
||
|
||
### §1. Emergency Cleanup (Disk >98%)
|
||
|
||
**Diagnostic:**
|
||
```bash
|
||
# Check disk usage
|
||
df -h
|
||
|
||
# Expected output (critical):
|
||
# Filesystem Size Used Avail Use% Mounted on
|
||
# /dev/sda1 100G 99G 500M 99% /
|
||
|
||
# Find largest directories
|
||
sudo du -h /data | sort -rh | head -20
|
||
```
|
||
|
||
**Resolution: Immediate cleanup**
|
||
|
||
⚠️ **WARNING:** Only perform when disk >98%. Always backup first if possible.
|
||
|
||
```bash
|
||
# Step 1: Delete old WAL segments (>7 days)
|
||
# ONLY if you have a recent backup!
|
||
sudo find data/wal -name "*.log" -mtime +7 -exec ls -lh {} \;
|
||
# Review list, then delete:
|
||
sudo find data/wal -name "*.log" -mtime +7 -delete
|
||
|
||
# Step 2: Delete old backups
|
||
sudo find backups/ -name "stemedb-backup-*" -mtime +30 -exec rm -rf {} \;
|
||
|
||
# Step 3: Delete old logs
|
||
sudo journalctl --vacuum-time=7d
|
||
|
||
# Step 4: Delete core dumps
|
||
sudo find /var/lib/systemd/coredump -name "core.*" -mtime +1 -delete
|
||
|
||
# Step 5: Verify space freed
|
||
df -h
|
||
# Should show >10% free now
|
||
```
|
||
|
||
**Start server:**
|
||
```bash
|
||
sudo systemctl start stemedb-api
|
||
|
||
# Verify startup
|
||
curl http://localhost:18180/v1/health
|
||
```
|
||
|
||
**If failed:** Still >95% after cleanup → Proceed to §5 Volume Expansion immediately.
|
||
|
||
---
|
||
|
||
### §2. WAL Cleanup (Planned)
|
||
|
||
**Diagnostic:**
|
||
```bash
|
||
# Check WAL directory size
|
||
du -sh data/wal/
|
||
|
||
# Count WAL segments
|
||
ls data/wal/*.log | wc -l
|
||
|
||
# Check oldest segment
|
||
ls -lt data/wal/*.log | tail -1
|
||
|
||
# Expected: Oldest segment <7 days for pilot workloads
|
||
```
|
||
|
||
**Resolution: Configure WAL retention**
|
||
|
||
```bash
|
||
# Set WAL retention to 7 days (default: unlimited)
|
||
export STEMEDB_WAL_RETENTION_DAYS=7
|
||
|
||
# Or in config file
|
||
cat >> /etc/stemedb/config.toml <<EOF
|
||
[wal]
|
||
retention_days = 7
|
||
max_segments = 100 # Cap at 100 segments
|
||
segment_size_mb = 64 # 64MB per segment
|
||
EOF
|
||
|
||
# Restart server to apply
|
||
sudo systemctl restart stemedb-api
|
||
|
||
# Verify WAL cleanup runs
|
||
journalctl -u stemedb-api | grep "WAL cleanup"
|
||
|
||
# Expected log:
|
||
# "WAL cleanup: removed 15 segments older than 7 days"
|
||
```
|
||
|
||
**Manual WAL cleanup (safe):**
|
||
```bash
|
||
# Stop server (required for safe WAL cleanup)
|
||
sudo systemctl stop stemedb-api
|
||
|
||
# Backup current WAL first
|
||
sudo ./scripts/backup-stemedb.sh
|
||
|
||
# Archive old WAL segments to S3/backup storage
|
||
sudo tar czf wal-archive-$(date +%Y%m%d).tar.gz data/wal/*.log
|
||
sudo mv wal-archive-*.tar.gz backups/
|
||
|
||
# Delete segments older than 7 days
|
||
sudo find data/wal -name "*.log" -mtime +7 -delete
|
||
|
||
# Start server
|
||
sudo systemctl start stemedb-api
|
||
|
||
# Verify health
|
||
curl http://localhost:18180/v1/health
|
||
```
|
||
|
||
**If failed:** WAL still growing rapidly → Check ingest rate, may need larger volume or WAL archival to S3 (roadmap P6.4).
|
||
|
||
---
|
||
|
||
### §3. Database Compaction
|
||
|
||
**Diagnostic:**
|
||
```bash
|
||
# Check database size
|
||
du -sh data/db/
|
||
|
||
# Check for fragmentation
|
||
ls -lh data/db/*.kv | awk '{sum+=$5} END {print sum/1024/1024 " MB"}'
|
||
|
||
# Check compaction metrics
|
||
curl http://localhost:18180/metrics | grep stemedb_compaction_
|
||
```
|
||
|
||
**Resolution: Trigger manual compaction**
|
||
|
||
⚠️ **NOTE:** Compaction is I/O intensive. Run during low-traffic periods.
|
||
|
||
```bash
|
||
# Trigger compaction via admin endpoint
|
||
curl -X POST http://localhost:18180/v1/admin/compact \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"aggressive": false}'
|
||
|
||
# Monitor progress
|
||
watch -n 5 'curl -s http://localhost:18180/metrics | grep compaction_progress'
|
||
|
||
# Expected duration: 5-30 minutes for <100K assertions
|
||
|
||
# Verify space freed
|
||
df -h
|
||
du -sh data/db/
|
||
```
|
||
|
||
**Automatic compaction (recommended):**
|
||
```toml
|
||
# /etc/stemedb/config.toml
|
||
[storage]
|
||
compaction_enabled = true
|
||
compaction_interval_hours = 24 # Daily
|
||
compaction_threshold_mb = 1000 # Trigger at 1GB growth
|
||
```
|
||
|
||
**If failed:** Compaction doesn't free space → Database growth is legitimate. Proceed to §5 Volume Expansion.
|
||
|
||
---
|
||
|
||
### §4. Inode Exhaustion
|
||
|
||
**Diagnostic:**
|
||
```bash
|
||
# Check inode usage
|
||
df -i
|
||
|
||
# Expected output (exhausted):
|
||
# Filesystem Inodes IUsed IFree IUse% Mounted on
|
||
# /dev/sda1 6.2M 6.2M 0 100% /
|
||
|
||
# Find directories with most files
|
||
sudo find /data -xdev -printf '%h\n' | sort | uniq -c | sort -k 1 -n | tail -20
|
||
```
|
||
|
||
**Resolution: Delete small files**
|
||
|
||
```bash
|
||
# Find temp files
|
||
sudo find data/ -name "*.tmp" -delete
|
||
|
||
# Find empty files
|
||
sudo find data/ -type f -empty -delete
|
||
|
||
# Consolidate small WAL segments (if many tiny files)
|
||
sudo systemctl stop stemedb-api
|
||
|
||
# Archive and consolidate
|
||
cd data/wal
|
||
sudo tar czf consolidated-$(date +%Y%m%d).tar.gz segment-*.log
|
||
sudo rm segment-*.log
|
||
# (Server will recreate on startup)
|
||
|
||
sudo systemctl start stemedb-api
|
||
|
||
# Verify inodes freed
|
||
df -i
|
||
```
|
||
|
||
**If failed:** Can't free inodes → May need to increase inode ratio (requires filesystem recreation) or migrate to larger volume.
|
||
|
||
---
|
||
|
||
### §5. Volume Expansion
|
||
|
||
**Diagnostic:**
|
||
```bash
|
||
# Check current volume size
|
||
df -h /data
|
||
|
||
# Check if volume is expandable
|
||
# AWS EBS example:
|
||
aws ec2 describe-volumes --volume-ids vol-xxx | jq '.Volumes[].Size'
|
||
```
|
||
|
||
**Resolution A: Expand existing volume (AWS EBS)**
|
||
|
||
```bash
|
||
# Step 1: Expand EBS volume (AWS example)
|
||
aws ec2 modify-volume --volume-id vol-xxx --size 200
|
||
# (Doubles from 100GB to 200GB)
|
||
|
||
# Step 2: Wait for modification to complete
|
||
aws ec2 describe-volumes-modifications --volume-id vol-xxx
|
||
|
||
# Step 3: Expand filesystem
|
||
sudo growpart /dev/nvme0n1 1 # Expand partition
|
||
sudo resize2fs /dev/nvme0n1p1 # Resize ext4
|
||
# (For XFS: sudo xfs_growfs /data)
|
||
|
||
# Step 4: Verify expansion
|
||
df -h
|
||
# Should show new size
|
||
|
||
# No restart needed, server continues running
|
||
```
|
||
|
||
**Resolution B: Add secondary volume**
|
||
|
||
```bash
|
||
# Step 1: Attach new volume (AWS example)
|
||
aws ec2 attach-volume --volume-id vol-yyy --instance-id i-xxx --device /dev/sdf
|
||
|
||
# Step 2: Format new volume
|
||
sudo mkfs.ext4 /dev/sdf
|
||
|
||
# Step 3: Mount temporarily
|
||
sudo mount /dev/sdf /mnt/newdata
|
||
|
||
# Step 4: Stop server and migrate
|
||
sudo systemctl stop stemedb-api
|
||
sudo rsync -av /data/ /mnt/newdata/
|
||
|
||
# Step 5: Update fstab
|
||
echo "/dev/sdf /data ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab
|
||
|
||
# Step 6: Remount
|
||
sudo umount /data
|
||
sudo mount /data
|
||
|
||
# Step 7: Start server
|
||
sudo systemctl start stemedb-api
|
||
|
||
# Verify health
|
||
curl http://localhost:18180/v1/health
|
||
```
|
||
|
||
**Resolution C: Archive old data to S3**
|
||
|
||
⚠️ **NOTE:** Requires roadmap P6.4 (WAL archival). Workaround: Manual archival.
|
||
|
||
```bash
|
||
# Archive WAL segments older than 30 days to S3
|
||
sudo find data/wal -name "*.log" -mtime +30 -exec echo {} \; > wal-to-archive.txt
|
||
|
||
# Upload to S3
|
||
cat wal-to-archive.txt | xargs -I {} aws s3 cp {} s3://stemedb-archive/wal/
|
||
|
||
# Verify upload, then delete local copies
|
||
cat wal-to-archive.txt | xargs -I {} sudo rm {}
|
||
|
||
# Verify space freed
|
||
df -h
|
||
```
|
||
|
||
**If failed:** Can't expand volume → Migrate to new server with larger storage. See [Add Node Runbook](./add-node.md) for cluster migration.
|
||
|
||
---
|
||
|
||
## Validation
|
||
|
||
After applying resolution, validate disk health:
|
||
|
||
- [ ] **Disk usage <80%**
|
||
```bash
|
||
df -h
|
||
# Should show <80% used
|
||
```
|
||
|
||
- [ ] **Inodes available**
|
||
```bash
|
||
df -i
|
||
# Should show >10% inodes free
|
||
```
|
||
|
||
- [ ] **Server running**
|
||
```bash
|
||
systemctl status stemedb-api
|
||
# Should show: active (running)
|
||
```
|
||
|
||
- [ ] **Writes succeed**
|
||
```bash
|
||
curl -X POST http://localhost:18180/v1/assert \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"concept_path": "test/disk", "predicate": "space_ok", "value": true}'
|
||
# Should return: 201 Created
|
||
```
|
||
|
||
- [ ] **No disk errors in logs**
|
||
```bash
|
||
journalctl -u stemedb-api | grep -i "no space"
|
||
# Should return empty
|
||
```
|
||
|
||
---
|
||
|
||
## Prevention
|
||
|
||
### Monitoring
|
||
|
||
**Set up alerts for:**
|
||
|
||
```yaml
|
||
# Prometheus alert rules
|
||
groups:
|
||
- name: stemedb_disk
|
||
rules:
|
||
- alert: StemeDBDiskSpaceWarning
|
||
expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes) < 0.2
|
||
for: 15m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Disk space <20% on /data"
|
||
description: "Available: {{ $value | humanizePercentage }}"
|
||
|
||
- alert: StemeDBDiskSpaceCritical
|
||
expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes) < 0.1
|
||
for: 5m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "Disk space <10% on /data"
|
||
description: "Available: {{ $value | humanizePercentage }}"
|
||
|
||
- alert: StemeDBInodeExhaustion
|
||
expr: (node_filesystem_files_free / node_filesystem_files) < 0.1
|
||
for: 15m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Inodes <10% available"
|
||
```
|
||
|
||
### Configuration Changes
|
||
|
||
**To prevent recurrence:**
|
||
|
||
1. **WAL retention:** Set to 7 days for pilot, 3 days for production with frequent backups
|
||
2. **Compaction:** Enable automatic daily compaction
|
||
3. **Backup cleanup:** Retain last 7 daily backups only
|
||
4. **Log rotation:** Configure systemd journal vacuum
|
||
5. **Capacity planning:** Right-size volumes based on [Resource Sizing Guide](../reference-architecture/resource-sizing.md)
|
||
|
||
**Example: Comprehensive disk management**
|
||
```toml
|
||
# /etc/stemedb/config.toml
|
||
[wal]
|
||
retention_days = 7
|
||
max_segments = 100
|
||
segment_size_mb = 64
|
||
|
||
[storage]
|
||
compaction_enabled = true
|
||
compaction_interval_hours = 24
|
||
compaction_threshold_mb = 1000
|
||
|
||
[backup]
|
||
retention_days = 7
|
||
compression_enabled = true
|
||
```
|
||
|
||
**Systemd journal vacuum:**
|
||
```bash
|
||
# Limit journal to 500MB
|
||
sudo journalctl --vacuum-size=500M
|
||
|
||
# Or limit to 7 days
|
||
sudo journalctl --vacuum-time=7d
|
||
|
||
# Make permanent
|
||
sudo mkdir -p /etc/systemd/journald.conf.d/
|
||
cat <<EOF | sudo tee /etc/systemd/journald.conf.d/vacuum.conf
|
||
[Journal]
|
||
SystemMaxUse=500M
|
||
MaxRetentionSec=7day
|
||
EOF
|
||
|
||
sudo systemctl restart systemd-journald
|
||
```
|
||
|
||
---
|
||
|
||
## Capacity Planning
|
||
|
||
**Disk growth formula:**
|
||
|
||
| Component | Growth Rate | Calculation |
|
||
|-----------|-------------|-------------|
|
||
| **WAL** | ~10MB per 1K assertions | retention_days × daily_assertions × 10MB / 1000 |
|
||
| **Database** | ~50MB per 10K assertions | (total_assertions / 10000) × 50MB |
|
||
| **Indexes** | ~10% of database size | database_size × 0.1 |
|
||
| **Backups** | 1x data size per backup | (wal_size + db_size) × retention_count |
|
||
|
||
**Example: Pilot with 100K assertions, 7-day retention:**
|
||
- WAL: 7 days × 1K/day × 10MB / 1000 = 70MB
|
||
- Database: (100K / 10K) × 50MB = 500MB
|
||
- Indexes: 500MB × 0.1 = 50MB
|
||
- Backups: (70MB + 500MB) × 7 = 4GB
|
||
- **Total: ~5GB** (provision 20GB for 4x headroom)
|
||
|
||
**See:** [Resource Sizing Guide](../reference-architecture/resource-sizing.md) for detailed calculations.
|
||
|
||
---
|
||
|
||
## Related Runbooks
|
||
|
||
- [Server Won't Start](./server-wont-start.md) - Disk full preventing startup
|
||
- [Restore from Backup](./restore-from-backup.md) - Need space for restore operations
|
||
- [High Query Latency](./high-query-latency.md) - Performance impact of disk pressure
|
||
|
||
---
|
||
|
||
## Last Updated
|
||
|
||
2026-02-11
|