This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
12 KiB
Runbook: Disk Full
Symptom
- Writes fail with "No space left on device"
- Server won't start due to disk space
- Disk usage >95%
- WAL segments filling disk rapidly
- "No inodes available" errors
Metrics Alerts:
node_filesystem_avail_bytes< 5% of totalnode_filesystem_files_free< 1000 (inode exhaustion)
Quick Diagnosis
Disk full
│
├─► Check: df -h
│ └─► >98%? → §1 Emergency Cleanup
│
├─► Check: du -sh data/wal/
│ └─► WAL using most space? → §2 WAL Cleanup
│
├─► Check: du -sh data/db/
│ └─► Database using most space? → §3 Compaction
│
├─► Check: df -i
│ └─► Inodes exhausted? → §4 Inode Exhaustion
│
└─► Normal growth, no cleanup options?
└─► §5 Volume Expansion
Common Causes
-
WAL segments not being cleaned up — Likelihood: 50%
- WAL retention too long
- Backup process holding references
- Compaction not running
-
Database growth — Likelihood: 25%
- High ingest rate
- No compaction configured
- Expected growth, undersized volume
-
Log files accumulating — Likelihood: 15%
- Application logs not rotated
- systemd journal filling disk
- Old backups not deleted
-
Inode exhaustion — Likelihood: 5%
- Many small WAL segments
- Temporary files not cleaned
- Filesystem fragmentation
-
Unexpected data — Likelihood: 5%
- Core dumps
- Large test datasets
- Temporary files from failed operations
Resolution Steps
§1. Emergency Cleanup (Disk >98%)
Diagnostic:
# Check disk usage
df -h
# Expected output (critical):
# Filesystem Size Used Avail Use% Mounted on
# /dev/sda1 100G 99G 500M 99% /
# Find largest directories
sudo du -h /data | sort -rh | head -20
Resolution: Immediate cleanup
⚠️ WARNING: Only perform when disk >98%. Always backup first if possible.
# Step 1: Delete old WAL segments (>7 days)
# ONLY if you have a recent backup!
sudo find data/wal -name "*.log" -mtime +7 -exec ls -lh {} \;
# Review list, then delete:
sudo find data/wal -name "*.log" -mtime +7 -delete
# Step 2: Delete old backups
sudo find backups/ -name "stemedb-backup-*" -mtime +30 -exec rm -rf {} \;
# Step 3: Delete old logs
sudo journalctl --vacuum-time=7d
# Step 4: Delete core dumps
sudo find /var/lib/systemd/coredump -name "core.*" -mtime +1 -delete
# Step 5: Verify space freed
df -h
# Should show >10% free now
Start server:
sudo systemctl start stemedb-api
# Verify startup
curl http://localhost:18180/v1/health
If failed: Still >95% after cleanup → Proceed to §5 Volume Expansion immediately.
§2. WAL Cleanup (Planned)
Diagnostic:
# Check WAL directory size
du -sh data/wal/
# Count WAL segments
ls data/wal/*.log | wc -l
# Check oldest segment
ls -lt data/wal/*.log | tail -1
# Expected: Oldest segment <7 days for pilot workloads
Resolution: Configure WAL retention
# Set WAL retention to 7 days (default: unlimited)
export STEMEDB_WAL_RETENTION_DAYS=7
# Or in config file
cat >> /etc/stemedb/config.toml <<EOF
[wal]
retention_days = 7
max_segments = 100 # Cap at 100 segments
segment_size_mb = 64 # 64MB per segment
EOF
# Restart server to apply
sudo systemctl restart stemedb-api
# Verify WAL cleanup runs
journalctl -u stemedb-api | grep "WAL cleanup"
# Expected log:
# "WAL cleanup: removed 15 segments older than 7 days"
Manual WAL cleanup (safe):
# Stop server (required for safe WAL cleanup)
sudo systemctl stop stemedb-api
# Backup current WAL first
sudo ./scripts/backup-stemedb.sh
# Archive old WAL segments to S3/backup storage
sudo tar czf wal-archive-$(date +%Y%m%d).tar.gz data/wal/*.log
sudo mv wal-archive-*.tar.gz backups/
# Delete segments older than 7 days
sudo find data/wal -name "*.log" -mtime +7 -delete
# Start server
sudo systemctl start stemedb-api
# Verify health
curl http://localhost:18180/v1/health
If failed: WAL still growing rapidly → Check ingest rate, may need larger volume or WAL archival to S3 (roadmap P6.4).
§3. Database Compaction
Diagnostic:
# Check database size
du -sh data/db/
# Check for fragmentation
ls -lh data/db/*.kv | awk '{sum+=$5} END {print sum/1024/1024 " MB"}'
# Check compaction metrics
curl http://localhost:18180/metrics | grep stemedb_compaction_
Resolution: Trigger manual compaction
⚠️ NOTE: Compaction is I/O intensive. Run during low-traffic periods.
# Trigger compaction via admin endpoint
curl -X POST http://localhost:18180/v1/admin/compact \
-H "Content-Type: application/json" \
-d '{"aggressive": false}'
# Monitor progress
watch -n 5 'curl -s http://localhost:18180/metrics | grep compaction_progress'
# Expected duration: 5-30 minutes for <100K assertions
# Verify space freed
df -h
du -sh data/db/
Automatic compaction (recommended):
# /etc/stemedb/config.toml
[storage]
compaction_enabled = true
compaction_interval_hours = 24 # Daily
compaction_threshold_mb = 1000 # Trigger at 1GB growth
If failed: Compaction doesn't free space → Database growth is legitimate. Proceed to §5 Volume Expansion.
§4. Inode Exhaustion
Diagnostic:
# Check inode usage
df -i
# Expected output (exhausted):
# Filesystem Inodes IUsed IFree IUse% Mounted on
# /dev/sda1 6.2M 6.2M 0 100% /
# Find directories with most files
sudo find /data -xdev -printf '%h\n' | sort | uniq -c | sort -k 1 -n | tail -20
Resolution: Delete small files
# Find temp files
sudo find data/ -name "*.tmp" -delete
# Find empty files
sudo find data/ -type f -empty -delete
# Consolidate small WAL segments (if many tiny files)
sudo systemctl stop stemedb-api
# Archive and consolidate
cd data/wal
sudo tar czf consolidated-$(date +%Y%m%d).tar.gz segment-*.log
sudo rm segment-*.log
# (Server will recreate on startup)
sudo systemctl start stemedb-api
# Verify inodes freed
df -i
If failed: Can't free inodes → May need to increase inode ratio (requires filesystem recreation) or migrate to larger volume.
§5. Volume Expansion
Diagnostic:
# Check current volume size
df -h /data
# Check if volume is expandable
# AWS EBS example:
aws ec2 describe-volumes --volume-ids vol-xxx | jq '.Volumes[].Size'
Resolution A: Expand existing volume (AWS EBS)
# Step 1: Expand EBS volume (AWS example)
aws ec2 modify-volume --volume-id vol-xxx --size 200
# (Doubles from 100GB to 200GB)
# Step 2: Wait for modification to complete
aws ec2 describe-volumes-modifications --volume-id vol-xxx
# Step 3: Expand filesystem
sudo growpart /dev/nvme0n1 1 # Expand partition
sudo resize2fs /dev/nvme0n1p1 # Resize ext4
# (For XFS: sudo xfs_growfs /data)
# Step 4: Verify expansion
df -h
# Should show new size
# No restart needed, server continues running
Resolution B: Add secondary volume
# Step 1: Attach new volume (AWS example)
aws ec2 attach-volume --volume-id vol-yyy --instance-id i-xxx --device /dev/sdf
# Step 2: Format new volume
sudo mkfs.ext4 /dev/sdf
# Step 3: Mount temporarily
sudo mount /dev/sdf /mnt/newdata
# Step 4: Stop server and migrate
sudo systemctl stop stemedb-api
sudo rsync -av /data/ /mnt/newdata/
# Step 5: Update fstab
echo "/dev/sdf /data ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab
# Step 6: Remount
sudo umount /data
sudo mount /data
# Step 7: Start server
sudo systemctl start stemedb-api
# Verify health
curl http://localhost:18180/v1/health
Resolution C: Archive old data to S3
⚠️ NOTE: Requires roadmap P6.4 (WAL archival). Workaround: Manual archival.
# Archive WAL segments older than 30 days to S3
sudo find data/wal -name "*.log" -mtime +30 -exec echo {} \; > wal-to-archive.txt
# Upload to S3
cat wal-to-archive.txt | xargs -I {} aws s3 cp {} s3://stemedb-archive/wal/
# Verify upload, then delete local copies
cat wal-to-archive.txt | xargs -I {} sudo rm {}
# Verify space freed
df -h
If failed: Can't expand volume → Migrate to new server with larger storage. See Add Node Runbook for cluster migration.
Validation
After applying resolution, validate disk health:
-
Disk usage <80%
df -h # Should show <80% used -
Inodes available
df -i # Should show >10% inodes free -
Server running
systemctl status stemedb-api # Should show: active (running) -
Writes succeed
curl -X POST http://localhost:18180/v1/assert \ -H "Content-Type: application/json" \ -d '{"concept_path": "test/disk", "predicate": "space_ok", "value": true}' # Should return: 201 Created -
No disk errors in logs
journalctl -u stemedb-api | grep -i "no space" # Should return empty
Prevention
Monitoring
Set up alerts for:
# Prometheus alert rules
groups:
- name: stemedb_disk
rules:
- alert: StemeDBDiskSpaceWarning
expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes) < 0.2
for: 15m
labels:
severity: warning
annotations:
summary: "Disk space <20% on /data"
description: "Available: {{ $value | humanizePercentage }}"
- alert: StemeDBDiskSpaceCritical
expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes) < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space <10% on /data"
description: "Available: {{ $value | humanizePercentage }}"
- alert: StemeDBInodeExhaustion
expr: (node_filesystem_files_free / node_filesystem_files) < 0.1
for: 15m
labels:
severity: warning
annotations:
summary: "Inodes <10% available"
Configuration Changes
To prevent recurrence:
- WAL retention: Set to 7 days for pilot, 3 days for production with frequent backups
- Compaction: Enable automatic daily compaction
- Backup cleanup: Retain last 7 daily backups only
- Log rotation: Configure systemd journal vacuum
- Capacity planning: Right-size volumes based on Resource Sizing Guide
Example: Comprehensive disk management
# /etc/stemedb/config.toml
[wal]
retention_days = 7
max_segments = 100
segment_size_mb = 64
[storage]
compaction_enabled = true
compaction_interval_hours = 24
compaction_threshold_mb = 1000
[backup]
retention_days = 7
compression_enabled = true
Systemd journal vacuum:
# Limit journal to 500MB
sudo journalctl --vacuum-size=500M
# Or limit to 7 days
sudo journalctl --vacuum-time=7d
# Make permanent
sudo mkdir -p /etc/systemd/journald.conf.d/
cat <<EOF | sudo tee /etc/systemd/journald.conf.d/vacuum.conf
[Journal]
SystemMaxUse=500M
MaxRetentionSec=7day
EOF
sudo systemctl restart systemd-journald
Capacity Planning
Disk growth formula:
| Component | Growth Rate | Calculation |
|---|---|---|
| WAL | ~10MB per 1K assertions | retention_days × daily_assertions × 10MB / 1000 |
| Database | ~50MB per 10K assertions | (total_assertions / 10000) × 50MB |
| Indexes | ~10% of database size | database_size × 0.1 |
| Backups | 1x data size per backup | (wal_size + db_size) × retention_count |
Example: Pilot with 100K assertions, 7-day retention:
- WAL: 7 days × 1K/day × 10MB / 1000 = 70MB
- Database: (100K / 10K) × 50MB = 500MB
- Indexes: 500MB × 0.1 = 50MB
- Backups: (70MB + 500MB) × 7 = 4GB
- Total: ~5GB (provision 20GB for 4x headroom)
See: Resource Sizing Guide for detailed calculations.
Related Runbooks
- Server Won't Start - Disk full preventing startup
- Restore from Backup - Need space for restore operations
- High Query Latency - Performance impact of disk pressure
Last Updated
2026-02-11