jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

12 KiB

Raw Permalink Blame History

Runbook: Disk Full

Symptom

Writes fail with "No space left on device"
Server won't start due to disk space
Disk usage >95%
WAL segments filling disk rapidly
"No inodes available" errors

Metrics Alerts:

node_filesystem_avail_bytes < 5% of total
node_filesystem_files_free < 1000 (inode exhaustion)

Quick Diagnosis

Disk full
    │
    ├─► Check: df -h
    │   └─► >98%? → §1 Emergency Cleanup
    │
    ├─► Check: du -sh data/wal/
    │   └─► WAL using most space? → §2 WAL Cleanup
    │
    ├─► Check: du -sh data/db/
    │   └─► Database using most space? → §3 Compaction
    │
    ├─► Check: df -i
    │   └─► Inodes exhausted? → §4 Inode Exhaustion
    │
    └─► Normal growth, no cleanup options?
        └─► §5 Volume Expansion

Common Causes

WAL segments not being cleaned up — Likelihood: 50%
- WAL retention too long
- Backup process holding references
- Compaction not running
Database growth — Likelihood: 25%
- High ingest rate
- No compaction configured
- Expected growth, undersized volume
Log files accumulating — Likelihood: 15%
- Application logs not rotated
- systemd journal filling disk
- Old backups not deleted
Inode exhaustion — Likelihood: 5%
- Many small WAL segments
- Temporary files not cleaned
- Filesystem fragmentation
Unexpected data — Likelihood: 5%
- Core dumps
- Large test datasets
- Temporary files from failed operations

Resolution Steps

§1. Emergency Cleanup (Disk >98%)

Diagnostic:

# Check disk usage
df -h

# Expected output (critical):
# Filesystem      Size  Used Avail Use% Mounted on
# /dev/sda1       100G   99G  500M  99% /

# Find largest directories
sudo du -h /data | sort -rh | head -20

Resolution: Immediate cleanup

⚠️ WARNING: Only perform when disk >98%. Always backup first if possible.

# Step 1: Delete old WAL segments (>7 days)
# ONLY if you have a recent backup!
sudo find data/wal -name "*.log" -mtime +7 -exec ls -lh {} \;
# Review list, then delete:
sudo find data/wal -name "*.log" -mtime +7 -delete

# Step 2: Delete old backups
sudo find backups/ -name "stemedb-backup-*" -mtime +30 -exec rm -rf {} \;

# Step 3: Delete old logs
sudo journalctl --vacuum-time=7d

# Step 4: Delete core dumps
sudo find /var/lib/systemd/coredump -name "core.*" -mtime +1 -delete

# Step 5: Verify space freed
df -h
# Should show >10% free now

Start server:

sudo systemctl start stemedb-api

# Verify startup
curl http://localhost:18180/v1/health

If failed: Still >95% after cleanup → Proceed to §5 Volume Expansion immediately.

§2. WAL Cleanup (Planned)

Diagnostic:

# Check WAL directory size
du -sh data/wal/

# Count WAL segments
ls data/wal/*.log | wc -l

# Check oldest segment
ls -lt data/wal/*.log | tail -1

# Expected: Oldest segment <7 days for pilot workloads

Resolution: Configure WAL retention

# Set WAL retention to 7 days (default: unlimited)
export STEMEDB_WAL_RETENTION_DAYS=7

# Or in config file
cat >> /etc/stemedb/config.toml <<EOF
[wal]
retention_days = 7
max_segments = 100  # Cap at 100 segments
segment_size_mb = 64  # 64MB per segment
EOF

# Restart server to apply
sudo systemctl restart stemedb-api

# Verify WAL cleanup runs
journalctl -u stemedb-api | grep "WAL cleanup"

# Expected log:
# "WAL cleanup: removed 15 segments older than 7 days"

Manual WAL cleanup (safe):

# Stop server (required for safe WAL cleanup)
sudo systemctl stop stemedb-api

# Backup current WAL first
sudo ./scripts/backup-stemedb.sh

# Archive old WAL segments to S3/backup storage
sudo tar czf wal-archive-$(date +%Y%m%d).tar.gz data/wal/*.log
sudo mv wal-archive-*.tar.gz backups/

# Delete segments older than 7 days
sudo find data/wal -name "*.log" -mtime +7 -delete

# Start server
sudo systemctl start stemedb-api

# Verify health
curl http://localhost:18180/v1/health

If failed: WAL still growing rapidly → Check ingest rate, may need larger volume or WAL archival to S3 (roadmap P6.4).

§3. Database Compaction

Diagnostic:

# Check database size
du -sh data/db/

# Check for fragmentation
ls -lh data/db/*.kv | awk '{sum+=$5} END {print sum/1024/1024 " MB"}'

# Check compaction metrics
curl http://localhost:18180/metrics | grep stemedb_compaction_

Resolution: Trigger manual compaction

⚠️ NOTE: Compaction is I/O intensive. Run during low-traffic periods.

# Trigger compaction via admin endpoint
curl -X POST http://localhost:18180/v1/admin/compact \
  -H "Content-Type: application/json" \
  -d '{"aggressive": false}'

# Monitor progress
watch -n 5 'curl -s http://localhost:18180/metrics | grep compaction_progress'

# Expected duration: 5-30 minutes for <100K assertions

# Verify space freed
df -h
du -sh data/db/

Automatic compaction (recommended):

# /etc/stemedb/config.toml
[storage]
compaction_enabled = true
compaction_interval_hours = 24  # Daily
compaction_threshold_mb = 1000  # Trigger at 1GB growth

If failed: Compaction doesn't free space → Database growth is legitimate. Proceed to §5 Volume Expansion.

§4. Inode Exhaustion

Diagnostic:

# Check inode usage
df -i

# Expected output (exhausted):
# Filesystem     Inodes  IUsed  IFree IUse% Mounted on
# /dev/sda1      6.2M    6.2M      0  100% /

# Find directories with most files
sudo find /data -xdev -printf '%h\n' | sort | uniq -c | sort -k 1 -n | tail -20

Resolution: Delete small files

# Find temp files
sudo find data/ -name "*.tmp" -delete

# Find empty files
sudo find data/ -type f -empty -delete

# Consolidate small WAL segments (if many tiny files)
sudo systemctl stop stemedb-api

# Archive and consolidate
cd data/wal
sudo tar czf consolidated-$(date +%Y%m%d).tar.gz segment-*.log
sudo rm segment-*.log
# (Server will recreate on startup)

sudo systemctl start stemedb-api

# Verify inodes freed
df -i

If failed: Can't free inodes → May need to increase inode ratio (requires filesystem recreation) or migrate to larger volume.

§5. Volume Expansion

Diagnostic:

# Check current volume size
df -h /data

# Check if volume is expandable
# AWS EBS example:
aws ec2 describe-volumes --volume-ids vol-xxx | jq '.Volumes[].Size'

Resolution A: Expand existing volume (AWS EBS)

# Step 1: Expand EBS volume (AWS example)
aws ec2 modify-volume --volume-id vol-xxx --size 200
# (Doubles from 100GB to 200GB)

# Step 2: Wait for modification to complete
aws ec2 describe-volumes-modifications --volume-id vol-xxx

# Step 3: Expand filesystem
sudo growpart /dev/nvme0n1 1  # Expand partition
sudo resize2fs /dev/nvme0n1p1  # Resize ext4
# (For XFS: sudo xfs_growfs /data)

# Step 4: Verify expansion
df -h
# Should show new size

# No restart needed, server continues running

Resolution B: Add secondary volume

# Step 1: Attach new volume (AWS example)
aws ec2 attach-volume --volume-id vol-yyy --instance-id i-xxx --device /dev/sdf

# Step 2: Format new volume
sudo mkfs.ext4 /dev/sdf

# Step 3: Mount temporarily
sudo mount /dev/sdf /mnt/newdata

# Step 4: Stop server and migrate
sudo systemctl stop stemedb-api
sudo rsync -av /data/ /mnt/newdata/

# Step 5: Update fstab
echo "/dev/sdf /data ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab

# Step 6: Remount
sudo umount /data
sudo mount /data

# Step 7: Start server
sudo systemctl start stemedb-api

# Verify health
curl http://localhost:18180/v1/health

Resolution C: Archive old data to S3

⚠️ NOTE: Requires roadmap P6.4 (WAL archival). Workaround: Manual archival.

# Archive WAL segments older than 30 days to S3
sudo find data/wal -name "*.log" -mtime +30 -exec echo {} \; > wal-to-archive.txt

# Upload to S3
cat wal-to-archive.txt | xargs -I {} aws s3 cp {} s3://stemedb-archive/wal/

# Verify upload, then delete local copies
cat wal-to-archive.txt | xargs -I {} sudo rm {}

# Verify space freed
df -h

If failed: Can't expand volume → Migrate to new server with larger storage. See Add Node Runbook for cluster migration.

Validation

After applying resolution, validate disk health:

Disk usage <80%
```
df -h
# Should show <80% used
```
Inodes available
```
df -i
# Should show >10% inodes free
```

Server running

systemctl status stemedb-api
# Should show: active (running)

Writes succeed

curl -X POST http://localhost:18180/v1/assert \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/disk", "predicate": "space_ok", "value": true}'
# Should return: 201 Created

No disk errors in logs

journalctl -u stemedb-api | grep -i "no space"
# Should return empty

Prevention

Monitoring

Set up alerts for:

# Prometheus alert rules
groups:
  - name: stemedb_disk
    rules:
      - alert: StemeDBDiskSpaceWarning
        expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes) < 0.2
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Disk space <20% on /data"
          description: "Available: {{ $value | humanizePercentage }}"

      - alert: StemeDBDiskSpaceCritical
        expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes) < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space <10% on /data"
          description: "Available: {{ $value | humanizePercentage }}"

      - alert: StemeDBInodeExhaustion
        expr: (node_filesystem_files_free / node_filesystem_files) < 0.1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Inodes <10% available"

Configuration Changes

To prevent recurrence:

WAL retention: Set to 7 days for pilot, 3 days for production with frequent backups
Compaction: Enable automatic daily compaction
Backup cleanup: Retain last 7 daily backups only
Log rotation: Configure systemd journal vacuum
Capacity planning: Right-size volumes based on Resource Sizing Guide

Example: Comprehensive disk management

# /etc/stemedb/config.toml
[wal]
retention_days = 7
max_segments = 100
segment_size_mb = 64

[storage]
compaction_enabled = true
compaction_interval_hours = 24
compaction_threshold_mb = 1000

[backup]
retention_days = 7
compression_enabled = true

Systemd journal vacuum:

# Limit journal to 500MB
sudo journalctl --vacuum-size=500M

# Or limit to 7 days
sudo journalctl --vacuum-time=7d

# Make permanent
sudo mkdir -p /etc/systemd/journald.conf.d/
cat <<EOF | sudo tee /etc/systemd/journald.conf.d/vacuum.conf
[Journal]
SystemMaxUse=500M
MaxRetentionSec=7day
EOF

sudo systemctl restart systemd-journald

Capacity Planning

Disk growth formula:

Component	Growth Rate	Calculation
WAL	~10MB per 1K assertions	retention_days × daily_assertions × 10MB / 1000
Database	~50MB per 10K assertions	(total_assertions / 10000) × 50MB
Indexes	~10% of database size	database_size × 0.1
Backups	1x data size per backup	(wal_size + db_size) × retention_count

Example: Pilot with 100K assertions, 7-day retention:

WAL: 7 days × 1K/day × 10MB / 1000 = 70MB
Database: (100K / 10K) × 50MB = 500MB
Indexes: 500MB × 0.1 = 50MB
Backups: (70MB + 500MB) × 7 = 4GB
Total: ~5GB (provision 20GB for 4x headroom)

See: Resource Sizing Guide for detailed calculations.

Server Won't Start - Disk full preventing startup
Restore from Backup - Need space for restore operations
High Query Latency - Performance impact of disk pressure

Last Updated

2026-02-11

12 KiB Raw Permalink Blame History Unescape Escape

Runbook: Disk Full

Symptom

Quick Diagnosis

Common Causes

Resolution Steps

§1. Emergency Cleanup (Disk >98%)

§2. WAL Cleanup (Planned)

§3. Database Compaction

§4. Inode Exhaustion

§5. Volume Expansion

Validation

Prevention

Monitoring

Configuration Changes

Capacity Planning

Related Runbooks

Last Updated

12 KiB

Raw Permalink Blame History