stemedb/docs/operations/deployment/systemd
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00
..
README.md feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00
stemedb-archive-wal.service feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00
stemedb-archive-wal.timer feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00
stemedb-backup.service feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00
stemedb-backup.timer feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00
stemedb-verify-backup.service feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00
stemedb-verify-backup.timer feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00

StemeDB Systemd Units

Systemd service and timer units for automated StemeDB operations.

Installation

1. Copy Units to System Directory

sudo cp docs/operations/deployment/systemd/stemedb-*.{service,timer} /etc/systemd/system/

2. Copy Backup Script

sudo cp scripts/backup-stemedb.sh /usr/local/bin/
sudo chmod +x /usr/local/bin/backup-stemedb.sh

3. Create Configuration File

Create /etc/default/stemedb-backup:

# AWS S3 Configuration
AWS_REGION=us-east-1
AWS_S3_BUCKET=stemedb-backups-prod
# AWS credentials: use IAM instance profile (preferred) or specify below
# AWS_ACCESS_KEY_ID=AKIAXXXXXXXXXXXXXXXX
# AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# Backup Configuration
BACKUP_OUTPUT_DIR=/var/backups/stemedb
BACKUP_RETENTION=30d

# StemeDB Data Directories
STEMEDB_WAL_DIR=/var/lib/stemedb/wal
STEMEDB_DB_DIR=/var/lib/stemedb/db

Security Note: Use IAM instance profiles instead of credentials in config file when possible.

4. Create Backup Directory

sudo mkdir -p /var/backups/stemedb
sudo chown stemedb:stemedb /var/backups/stemedb

5. Enable and Start Timers

# Reload systemd configuration
sudo systemctl daemon-reload

# Enable backup timer (starts on boot)
sudo systemctl enable stemedb-backup.timer

# Start backup timer immediately
sudo systemctl start stemedb-backup.timer

# Enable verification timer
sudo systemctl enable stemedb-verify-backup.timer
sudo systemctl start stemedb-verify-backup.timer

# Enable WAL archival timer
sudo systemctl enable stemedb-archive-wal.timer
sudo systemctl start stemedb-archive-wal.timer

Verification

Check Timer Status

# List all StemeDB timers
systemctl list-timers 'stemedb-*'

# Expected output:
# NEXT                        LEFT          LAST PASSED UNIT                        ACTIVATES
# Wed 2026-02-12 06:00:00 UTC 3h 45min left n/a  n/a    stemedb-backup.timer        stemedb-backup.service
# Sun 2026-02-16 03:00:00 UTC 3d 23h left  n/a  n/a    stemedb-verify-backup.timer stemedb-verify-backup.service
# Wed 2026-02-12 02:30:00 UTC 15min left   n/a  n/a    stemedb-archive-wal.timer   stemedb-archive-wal.service

Check Service Status

# View backup service status
sudo systemctl status stemedb-backup.service

# View recent logs
sudo journalctl -u stemedb-backup.service -n 50

# Follow logs in real-time
sudo journalctl -u stemedb-backup.service -f

Manual Trigger

# Trigger backup manually (without waiting for timer)
sudo systemctl start stemedb-backup.service

# Watch progress
sudo journalctl -u stemedb-backup.service -f

Units Reference

stemedb-backup.timer

  • Schedule: Every 6 hours (00:00, 06:00, 12:00, 18:00 UTC)
  • Persistent: Runs on boot if missed
  • Randomized Delay: 0-5 minutes to avoid thundering herd

stemedb-backup.service

  • What it does:
    • Backs up WAL and DB directories
    • Enforces retention policy (default: 30 days)
    • Uploads to S3 (if --upload-s3 flag enabled)
    • Writes Prometheus metrics
  • Timeout: 1 hour
  • Retries: 3 attempts with 5-minute backoff

stemedb-verify-backup.timer

  • Schedule: Weekly on Sunday at 03:00 UTC
  • Persistent: Yes

stemedb-verify-backup.service

  • What it does:
    • Validates latest backup checksums
    • Checks magic bytes, CRC32C, BLAKE3
    • Writes verification status to metrics
  • Timeout: 30 minutes

stemedb-archive-wal.timer

  • Schedule: Every 15 minutes
  • Persistent: Yes

stemedb-archive-wal.service

  • What it does:
    • Ships WAL segments to S3
    • Tracks archival state
    • Achieves RPO=15min
  • Timeout: 10 minutes

Monitoring

All services write metrics to /var/lib/node_exporter/textfile_collector/stemedb_backup.prom for Prometheus scraping.

Key metrics:

  • stemedb_backup_age_seconds - Time since last successful backup
  • stemedb_backup_last_success_timestamp - Unix timestamp of last backup
  • stemedb_backup_verification_status - 1 = verified, 0 = failed/pending
  • stemedb_wal_archival_lag_seconds - Delay between WAL creation and S3 upload

See docs/operations/deployment/prometheus/backup-alerts.yml for alert rules.

Troubleshooting

Timer Not Running

# Check if timer is enabled
systemctl is-enabled stemedb-backup.timer

# Check timer status
systemctl status stemedb-backup.timer

# View timer logs
journalctl -u stemedb-backup.timer

Service Failing

# View service logs
sudo journalctl -u stemedb-backup.service -n 100

# Common issues:
# - Permission denied: check user/group in service file
# - AWS credentials: verify /etc/default/stemedb-backup or IAM role
# - Disk full: check df -h /var/backups/stemedb

S3 Upload Failing

# Test AWS credentials
sudo -u stemedb aws s3 ls s3://stemedb-backups-prod/

# Check bucket permissions
aws s3api get-bucket-policy --bucket stemedb-backups-prod

# Verify service has AWS environment variables
sudo systemctl show stemedb-backup.service --property=Environment

Maintenance

Update Timer Schedule

Edit /etc/systemd/system/stemedb-backup.timer, change OnCalendar, then:

sudo systemctl daemon-reload
sudo systemctl restart stemedb-backup.timer

Change Retention Policy

Edit /etc/default/stemedb-backup, change BACKUP_RETENTION, then:

# No restart needed - takes effect on next backup

Disable Backups Temporarily

# Stop timer (prevents new backups)
sudo systemctl stop stemedb-backup.timer

# Re-enable later
sudo systemctl start stemedb-backup.timer