This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| README.md | ||
| stemedb-archive-wal.service | ||
| stemedb-archive-wal.timer | ||
| stemedb-backup.service | ||
| stemedb-backup.timer | ||
| stemedb-verify-backup.service | ||
| stemedb-verify-backup.timer | ||
StemeDB Systemd Units
Systemd service and timer units for automated StemeDB operations.
Installation
1. Copy Units to System Directory
sudo cp docs/operations/deployment/systemd/stemedb-*.{service,timer} /etc/systemd/system/
2. Copy Backup Script
sudo cp scripts/backup-stemedb.sh /usr/local/bin/
sudo chmod +x /usr/local/bin/backup-stemedb.sh
3. Create Configuration File
Create /etc/default/stemedb-backup:
# AWS S3 Configuration
AWS_REGION=us-east-1
AWS_S3_BUCKET=stemedb-backups-prod
# AWS credentials: use IAM instance profile (preferred) or specify below
# AWS_ACCESS_KEY_ID=AKIAXXXXXXXXXXXXXXXX
# AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Backup Configuration
BACKUP_OUTPUT_DIR=/var/backups/stemedb
BACKUP_RETENTION=30d
# StemeDB Data Directories
STEMEDB_WAL_DIR=/var/lib/stemedb/wal
STEMEDB_DB_DIR=/var/lib/stemedb/db
Security Note: Use IAM instance profiles instead of credentials in config file when possible.
4. Create Backup Directory
sudo mkdir -p /var/backups/stemedb
sudo chown stemedb:stemedb /var/backups/stemedb
5. Enable and Start Timers
# Reload systemd configuration
sudo systemctl daemon-reload
# Enable backup timer (starts on boot)
sudo systemctl enable stemedb-backup.timer
# Start backup timer immediately
sudo systemctl start stemedb-backup.timer
# Enable verification timer
sudo systemctl enable stemedb-verify-backup.timer
sudo systemctl start stemedb-verify-backup.timer
# Enable WAL archival timer
sudo systemctl enable stemedb-archive-wal.timer
sudo systemctl start stemedb-archive-wal.timer
Verification
Check Timer Status
# List all StemeDB timers
systemctl list-timers 'stemedb-*'
# Expected output:
# NEXT LEFT LAST PASSED UNIT ACTIVATES
# Wed 2026-02-12 06:00:00 UTC 3h 45min left n/a n/a stemedb-backup.timer stemedb-backup.service
# Sun 2026-02-16 03:00:00 UTC 3d 23h left n/a n/a stemedb-verify-backup.timer stemedb-verify-backup.service
# Wed 2026-02-12 02:30:00 UTC 15min left n/a n/a stemedb-archive-wal.timer stemedb-archive-wal.service
Check Service Status
# View backup service status
sudo systemctl status stemedb-backup.service
# View recent logs
sudo journalctl -u stemedb-backup.service -n 50
# Follow logs in real-time
sudo journalctl -u stemedb-backup.service -f
Manual Trigger
# Trigger backup manually (without waiting for timer)
sudo systemctl start stemedb-backup.service
# Watch progress
sudo journalctl -u stemedb-backup.service -f
Units Reference
stemedb-backup.timer
- Schedule: Every 6 hours (00:00, 06:00, 12:00, 18:00 UTC)
- Persistent: Runs on boot if missed
- Randomized Delay: 0-5 minutes to avoid thundering herd
stemedb-backup.service
- What it does:
- Backs up WAL and DB directories
- Enforces retention policy (default: 30 days)
- Uploads to S3 (if
--upload-s3flag enabled) - Writes Prometheus metrics
- Timeout: 1 hour
- Retries: 3 attempts with 5-minute backoff
stemedb-verify-backup.timer
- Schedule: Weekly on Sunday at 03:00 UTC
- Persistent: Yes
stemedb-verify-backup.service
- What it does:
- Validates latest backup checksums
- Checks magic bytes, CRC32C, BLAKE3
- Writes verification status to metrics
- Timeout: 30 minutes
stemedb-archive-wal.timer
- Schedule: Every 15 minutes
- Persistent: Yes
stemedb-archive-wal.service
- What it does:
- Ships WAL segments to S3
- Tracks archival state
- Achieves RPO=15min
- Timeout: 10 minutes
Monitoring
All services write metrics to /var/lib/node_exporter/textfile_collector/stemedb_backup.prom for Prometheus scraping.
Key metrics:
stemedb_backup_age_seconds- Time since last successful backupstemedb_backup_last_success_timestamp- Unix timestamp of last backupstemedb_backup_verification_status- 1 = verified, 0 = failed/pendingstemedb_wal_archival_lag_seconds- Delay between WAL creation and S3 upload
See docs/operations/deployment/prometheus/backup-alerts.yml for alert rules.
Troubleshooting
Timer Not Running
# Check if timer is enabled
systemctl is-enabled stemedb-backup.timer
# Check timer status
systemctl status stemedb-backup.timer
# View timer logs
journalctl -u stemedb-backup.timer
Service Failing
# View service logs
sudo journalctl -u stemedb-backup.service -n 100
# Common issues:
# - Permission denied: check user/group in service file
# - AWS credentials: verify /etc/default/stemedb-backup or IAM role
# - Disk full: check df -h /var/backups/stemedb
S3 Upload Failing
# Test AWS credentials
sudo -u stemedb aws s3 ls s3://stemedb-backups-prod/
# Check bucket permissions
aws s3api get-bucket-policy --bucket stemedb-backups-prod
# Verify service has AWS environment variables
sudo systemctl show stemedb-backup.service --property=Environment
Maintenance
Update Timer Schedule
Edit /etc/systemd/system/stemedb-backup.timer, change OnCalendar, then:
sudo systemctl daemon-reload
sudo systemctl restart stemedb-backup.timer
Change Retention Policy
Edit /etc/default/stemedb-backup, change BACKUP_RETENTION, then:
# No restart needed - takes effect on next backup
Disable Backups Temporarily
# Stop timer (prevents new backups)
sudo systemctl stop stemedb-backup.timer
# Re-enable later
sudo systemctl start stemedb-backup.timer