This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
240 lines
5.8 KiB
Markdown
240 lines
5.8 KiB
Markdown
# StemeDB Systemd Units
|
|
|
|
Systemd service and timer units for automated StemeDB operations.
|
|
|
|
## Installation
|
|
|
|
### 1. Copy Units to System Directory
|
|
|
|
```bash
|
|
sudo cp docs/operations/deployment/systemd/stemedb-*.{service,timer} /etc/systemd/system/
|
|
```
|
|
|
|
### 2. Copy Backup Script
|
|
|
|
```bash
|
|
sudo cp scripts/backup-stemedb.sh /usr/local/bin/
|
|
sudo chmod +x /usr/local/bin/backup-stemedb.sh
|
|
```
|
|
|
|
### 3. Create Configuration File
|
|
|
|
Create `/etc/default/stemedb-backup`:
|
|
|
|
```bash
|
|
# AWS S3 Configuration
|
|
AWS_REGION=us-east-1
|
|
AWS_S3_BUCKET=stemedb-backups-prod
|
|
# AWS credentials: use IAM instance profile (preferred) or specify below
|
|
# AWS_ACCESS_KEY_ID=AKIAXXXXXXXXXXXXXXXX
|
|
# AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
|
|
|
|
# Backup Configuration
|
|
BACKUP_OUTPUT_DIR=/var/backups/stemedb
|
|
BACKUP_RETENTION=30d
|
|
|
|
# StemeDB Data Directories
|
|
STEMEDB_WAL_DIR=/var/lib/stemedb/wal
|
|
STEMEDB_DB_DIR=/var/lib/stemedb/db
|
|
```
|
|
|
|
**Security Note:** Use IAM instance profiles instead of credentials in config file when possible.
|
|
|
|
### 4. Create Backup Directory
|
|
|
|
```bash
|
|
sudo mkdir -p /var/backups/stemedb
|
|
sudo chown stemedb:stemedb /var/backups/stemedb
|
|
```
|
|
|
|
### 5. Enable and Start Timers
|
|
|
|
```bash
|
|
# Reload systemd configuration
|
|
sudo systemctl daemon-reload
|
|
|
|
# Enable backup timer (starts on boot)
|
|
sudo systemctl enable stemedb-backup.timer
|
|
|
|
# Start backup timer immediately
|
|
sudo systemctl start stemedb-backup.timer
|
|
|
|
# Enable verification timer
|
|
sudo systemctl enable stemedb-verify-backup.timer
|
|
sudo systemctl start stemedb-verify-backup.timer
|
|
|
|
# Enable WAL archival timer
|
|
sudo systemctl enable stemedb-archive-wal.timer
|
|
sudo systemctl start stemedb-archive-wal.timer
|
|
```
|
|
|
|
## Verification
|
|
|
|
### Check Timer Status
|
|
|
|
```bash
|
|
# List all StemeDB timers
|
|
systemctl list-timers 'stemedb-*'
|
|
|
|
# Expected output:
|
|
# NEXT LEFT LAST PASSED UNIT ACTIVATES
|
|
# Wed 2026-02-12 06:00:00 UTC 3h 45min left n/a n/a stemedb-backup.timer stemedb-backup.service
|
|
# Sun 2026-02-16 03:00:00 UTC 3d 23h left n/a n/a stemedb-verify-backup.timer stemedb-verify-backup.service
|
|
# Wed 2026-02-12 02:30:00 UTC 15min left n/a n/a stemedb-archive-wal.timer stemedb-archive-wal.service
|
|
```
|
|
|
|
### Check Service Status
|
|
|
|
```bash
|
|
# View backup service status
|
|
sudo systemctl status stemedb-backup.service
|
|
|
|
# View recent logs
|
|
sudo journalctl -u stemedb-backup.service -n 50
|
|
|
|
# Follow logs in real-time
|
|
sudo journalctl -u stemedb-backup.service -f
|
|
```
|
|
|
|
### Manual Trigger
|
|
|
|
```bash
|
|
# Trigger backup manually (without waiting for timer)
|
|
sudo systemctl start stemedb-backup.service
|
|
|
|
# Watch progress
|
|
sudo journalctl -u stemedb-backup.service -f
|
|
```
|
|
|
|
## Units Reference
|
|
|
|
### stemedb-backup.timer
|
|
|
|
- **Schedule:** Every 6 hours (00:00, 06:00, 12:00, 18:00 UTC)
|
|
- **Persistent:** Runs on boot if missed
|
|
- **Randomized Delay:** 0-5 minutes to avoid thundering herd
|
|
|
|
### stemedb-backup.service
|
|
|
|
- **What it does:**
|
|
- Backs up WAL and DB directories
|
|
- Enforces retention policy (default: 30 days)
|
|
- Uploads to S3 (if `--upload-s3` flag enabled)
|
|
- Writes Prometheus metrics
|
|
- **Timeout:** 1 hour
|
|
- **Retries:** 3 attempts with 5-minute backoff
|
|
|
|
### stemedb-verify-backup.timer
|
|
|
|
- **Schedule:** Weekly on Sunday at 03:00 UTC
|
|
- **Persistent:** Yes
|
|
|
|
### stemedb-verify-backup.service
|
|
|
|
- **What it does:**
|
|
- Validates latest backup checksums
|
|
- Checks magic bytes, CRC32C, BLAKE3
|
|
- Writes verification status to metrics
|
|
- **Timeout:** 30 minutes
|
|
|
|
### stemedb-archive-wal.timer
|
|
|
|
- **Schedule:** Every 15 minutes
|
|
- **Persistent:** Yes
|
|
|
|
### stemedb-archive-wal.service
|
|
|
|
- **What it does:**
|
|
- Ships WAL segments to S3
|
|
- Tracks archival state
|
|
- Achieves RPO=15min
|
|
- **Timeout:** 10 minutes
|
|
|
|
## Monitoring
|
|
|
|
All services write metrics to `/var/lib/node_exporter/textfile_collector/stemedb_backup.prom` for Prometheus scraping.
|
|
|
|
**Key metrics:**
|
|
- `stemedb_backup_age_seconds` - Time since last successful backup
|
|
- `stemedb_backup_last_success_timestamp` - Unix timestamp of last backup
|
|
- `stemedb_backup_verification_status` - 1 = verified, 0 = failed/pending
|
|
- `stemedb_wal_archival_lag_seconds` - Delay between WAL creation and S3 upload
|
|
|
|
See `docs/operations/deployment/prometheus/backup-alerts.yml` for alert rules.
|
|
|
|
## Troubleshooting
|
|
|
|
### Timer Not Running
|
|
|
|
```bash
|
|
# Check if timer is enabled
|
|
systemctl is-enabled stemedb-backup.timer
|
|
|
|
# Check timer status
|
|
systemctl status stemedb-backup.timer
|
|
|
|
# View timer logs
|
|
journalctl -u stemedb-backup.timer
|
|
```
|
|
|
|
### Service Failing
|
|
|
|
```bash
|
|
# View service logs
|
|
sudo journalctl -u stemedb-backup.service -n 100
|
|
|
|
# Common issues:
|
|
# - Permission denied: check user/group in service file
|
|
# - AWS credentials: verify /etc/default/stemedb-backup or IAM role
|
|
# - Disk full: check df -h /var/backups/stemedb
|
|
```
|
|
|
|
### S3 Upload Failing
|
|
|
|
```bash
|
|
# Test AWS credentials
|
|
sudo -u stemedb aws s3 ls s3://stemedb-backups-prod/
|
|
|
|
# Check bucket permissions
|
|
aws s3api get-bucket-policy --bucket stemedb-backups-prod
|
|
|
|
# Verify service has AWS environment variables
|
|
sudo systemctl show stemedb-backup.service --property=Environment
|
|
```
|
|
|
|
## Maintenance
|
|
|
|
### Update Timer Schedule
|
|
|
|
Edit `/etc/systemd/system/stemedb-backup.timer`, change `OnCalendar`, then:
|
|
|
|
```bash
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl restart stemedb-backup.timer
|
|
```
|
|
|
|
### Change Retention Policy
|
|
|
|
Edit `/etc/default/stemedb-backup`, change `BACKUP_RETENTION`, then:
|
|
|
|
```bash
|
|
# No restart needed - takes effect on next backup
|
|
```
|
|
|
|
### Disable Backups Temporarily
|
|
|
|
```bash
|
|
# Stop timer (prevents new backups)
|
|
sudo systemctl stop stemedb-backup.timer
|
|
|
|
# Re-enable later
|
|
sudo systemctl start stemedb-backup.timer
|
|
```
|
|
|
|
## Related Documentation
|
|
|
|
- [Backup Script Reference](../../../../scripts/backup-stemedb.sh)
|
|
- [Restore Runbook](../../runbooks/restore-from-backup.md)
|
|
- [Disaster Recovery](../../runbooks/disaster-recovery.md)
|
|
- [Prometheus Alerts](../prometheus/backup-alerts.yml)
|