stemedb/docs/operations/deployment/systemd/README.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

240 lines
5.8 KiB
Markdown

# StemeDB Systemd Units
Systemd service and timer units for automated StemeDB operations.
## Installation
### 1. Copy Units to System Directory
```bash
sudo cp docs/operations/deployment/systemd/stemedb-*.{service,timer} /etc/systemd/system/
```
### 2. Copy Backup Script
```bash
sudo cp scripts/backup-stemedb.sh /usr/local/bin/
sudo chmod +x /usr/local/bin/backup-stemedb.sh
```
### 3. Create Configuration File
Create `/etc/default/stemedb-backup`:
```bash
# AWS S3 Configuration
AWS_REGION=us-east-1
AWS_S3_BUCKET=stemedb-backups-prod
# AWS credentials: use IAM instance profile (preferred) or specify below
# AWS_ACCESS_KEY_ID=AKIAXXXXXXXXXXXXXXXX
# AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Backup Configuration
BACKUP_OUTPUT_DIR=/var/backups/stemedb
BACKUP_RETENTION=30d
# StemeDB Data Directories
STEMEDB_WAL_DIR=/var/lib/stemedb/wal
STEMEDB_DB_DIR=/var/lib/stemedb/db
```
**Security Note:** Use IAM instance profiles instead of credentials in config file when possible.
### 4. Create Backup Directory
```bash
sudo mkdir -p /var/backups/stemedb
sudo chown stemedb:stemedb /var/backups/stemedb
```
### 5. Enable and Start Timers
```bash
# Reload systemd configuration
sudo systemctl daemon-reload
# Enable backup timer (starts on boot)
sudo systemctl enable stemedb-backup.timer
# Start backup timer immediately
sudo systemctl start stemedb-backup.timer
# Enable verification timer
sudo systemctl enable stemedb-verify-backup.timer
sudo systemctl start stemedb-verify-backup.timer
# Enable WAL archival timer
sudo systemctl enable stemedb-archive-wal.timer
sudo systemctl start stemedb-archive-wal.timer
```
## Verification
### Check Timer Status
```bash
# List all StemeDB timers
systemctl list-timers 'stemedb-*'
# Expected output:
# NEXT LEFT LAST PASSED UNIT ACTIVATES
# Wed 2026-02-12 06:00:00 UTC 3h 45min left n/a n/a stemedb-backup.timer stemedb-backup.service
# Sun 2026-02-16 03:00:00 UTC 3d 23h left n/a n/a stemedb-verify-backup.timer stemedb-verify-backup.service
# Wed 2026-02-12 02:30:00 UTC 15min left n/a n/a stemedb-archive-wal.timer stemedb-archive-wal.service
```
### Check Service Status
```bash
# View backup service status
sudo systemctl status stemedb-backup.service
# View recent logs
sudo journalctl -u stemedb-backup.service -n 50
# Follow logs in real-time
sudo journalctl -u stemedb-backup.service -f
```
### Manual Trigger
```bash
# Trigger backup manually (without waiting for timer)
sudo systemctl start stemedb-backup.service
# Watch progress
sudo journalctl -u stemedb-backup.service -f
```
## Units Reference
### stemedb-backup.timer
- **Schedule:** Every 6 hours (00:00, 06:00, 12:00, 18:00 UTC)
- **Persistent:** Runs on boot if missed
- **Randomized Delay:** 0-5 minutes to avoid thundering herd
### stemedb-backup.service
- **What it does:**
- Backs up WAL and DB directories
- Enforces retention policy (default: 30 days)
- Uploads to S3 (if `--upload-s3` flag enabled)
- Writes Prometheus metrics
- **Timeout:** 1 hour
- **Retries:** 3 attempts with 5-minute backoff
### stemedb-verify-backup.timer
- **Schedule:** Weekly on Sunday at 03:00 UTC
- **Persistent:** Yes
### stemedb-verify-backup.service
- **What it does:**
- Validates latest backup checksums
- Checks magic bytes, CRC32C, BLAKE3
- Writes verification status to metrics
- **Timeout:** 30 minutes
### stemedb-archive-wal.timer
- **Schedule:** Every 15 minutes
- **Persistent:** Yes
### stemedb-archive-wal.service
- **What it does:**
- Ships WAL segments to S3
- Tracks archival state
- Achieves RPO=15min
- **Timeout:** 10 minutes
## Monitoring
All services write metrics to `/var/lib/node_exporter/textfile_collector/stemedb_backup.prom` for Prometheus scraping.
**Key metrics:**
- `stemedb_backup_age_seconds` - Time since last successful backup
- `stemedb_backup_last_success_timestamp` - Unix timestamp of last backup
- `stemedb_backup_verification_status` - 1 = verified, 0 = failed/pending
- `stemedb_wal_archival_lag_seconds` - Delay between WAL creation and S3 upload
See `docs/operations/deployment/prometheus/backup-alerts.yml` for alert rules.
## Troubleshooting
### Timer Not Running
```bash
# Check if timer is enabled
systemctl is-enabled stemedb-backup.timer
# Check timer status
systemctl status stemedb-backup.timer
# View timer logs
journalctl -u stemedb-backup.timer
```
### Service Failing
```bash
# View service logs
sudo journalctl -u stemedb-backup.service -n 100
# Common issues:
# - Permission denied: check user/group in service file
# - AWS credentials: verify /etc/default/stemedb-backup or IAM role
# - Disk full: check df -h /var/backups/stemedb
```
### S3 Upload Failing
```bash
# Test AWS credentials
sudo -u stemedb aws s3 ls s3://stemedb-backups-prod/
# Check bucket permissions
aws s3api get-bucket-policy --bucket stemedb-backups-prod
# Verify service has AWS environment variables
sudo systemctl show stemedb-backup.service --property=Environment
```
## Maintenance
### Update Timer Schedule
Edit `/etc/systemd/system/stemedb-backup.timer`, change `OnCalendar`, then:
```bash
sudo systemctl daemon-reload
sudo systemctl restart stemedb-backup.timer
```
### Change Retention Policy
Edit `/etc/default/stemedb-backup`, change `BACKUP_RETENTION`, then:
```bash
# No restart needed - takes effect on next backup
```
### Disable Backups Temporarily
```bash
# Stop timer (prevents new backups)
sudo systemctl stop stemedb-backup.timer
# Re-enable later
sudo systemctl start stemedb-backup.timer
```
## Related Documentation
- [Backup Script Reference](../../../../scripts/backup-stemedb.sh)
- [Restore Runbook](../../runbooks/restore-from-backup.md)
- [Disaster Recovery](../../runbooks/disaster-recovery.md)
- [Prometheus Alerts](../prometheus/backup-alerts.yml)