This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
7.9 KiB
Certificate Expiring Soon
Severity: CRITICAL
Alert Rule
Alert: CertificateExpiringSoon
Trigger: TLS certificate expires within 7 days
Duration: 1h
Symptom
- Alert fires: "TLS certificate expires in X days"
- Metrics show
stemedb_tls_cert_expiry_seconds < 604800(7 days) - Logs contain certificate expiry warnings
opensslcommands show approaching expiration date
Impact
User Impact (if cert expires):
- All HTTPS/TLS connections fail immediately
- API becomes unreachable for external clients
- Dashboard shows "Certificate Invalid" errors
- Inter-node cluster communication fails (if using mTLS)
Business Impact:
- Complete service outage for external users
- SLA breach
- Customer trust erosion (security warnings in browsers)
Investigation Steps
1. Check Certificate Expiration
# Check certificate expiry date
echo | openssl s_client -servername stemedb.example.com \
-connect localhost:18180 2>/dev/null | \
openssl x509 -noout -dates
# notBefore=Jan 1 00:00:00 2025 GMT
# notAfter=Apr 1 23:59:59 2026 GMT
# Days until expiry
echo | openssl s_client -servername stemedb.example.com \
-connect localhost:18180 2>/dev/null | \
openssl x509 -noout -checkend $((7 * 86400))
2. Check Certificate Details
# View full certificate
openssl s_client -servername stemedb.example.com \
-connect localhost:18180 </dev/null 2>/dev/null | \
openssl x509 -text -noout | grep -A 3 "Subject:\|Issuer:\|Validity"
3. Check Certificate Source
# Check if using Let's Encrypt
cat /etc/stemedb/tls/cert.pem | openssl x509 -noout -issuer
# issuer=C = US, O = Let's Encrypt, CN = R3
# Check certbot renewal status (if using Let's Encrypt)
certbot certificates | grep -A 10 stemedb.example.com
4. Check Renewal Automation
# Check certbot timer (systemd)
systemctl status certbot.timer
# Check cron jobs
crontab -l | grep certbot
# Check recent renewal attempts
journalctl -u certbot --since "7 days ago" | grep -i "renew"
Resolution
If Using Let's Encrypt
1. Attempt manual renewal:
# Dry run first
certbot renew --dry-run --cert-name stemedb.example.com
# If successful, perform actual renewal
certbot renew --cert-name stemedb.example.com --force-renewal
2. Reload certificate in stemedb-api:
# Option A: Graceful reload (no downtime)
systemctl reload stemedb-api
# Option B: Restart (brief downtime)
systemctl restart stemedb-api
3. Verify new certificate:
echo | openssl s_client -servername stemedb.example.com \
-connect localhost:18180 2>/dev/null | \
openssl x509 -noout -dates | grep notAfter
If Using Custom CA
1. Generate new certificate signing request (CSR):
# Generate new private key
openssl genrsa -out /etc/stemedb/tls/new-key.pem 4096
# Generate CSR
openssl req -new -key /etc/stemedb/tls/new-key.pem \
-out /tmp/stemedb.csr \
-subj "/C=US/ST=CA/O=StemeDB/CN=stemedb.example.com"
2. Submit CSR to CA:
# Send CSR to CA for signing
# (Process varies by CA - follow CA-specific procedures)
cat /tmp/stemedb.csr | mail -s "Certificate Renewal Request" ca@example.com
3. After receiving signed certificate, install:
# Backup old certificate
cp /etc/stemedb/tls/cert.pem /etc/stemedb/tls/cert.pem.old.$(date +%Y%m%d)
cp /etc/stemedb/tls/key.pem /etc/stemedb/tls/key.pem.old.$(date +%Y%m%d)
# Install new certificate
mv /tmp/new-cert.pem /etc/stemedb/tls/cert.pem
mv /etc/stemedb/tls/new-key.pem /etc/stemedb/tls/key.pem
# Set correct permissions
chmod 600 /etc/stemedb/tls/key.pem
chmod 644 /etc/stemedb/tls/cert.pem
chown stemedb:stemedb /etc/stemedb/tls/*.pem
4. Reload service:
systemctl reload stemedb-api
# Verify service accepted new cert
journalctl -u stemedb-api --since "1 min ago" | grep -i "tls\|certificate"
If Renewal Fails
1. Check common failure reasons:
# DNS validation issues (Let's Encrypt)
dig _acme-challenge.stemedb.example.com TXT
# HTTP validation issues
curl -v http://stemedb.example.com/.well-known/acme-challenge/test
# Rate limits
certbot renew --dry-run 2>&1 | grep -i "rate limit"
2. Switch to DNS validation (if HTTP fails):
certbot certonly --manual --preferred-challenges dns \
-d stemedb.example.com \
--email ops@example.com
3. Use staging CA to test (doesn't count against rate limits):
certbot renew --cert-name stemedb.example.com \
--server https://acme-staging-v02.api.letsencrypt.org/directory \
--dry-run
If Certificate Already Expired
1. Generate temporary self-signed certificate:
openssl req -x509 -nodes -days 30 -newkey rsa:4096 \
-keyout /etc/stemedb/tls/temp-key.pem \
-out /etc/stemedb/tls/temp-cert.pem \
-subj "/CN=stemedb.example.com"
2. Install temporary cert:
mv /etc/stemedb/tls/cert.pem /etc/stemedb/tls/cert.pem.expired
cp /etc/stemedb/tls/temp-cert.pem /etc/stemedb/tls/cert.pem
cp /etc/stemedb/tls/temp-key.pem /etc/stemedb/tls/key.pem
systemctl reload stemedb-api
3. Fix renewal and replace with valid cert:
Follow renewal steps above, then replace temporary cert.
Prevention
Automated Renewal
1. Enable certbot timer (Let's Encrypt):
# Enable automatic renewal
systemctl enable certbot.timer
systemctl start certbot.timer
# Verify timer is active
systemctl list-timers | grep certbot
2. Configure deploy hook:
Create /etc/letsencrypt/renewal-hooks/deploy/reload-stemedb.sh:
#!/bin/bash
systemctl reload stemedb-api
journalctl -u stemedb-api -n 5 | grep -i "certificate reloaded" || \
echo "WARNING: Certificate reload may have failed"
Make executable:
chmod +x /etc/letsencrypt/renewal-hooks/deploy/reload-stemedb.sh
3. Test renewal automation:
# Dry run triggers deploy hook
certbot renew --dry-run
Monitoring
1. Alert at 30 days (warning) and 7 days (critical):
# Prometheus alert
- alert: CertificateExpiringWarning
expr: stemedb_tls_cert_expiry_seconds < (30 * 86400)
annotations:
summary: "TLS certificate expires in 30 days"
- alert: CertificateExpiringSoon
expr: stemedb_tls_cert_expiry_seconds < (7 * 86400)
annotations:
summary: "TLS certificate expires in 7 days - RENEW NOW"
2. Export certificate expiry metric:
Ensure /metrics endpoint includes:
stemedb_tls_cert_expiry_seconds{domain="stemedb.example.com"} 2592000
3. Set up external monitoring:
# Monitor from outside (catches firewall issues)
# Cron job on monitoring server:
0 */6 * * * /usr/local/bin/check-cert.sh stemedb.example.com
Operational Best Practices
1. Renew at 60 days (Let's Encrypt expires at 90):
Edit /etc/letsencrypt/renewal/stemedb.example.com.conf:
renew_before_expiry = 30 days
2. Document certificate renewal procedures:
Maintain runbook with:
- CA contact information
- DNS/domain registrar access
- Escalation path if renewal fails
3. Test renewal quarterly:
# Quarterly manual test
certbot renew --cert-name stemedb.example.com --force-renewal --dry-run
Escalation
Escalate immediately if:
- Certificate expires in <48 hours and renewal failing
- CA rate limits prevent renewal
- DNS validation requires domain registrar access (not available)
- Certificate already expired and affecting production
Escalation path:
- Primary on-call: Infrastructure SRE
- Secondary: Security engineer (CA coordination)
- Final escalation: VP Engineering + Legal (CA contract issues)
References
- Dashboard: StemeDB TLS Health
- Related alerts:
TLSHandshakeFailures,ClientAuthenticationErrors - Metrics:
stemedb_tls_cert_expiry_seconds(days until expiry)stemedb_tls_handshake_errors_total(TLS failures)
- Docs:
- Let's Encrypt: https://letsencrypt.org/docs/
- Certbot renewal: https://eff-certbot.readthedocs.io/en/stable/using.html#renewal