stemedb/docs/operations/runbooks/certificate-renewal.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

338 lines
7.9 KiB
Markdown

# Certificate Expiring Soon
## Severity: CRITICAL
## Alert Rule
**Alert:** `CertificateExpiringSoon`
**Trigger:** TLS certificate expires within 7 days
**Duration:** 1h
## Symptom
- Alert fires: "TLS certificate expires in X days"
- Metrics show `stemedb_tls_cert_expiry_seconds < 604800` (7 days)
- Logs contain certificate expiry warnings
- `openssl` commands show approaching expiration date
## Impact
**User Impact (if cert expires):**
- All HTTPS/TLS connections fail immediately
- API becomes unreachable for external clients
- Dashboard shows "Certificate Invalid" errors
- Inter-node cluster communication fails (if using mTLS)
**Business Impact:**
- Complete service outage for external users
- SLA breach
- Customer trust erosion (security warnings in browsers)
## Investigation Steps
### 1. Check Certificate Expiration
```bash
# Check certificate expiry date
echo | openssl s_client -servername stemedb.example.com \
-connect localhost:18180 2>/dev/null | \
openssl x509 -noout -dates
# notBefore=Jan 1 00:00:00 2025 GMT
# notAfter=Apr 1 23:59:59 2026 GMT
# Days until expiry
echo | openssl s_client -servername stemedb.example.com \
-connect localhost:18180 2>/dev/null | \
openssl x509 -noout -checkend $((7 * 86400))
```
### 2. Check Certificate Details
```bash
# View full certificate
openssl s_client -servername stemedb.example.com \
-connect localhost:18180 </dev/null 2>/dev/null | \
openssl x509 -text -noout | grep -A 3 "Subject:\|Issuer:\|Validity"
```
### 3. Check Certificate Source
```bash
# Check if using Let's Encrypt
cat /etc/stemedb/tls/cert.pem | openssl x509 -noout -issuer
# issuer=C = US, O = Let's Encrypt, CN = R3
# Check certbot renewal status (if using Let's Encrypt)
certbot certificates | grep -A 10 stemedb.example.com
```
### 4. Check Renewal Automation
```bash
# Check certbot timer (systemd)
systemctl status certbot.timer
# Check cron jobs
crontab -l | grep certbot
# Check recent renewal attempts
journalctl -u certbot --since "7 days ago" | grep -i "renew"
```
## Resolution
### If Using Let's Encrypt
**1. Attempt manual renewal:**
```bash
# Dry run first
certbot renew --dry-run --cert-name stemedb.example.com
# If successful, perform actual renewal
certbot renew --cert-name stemedb.example.com --force-renewal
```
**2. Reload certificate in stemedb-api:**
```bash
# Option A: Graceful reload (no downtime)
systemctl reload stemedb-api
# Option B: Restart (brief downtime)
systemctl restart stemedb-api
```
**3. Verify new certificate:**
```bash
echo | openssl s_client -servername stemedb.example.com \
-connect localhost:18180 2>/dev/null | \
openssl x509 -noout -dates | grep notAfter
```
### If Using Custom CA
**1. Generate new certificate signing request (CSR):**
```bash
# Generate new private key
openssl genrsa -out /etc/stemedb/tls/new-key.pem 4096
# Generate CSR
openssl req -new -key /etc/stemedb/tls/new-key.pem \
-out /tmp/stemedb.csr \
-subj "/C=US/ST=CA/O=StemeDB/CN=stemedb.example.com"
```
**2. Submit CSR to CA:**
```bash
# Send CSR to CA for signing
# (Process varies by CA - follow CA-specific procedures)
cat /tmp/stemedb.csr | mail -s "Certificate Renewal Request" ca@example.com
```
**3. After receiving signed certificate, install:**
```bash
# Backup old certificate
cp /etc/stemedb/tls/cert.pem /etc/stemedb/tls/cert.pem.old.$(date +%Y%m%d)
cp /etc/stemedb/tls/key.pem /etc/stemedb/tls/key.pem.old.$(date +%Y%m%d)
# Install new certificate
mv /tmp/new-cert.pem /etc/stemedb/tls/cert.pem
mv /etc/stemedb/tls/new-key.pem /etc/stemedb/tls/key.pem
# Set correct permissions
chmod 600 /etc/stemedb/tls/key.pem
chmod 644 /etc/stemedb/tls/cert.pem
chown stemedb:stemedb /etc/stemedb/tls/*.pem
```
**4. Reload service:**
```bash
systemctl reload stemedb-api
# Verify service accepted new cert
journalctl -u stemedb-api --since "1 min ago" | grep -i "tls\|certificate"
```
### If Renewal Fails
**1. Check common failure reasons:**
```bash
# DNS validation issues (Let's Encrypt)
dig _acme-challenge.stemedb.example.com TXT
# HTTP validation issues
curl -v http://stemedb.example.com/.well-known/acme-challenge/test
# Rate limits
certbot renew --dry-run 2>&1 | grep -i "rate limit"
```
**2. Switch to DNS validation (if HTTP fails):**
```bash
certbot certonly --manual --preferred-challenges dns \
-d stemedb.example.com \
--email ops@example.com
```
**3. Use staging CA to test (doesn't count against rate limits):**
```bash
certbot renew --cert-name stemedb.example.com \
--server https://acme-staging-v02.api.letsencrypt.org/directory \
--dry-run
```
### If Certificate Already Expired
**1. Generate temporary self-signed certificate:**
```bash
openssl req -x509 -nodes -days 30 -newkey rsa:4096 \
-keyout /etc/stemedb/tls/temp-key.pem \
-out /etc/stemedb/tls/temp-cert.pem \
-subj "/CN=stemedb.example.com"
```
**2. Install temporary cert:**
```bash
mv /etc/stemedb/tls/cert.pem /etc/stemedb/tls/cert.pem.expired
cp /etc/stemedb/tls/temp-cert.pem /etc/stemedb/tls/cert.pem
cp /etc/stemedb/tls/temp-key.pem /etc/stemedb/tls/key.pem
systemctl reload stemedb-api
```
**3. Fix renewal and replace with valid cert:**
Follow renewal steps above, then replace temporary cert.
## Prevention
### Automated Renewal
**1. Enable certbot timer (Let's Encrypt):**
```bash
# Enable automatic renewal
systemctl enable certbot.timer
systemctl start certbot.timer
# Verify timer is active
systemctl list-timers | grep certbot
```
**2. Configure deploy hook:**
Create `/etc/letsencrypt/renewal-hooks/deploy/reload-stemedb.sh`:
```bash
#!/bin/bash
systemctl reload stemedb-api
journalctl -u stemedb-api -n 5 | grep -i "certificate reloaded" || \
echo "WARNING: Certificate reload may have failed"
```
Make executable:
```bash
chmod +x /etc/letsencrypt/renewal-hooks/deploy/reload-stemedb.sh
```
**3. Test renewal automation:**
```bash
# Dry run triggers deploy hook
certbot renew --dry-run
```
### Monitoring
**1. Alert at 30 days (warning) and 7 days (critical):**
```yaml
# Prometheus alert
- alert: CertificateExpiringWarning
expr: stemedb_tls_cert_expiry_seconds < (30 * 86400)
annotations:
summary: "TLS certificate expires in 30 days"
- alert: CertificateExpiringSoon
expr: stemedb_tls_cert_expiry_seconds < (7 * 86400)
annotations:
summary: "TLS certificate expires in 7 days - RENEW NOW"
```
**2. Export certificate expiry metric:**
Ensure `/metrics` endpoint includes:
```
stemedb_tls_cert_expiry_seconds{domain="stemedb.example.com"} 2592000
```
**3. Set up external monitoring:**
```bash
# Monitor from outside (catches firewall issues)
# Cron job on monitoring server:
0 */6 * * * /usr/local/bin/check-cert.sh stemedb.example.com
```
### Operational Best Practices
**1. Renew at 60 days (Let's Encrypt expires at 90):**
Edit `/etc/letsencrypt/renewal/stemedb.example.com.conf`:
```ini
renew_before_expiry = 30 days
```
**2. Document certificate renewal procedures:**
Maintain runbook with:
- CA contact information
- DNS/domain registrar access
- Escalation path if renewal fails
**3. Test renewal quarterly:**
```bash
# Quarterly manual test
certbot renew --cert-name stemedb.example.com --force-renewal --dry-run
```
## Escalation
**Escalate immediately if:**
- Certificate expires in <48 hours and renewal failing
- CA rate limits prevent renewal
- DNS validation requires domain registrar access (not available)
- Certificate already expired and affecting production
**Escalation path:**
1. **Primary on-call:** Infrastructure SRE
2. **Secondary:** Security engineer (CA coordination)
3. **Final escalation:** VP Engineering + Legal (CA contract issues)
## References
- **Dashboard:** [StemeDB TLS Health](http://grafana.example.com/d/stemedb-tls)
- **Related alerts:** `TLSHandshakeFailures`, `ClientAuthenticationErrors`
- **Metrics:**
- `stemedb_tls_cert_expiry_seconds` (days until expiry)
- `stemedb_tls_handshake_errors_total` (TLS failures)
- **Docs:**
- Let's Encrypt: https://letsencrypt.org/docs/
- Certbot renewal: https://eff-certbot.readthedocs.io/en/stable/using.html#renewal