This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
338 lines
7.9 KiB
Markdown
338 lines
7.9 KiB
Markdown
# Certificate Expiring Soon
|
|
|
|
## Severity: CRITICAL
|
|
|
|
## Alert Rule
|
|
|
|
**Alert:** `CertificateExpiringSoon`
|
|
**Trigger:** TLS certificate expires within 7 days
|
|
**Duration:** 1h
|
|
|
|
## Symptom
|
|
|
|
- Alert fires: "TLS certificate expires in X days"
|
|
- Metrics show `stemedb_tls_cert_expiry_seconds < 604800` (7 days)
|
|
- Logs contain certificate expiry warnings
|
|
- `openssl` commands show approaching expiration date
|
|
|
|
## Impact
|
|
|
|
**User Impact (if cert expires):**
|
|
- All HTTPS/TLS connections fail immediately
|
|
- API becomes unreachable for external clients
|
|
- Dashboard shows "Certificate Invalid" errors
|
|
- Inter-node cluster communication fails (if using mTLS)
|
|
|
|
**Business Impact:**
|
|
- Complete service outage for external users
|
|
- SLA breach
|
|
- Customer trust erosion (security warnings in browsers)
|
|
|
|
## Investigation Steps
|
|
|
|
### 1. Check Certificate Expiration
|
|
|
|
```bash
|
|
# Check certificate expiry date
|
|
echo | openssl s_client -servername stemedb.example.com \
|
|
-connect localhost:18180 2>/dev/null | \
|
|
openssl x509 -noout -dates
|
|
# notBefore=Jan 1 00:00:00 2025 GMT
|
|
# notAfter=Apr 1 23:59:59 2026 GMT
|
|
|
|
# Days until expiry
|
|
echo | openssl s_client -servername stemedb.example.com \
|
|
-connect localhost:18180 2>/dev/null | \
|
|
openssl x509 -noout -checkend $((7 * 86400))
|
|
```
|
|
|
|
### 2. Check Certificate Details
|
|
|
|
```bash
|
|
# View full certificate
|
|
openssl s_client -servername stemedb.example.com \
|
|
-connect localhost:18180 </dev/null 2>/dev/null | \
|
|
openssl x509 -text -noout | grep -A 3 "Subject:\|Issuer:\|Validity"
|
|
```
|
|
|
|
### 3. Check Certificate Source
|
|
|
|
```bash
|
|
# Check if using Let's Encrypt
|
|
cat /etc/stemedb/tls/cert.pem | openssl x509 -noout -issuer
|
|
# issuer=C = US, O = Let's Encrypt, CN = R3
|
|
|
|
# Check certbot renewal status (if using Let's Encrypt)
|
|
certbot certificates | grep -A 10 stemedb.example.com
|
|
```
|
|
|
|
### 4. Check Renewal Automation
|
|
|
|
```bash
|
|
# Check certbot timer (systemd)
|
|
systemctl status certbot.timer
|
|
|
|
# Check cron jobs
|
|
crontab -l | grep certbot
|
|
|
|
# Check recent renewal attempts
|
|
journalctl -u certbot --since "7 days ago" | grep -i "renew"
|
|
```
|
|
|
|
## Resolution
|
|
|
|
### If Using Let's Encrypt
|
|
|
|
**1. Attempt manual renewal:**
|
|
|
|
```bash
|
|
# Dry run first
|
|
certbot renew --dry-run --cert-name stemedb.example.com
|
|
|
|
# If successful, perform actual renewal
|
|
certbot renew --cert-name stemedb.example.com --force-renewal
|
|
```
|
|
|
|
**2. Reload certificate in stemedb-api:**
|
|
|
|
```bash
|
|
# Option A: Graceful reload (no downtime)
|
|
systemctl reload stemedb-api
|
|
|
|
# Option B: Restart (brief downtime)
|
|
systemctl restart stemedb-api
|
|
```
|
|
|
|
**3. Verify new certificate:**
|
|
|
|
```bash
|
|
echo | openssl s_client -servername stemedb.example.com \
|
|
-connect localhost:18180 2>/dev/null | \
|
|
openssl x509 -noout -dates | grep notAfter
|
|
```
|
|
|
|
### If Using Custom CA
|
|
|
|
**1. Generate new certificate signing request (CSR):**
|
|
|
|
```bash
|
|
# Generate new private key
|
|
openssl genrsa -out /etc/stemedb/tls/new-key.pem 4096
|
|
|
|
# Generate CSR
|
|
openssl req -new -key /etc/stemedb/tls/new-key.pem \
|
|
-out /tmp/stemedb.csr \
|
|
-subj "/C=US/ST=CA/O=StemeDB/CN=stemedb.example.com"
|
|
```
|
|
|
|
**2. Submit CSR to CA:**
|
|
|
|
```bash
|
|
# Send CSR to CA for signing
|
|
# (Process varies by CA - follow CA-specific procedures)
|
|
cat /tmp/stemedb.csr | mail -s "Certificate Renewal Request" ca@example.com
|
|
```
|
|
|
|
**3. After receiving signed certificate, install:**
|
|
|
|
```bash
|
|
# Backup old certificate
|
|
cp /etc/stemedb/tls/cert.pem /etc/stemedb/tls/cert.pem.old.$(date +%Y%m%d)
|
|
cp /etc/stemedb/tls/key.pem /etc/stemedb/tls/key.pem.old.$(date +%Y%m%d)
|
|
|
|
# Install new certificate
|
|
mv /tmp/new-cert.pem /etc/stemedb/tls/cert.pem
|
|
mv /etc/stemedb/tls/new-key.pem /etc/stemedb/tls/key.pem
|
|
|
|
# Set correct permissions
|
|
chmod 600 /etc/stemedb/tls/key.pem
|
|
chmod 644 /etc/stemedb/tls/cert.pem
|
|
chown stemedb:stemedb /etc/stemedb/tls/*.pem
|
|
```
|
|
|
|
**4. Reload service:**
|
|
|
|
```bash
|
|
systemctl reload stemedb-api
|
|
|
|
# Verify service accepted new cert
|
|
journalctl -u stemedb-api --since "1 min ago" | grep -i "tls\|certificate"
|
|
```
|
|
|
|
### If Renewal Fails
|
|
|
|
**1. Check common failure reasons:**
|
|
|
|
```bash
|
|
# DNS validation issues (Let's Encrypt)
|
|
dig _acme-challenge.stemedb.example.com TXT
|
|
|
|
# HTTP validation issues
|
|
curl -v http://stemedb.example.com/.well-known/acme-challenge/test
|
|
|
|
# Rate limits
|
|
certbot renew --dry-run 2>&1 | grep -i "rate limit"
|
|
```
|
|
|
|
**2. Switch to DNS validation (if HTTP fails):**
|
|
|
|
```bash
|
|
certbot certonly --manual --preferred-challenges dns \
|
|
-d stemedb.example.com \
|
|
--email ops@example.com
|
|
```
|
|
|
|
**3. Use staging CA to test (doesn't count against rate limits):**
|
|
|
|
```bash
|
|
certbot renew --cert-name stemedb.example.com \
|
|
--server https://acme-staging-v02.api.letsencrypt.org/directory \
|
|
--dry-run
|
|
```
|
|
|
|
### If Certificate Already Expired
|
|
|
|
**1. Generate temporary self-signed certificate:**
|
|
|
|
```bash
|
|
openssl req -x509 -nodes -days 30 -newkey rsa:4096 \
|
|
-keyout /etc/stemedb/tls/temp-key.pem \
|
|
-out /etc/stemedb/tls/temp-cert.pem \
|
|
-subj "/CN=stemedb.example.com"
|
|
```
|
|
|
|
**2. Install temporary cert:**
|
|
|
|
```bash
|
|
mv /etc/stemedb/tls/cert.pem /etc/stemedb/tls/cert.pem.expired
|
|
cp /etc/stemedb/tls/temp-cert.pem /etc/stemedb/tls/cert.pem
|
|
cp /etc/stemedb/tls/temp-key.pem /etc/stemedb/tls/key.pem
|
|
systemctl reload stemedb-api
|
|
```
|
|
|
|
**3. Fix renewal and replace with valid cert:**
|
|
|
|
Follow renewal steps above, then replace temporary cert.
|
|
|
|
## Prevention
|
|
|
|
### Automated Renewal
|
|
|
|
**1. Enable certbot timer (Let's Encrypt):**
|
|
|
|
```bash
|
|
# Enable automatic renewal
|
|
systemctl enable certbot.timer
|
|
systemctl start certbot.timer
|
|
|
|
# Verify timer is active
|
|
systemctl list-timers | grep certbot
|
|
```
|
|
|
|
**2. Configure deploy hook:**
|
|
|
|
Create `/etc/letsencrypt/renewal-hooks/deploy/reload-stemedb.sh`:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
systemctl reload stemedb-api
|
|
journalctl -u stemedb-api -n 5 | grep -i "certificate reloaded" || \
|
|
echo "WARNING: Certificate reload may have failed"
|
|
```
|
|
|
|
Make executable:
|
|
|
|
```bash
|
|
chmod +x /etc/letsencrypt/renewal-hooks/deploy/reload-stemedb.sh
|
|
```
|
|
|
|
**3. Test renewal automation:**
|
|
|
|
```bash
|
|
# Dry run triggers deploy hook
|
|
certbot renew --dry-run
|
|
```
|
|
|
|
### Monitoring
|
|
|
|
**1. Alert at 30 days (warning) and 7 days (critical):**
|
|
|
|
```yaml
|
|
# Prometheus alert
|
|
- alert: CertificateExpiringWarning
|
|
expr: stemedb_tls_cert_expiry_seconds < (30 * 86400)
|
|
annotations:
|
|
summary: "TLS certificate expires in 30 days"
|
|
|
|
- alert: CertificateExpiringSoon
|
|
expr: stemedb_tls_cert_expiry_seconds < (7 * 86400)
|
|
annotations:
|
|
summary: "TLS certificate expires in 7 days - RENEW NOW"
|
|
```
|
|
|
|
**2. Export certificate expiry metric:**
|
|
|
|
Ensure `/metrics` endpoint includes:
|
|
|
|
```
|
|
stemedb_tls_cert_expiry_seconds{domain="stemedb.example.com"} 2592000
|
|
```
|
|
|
|
**3. Set up external monitoring:**
|
|
|
|
```bash
|
|
# Monitor from outside (catches firewall issues)
|
|
# Cron job on monitoring server:
|
|
0 */6 * * * /usr/local/bin/check-cert.sh stemedb.example.com
|
|
```
|
|
|
|
### Operational Best Practices
|
|
|
|
**1. Renew at 60 days (Let's Encrypt expires at 90):**
|
|
|
|
Edit `/etc/letsencrypt/renewal/stemedb.example.com.conf`:
|
|
|
|
```ini
|
|
renew_before_expiry = 30 days
|
|
```
|
|
|
|
**2. Document certificate renewal procedures:**
|
|
|
|
Maintain runbook with:
|
|
- CA contact information
|
|
- DNS/domain registrar access
|
|
- Escalation path if renewal fails
|
|
|
|
**3. Test renewal quarterly:**
|
|
|
|
```bash
|
|
# Quarterly manual test
|
|
certbot renew --cert-name stemedb.example.com --force-renewal --dry-run
|
|
```
|
|
|
|
## Escalation
|
|
|
|
**Escalate immediately if:**
|
|
|
|
- Certificate expires in <48 hours and renewal failing
|
|
- CA rate limits prevent renewal
|
|
- DNS validation requires domain registrar access (not available)
|
|
- Certificate already expired and affecting production
|
|
|
|
**Escalation path:**
|
|
|
|
1. **Primary on-call:** Infrastructure SRE
|
|
2. **Secondary:** Security engineer (CA coordination)
|
|
3. **Final escalation:** VP Engineering + Legal (CA contract issues)
|
|
|
|
## References
|
|
|
|
- **Dashboard:** [StemeDB TLS Health](http://grafana.example.com/d/stemedb-tls)
|
|
- **Related alerts:** `TLSHandshakeFailures`, `ClientAuthenticationErrors`
|
|
- **Metrics:**
|
|
- `stemedb_tls_cert_expiry_seconds` (days until expiry)
|
|
- `stemedb_tls_handshake_errors_total` (TLS failures)
|
|
- **Docs:**
|
|
- Let's Encrypt: https://letsencrypt.org/docs/
|
|
- Certbot renewal: https://eff-certbot.readthedocs.io/en/stable/using.html#renewal
|