stemedb/docs/operations/runbooks/certificate-renewal.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

7.9 KiB

Certificate Expiring Soon

Severity: CRITICAL

Alert Rule

Alert: CertificateExpiringSoon Trigger: TLS certificate expires within 7 days Duration: 1h

Symptom

  • Alert fires: "TLS certificate expires in X days"
  • Metrics show stemedb_tls_cert_expiry_seconds < 604800 (7 days)
  • Logs contain certificate expiry warnings
  • openssl commands show approaching expiration date

Impact

User Impact (if cert expires):

  • All HTTPS/TLS connections fail immediately
  • API becomes unreachable for external clients
  • Dashboard shows "Certificate Invalid" errors
  • Inter-node cluster communication fails (if using mTLS)

Business Impact:

  • Complete service outage for external users
  • SLA breach
  • Customer trust erosion (security warnings in browsers)

Investigation Steps

1. Check Certificate Expiration

# Check certificate expiry date
echo | openssl s_client -servername stemedb.example.com \
  -connect localhost:18180 2>/dev/null | \
  openssl x509 -noout -dates
# notBefore=Jan  1 00:00:00 2025 GMT
# notAfter=Apr  1 23:59:59 2026 GMT

# Days until expiry
echo | openssl s_client -servername stemedb.example.com \
  -connect localhost:18180 2>/dev/null | \
  openssl x509 -noout -checkend $((7 * 86400))

2. Check Certificate Details

# View full certificate
openssl s_client -servername stemedb.example.com \
  -connect localhost:18180 </dev/null 2>/dev/null | \
  openssl x509 -text -noout | grep -A 3 "Subject:\|Issuer:\|Validity"

3. Check Certificate Source

# Check if using Let's Encrypt
cat /etc/stemedb/tls/cert.pem | openssl x509 -noout -issuer
# issuer=C = US, O = Let's Encrypt, CN = R3

# Check certbot renewal status (if using Let's Encrypt)
certbot certificates | grep -A 10 stemedb.example.com

4. Check Renewal Automation

# Check certbot timer (systemd)
systemctl status certbot.timer

# Check cron jobs
crontab -l | grep certbot

# Check recent renewal attempts
journalctl -u certbot --since "7 days ago" | grep -i "renew"

Resolution

If Using Let's Encrypt

1. Attempt manual renewal:

# Dry run first
certbot renew --dry-run --cert-name stemedb.example.com

# If successful, perform actual renewal
certbot renew --cert-name stemedb.example.com --force-renewal

2. Reload certificate in stemedb-api:

# Option A: Graceful reload (no downtime)
systemctl reload stemedb-api

# Option B: Restart (brief downtime)
systemctl restart stemedb-api

3. Verify new certificate:

echo | openssl s_client -servername stemedb.example.com \
  -connect localhost:18180 2>/dev/null | \
  openssl x509 -noout -dates | grep notAfter

If Using Custom CA

1. Generate new certificate signing request (CSR):

# Generate new private key
openssl genrsa -out /etc/stemedb/tls/new-key.pem 4096

# Generate CSR
openssl req -new -key /etc/stemedb/tls/new-key.pem \
  -out /tmp/stemedb.csr \
  -subj "/C=US/ST=CA/O=StemeDB/CN=stemedb.example.com"

2. Submit CSR to CA:

# Send CSR to CA for signing
# (Process varies by CA - follow CA-specific procedures)
cat /tmp/stemedb.csr | mail -s "Certificate Renewal Request" ca@example.com

3. After receiving signed certificate, install:

# Backup old certificate
cp /etc/stemedb/tls/cert.pem /etc/stemedb/tls/cert.pem.old.$(date +%Y%m%d)
cp /etc/stemedb/tls/key.pem /etc/stemedb/tls/key.pem.old.$(date +%Y%m%d)

# Install new certificate
mv /tmp/new-cert.pem /etc/stemedb/tls/cert.pem
mv /etc/stemedb/tls/new-key.pem /etc/stemedb/tls/key.pem

# Set correct permissions
chmod 600 /etc/stemedb/tls/key.pem
chmod 644 /etc/stemedb/tls/cert.pem
chown stemedb:stemedb /etc/stemedb/tls/*.pem

4. Reload service:

systemctl reload stemedb-api

# Verify service accepted new cert
journalctl -u stemedb-api --since "1 min ago" | grep -i "tls\|certificate"

If Renewal Fails

1. Check common failure reasons:

# DNS validation issues (Let's Encrypt)
dig _acme-challenge.stemedb.example.com TXT

# HTTP validation issues
curl -v http://stemedb.example.com/.well-known/acme-challenge/test

# Rate limits
certbot renew --dry-run 2>&1 | grep -i "rate limit"

2. Switch to DNS validation (if HTTP fails):

certbot certonly --manual --preferred-challenges dns \
  -d stemedb.example.com \
  --email ops@example.com

3. Use staging CA to test (doesn't count against rate limits):

certbot renew --cert-name stemedb.example.com \
  --server https://acme-staging-v02.api.letsencrypt.org/directory \
  --dry-run

If Certificate Already Expired

1. Generate temporary self-signed certificate:

openssl req -x509 -nodes -days 30 -newkey rsa:4096 \
  -keyout /etc/stemedb/tls/temp-key.pem \
  -out /etc/stemedb/tls/temp-cert.pem \
  -subj "/CN=stemedb.example.com"

2. Install temporary cert:

mv /etc/stemedb/tls/cert.pem /etc/stemedb/tls/cert.pem.expired
cp /etc/stemedb/tls/temp-cert.pem /etc/stemedb/tls/cert.pem
cp /etc/stemedb/tls/temp-key.pem /etc/stemedb/tls/key.pem
systemctl reload stemedb-api

3. Fix renewal and replace with valid cert:

Follow renewal steps above, then replace temporary cert.

Prevention

Automated Renewal

1. Enable certbot timer (Let's Encrypt):

# Enable automatic renewal
systemctl enable certbot.timer
systemctl start certbot.timer

# Verify timer is active
systemctl list-timers | grep certbot

2. Configure deploy hook:

Create /etc/letsencrypt/renewal-hooks/deploy/reload-stemedb.sh:

#!/bin/bash
systemctl reload stemedb-api
journalctl -u stemedb-api -n 5 | grep -i "certificate reloaded" || \
  echo "WARNING: Certificate reload may have failed"

Make executable:

chmod +x /etc/letsencrypt/renewal-hooks/deploy/reload-stemedb.sh

3. Test renewal automation:

# Dry run triggers deploy hook
certbot renew --dry-run

Monitoring

1. Alert at 30 days (warning) and 7 days (critical):

# Prometheus alert
- alert: CertificateExpiringWarning
  expr: stemedb_tls_cert_expiry_seconds < (30 * 86400)
  annotations:
    summary: "TLS certificate expires in 30 days"

- alert: CertificateExpiringSoon
  expr: stemedb_tls_cert_expiry_seconds < (7 * 86400)
  annotations:
    summary: "TLS certificate expires in 7 days - RENEW NOW"

2. Export certificate expiry metric:

Ensure /metrics endpoint includes:

stemedb_tls_cert_expiry_seconds{domain="stemedb.example.com"} 2592000

3. Set up external monitoring:

# Monitor from outside (catches firewall issues)
# Cron job on monitoring server:
0 */6 * * * /usr/local/bin/check-cert.sh stemedb.example.com

Operational Best Practices

1. Renew at 60 days (Let's Encrypt expires at 90):

Edit /etc/letsencrypt/renewal/stemedb.example.com.conf:

renew_before_expiry = 30 days

2. Document certificate renewal procedures:

Maintain runbook with:

  • CA contact information
  • DNS/domain registrar access
  • Escalation path if renewal fails

3. Test renewal quarterly:

# Quarterly manual test
certbot renew --cert-name stemedb.example.com --force-renewal --dry-run

Escalation

Escalate immediately if:

  • Certificate expires in <48 hours and renewal failing
  • CA rate limits prevent renewal
  • DNS validation requires domain registrar access (not available)
  • Certificate already expired and affecting production

Escalation path:

  1. Primary on-call: Infrastructure SRE
  2. Secondary: Security engineer (CA coordination)
  3. Final escalation: VP Engineering + Legal (CA contract issues)

References