This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
266 lines
10 KiB
YAML
266 lines
10 KiB
YAML
# Alertmanager configuration for Slack integration
|
||
#
|
||
# This configuration sends StemeDB alerts to Slack channels by severity.
|
||
# Merge this with your existing alertmanager.yml or pagerduty-config.yml.
|
||
|
||
receivers:
|
||
# Critical alerts -> #stemedb-alerts-critical (high visibility)
|
||
- name: 'slack-critical'
|
||
slack_configs:
|
||
- api_url: '<YOUR_SLACK_WEBHOOK_URL_CRITICAL>'
|
||
channel: '#stemedb-alerts-critical'
|
||
username: 'StemeDB Alerts'
|
||
icon_emoji: ':rotating_light:'
|
||
title: ':fire: StemeDB CRITICAL Alert'
|
||
title_link: '{{ range .Alerts }}{{ .Annotations.dashboard }}{{ end }}'
|
||
text: |
|
||
{{ range .Alerts }}
|
||
*Alert:* {{ .Labels.alertname }}
|
||
*Severity:* {{ .Labels.severity }}
|
||
*Component:* {{ .Labels.component }}
|
||
*Instance:* {{ .Labels.instance }}
|
||
|
||
{{ .Annotations.summary }}
|
||
|
||
*Description:*
|
||
{{ .Annotations.description }}
|
||
|
||
*Impact:*
|
||
{{ .Annotations.impact }}
|
||
|
||
*Action Required:*
|
||
{{ .Annotations.action }}
|
||
|
||
<{{ .Annotations.runbook }}|View Runbook> | <{{ .Annotations.dashboard }}|View Dashboard>
|
||
{{ end }}
|
||
color: 'danger'
|
||
send_resolved: true
|
||
|
||
# Warning alerts -> #stemedb-alerts-warning (medium visibility)
|
||
- name: 'slack-warning'
|
||
slack_configs:
|
||
- api_url: '<YOUR_SLACK_WEBHOOK_URL_WARNING>'
|
||
channel: '#stemedb-alerts-warning'
|
||
username: 'StemeDB Alerts'
|
||
icon_emoji: ':warning:'
|
||
title: ':warning: StemeDB Warning Alert'
|
||
title_link: '{{ range .Alerts }}{{ .Annotations.dashboard }}{{ end }}'
|
||
text: |
|
||
{{ range .Alerts }}
|
||
*Alert:* {{ .Labels.alertname }}
|
||
*Component:* {{ .Labels.component }}
|
||
*Instance:* {{ .Labels.instance }}
|
||
|
||
{{ .Annotations.summary }}
|
||
|
||
*Description:*
|
||
{{ .Annotations.description }}
|
||
|
||
<{{ .Annotations.runbook }}|View Runbook>
|
||
{{ end }}
|
||
color: 'warning'
|
||
send_resolved: true
|
||
|
||
# Info alerts -> #stemedb-alerts-info (low visibility, audit trail)
|
||
- name: 'slack-info'
|
||
slack_configs:
|
||
- api_url: '<YOUR_SLACK_WEBHOOK_URL_INFO>'
|
||
channel: '#stemedb-alerts-info'
|
||
username: 'StemeDB Alerts'
|
||
icon_emoji: ':information_source:'
|
||
title: 'StemeDB Info'
|
||
text: |
|
||
{{ range .Alerts }}
|
||
{{ .Annotations.summary }}
|
||
|
||
{{ .Annotations.description }}
|
||
|
||
<{{ .Annotations.runbook }}|Details>
|
||
{{ end }}
|
||
color: 'good'
|
||
send_resolved: false
|
||
|
||
# Slack Integration Setup Guide
|
||
|
||
## 1. Create Slack App
|
||
|
||
1. Go to https://api.slack.com/apps
|
||
2. Click **Create New App** → **From scratch**
|
||
3. Name: `StemeDB Alerts`
|
||
4. Select your workspace
|
||
|
||
## 2. Enable Incoming Webhooks
|
||
|
||
1. In your app → **Incoming Webhooks**
|
||
2. Toggle **Activate Incoming Webhooks** to ON
|
||
3. Click **Add New Webhook to Workspace**
|
||
4. Select channel (e.g., `#stemedb-alerts-critical`)
|
||
5. Click **Allow**
|
||
6. Copy webhook URL (starts with `https://hooks.slack.com/services/...`)
|
||
7. Repeat for warning and info channels
|
||
|
||
## 3. Configure Alertmanager
|
||
|
||
Replace placeholders with your webhook URLs:
|
||
|
||
```yaml
|
||
api_url: '<YOUR_SLACK_WEBHOOK_URL_CRITICAL>'
|
||
```
|
||
|
||
Becomes:
|
||
|
||
```yaml
|
||
api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX'
|
||
```
|
||
|
||
## 4. Test Integration
|
||
|
||
```bash
|
||
# Send test message directly to Slack
|
||
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{
|
||
"text": "Test alert from StemeDB monitoring setup",
|
||
"username": "StemeDB Alerts",
|
||
"icon_emoji": ":rotating_light:"
|
||
}'
|
||
```
|
||
|
||
## 5. Recommended Channel Structure
|
||
|
||
Create three Slack channels:
|
||
|
||
| Channel | Purpose | Members | Notifications |
|
||
|---------|---------|---------|---------------|
|
||
| `#stemedb-alerts-critical` | Critical alerts requiring immediate action | On-call engineers, managers | @channel |
|
||
| `#stemedb-alerts-warning` | Warning alerts for investigation | Engineering team | @here |
|
||
| `#stemedb-alerts-info` | Info alerts for audit trail | Engineering team, optional | None |
|
||
|
||
## 6. Channel Topics
|
||
|
||
Set channel topics with useful links:
|
||
|
||
```
|
||
#stemedb-alerts-critical
|
||
🔴 Critical StemeDB alerts | On-call: @oncall-engineer | Runbooks: https://docs/runbooks | Dashboards: https://grafana/stemedb
|
||
```
|
||
|
||
```
|
||
#stemedb-alerts-warning
|
||
🟡 StemeDB warning alerts | Escalate to #stemedb-alerts-critical if critical | Runbooks: https://docs/runbooks
|
||
```
|
||
|
||
```
|
||
#stemedb-alerts-info
|
||
ℹ️ StemeDB informational alerts | No action required | Mute this channel if too noisy
|
||
```
|
||
|
||
## 7. Slack Workflow Integration (Advanced)
|
||
|
||
For automated incident response, create Slack workflows:
|
||
|
||
### Critical Alert Workflow
|
||
|
||
Triggered by: Message posted to `#stemedb-alerts-critical` with "CRITICAL"
|
||
|
||
Steps:
|
||
1. **Create incident channel** (`#incident-YYYY-MM-DD-HH-MM`)
|
||
2. **Add participants** (@oncall-engineer, @manager, @sre-lead)
|
||
3. **Post incident template** with runbook links
|
||
4. **Start Zoom call** for coordination
|
||
5. **Create PagerDuty incident** if not auto-created
|
||
|
||
### Resolution Workflow
|
||
|
||
Triggered by: Reaction `:white_check_mark:` on critical alert
|
||
|
||
Steps:
|
||
1. **Mark incident as resolved** in PagerDuty
|
||
2. **Post resolution message** in incident channel
|
||
3. **Request post-mortem** (create template doc)
|
||
4. **Archive incident channel** after 7 days
|
||
|
||
## Troubleshooting
|
||
|
||
### Messages not appearing in Slack
|
||
|
||
1. **Verify webhook URL:**
|
||
```bash
|
||
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
|
||
-d '{"text":"test"}'
|
||
```
|
||
|
||
2. **Check Alertmanager logs:**
|
||
```bash
|
||
journalctl -u alertmanager -f | grep slack
|
||
```
|
||
|
||
3. **Verify app permissions:**
|
||
- App must have `incoming-webhook` scope
|
||
- App must be installed in workspace
|
||
|
||
### Alert formatting broken
|
||
|
||
- Slack uses Markdown syntax (not Go templates)
|
||
- Test formatting with https://api.slack.com/docs/messages/builder
|
||
- Use `\n` for line breaks, `*bold*`, `_italic_`, `` `code` ``
|
||
|
||
### Too many notifications
|
||
|
||
- Mute `#stemedb-alerts-info` channel (low priority)
|
||
- Increase `group_interval` in Alertmanager (batch more alerts)
|
||
- Add inhibition rules to suppress related alerts
|
||
|
||
### Alerts not resolving
|
||
|
||
- Set `send_resolved: true` in Slack config (default: false for info)
|
||
- Verify Prometheus `for` duration allows time for resolution
|
||
|
||
## Best Practices
|
||
|
||
1. **Channel naming**: Use consistent prefix (`stemedb-alerts-*`)
|
||
2. **Color coding**: Critical=red, Warning=orange, Info=blue
|
||
3. **Actionable messages**: Include runbook links and next steps
|
||
4. **Mention on-call**: Use `@oncall-engineer` handle in critical channel
|
||
5. **Archive old channels**: Auto-archive incident channels after 7 days
|
||
6. **Review periodically**: Check alert volume, tune thresholds
|
||
7. **Test regularly**: Send test alerts monthly to verify routing
|
||
|
||
## Example Alert Flow
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ Prometheus fires "WALDiskNearlyFull" alert │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ Alertmanager routes to 'slack-critical' receiver │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ Message posted to #stemedb-alerts-critical │
|
||
│ "🔥 WAL disk usage >90% on prod-node-1" │
|
||
│ + Runbook link + Dashboard link │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ On-call engineer clicks runbook │
|
||
│ Follows steps: Check disk, run cleanup, increase size │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ Disk usage drops to 75% │
|
||
│ Prometheus marks alert as resolved │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ Alertmanager sends resolved notification to Slack │
|
||
│ "✅ WAL disk usage now 75% on prod-node-1" │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
```
|