stemedb/docs/operations/monitoring/alerting/slack-config.yml
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

266 lines
10 KiB
YAML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Alertmanager configuration for Slack integration
#
# This configuration sends StemeDB alerts to Slack channels by severity.
# Merge this with your existing alertmanager.yml or pagerduty-config.yml.
receivers:
# Critical alerts -> #stemedb-alerts-critical (high visibility)
- name: 'slack-critical'
slack_configs:
- api_url: '<YOUR_SLACK_WEBHOOK_URL_CRITICAL>'
channel: '#stemedb-alerts-critical'
username: 'StemeDB Alerts'
icon_emoji: ':rotating_light:'
title: ':fire: StemeDB CRITICAL Alert'
title_link: '{{ range .Alerts }}{{ .Annotations.dashboard }}{{ end }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity }}
*Component:* {{ .Labels.component }}
*Instance:* {{ .Labels.instance }}
{{ .Annotations.summary }}
*Description:*
{{ .Annotations.description }}
*Impact:*
{{ .Annotations.impact }}
*Action Required:*
{{ .Annotations.action }}
<{{ .Annotations.runbook }}|View Runbook> | <{{ .Annotations.dashboard }}|View Dashboard>
{{ end }}
color: 'danger'
send_resolved: true
# Warning alerts -> #stemedb-alerts-warning (medium visibility)
- name: 'slack-warning'
slack_configs:
- api_url: '<YOUR_SLACK_WEBHOOK_URL_WARNING>'
channel: '#stemedb-alerts-warning'
username: 'StemeDB Alerts'
icon_emoji: ':warning:'
title: ':warning: StemeDB Warning Alert'
title_link: '{{ range .Alerts }}{{ .Annotations.dashboard }}{{ end }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Component:* {{ .Labels.component }}
*Instance:* {{ .Labels.instance }}
{{ .Annotations.summary }}
*Description:*
{{ .Annotations.description }}
<{{ .Annotations.runbook }}|View Runbook>
{{ end }}
color: 'warning'
send_resolved: true
# Info alerts -> #stemedb-alerts-info (low visibility, audit trail)
- name: 'slack-info'
slack_configs:
- api_url: '<YOUR_SLACK_WEBHOOK_URL_INFO>'
channel: '#stemedb-alerts-info'
username: 'StemeDB Alerts'
icon_emoji: ':information_source:'
title: 'StemeDB Info'
text: |
{{ range .Alerts }}
{{ .Annotations.summary }}
{{ .Annotations.description }}
<{{ .Annotations.runbook }}|Details>
{{ end }}
color: 'good'
send_resolved: false
# Slack Integration Setup Guide
## 1. Create Slack App
1. Go to https://api.slack.com/apps
2. Click **Create New App** → **From scratch**
3. Name: `StemeDB Alerts`
4. Select your workspace
## 2. Enable Incoming Webhooks
1. In your app → **Incoming Webhooks**
2. Toggle **Activate Incoming Webhooks** to ON
3. Click **Add New Webhook to Workspace**
4. Select channel (e.g., `#stemedb-alerts-critical`)
5. Click **Allow**
6. Copy webhook URL (starts with `https://hooks.slack.com/services/...`)
7. Repeat for warning and info channels
## 3. Configure Alertmanager
Replace placeholders with your webhook URLs:
```yaml
api_url: '<YOUR_SLACK_WEBHOOK_URL_CRITICAL>'
```
Becomes:
```yaml
api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX'
```
## 4. Test Integration
```bash
# Send test message directly to Slack
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
-H 'Content-Type: application/json' \
-d '{
"text": "Test alert from StemeDB monitoring setup",
"username": "StemeDB Alerts",
"icon_emoji": ":rotating_light:"
}'
```
## 5. Recommended Channel Structure
Create three Slack channels:
| Channel | Purpose | Members | Notifications |
|---------|---------|---------|---------------|
| `#stemedb-alerts-critical` | Critical alerts requiring immediate action | On-call engineers, managers | @channel |
| `#stemedb-alerts-warning` | Warning alerts for investigation | Engineering team | @here |
| `#stemedb-alerts-info` | Info alerts for audit trail | Engineering team, optional | None |
## 6. Channel Topics
Set channel topics with useful links:
```
#stemedb-alerts-critical
🔴 Critical StemeDB alerts | On-call: @oncall-engineer | Runbooks: https://docs/runbooks | Dashboards: https://grafana/stemedb
```
```
#stemedb-alerts-warning
🟡 StemeDB warning alerts | Escalate to #stemedb-alerts-critical if critical | Runbooks: https://docs/runbooks
```
```
#stemedb-alerts-info
StemeDB informational alerts | No action required | Mute this channel if too noisy
```
## 7. Slack Workflow Integration (Advanced)
For automated incident response, create Slack workflows:
### Critical Alert Workflow
Triggered by: Message posted to `#stemedb-alerts-critical` with "CRITICAL"
Steps:
1. **Create incident channel** (`#incident-YYYY-MM-DD-HH-MM`)
2. **Add participants** (@oncall-engineer, @manager, @sre-lead)
3. **Post incident template** with runbook links
4. **Start Zoom call** for coordination
5. **Create PagerDuty incident** if not auto-created
### Resolution Workflow
Triggered by: Reaction `:white_check_mark:` on critical alert
Steps:
1. **Mark incident as resolved** in PagerDuty
2. **Post resolution message** in incident channel
3. **Request post-mortem** (create template doc)
4. **Archive incident channel** after 7 days
## Troubleshooting
### Messages not appearing in Slack
1. **Verify webhook URL:**
```bash
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
-d '{"text":"test"}'
```
2. **Check Alertmanager logs:**
```bash
journalctl -u alertmanager -f | grep slack
```
3. **Verify app permissions:**
- App must have `incoming-webhook` scope
- App must be installed in workspace
### Alert formatting broken
- Slack uses Markdown syntax (not Go templates)
- Test formatting with https://api.slack.com/docs/messages/builder
- Use `\n` for line breaks, `*bold*`, `_italic_`, `` `code` ``
### Too many notifications
- Mute `#stemedb-alerts-info` channel (low priority)
- Increase `group_interval` in Alertmanager (batch more alerts)
- Add inhibition rules to suppress related alerts
### Alerts not resolving
- Set `send_resolved: true` in Slack config (default: false for info)
- Verify Prometheus `for` duration allows time for resolution
## Best Practices
1. **Channel naming**: Use consistent prefix (`stemedb-alerts-*`)
2. **Color coding**: Critical=red, Warning=orange, Info=blue
3. **Actionable messages**: Include runbook links and next steps
4. **Mention on-call**: Use `@oncall-engineer` handle in critical channel
5. **Archive old channels**: Auto-archive incident channels after 7 days
6. **Review periodically**: Check alert volume, tune thresholds
7. **Test regularly**: Send test alerts monthly to verify routing
## Example Alert Flow
```
┌─────────────────────────────────────────────────────────────┐
│ Prometheus fires "WALDiskNearlyFull" alert │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Alertmanager routes to 'slack-critical' receiver │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Message posted to #stemedb-alerts-critical │
│ "🔥 WAL disk usage >90% on prod-node-1" │
│ + Runbook link + Dashboard link │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ On-call engineer clicks runbook │
│ Follows steps: Check disk, run cleanup, increase size │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Disk usage drops to 75% │
│ Prometheus marks alert as resolved │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Alertmanager sends resolved notification to Slack │
│ "✅ WAL disk usage now 75% on prod-node-1" │
└─────────────────────────────────────────────────────────────┘
```