stemedb/docs/operations/monitoring/alerting/slack-config.yml

# Alertmanager configuration for Slack integration
#
# This configuration sends StemeDB alerts to Slack channels by severity.
# Merge this with your existing alertmanager.yml or pagerduty-config.yml.

receivers:
  # Critical alerts -> #stemedb-alerts-critical (high visibility)
  - name: 'slack-critical'
    slack_configs:
      - api_url: '<YOUR_SLACK_WEBHOOK_URL_CRITICAL>'
        channel: '#stemedb-alerts-critical'
        username: 'StemeDB Alerts'
        icon_emoji: ':rotating_light:'
        title: ':fire: StemeDB CRITICAL Alert'
        title_link: '{{ range .Alerts }}{{ .Annotations.dashboard }}{{ end }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Labels.alertname }}
          *Severity:* {{ .Labels.severity }}
          *Component:* {{ .Labels.component }}
          *Instance:* {{ .Labels.instance }}

          {{ .Annotations.summary }}

          *Description:*
          {{ .Annotations.description }}

          *Impact:*
          {{ .Annotations.impact }}

          *Action Required:*
          {{ .Annotations.action }}

          <{{ .Annotations.runbook }}|View Runbook> | <{{ .Annotations.dashboard }}|View Dashboard>
          {{ end }}
        color: 'danger'
        send_resolved: true

  # Warning alerts -> #stemedb-alerts-warning (medium visibility)
  - name: 'slack-warning'
    slack_configs:
      - api_url: '<YOUR_SLACK_WEBHOOK_URL_WARNING>'
        channel: '#stemedb-alerts-warning'
        username: 'StemeDB Alerts'
        icon_emoji: ':warning:'
        title: ':warning: StemeDB Warning Alert'
        title_link: '{{ range .Alerts }}{{ .Annotations.dashboard }}{{ end }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Labels.alertname }}
          *Component:* {{ .Labels.component }}
          *Instance:* {{ .Labels.instance }}

          {{ .Annotations.summary }}

          *Description:*
          {{ .Annotations.description }}

          <{{ .Annotations.runbook }}|View Runbook>
          {{ end }}
        color: 'warning'
        send_resolved: true

  # Info alerts -> #stemedb-alerts-info (low visibility, audit trail)
  - name: 'slack-info'
    slack_configs:
      - api_url: '<YOUR_SLACK_WEBHOOK_URL_INFO>'
        channel: '#stemedb-alerts-info'
        username: 'StemeDB Alerts'
        icon_emoji: ':information_source:'
        title: 'StemeDB Info'
        text: |
          {{ range .Alerts }}
          {{ .Annotations.summary }}

          {{ .Annotations.description }}

          <{{ .Annotations.runbook }}|Details>
          {{ end }}
        color: 'good'
        send_resolved: false

# Slack Integration Setup Guide

## 1. Create Slack App

1. Go to https://api.slack.com/apps
2. Click **Create New App** → **From scratch**
3. Name: `StemeDB Alerts`
4. Select your workspace

## 2. Enable Incoming Webhooks

1. In your app → **Incoming Webhooks**
2. Toggle **Activate Incoming Webhooks** to ON
3. Click **Add New Webhook to Workspace**
4. Select channel (e.g., `#stemedb-alerts-critical`)
5. Click **Allow**
6. Copy webhook URL (starts with `https://hooks.slack.com/services/...`)
7. Repeat for warning and info channels

## 3. Configure Alertmanager

Replace placeholders with your webhook URLs:

```yaml
api_url: '<YOUR_SLACK_WEBHOOK_URL_CRITICAL>'
```

Becomes:

```yaml
api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX'
```

## 4. Test Integration

```bash
# Send test message directly to Slack
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
  -H 'Content-Type: application/json' \
  -d '{
    "text": "Test alert from StemeDB monitoring setup",
    "username": "StemeDB Alerts",
    "icon_emoji": ":rotating_light:"
  }'
```

## 5. Recommended Channel Structure

Create three Slack channels:

| Channel | Purpose | Members | Notifications |
|---------|---------|---------|---------------|
| `#stemedb-alerts-critical` | Critical alerts requiring immediate action | On-call engineers, managers | @channel |
| `#stemedb-alerts-warning` | Warning alerts for investigation | Engineering team | @here |
| `#stemedb-alerts-info` | Info alerts for audit trail | Engineering team, optional | None |

## 6. Channel Topics

Set channel topics with useful links:

```
#stemedb-alerts-critical
🔴 Critical StemeDB alerts | On-call: @oncall-engineer | Runbooks: https://docs/runbooks | Dashboards: https://grafana/stemedb
```

```
#stemedb-alerts-warning
🟡 StemeDB warning alerts | Escalate to #stemedb-alerts-critical if critical | Runbooks: https://docs/runbooks
```

```
#stemedb-alerts-info
ℹ️ StemeDB informational alerts | No action required | Mute this channel if too noisy
```

## 7. Slack Workflow Integration (Advanced)

For automated incident response, create Slack workflows:

### Critical Alert Workflow

Triggered by: Message posted to `#stemedb-alerts-critical` with "CRITICAL"

Steps:
1. **Create incident channel** (`#incident-YYYY-MM-DD-HH-MM`)
2. **Add participants** (@oncall-engineer, @manager, @sre-lead)
3. **Post incident template** with runbook links
4. **Start Zoom call** for coordination
5. **Create PagerDuty incident** if not auto-created

### Resolution Workflow

Triggered by: Reaction `:white_check_mark:` on critical alert

Steps:
1. **Mark incident as resolved** in PagerDuty
2. **Post resolution message** in incident channel
3. **Request post-mortem** (create template doc)
4. **Archive incident channel** after 7 days

## Troubleshooting

### Messages not appearing in Slack

1. **Verify webhook URL:**
   ```bash
   curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
     -d '{"text":"test"}'
   ```

2. **Check Alertmanager logs:**
   ```bash
   journalctl -u alertmanager -f | grep slack
   ```

3. **Verify app permissions:**
   - App must have `incoming-webhook` scope
   - App must be installed in workspace

### Alert formatting broken

- Slack uses Markdown syntax (not Go templates)
- Test formatting with https://api.slack.com/docs/messages/builder
- Use `\n` for line breaks, `*bold*`, `_italic_`, `` `code` ``

### Too many notifications

- Mute `#stemedb-alerts-info` channel (low priority)
- Increase `group_interval` in Alertmanager (batch more alerts)
- Add inhibition rules to suppress related alerts

### Alerts not resolving

- Set `send_resolved: true` in Slack config (default: false for info)
- Verify Prometheus `for` duration allows time for resolution

## Best Practices

1. **Channel naming**: Use consistent prefix (`stemedb-alerts-*`)
2. **Color coding**: Critical=red, Warning=orange, Info=blue
3. **Actionable messages**: Include runbook links and next steps
4. **Mention on-call**: Use `@oncall-engineer` handle in critical channel
5. **Archive old channels**: Auto-archive incident channels after 7 days
6. **Review periodically**: Check alert volume, tune thresholds
7. **Test regularly**: Send test alerts monthly to verify routing

## Example Alert Flow

```
┌─────────────────────────────────────────────────────────────┐
│  Prometheus fires "WALDiskNearlyFull" alert                 │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  Alertmanager routes to 'slack-critical' receiver           │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  Message posted to #stemedb-alerts-critical                 │
│  "🔥 WAL disk usage >90% on prod-node-1"                    │
│  + Runbook link + Dashboard link                            │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  On-call engineer clicks runbook                            │
│  Follows steps: Check disk, run cleanup, increase size      │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  Disk usage drops to 75%                                    │
│  Prometheus marks alert as resolved                         │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  Alertmanager sends resolved notification to Slack          │
│  "✅ WAL disk usage now 75% on prod-node-1"                 │
└─────────────────────────────────────────────────────────────┘
```