stemedb/docs/operations/monitoring/grafana/README.md

# Grafana Dashboards for StemeDB

This directory contains pre-configured Grafana dashboards for monitoring StemeDB in production.

## Dashboards

| Dashboard | Purpose | Refresh Rate |
|-----------|---------|--------------|
| **storage-health.json** | WAL performance, storage latency, index lookup timing | 30s |
| **cluster-overview.json** | Node status, replication lag, sync operations, gossip | 10s |
| **sli-dashboard.json** | Request rate, latency percentiles, error rate, availability | 15s |

## Prerequisites

- Prometheus configured to scrape StemeDB `/metrics` endpoint
- Grafana 8.0+ installed
- Network access from Grafana to Prometheus

## Import Instructions

### Option 1: Grafana UI

1. Open Grafana → **Dashboards** → **Import**
2. Click **Upload JSON file**
3. Select dashboard file (e.g., `storage-health.json`)
4. Configure data source:
   - **Prometheus**: Select your Prometheus data source
5. Click **Import**
6. Repeat for all three dashboards

### Option 2: Grafana API

```bash
# Set Grafana credentials
GRAFANA_URL="http://localhost:3000"
GRAFANA_API_KEY="your-api-key"

# Import all dashboards
for dashboard in storage-health cluster-overview sli-dashboard; do
  curl -X POST "$GRAFANA_URL/api/dashboards/db" \
    -H "Authorization: Bearer $GRAFANA_API_KEY" \
    -H "Content-Type: application/json" \
    -d @"$dashboard.json"
done
```

### Option 3: Grafana Provisioning (Automated)

Create `/etc/grafana/provisioning/dashboards/stemedb.yaml`:

```yaml
apiVersion: 1

providers:
  - name: 'stemedb'
    orgId: 1
    folder: 'StemeDB'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards/stemedb
```

Copy dashboard files:

```bash
sudo mkdir -p /var/lib/grafana/dashboards/stemedb
sudo cp *.json /var/lib/grafana/dashboards/stemedb/
sudo chown -R grafana:grafana /var/lib/grafana/dashboards/
sudo systemctl restart grafana-server
```

## Dashboard Overview

### Storage Health Dashboard

**Panels:**
- WAL Fsync Latency (p50, p95, p99) - Track write path performance
- WAL Disk Usage - Monitor disk capacity (alerts at 70%/90%)
- WAL Write Rate - Writes/sec and MB/sec throughput
- WAL Error Rate - Detect write failures
- Storage Operation Latency - KV operation timing by backend (fjall/redb)
- Index Lookup Latency - Subject/predicate index performance
- Storage Operations/sec - Read/write operation rates

**Use for:**
- Diagnosing slow writes (check fsync latency)
- Capacity planning (disk usage trend)
- Identifying storage bottlenecks (operation latency)

### Cluster Overview Dashboard

**Panels:**
- Node Status - Alive/Suspect/Dead node counts
- Replication Lag - Sync delay by peer (alerts >5min)
- Sync Operations/sec - Replication throughput
- Merkle Diff Size - Divergence magnitude
- Cluster Convergence State - % of nodes in sync
- Gossip Message Rate - SWIM protocol health

**Use for:**
- Detecting node failures (status changes)
- Monitoring cluster health (convergence ratio)
- Troubleshooting replication issues (lag spikes)

### SLI Dashboard

**Panels:**
- Request Rate - Traffic by endpoint
- Request Latency p99 - Heatmap showing latency distribution
- Error Rate - Errors by type and layer
- Availability - Success rate gauge (SLO: >99%)
- Request Status Distribution - 2xx/4xx/5xx breakdown
- Latency Distribution - p50/p95/p99 across all endpoints
- Circuit Breaker Status - Open/half-open count

**Use for:**
- Validating SLO compliance (99% availability, p99 <500ms)
- Detecting outages (availability drops)
- Identifying slow endpoints (latency spikes)

## Alert Annotations

Dashboards include embedded Grafana alerts:

- **High Replication Lag** (cluster-overview) - Fires when lag >300s for 5min
- **High WAL Error Rate** (storage-health) - Fires when error rate >0.01/sec
- **High Error Rate** (sli-dashboard) - Fires when API errors >0.01/sec

These alerts can be forwarded to Alertmanager for PagerDuty/Slack integration.

## Customization

### Update Prometheus Data Source

Edit dashboard JSON, find:

```json
"datasource": "Prometheus"
```

Replace with your data source name/UID.

### Adjust Thresholds

For gauge panels, modify `thresholds.steps`:

```json
"thresholds": {
  "steps": [
    {"value": 0, "color": "green"},
    {"value": 70, "color": "yellow"},
    {"value": 90, "color": "red"}
  ]
}
```

### Change Refresh Rate

Modify `refresh` field at dashboard root:

```json
"refresh": "30s"  // Change to "10s", "1m", etc.
```

## Troubleshooting

### Dashboard shows "No data"

1. **Check Prometheus scrape config:**
   ```yaml
   scrape_configs:
     - job_name: 'stemedb'
       static_configs:
         - targets: ['localhost:18180']
   ```

2. **Verify metrics endpoint:**
   ```bash
   curl http://localhost:18180/metrics | grep stemedb_
   ```

3. **Check Prometheus targets:**
   - Open Prometheus → Status → Targets
   - Verify `stemedb` job shows "UP"

### Metrics missing

If specific metrics don't appear:

- **WAL metrics**: Ensure Layer 1 instrumentation is deployed
- **Storage metrics**: Ensure Layer 2 instrumentation is deployed
- **HTTP metrics**: Ensure Layer 3 instrumentation is deployed
- **Error metrics**: Ensure Layer 4 instrumentation is deployed

### Grafana shows "Panel plugin not found"

Update dashboard `type` field to use standard panel types:
- `graph` → `timeseries`
- `gauge` → `gauge`
- `stat` → `stat`
- `heatmap` → `heatmap`
- `piechart` → `piechart`

## Next Steps

After importing dashboards:

1. **Configure alerts** - See `../prometheus/alerts/` for alert rules
2. **Set up notification channels** - PagerDuty, Slack, email
3. **Create runbooks** - Link alerts to `../../runbooks/` docs
4. **Test alerts** - Simulate failures to verify alert delivery

## Support

For issues with dashboards:
- Check Grafana logs: `journalctl -u grafana-server -f`
- Verify Prometheus connectivity: `curl $GRAFANA_URL/api/datasources`
- Review dashboard JSON for syntax errors