# Grafana Dashboards for StemeDB This directory contains pre-configured Grafana dashboards for monitoring StemeDB in production. ## Dashboards | Dashboard | Purpose | Refresh Rate | |-----------|---------|--------------| | **storage-health.json** | WAL performance, storage latency, index lookup timing | 30s | | **cluster-overview.json** | Node status, replication lag, sync operations, gossip | 10s | | **sli-dashboard.json** | Request rate, latency percentiles, error rate, availability | 15s | ## Prerequisites - Prometheus configured to scrape StemeDB `/metrics` endpoint - Grafana 8.0+ installed - Network access from Grafana to Prometheus ## Import Instructions ### Option 1: Grafana UI 1. Open Grafana → **Dashboards** → **Import** 2. Click **Upload JSON file** 3. Select dashboard file (e.g., `storage-health.json`) 4. Configure data source: - **Prometheus**: Select your Prometheus data source 5. Click **Import** 6. Repeat for all three dashboards ### Option 2: Grafana API ```bash # Set Grafana credentials GRAFANA_URL="http://localhost:3000" GRAFANA_API_KEY="your-api-key" # Import all dashboards for dashboard in storage-health cluster-overview sli-dashboard; do curl -X POST "$GRAFANA_URL/api/dashboards/db" \ -H "Authorization: Bearer $GRAFANA_API_KEY" \ -H "Content-Type: application/json" \ -d @"$dashboard.json" done ``` ### Option 3: Grafana Provisioning (Automated) Create `/etc/grafana/provisioning/dashboards/stemedb.yaml`: ```yaml apiVersion: 1 providers: - name: 'stemedb' orgId: 1 folder: 'StemeDB' type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /var/lib/grafana/dashboards/stemedb ``` Copy dashboard files: ```bash sudo mkdir -p /var/lib/grafana/dashboards/stemedb sudo cp *.json /var/lib/grafana/dashboards/stemedb/ sudo chown -R grafana:grafana /var/lib/grafana/dashboards/ sudo systemctl restart grafana-server ``` ## Dashboard Overview ### Storage Health Dashboard **Panels:** - WAL Fsync Latency (p50, p95, p99) - Track write path performance - WAL Disk Usage - Monitor disk capacity (alerts at 70%/90%) - WAL Write Rate - Writes/sec and MB/sec throughput - WAL Error Rate - Detect write failures - Storage Operation Latency - KV operation timing by backend (fjall/redb) - Index Lookup Latency - Subject/predicate index performance - Storage Operations/sec - Read/write operation rates **Use for:** - Diagnosing slow writes (check fsync latency) - Capacity planning (disk usage trend) - Identifying storage bottlenecks (operation latency) ### Cluster Overview Dashboard **Panels:** - Node Status - Alive/Suspect/Dead node counts - Replication Lag - Sync delay by peer (alerts >5min) - Sync Operations/sec - Replication throughput - Merkle Diff Size - Divergence magnitude - Cluster Convergence State - % of nodes in sync - Gossip Message Rate - SWIM protocol health **Use for:** - Detecting node failures (status changes) - Monitoring cluster health (convergence ratio) - Troubleshooting replication issues (lag spikes) ### SLI Dashboard **Panels:** - Request Rate - Traffic by endpoint - Request Latency p99 - Heatmap showing latency distribution - Error Rate - Errors by type and layer - Availability - Success rate gauge (SLO: >99%) - Request Status Distribution - 2xx/4xx/5xx breakdown - Latency Distribution - p50/p95/p99 across all endpoints - Circuit Breaker Status - Open/half-open count **Use for:** - Validating SLO compliance (99% availability, p99 <500ms) - Detecting outages (availability drops) - Identifying slow endpoints (latency spikes) ## Alert Annotations Dashboards include embedded Grafana alerts: - **High Replication Lag** (cluster-overview) - Fires when lag >300s for 5min - **High WAL Error Rate** (storage-health) - Fires when error rate >0.01/sec - **High Error Rate** (sli-dashboard) - Fires when API errors >0.01/sec These alerts can be forwarded to Alertmanager for PagerDuty/Slack integration. ## Customization ### Update Prometheus Data Source Edit dashboard JSON, find: ```json "datasource": "Prometheus" ``` Replace with your data source name/UID. ### Adjust Thresholds For gauge panels, modify `thresholds.steps`: ```json "thresholds": { "steps": [ {"value": 0, "color": "green"}, {"value": 70, "color": "yellow"}, {"value": 90, "color": "red"} ] } ``` ### Change Refresh Rate Modify `refresh` field at dashboard root: ```json "refresh": "30s" // Change to "10s", "1m", etc. ``` ## Troubleshooting ### Dashboard shows "No data" 1. **Check Prometheus scrape config:** ```yaml scrape_configs: - job_name: 'stemedb' static_configs: - targets: ['localhost:18180'] ``` 2. **Verify metrics endpoint:** ```bash curl http://localhost:18180/metrics | grep stemedb_ ``` 3. **Check Prometheus targets:** - Open Prometheus → Status → Targets - Verify `stemedb` job shows "UP" ### Metrics missing If specific metrics don't appear: - **WAL metrics**: Ensure Layer 1 instrumentation is deployed - **Storage metrics**: Ensure Layer 2 instrumentation is deployed - **HTTP metrics**: Ensure Layer 3 instrumentation is deployed - **Error metrics**: Ensure Layer 4 instrumentation is deployed ### Grafana shows "Panel plugin not found" Update dashboard `type` field to use standard panel types: - `graph` → `timeseries` - `gauge` → `gauge` - `stat` → `stat` - `heatmap` → `heatmap` - `piechart` → `piechart` ## Next Steps After importing dashboards: 1. **Configure alerts** - See `../prometheus/alerts/` for alert rules 2. **Set up notification channels** - PagerDuty, Slack, email 3. **Create runbooks** - Link alerts to `../../runbooks/` docs 4. **Test alerts** - Simulate failures to verify alert delivery ## Support For issues with dashboards: - Check Grafana logs: `journalctl -u grafana-server -f` - Verify Prometheus connectivity: `curl $GRAFANA_URL/api/datasources` - Review dashboard JSON for syntax errors