stemedb/docs/operations/monitoring/grafana/README.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

222 lines
5.9 KiB
Markdown

# Grafana Dashboards for StemeDB
This directory contains pre-configured Grafana dashboards for monitoring StemeDB in production.
## Dashboards
| Dashboard | Purpose | Refresh Rate |
|-----------|---------|--------------|
| **storage-health.json** | WAL performance, storage latency, index lookup timing | 30s |
| **cluster-overview.json** | Node status, replication lag, sync operations, gossip | 10s |
| **sli-dashboard.json** | Request rate, latency percentiles, error rate, availability | 15s |
## Prerequisites
- Prometheus configured to scrape StemeDB `/metrics` endpoint
- Grafana 8.0+ installed
- Network access from Grafana to Prometheus
## Import Instructions
### Option 1: Grafana UI
1. Open Grafana → **Dashboards****Import**
2. Click **Upload JSON file**
3. Select dashboard file (e.g., `storage-health.json`)
4. Configure data source:
- **Prometheus**: Select your Prometheus data source
5. Click **Import**
6. Repeat for all three dashboards
### Option 2: Grafana API
```bash
# Set Grafana credentials
GRAFANA_URL="http://localhost:3000"
GRAFANA_API_KEY="your-api-key"
# Import all dashboards
for dashboard in storage-health cluster-overview sli-dashboard; do
curl -X POST "$GRAFANA_URL/api/dashboards/db" \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
-H "Content-Type: application/json" \
-d @"$dashboard.json"
done
```
### Option 3: Grafana Provisioning (Automated)
Create `/etc/grafana/provisioning/dashboards/stemedb.yaml`:
```yaml
apiVersion: 1
providers:
- name: 'stemedb'
orgId: 1
folder: 'StemeDB'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards/stemedb
```
Copy dashboard files:
```bash
sudo mkdir -p /var/lib/grafana/dashboards/stemedb
sudo cp *.json /var/lib/grafana/dashboards/stemedb/
sudo chown -R grafana:grafana /var/lib/grafana/dashboards/
sudo systemctl restart grafana-server
```
## Dashboard Overview
### Storage Health Dashboard
**Panels:**
- WAL Fsync Latency (p50, p95, p99) - Track write path performance
- WAL Disk Usage - Monitor disk capacity (alerts at 70%/90%)
- WAL Write Rate - Writes/sec and MB/sec throughput
- WAL Error Rate - Detect write failures
- Storage Operation Latency - KV operation timing by backend (fjall/redb)
- Index Lookup Latency - Subject/predicate index performance
- Storage Operations/sec - Read/write operation rates
**Use for:**
- Diagnosing slow writes (check fsync latency)
- Capacity planning (disk usage trend)
- Identifying storage bottlenecks (operation latency)
### Cluster Overview Dashboard
**Panels:**
- Node Status - Alive/Suspect/Dead node counts
- Replication Lag - Sync delay by peer (alerts >5min)
- Sync Operations/sec - Replication throughput
- Merkle Diff Size - Divergence magnitude
- Cluster Convergence State - % of nodes in sync
- Gossip Message Rate - SWIM protocol health
**Use for:**
- Detecting node failures (status changes)
- Monitoring cluster health (convergence ratio)
- Troubleshooting replication issues (lag spikes)
### SLI Dashboard
**Panels:**
- Request Rate - Traffic by endpoint
- Request Latency p99 - Heatmap showing latency distribution
- Error Rate - Errors by type and layer
- Availability - Success rate gauge (SLO: >99%)
- Request Status Distribution - 2xx/4xx/5xx breakdown
- Latency Distribution - p50/p95/p99 across all endpoints
- Circuit Breaker Status - Open/half-open count
**Use for:**
- Validating SLO compliance (99% availability, p99 <500ms)
- Detecting outages (availability drops)
- Identifying slow endpoints (latency spikes)
## Alert Annotations
Dashboards include embedded Grafana alerts:
- **High Replication Lag** (cluster-overview) - Fires when lag >300s for 5min
- **High WAL Error Rate** (storage-health) - Fires when error rate >0.01/sec
- **High Error Rate** (sli-dashboard) - Fires when API errors >0.01/sec
These alerts can be forwarded to Alertmanager for PagerDuty/Slack integration.
## Customization
### Update Prometheus Data Source
Edit dashboard JSON, find:
```json
"datasource": "Prometheus"
```
Replace with your data source name/UID.
### Adjust Thresholds
For gauge panels, modify `thresholds.steps`:
```json
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 70, "color": "yellow"},
{"value": 90, "color": "red"}
]
}
```
### Change Refresh Rate
Modify `refresh` field at dashboard root:
```json
"refresh": "30s" // Change to "10s", "1m", etc.
```
## Troubleshooting
### Dashboard shows "No data"
1. **Check Prometheus scrape config:**
```yaml
scrape_configs:
- job_name: 'stemedb'
static_configs:
- targets: ['localhost:18180']
```
2. **Verify metrics endpoint:**
```bash
curl http://localhost:18180/metrics | grep stemedb_
```
3. **Check Prometheus targets:**
- Open Prometheus → Status → Targets
- Verify `stemedb` job shows "UP"
### Metrics missing
If specific metrics don't appear:
- **WAL metrics**: Ensure Layer 1 instrumentation is deployed
- **Storage metrics**: Ensure Layer 2 instrumentation is deployed
- **HTTP metrics**: Ensure Layer 3 instrumentation is deployed
- **Error metrics**: Ensure Layer 4 instrumentation is deployed
### Grafana shows "Panel plugin not found"
Update dashboard `type` field to use standard panel types:
- `graph``timeseries`
- `gauge``gauge`
- `stat``stat`
- `heatmap``heatmap`
- `piechart``piechart`
## Next Steps
After importing dashboards:
1. **Configure alerts** - See `../prometheus/alerts/` for alert rules
2. **Set up notification channels** - PagerDuty, Slack, email
3. **Create runbooks** - Link alerts to `../../runbooks/` docs
4. **Test alerts** - Simulate failures to verify alert delivery
## Support
For issues with dashboards:
- Check Grafana logs: `journalctl -u grafana-server -f`
- Verify Prometheus connectivity: `curl $GRAFANA_URL/api/datasources`
- Review dashboard JSON for syntax errors