This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
222 lines
5.9 KiB
Markdown
222 lines
5.9 KiB
Markdown
# Grafana Dashboards for StemeDB
|
|
|
|
This directory contains pre-configured Grafana dashboards for monitoring StemeDB in production.
|
|
|
|
## Dashboards
|
|
|
|
| Dashboard | Purpose | Refresh Rate |
|
|
|-----------|---------|--------------|
|
|
| **storage-health.json** | WAL performance, storage latency, index lookup timing | 30s |
|
|
| **cluster-overview.json** | Node status, replication lag, sync operations, gossip | 10s |
|
|
| **sli-dashboard.json** | Request rate, latency percentiles, error rate, availability | 15s |
|
|
|
|
## Prerequisites
|
|
|
|
- Prometheus configured to scrape StemeDB `/metrics` endpoint
|
|
- Grafana 8.0+ installed
|
|
- Network access from Grafana to Prometheus
|
|
|
|
## Import Instructions
|
|
|
|
### Option 1: Grafana UI
|
|
|
|
1. Open Grafana → **Dashboards** → **Import**
|
|
2. Click **Upload JSON file**
|
|
3. Select dashboard file (e.g., `storage-health.json`)
|
|
4. Configure data source:
|
|
- **Prometheus**: Select your Prometheus data source
|
|
5. Click **Import**
|
|
6. Repeat for all three dashboards
|
|
|
|
### Option 2: Grafana API
|
|
|
|
```bash
|
|
# Set Grafana credentials
|
|
GRAFANA_URL="http://localhost:3000"
|
|
GRAFANA_API_KEY="your-api-key"
|
|
|
|
# Import all dashboards
|
|
for dashboard in storage-health cluster-overview sli-dashboard; do
|
|
curl -X POST "$GRAFANA_URL/api/dashboards/db" \
|
|
-H "Authorization: Bearer $GRAFANA_API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d @"$dashboard.json"
|
|
done
|
|
```
|
|
|
|
### Option 3: Grafana Provisioning (Automated)
|
|
|
|
Create `/etc/grafana/provisioning/dashboards/stemedb.yaml`:
|
|
|
|
```yaml
|
|
apiVersion: 1
|
|
|
|
providers:
|
|
- name: 'stemedb'
|
|
orgId: 1
|
|
folder: 'StemeDB'
|
|
type: file
|
|
disableDeletion: false
|
|
updateIntervalSeconds: 10
|
|
allowUiUpdates: true
|
|
options:
|
|
path: /var/lib/grafana/dashboards/stemedb
|
|
```
|
|
|
|
Copy dashboard files:
|
|
|
|
```bash
|
|
sudo mkdir -p /var/lib/grafana/dashboards/stemedb
|
|
sudo cp *.json /var/lib/grafana/dashboards/stemedb/
|
|
sudo chown -R grafana:grafana /var/lib/grafana/dashboards/
|
|
sudo systemctl restart grafana-server
|
|
```
|
|
|
|
## Dashboard Overview
|
|
|
|
### Storage Health Dashboard
|
|
|
|
**Panels:**
|
|
- WAL Fsync Latency (p50, p95, p99) - Track write path performance
|
|
- WAL Disk Usage - Monitor disk capacity (alerts at 70%/90%)
|
|
- WAL Write Rate - Writes/sec and MB/sec throughput
|
|
- WAL Error Rate - Detect write failures
|
|
- Storage Operation Latency - KV operation timing by backend (fjall/redb)
|
|
- Index Lookup Latency - Subject/predicate index performance
|
|
- Storage Operations/sec - Read/write operation rates
|
|
|
|
**Use for:**
|
|
- Diagnosing slow writes (check fsync latency)
|
|
- Capacity planning (disk usage trend)
|
|
- Identifying storage bottlenecks (operation latency)
|
|
|
|
### Cluster Overview Dashboard
|
|
|
|
**Panels:**
|
|
- Node Status - Alive/Suspect/Dead node counts
|
|
- Replication Lag - Sync delay by peer (alerts >5min)
|
|
- Sync Operations/sec - Replication throughput
|
|
- Merkle Diff Size - Divergence magnitude
|
|
- Cluster Convergence State - % of nodes in sync
|
|
- Gossip Message Rate - SWIM protocol health
|
|
|
|
**Use for:**
|
|
- Detecting node failures (status changes)
|
|
- Monitoring cluster health (convergence ratio)
|
|
- Troubleshooting replication issues (lag spikes)
|
|
|
|
### SLI Dashboard
|
|
|
|
**Panels:**
|
|
- Request Rate - Traffic by endpoint
|
|
- Request Latency p99 - Heatmap showing latency distribution
|
|
- Error Rate - Errors by type and layer
|
|
- Availability - Success rate gauge (SLO: >99%)
|
|
- Request Status Distribution - 2xx/4xx/5xx breakdown
|
|
- Latency Distribution - p50/p95/p99 across all endpoints
|
|
- Circuit Breaker Status - Open/half-open count
|
|
|
|
**Use for:**
|
|
- Validating SLO compliance (99% availability, p99 <500ms)
|
|
- Detecting outages (availability drops)
|
|
- Identifying slow endpoints (latency spikes)
|
|
|
|
## Alert Annotations
|
|
|
|
Dashboards include embedded Grafana alerts:
|
|
|
|
- **High Replication Lag** (cluster-overview) - Fires when lag >300s for 5min
|
|
- **High WAL Error Rate** (storage-health) - Fires when error rate >0.01/sec
|
|
- **High Error Rate** (sli-dashboard) - Fires when API errors >0.01/sec
|
|
|
|
These alerts can be forwarded to Alertmanager for PagerDuty/Slack integration.
|
|
|
|
## Customization
|
|
|
|
### Update Prometheus Data Source
|
|
|
|
Edit dashboard JSON, find:
|
|
|
|
```json
|
|
"datasource": "Prometheus"
|
|
```
|
|
|
|
Replace with your data source name/UID.
|
|
|
|
### Adjust Thresholds
|
|
|
|
For gauge panels, modify `thresholds.steps`:
|
|
|
|
```json
|
|
"thresholds": {
|
|
"steps": [
|
|
{"value": 0, "color": "green"},
|
|
{"value": 70, "color": "yellow"},
|
|
{"value": 90, "color": "red"}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Change Refresh Rate
|
|
|
|
Modify `refresh` field at dashboard root:
|
|
|
|
```json
|
|
"refresh": "30s" // Change to "10s", "1m", etc.
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Dashboard shows "No data"
|
|
|
|
1. **Check Prometheus scrape config:**
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'stemedb'
|
|
static_configs:
|
|
- targets: ['localhost:18180']
|
|
```
|
|
|
|
2. **Verify metrics endpoint:**
|
|
```bash
|
|
curl http://localhost:18180/metrics | grep stemedb_
|
|
```
|
|
|
|
3. **Check Prometheus targets:**
|
|
- Open Prometheus → Status → Targets
|
|
- Verify `stemedb` job shows "UP"
|
|
|
|
### Metrics missing
|
|
|
|
If specific metrics don't appear:
|
|
|
|
- **WAL metrics**: Ensure Layer 1 instrumentation is deployed
|
|
- **Storage metrics**: Ensure Layer 2 instrumentation is deployed
|
|
- **HTTP metrics**: Ensure Layer 3 instrumentation is deployed
|
|
- **Error metrics**: Ensure Layer 4 instrumentation is deployed
|
|
|
|
### Grafana shows "Panel plugin not found"
|
|
|
|
Update dashboard `type` field to use standard panel types:
|
|
- `graph` → `timeseries`
|
|
- `gauge` → `gauge`
|
|
- `stat` → `stat`
|
|
- `heatmap` → `heatmap`
|
|
- `piechart` → `piechart`
|
|
|
|
## Next Steps
|
|
|
|
After importing dashboards:
|
|
|
|
1. **Configure alerts** - See `../prometheus/alerts/` for alert rules
|
|
2. **Set up notification channels** - PagerDuty, Slack, email
|
|
3. **Create runbooks** - Link alerts to `../../runbooks/` docs
|
|
4. **Test alerts** - Simulate failures to verify alert delivery
|
|
|
|
## Support
|
|
|
|
For issues with dashboards:
|
|
- Check Grafana logs: `journalctl -u grafana-server -f`
|
|
- Verify Prometheus connectivity: `curl $GRAFANA_URL/api/datasources`
|
|
- Review dashboard JSON for syntax errors
|