This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| cluster-overview.json | ||
| README.md | ||
| sli-dashboard.json | ||
| storage-health.json | ||
Grafana Dashboards for StemeDB
This directory contains pre-configured Grafana dashboards for monitoring StemeDB in production.
Dashboards
| Dashboard | Purpose | Refresh Rate |
|---|---|---|
| storage-health.json | WAL performance, storage latency, index lookup timing | 30s |
| cluster-overview.json | Node status, replication lag, sync operations, gossip | 10s |
| sli-dashboard.json | Request rate, latency percentiles, error rate, availability | 15s |
Prerequisites
- Prometheus configured to scrape StemeDB
/metricsendpoint - Grafana 8.0+ installed
- Network access from Grafana to Prometheus
Import Instructions
Option 1: Grafana UI
- Open Grafana → Dashboards → Import
- Click Upload JSON file
- Select dashboard file (e.g.,
storage-health.json) - Configure data source:
- Prometheus: Select your Prometheus data source
- Click Import
- Repeat for all three dashboards
Option 2: Grafana API
# Set Grafana credentials
GRAFANA_URL="http://localhost:3000"
GRAFANA_API_KEY="your-api-key"
# Import all dashboards
for dashboard in storage-health cluster-overview sli-dashboard; do
curl -X POST "$GRAFANA_URL/api/dashboards/db" \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
-H "Content-Type: application/json" \
-d @"$dashboard.json"
done
Option 3: Grafana Provisioning (Automated)
Create /etc/grafana/provisioning/dashboards/stemedb.yaml:
apiVersion: 1
providers:
- name: 'stemedb'
orgId: 1
folder: 'StemeDB'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards/stemedb
Copy dashboard files:
sudo mkdir -p /var/lib/grafana/dashboards/stemedb
sudo cp *.json /var/lib/grafana/dashboards/stemedb/
sudo chown -R grafana:grafana /var/lib/grafana/dashboards/
sudo systemctl restart grafana-server
Dashboard Overview
Storage Health Dashboard
Panels:
- WAL Fsync Latency (p50, p95, p99) - Track write path performance
- WAL Disk Usage - Monitor disk capacity (alerts at 70%/90%)
- WAL Write Rate - Writes/sec and MB/sec throughput
- WAL Error Rate - Detect write failures
- Storage Operation Latency - KV operation timing by backend (fjall/redb)
- Index Lookup Latency - Subject/predicate index performance
- Storage Operations/sec - Read/write operation rates
Use for:
- Diagnosing slow writes (check fsync latency)
- Capacity planning (disk usage trend)
- Identifying storage bottlenecks (operation latency)
Cluster Overview Dashboard
Panels:
- Node Status - Alive/Suspect/Dead node counts
- Replication Lag - Sync delay by peer (alerts >5min)
- Sync Operations/sec - Replication throughput
- Merkle Diff Size - Divergence magnitude
- Cluster Convergence State - % of nodes in sync
- Gossip Message Rate - SWIM protocol health
Use for:
- Detecting node failures (status changes)
- Monitoring cluster health (convergence ratio)
- Troubleshooting replication issues (lag spikes)
SLI Dashboard
Panels:
- Request Rate - Traffic by endpoint
- Request Latency p99 - Heatmap showing latency distribution
- Error Rate - Errors by type and layer
- Availability - Success rate gauge (SLO: >99%)
- Request Status Distribution - 2xx/4xx/5xx breakdown
- Latency Distribution - p50/p95/p99 across all endpoints
- Circuit Breaker Status - Open/half-open count
Use for:
- Validating SLO compliance (99% availability, p99 <500ms)
- Detecting outages (availability drops)
- Identifying slow endpoints (latency spikes)
Alert Annotations
Dashboards include embedded Grafana alerts:
- High Replication Lag (cluster-overview) - Fires when lag >300s for 5min
- High WAL Error Rate (storage-health) - Fires when error rate >0.01/sec
- High Error Rate (sli-dashboard) - Fires when API errors >0.01/sec
These alerts can be forwarded to Alertmanager for PagerDuty/Slack integration.
Customization
Update Prometheus Data Source
Edit dashboard JSON, find:
"datasource": "Prometheus"
Replace with your data source name/UID.
Adjust Thresholds
For gauge panels, modify thresholds.steps:
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 70, "color": "yellow"},
{"value": 90, "color": "red"}
]
}
Change Refresh Rate
Modify refresh field at dashboard root:
"refresh": "30s" // Change to "10s", "1m", etc.
Troubleshooting
Dashboard shows "No data"
-
Check Prometheus scrape config:
scrape_configs: - job_name: 'stemedb' static_configs: - targets: ['localhost:18180'] -
Verify metrics endpoint:
curl http://localhost:18180/metrics | grep stemedb_ -
Check Prometheus targets:
- Open Prometheus → Status → Targets
- Verify
stemedbjob shows "UP"
Metrics missing
If specific metrics don't appear:
- WAL metrics: Ensure Layer 1 instrumentation is deployed
- Storage metrics: Ensure Layer 2 instrumentation is deployed
- HTTP metrics: Ensure Layer 3 instrumentation is deployed
- Error metrics: Ensure Layer 4 instrumentation is deployed
Grafana shows "Panel plugin not found"
Update dashboard type field to use standard panel types:
graph→timeseriesgauge→gaugestat→statheatmap→heatmappiechart→piechart
Next Steps
After importing dashboards:
- Configure alerts - See
../prometheus/alerts/for alert rules - Set up notification channels - PagerDuty, Slack, email
- Create runbooks - Link alerts to
../../runbooks/docs - Test alerts - Simulate failures to verify alert delivery
Support
For issues with dashboards:
- Check Grafana logs:
journalctl -u grafana-server -f - Verify Prometheus connectivity:
curl $GRAFANA_URL/api/datasources - Review dashboard JSON for syntax errors