stemedb/docs/operations/monitoring/grafana
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00
..
cluster-overview.json feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00
README.md feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00
sli-dashboard.json feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00
storage-health.json feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00

Grafana Dashboards for StemeDB

This directory contains pre-configured Grafana dashboards for monitoring StemeDB in production.

Dashboards

Dashboard Purpose Refresh Rate
storage-health.json WAL performance, storage latency, index lookup timing 30s
cluster-overview.json Node status, replication lag, sync operations, gossip 10s
sli-dashboard.json Request rate, latency percentiles, error rate, availability 15s

Prerequisites

  • Prometheus configured to scrape StemeDB /metrics endpoint
  • Grafana 8.0+ installed
  • Network access from Grafana to Prometheus

Import Instructions

Option 1: Grafana UI

  1. Open Grafana → DashboardsImport
  2. Click Upload JSON file
  3. Select dashboard file (e.g., storage-health.json)
  4. Configure data source:
    • Prometheus: Select your Prometheus data source
  5. Click Import
  6. Repeat for all three dashboards

Option 2: Grafana API

# Set Grafana credentials
GRAFANA_URL="http://localhost:3000"
GRAFANA_API_KEY="your-api-key"

# Import all dashboards
for dashboard in storage-health cluster-overview sli-dashboard; do
  curl -X POST "$GRAFANA_URL/api/dashboards/db" \
    -H "Authorization: Bearer $GRAFANA_API_KEY" \
    -H "Content-Type: application/json" \
    -d @"$dashboard.json"
done

Option 3: Grafana Provisioning (Automated)

Create /etc/grafana/provisioning/dashboards/stemedb.yaml:

apiVersion: 1

providers:
  - name: 'stemedb'
    orgId: 1
    folder: 'StemeDB'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards/stemedb

Copy dashboard files:

sudo mkdir -p /var/lib/grafana/dashboards/stemedb
sudo cp *.json /var/lib/grafana/dashboards/stemedb/
sudo chown -R grafana:grafana /var/lib/grafana/dashboards/
sudo systemctl restart grafana-server

Dashboard Overview

Storage Health Dashboard

Panels:

  • WAL Fsync Latency (p50, p95, p99) - Track write path performance
  • WAL Disk Usage - Monitor disk capacity (alerts at 70%/90%)
  • WAL Write Rate - Writes/sec and MB/sec throughput
  • WAL Error Rate - Detect write failures
  • Storage Operation Latency - KV operation timing by backend (fjall/redb)
  • Index Lookup Latency - Subject/predicate index performance
  • Storage Operations/sec - Read/write operation rates

Use for:

  • Diagnosing slow writes (check fsync latency)
  • Capacity planning (disk usage trend)
  • Identifying storage bottlenecks (operation latency)

Cluster Overview Dashboard

Panels:

  • Node Status - Alive/Suspect/Dead node counts
  • Replication Lag - Sync delay by peer (alerts >5min)
  • Sync Operations/sec - Replication throughput
  • Merkle Diff Size - Divergence magnitude
  • Cluster Convergence State - % of nodes in sync
  • Gossip Message Rate - SWIM protocol health

Use for:

  • Detecting node failures (status changes)
  • Monitoring cluster health (convergence ratio)
  • Troubleshooting replication issues (lag spikes)

SLI Dashboard

Panels:

  • Request Rate - Traffic by endpoint
  • Request Latency p99 - Heatmap showing latency distribution
  • Error Rate - Errors by type and layer
  • Availability - Success rate gauge (SLO: >99%)
  • Request Status Distribution - 2xx/4xx/5xx breakdown
  • Latency Distribution - p50/p95/p99 across all endpoints
  • Circuit Breaker Status - Open/half-open count

Use for:

  • Validating SLO compliance (99% availability, p99 <500ms)
  • Detecting outages (availability drops)
  • Identifying slow endpoints (latency spikes)

Alert Annotations

Dashboards include embedded Grafana alerts:

  • High Replication Lag (cluster-overview) - Fires when lag >300s for 5min
  • High WAL Error Rate (storage-health) - Fires when error rate >0.01/sec
  • High Error Rate (sli-dashboard) - Fires when API errors >0.01/sec

These alerts can be forwarded to Alertmanager for PagerDuty/Slack integration.

Customization

Update Prometheus Data Source

Edit dashboard JSON, find:

"datasource": "Prometheus"

Replace with your data source name/UID.

Adjust Thresholds

For gauge panels, modify thresholds.steps:

"thresholds": {
  "steps": [
    {"value": 0, "color": "green"},
    {"value": 70, "color": "yellow"},
    {"value": 90, "color": "red"}
  ]
}

Change Refresh Rate

Modify refresh field at dashboard root:

"refresh": "30s"  // Change to "10s", "1m", etc.

Troubleshooting

Dashboard shows "No data"

  1. Check Prometheus scrape config:

    scrape_configs:
      - job_name: 'stemedb'
        static_configs:
          - targets: ['localhost:18180']
    
  2. Verify metrics endpoint:

    curl http://localhost:18180/metrics | grep stemedb_
    
  3. Check Prometheus targets:

    • Open Prometheus → Status → Targets
    • Verify stemedb job shows "UP"

Metrics missing

If specific metrics don't appear:

  • WAL metrics: Ensure Layer 1 instrumentation is deployed
  • Storage metrics: Ensure Layer 2 instrumentation is deployed
  • HTTP metrics: Ensure Layer 3 instrumentation is deployed
  • Error metrics: Ensure Layer 4 instrumentation is deployed

Grafana shows "Panel plugin not found"

Update dashboard type field to use standard panel types:

  • graphtimeseries
  • gaugegauge
  • statstat
  • heatmapheatmap
  • piechartpiechart

Next Steps

After importing dashboards:

  1. Configure alerts - See ../prometheus/alerts/ for alert rules
  2. Set up notification channels - PagerDuty, Slack, email
  3. Create runbooks - Link alerts to ../../runbooks/ docs
  4. Test alerts - Simulate failures to verify alert delivery

Support

For issues with dashboards:

  • Check Grafana logs: journalctl -u grafana-server -f
  • Verify Prometheus connectivity: curl $GRAFANA_URL/api/datasources
  • Review dashboard JSON for syntax errors