History

jml 3e7eddc074 feat: add enterprise production readiness infrastructure This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>		2026-02-12 06:08:15 +00:00
..
cluster-overview.json	feat: add enterprise production readiness infrastructure	2026-02-12 06:08:15 +00:00
README.md	feat: add enterprise production readiness infrastructure	2026-02-12 06:08:15 +00:00
sli-dashboard.json	feat: add enterprise production readiness infrastructure	2026-02-12 06:08:15 +00:00
storage-health.json	feat: add enterprise production readiness infrastructure	2026-02-12 06:08:15 +00:00

README.md

Grafana Dashboards for StemeDB

This directory contains pre-configured Grafana dashboards for monitoring StemeDB in production.

Dashboards

Dashboard	Purpose	Refresh Rate
storage-health.json	WAL performance, storage latency, index lookup timing	30s
cluster-overview.json	Node status, replication lag, sync operations, gossip	10s
sli-dashboard.json	Request rate, latency percentiles, error rate, availability	15s

Prerequisites

Prometheus configured to scrape StemeDB /metrics endpoint
Grafana 8.0+ installed
Network access from Grafana to Prometheus

Import Instructions

Option 1: Grafana UI

Open Grafana → Dashboards → Import
Click Upload JSON file
Select dashboard file (e.g., storage-health.json)
Configure data source:
- Prometheus: Select your Prometheus data source
Click Import
Repeat for all three dashboards

Option 2: Grafana API

# Set Grafana credentials
GRAFANA_URL="http://localhost:3000"
GRAFANA_API_KEY="your-api-key"

# Import all dashboards
for dashboard in storage-health cluster-overview sli-dashboard; do
  curl -X POST "$GRAFANA_URL/api/dashboards/db" \
    -H "Authorization: Bearer $GRAFANA_API_KEY" \
    -H "Content-Type: application/json" \
    -d @"$dashboard.json"
done

Option 3: Grafana Provisioning (Automated)

Create /etc/grafana/provisioning/dashboards/stemedb.yaml:

apiVersion: 1

providers:
  - name: 'stemedb'
    orgId: 1
    folder: 'StemeDB'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards/stemedb

Copy dashboard files:

sudo mkdir -p /var/lib/grafana/dashboards/stemedb
sudo cp *.json /var/lib/grafana/dashboards/stemedb/
sudo chown -R grafana:grafana /var/lib/grafana/dashboards/
sudo systemctl restart grafana-server

Dashboard Overview

Storage Health Dashboard

Panels:

WAL Fsync Latency (p50, p95, p99) - Track write path performance
WAL Disk Usage - Monitor disk capacity (alerts at 70%/90%)
WAL Write Rate - Writes/sec and MB/sec throughput
WAL Error Rate - Detect write failures
Storage Operation Latency - KV operation timing by backend (fjall/redb)
Index Lookup Latency - Subject/predicate index performance
Storage Operations/sec - Read/write operation rates

Use for:

Diagnosing slow writes (check fsync latency)
Capacity planning (disk usage trend)
Identifying storage bottlenecks (operation latency)

Cluster Overview Dashboard

Panels:

Node Status - Alive/Suspect/Dead node counts
Replication Lag - Sync delay by peer (alerts >5min)
Sync Operations/sec - Replication throughput
Merkle Diff Size - Divergence magnitude
Cluster Convergence State - % of nodes in sync
Gossip Message Rate - SWIM protocol health

Use for:

Detecting node failures (status changes)
Monitoring cluster health (convergence ratio)
Troubleshooting replication issues (lag spikes)

SLI Dashboard

Panels:

Request Rate - Traffic by endpoint
Request Latency p99 - Heatmap showing latency distribution
Error Rate - Errors by type and layer
Availability - Success rate gauge (SLO: >99%)
Request Status Distribution - 2xx/4xx/5xx breakdown
Latency Distribution - p50/p95/p99 across all endpoints
Circuit Breaker Status - Open/half-open count

Use for:

Validating SLO compliance (99% availability, p99 <500ms)
Detecting outages (availability drops)
Identifying slow endpoints (latency spikes)

Alert Annotations

Dashboards include embedded Grafana alerts:

High Replication Lag (cluster-overview) - Fires when lag >300s for 5min
High WAL Error Rate (storage-health) - Fires when error rate >0.01/sec
High Error Rate (sli-dashboard) - Fires when API errors >0.01/sec

These alerts can be forwarded to Alertmanager for PagerDuty/Slack integration.

Customization

Update Prometheus Data Source

Edit dashboard JSON, find:

"datasource": "Prometheus"

Replace with your data source name/UID.

Adjust Thresholds

For gauge panels, modify thresholds.steps:

"thresholds": {
  "steps": [
    {"value": 0, "color": "green"},
    {"value": 70, "color": "yellow"},
    {"value": 90, "color": "red"}
  ]
}

Change Refresh Rate

Modify refresh field at dashboard root:

"refresh": "30s"  // Change to "10s", "1m", etc.

Troubleshooting

Dashboard shows "No data"

Check Prometheus scrape config:

scrape_configs:
  - job_name: 'stemedb'
    static_configs:
      - targets: ['localhost:18180']

Verify metrics endpoint:

curl http://localhost:18180/metrics | grep stemedb_

Check Prometheus targets:
- Open Prometheus → Status → Targets
- Verify stemedb job shows "UP"

Metrics missing

If specific metrics don't appear:

WAL metrics: Ensure Layer 1 instrumentation is deployed
Storage metrics: Ensure Layer 2 instrumentation is deployed
HTTP metrics: Ensure Layer 3 instrumentation is deployed
Error metrics: Ensure Layer 4 instrumentation is deployed

Grafana shows "Panel plugin not found"

Update dashboard type field to use standard panel types:

graph → timeseries
gauge → gauge
stat → stat
heatmap → heatmap
piechart → piechart

Next Steps

After importing dashboards:

Configure alerts - See ../prometheus/alerts/ for alert rules
Set up notification channels - PagerDuty, Slack, email
Create runbooks - Link alerts to ../../runbooks/ docs
Test alerts - Simulate failures to verify alert delivery

Support

For issues with dashboards:

Check Grafana logs: journalctl -u grafana-server -f
Verify Prometheus connectivity: curl $GRAFANA_URL/api/datasources
Review dashboard JSON for syntax errors