# Monitoring Guide This guide covers monitoring rdev API with Prometheus and Grafana. ## Prerequisites ```bash # REQUIRED: Set kubeconfig before any kubectl command export KUBECONFIG=~/.kube/orchard9-k3sf.yaml ``` ## Metrics Endpoint rdev exposes Prometheus metrics at `/metrics`: ```bash curl http://rdev-api:8080/metrics ``` ## Available Metrics ### HTTP Metrics | Metric | Type | Description | |--------|------|-------------| | `http_requests_total` | Counter | Total HTTP requests | | `http_request_duration_seconds` | Histogram | Request latency | | `http_requests_in_flight` | Gauge | Current active requests | Labels: `method`, `path`, `status` ### Command Metrics | Metric | Type | Description | |--------|------|-------------| | `rdev_commands_total` | Counter | Total commands executed | | `rdev_commands_active` | Gauge | Currently running commands | | `rdev_command_duration_seconds` | Histogram | Command execution time | Labels: `project`, `type` (claude/shell/git), `status` ### SSE Metrics | Metric | Type | Description | |--------|------|-------------| | `rdev_sse_connections_total` | Counter | Total SSE connections | | `rdev_sse_connections_active` | Gauge | Active SSE connections | | `rdev_sse_events_sent_total` | Counter | Total events sent | Labels: `project`, `event_type` ### Auth Metrics | Metric | Type | Description | |--------|------|-------------| | `rdev_auth_requests_total` | Counter | Auth attempts | | `rdev_auth_failures_total` | Counter | Auth failures | Labels: `reason` (invalid, revoked, expired, ip_blocked) ### Rate Limit Metrics | Metric | Type | Description | |--------|------|-------------| | `rdev_ratelimit_requests_total` | Counter | Rate limit checks | | `rdev_ratelimit_rejected_total` | Counter | Rejected requests | ## Prometheus Configuration ### ServiceMonitor (Prometheus Operator) ```yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: rdev-api namespace: rdev labels: app: rdev-api spec: selector: matchLabels: app: rdev-api endpoints: - port: http path: /metrics interval: 15s ``` ### Static Config ```yaml scrape_configs: - job_name: 'rdev-api' kubernetes_sd_configs: - role: endpoints namespaces: names: - rdev relabel_configs: - source_labels: [__meta_kubernetes_service_label_app] regex: rdev-api action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] regex: http action: keep ``` ## Grafana Dashboards ### Overview Dashboard ```json { "title": "rdev API Overview", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "rate(http_requests_total{job=\"rdev-api\"}[5m])", "legendFormat": "{{method}} {{path}}" } ] }, { "title": "Latency P99", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job=\"rdev-api\"}[5m]))", "legendFormat": "p99" } ] }, { "title": "Error Rate", "type": "graph", "targets": [ { "expr": "rate(http_requests_total{job=\"rdev-api\",status=~\"5..\"}[5m])", "legendFormat": "5xx errors" } ] }, { "title": "Active Commands", "type": "gauge", "targets": [ { "expr": "rdev_commands_active", "legendFormat": "{{project}}" } ] } ] } ``` ### Key PromQL Queries **Request rate by endpoint:** ```promql rate(http_requests_total{job="rdev-api"}[5m]) ``` **P99 latency:** ```promql histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="rdev-api"}[5m])) ``` **Error rate percentage:** ```promql 100 * rate(http_requests_total{job="rdev-api",status=~"5.."}[5m]) / rate(http_requests_total{job="rdev-api"}[5m]) ``` **Command execution rate:** ```promql rate(rdev_commands_total{job="rdev-api"}[5m]) ``` **Average command duration:** ```promql rate(rdev_command_duration_seconds_sum[5m]) / rate(rdev_command_duration_seconds_count[5m]) ``` ## Alerting ### PrometheusRule ```yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: rdev-api-alerts namespace: rdev spec: groups: - name: rdev-api rules: - alert: RdevAPIHighErrorRate expr: | rate(http_requests_total{job="rdev-api",status=~"5.."}[5m]) / rate(http_requests_total{job="rdev-api"}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "rdev API error rate > 5%" description: "Error rate is {{ $value | humanizePercentage }}" - alert: RdevAPIHighLatency expr: | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="rdev-api"}[5m])) > 2 for: 5m labels: severity: warning annotations: summary: "rdev API p99 latency > 2s" description: "P99 latency is {{ $value | humanizeDuration }}" - alert: RdevAPIPodDown expr: up{job="rdev-api"} == 0 for: 1m labels: severity: critical annotations: summary: "rdev API pod is down" - alert: RdevAPIHighCommandQueue expr: rdev_commands_active > 4 for: 5m labels: severity: warning annotations: summary: "High number of active commands" description: "{{ $value }} commands currently running" - alert: RdevAPIHighRateLimit expr: | rate(rdev_ratelimit_rejected_total[5m]) / rate(rdev_ratelimit_requests_total[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "High rate limit rejection rate" ``` ## Logging ### Log Format rdev uses structured JSON logging: ```json { "level": "info", "time": "2024-01-15T10:30:00Z", "msg": "request completed", "request_id": "req-abc123", "method": "POST", "path": "/projects/test/claude", "status": 201, "duration_ms": 45, "client_ip": "10.0.0.1" } ``` ### Log Levels | Level | Description | |-------|-------------| | `debug` | Detailed debugging info | | `info` | Normal operations | | `warn` | Potential issues | | `error` | Errors requiring attention | ### Loki/Promtail ```yaml # promtail config scrape_configs: - job_name: rdev-api kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] regex: rdev-api action: keep pipeline_stages: - json: expressions: level: level request_id: request_id path: path status: status - labels: level: path: ``` ### LogQL Queries **Errors in last hour:** ```logql {app="rdev-api"} |= "error" ``` **Slow requests:** ```logql {app="rdev-api"} | json | duration_ms > 1000 ``` **Requests by status:** ```logql sum by (status) (count_over_time({app="rdev-api"} | json [1h])) ``` ## Health Checks ### Liveness ```bash curl http://rdev-api:8080/health # Returns 200 if process is alive ``` ### Readiness ```bash curl http://rdev-api:8080/ready # Returns 200 if ready to serve traffic # Checks: database connectivity, K8s API access ``` Response: ```json { "status": "healthy", "checks": { "database": { "status": "healthy", "latency_ms": 5 }, "kubernetes": { "status": "healthy", "latency_ms": 12 } } } ```