rdev/docs/operations/monitoring.md

# Monitoring Guide

This guide covers monitoring rdev API with Prometheus and Grafana.

## Metrics Endpoint

rdev exposes Prometheus metrics at `/metrics`:

```bash
curl http://rdev-api:8080/metrics
```

## Available Metrics

### HTTP Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `http_requests_total` | Counter | Total HTTP requests |
| `http_request_duration_seconds` | Histogram | Request latency |
| `http_requests_in_flight` | Gauge | Current active requests |

Labels: `method`, `path`, `status`

### Command Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `rdev_commands_total` | Counter | Total commands executed |
| `rdev_commands_active` | Gauge | Currently running commands |
| `rdev_command_duration_seconds` | Histogram | Command execution time |

Labels: `project`, `type` (claude/shell/git), `status`

### SSE Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `rdev_sse_connections_total` | Counter | Total SSE connections |
| `rdev_sse_connections_active` | Gauge | Active SSE connections |
| `rdev_sse_events_sent_total` | Counter | Total events sent |

Labels: `project`, `event_type`

### Auth Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `rdev_auth_requests_total` | Counter | Auth attempts |
| `rdev_auth_failures_total` | Counter | Auth failures |

Labels: `reason` (invalid, revoked, expired, ip_blocked)

### Rate Limit Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `rdev_ratelimit_requests_total` | Counter | Rate limit checks |
| `rdev_ratelimit_rejected_total` | Counter | Rejected requests |

## Prometheus Configuration

### ServiceMonitor (Prometheus Operator)

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: rdev-api
  namespace: rdev
  labels:
    app: rdev-api
spec:
  selector:
    matchLabels:
      app: rdev-api
  endpoints:
    - port: http
      path: /metrics
      interval: 15s
```

### Static Config

```yaml
scrape_configs:
  - job_name: 'rdev-api'
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - rdev
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app]
        regex: rdev-api
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        regex: http
        action: keep
```

## Grafana Dashboards

### Overview Dashboard

```json
{
  "title": "rdev API Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(http_requests_total{job=\"rdev-api\"}[5m])",
          "legendFormat": "{{method}} {{path}}"
        }
      ]
    },
    {
      "title": "Latency P99",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job=\"rdev-api\"}[5m]))",
          "legendFormat": "p99"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(http_requests_total{job=\"rdev-api\",status=~\"5..\"}[5m])",
          "legendFormat": "5xx errors"
        }
      ]
    },
    {
      "title": "Active Commands",
      "type": "gauge",
      "targets": [
        {
          "expr": "rdev_commands_active",
          "legendFormat": "{{project}}"
        }
      ]
    }
  ]
}
```

### Key PromQL Queries

**Request rate by endpoint:**
```promql
rate(http_requests_total{job="rdev-api"}[5m])
```

**P99 latency:**
```promql
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="rdev-api"}[5m]))
```

**Error rate percentage:**
```promql
100 * rate(http_requests_total{job="rdev-api",status=~"5.."}[5m])
/ rate(http_requests_total{job="rdev-api"}[5m])
```

**Command execution rate:**
```promql
rate(rdev_commands_total{job="rdev-api"}[5m])
```

**Average command duration:**
```promql
rate(rdev_command_duration_seconds_sum[5m])
/ rate(rdev_command_duration_seconds_count[5m])
```

## Alerting

### PrometheusRule

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: rdev-api-alerts
  namespace: rdev
spec:
  groups:
    - name: rdev-api
      rules:
        - alert: RdevAPIHighErrorRate
          expr: |
            rate(http_requests_total{job="rdev-api",status=~"5.."}[5m])
            / rate(http_requests_total{job="rdev-api"}[5m]) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "rdev API error rate > 5%"
            description: "Error rate is {{ $value | humanizePercentage }}"

        - alert: RdevAPIHighLatency
          expr: |
            histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="rdev-api"}[5m])) > 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "rdev API p99 latency > 2s"
            description: "P99 latency is {{ $value | humanizeDuration }}"

        - alert: RdevAPIPodDown
          expr: up{job="rdev-api"} == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "rdev API pod is down"

        - alert: RdevAPIHighCommandQueue
          expr: rdev_commands_active > 4
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High number of active commands"
            description: "{{ $value }} commands currently running"

        - alert: RdevAPIHighRateLimit
          expr: |
            rate(rdev_ratelimit_rejected_total[5m])
            / rate(rdev_ratelimit_requests_total[5m]) > 0.1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High rate limit rejection rate"
```

## Logging

### Log Format

rdev uses structured JSON logging:

```json
{
  "level": "info",
  "time": "2024-01-15T10:30:00Z",
  "msg": "request completed",
  "request_id": "req-abc123",
  "method": "POST",
  "path": "/projects/test/claude",
  "status": 201,
  "duration_ms": 45,
  "client_ip": "10.0.0.1"
}
```

### Log Levels

| Level | Description |
|-------|-------------|
| `debug` | Detailed debugging info |
| `info` | Normal operations |
| `warn` | Potential issues |
| `error` | Errors requiring attention |

### Loki/Promtail

```yaml
# promtail config
scrape_configs:
  - job_name: rdev-api
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: rdev-api
        action: keep
    pipeline_stages:
      - json:
          expressions:
            level: level
            request_id: request_id
            path: path
            status: status
      - labels:
          level:
          path:
```

### LogQL Queries

**Errors in last hour:**
```logql
{app="rdev-api"} |= "error"
```

**Slow requests:**
```logql
{app="rdev-api"} | json | duration_ms > 1000
```

**Requests by status:**
```logql
sum by (status) (count_over_time({app="rdev-api"} | json [1h]))
```

## Health Checks

### Liveness

```bash
curl http://rdev-api:8080/health
# Returns 200 if process is alive
```

### Readiness

```bash
curl http://rdev-api:8080/ready
# Returns 200 if ready to serve traffic
# Checks: database connectivity, K8s API access
```

Response:
```json
{
  "status": "healthy",
  "checks": {
    "database": {
      "status": "healthy",
      "latency_ms": 5
    },
    "kubernetes": {
      "status": "healthy",
      "latency_ms": 12
    }
  }
}
```