jordan 72d16929ca feat: Implement hexagonal architecture with services, webhooks, queue, and telemetry

Major refactoring to hexagonal (ports & adapters) architecture:

- Add service layer (apikey_service, project_service) for business logic
- Add webhook system with dispatcher and delivery tracking
- Add command queue with priority-based processing
- Add rate limiting with sliding window algorithm
- Add audit logging for command execution
- Add OpenTelemetry integration (traces, metrics, spans)
- Add circuit breaker for fault tolerance
- Add cached repository wrapper for performance
- Add comprehensive validation package
- Add Kubernetes client integration for pod management
- Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks)
- Add network policy and PodDisruptionBudget for k8s
- Remove legacy executor and projects/registry packages
- Untrack secrets.yaml (now managed via envault)
- Add coverage.out to .gitignore
- Add e2e test infrastructure with docker-compose
- Add comprehensive documentation (API, architecture, operations, plans)
- Add golangci-lint config and pre-commit hook

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-25 19:57:46 -07:00

7.4 KiB

Raw Blame History

Monitoring Guide

This guide covers monitoring rdev API with Prometheus and Grafana.

Metrics Endpoint

rdev exposes Prometheus metrics at /metrics:

curl http://rdev-api:8080/metrics

Available Metrics

HTTP Metrics

Metric	Type	Description
`http_requests_total`	Counter	Total HTTP requests
`http_request_duration_seconds`	Histogram	Request latency
`http_requests_in_flight`	Gauge	Current active requests

Labels: method, path, status

Command Metrics

Metric	Type	Description
`rdev_commands_total`	Counter	Total commands executed
`rdev_commands_active`	Gauge	Currently running commands
`rdev_command_duration_seconds`	Histogram	Command execution time

Labels: project, type (claude/shell/git), status

SSE Metrics

Metric	Type	Description
`rdev_sse_connections_total`	Counter	Total SSE connections
`rdev_sse_connections_active`	Gauge	Active SSE connections
`rdev_sse_events_sent_total`	Counter	Total events sent

Labels: project, event_type

Auth Metrics

Metric	Type	Description
`rdev_auth_requests_total`	Counter	Auth attempts
`rdev_auth_failures_total`	Counter	Auth failures

Labels: reason (invalid, revoked, expired, ip_blocked)

Rate Limit Metrics

Metric	Type	Description
`rdev_ratelimit_requests_total`	Counter	Rate limit checks
`rdev_ratelimit_rejected_total`	Counter	Rejected requests

Prometheus Configuration

ServiceMonitor (Prometheus Operator)

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: rdev-api
  namespace: rdev
  labels:
    app: rdev-api
spec:
  selector:
    matchLabels:
      app: rdev-api
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Static Config

scrape_configs:
  - job_name: 'rdev-api'
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - rdev
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app]
        regex: rdev-api
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        regex: http
        action: keep

Grafana Dashboards

Overview Dashboard

{
  "title": "rdev API Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(http_requests_total{job=\"rdev-api\"}[5m])",
          "legendFormat": "{{method}} {{path}}"
        }
      ]
    },
    {
      "title": "Latency P99",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job=\"rdev-api\"}[5m]))",
          "legendFormat": "p99"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(http_requests_total{job=\"rdev-api\",status=~\"5..\"}[5m])",
          "legendFormat": "5xx errors"
        }
      ]
    },
    {
      "title": "Active Commands",
      "type": "gauge",
      "targets": [
        {
          "expr": "rdev_commands_active",
          "legendFormat": "{{project}}"
        }
      ]
    }
  ]
}

Key PromQL Queries

Request rate by endpoint:

rate(http_requests_total{job="rdev-api"}[5m])

P99 latency:

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="rdev-api"}[5m]))

Error rate percentage:

100 * rate(http_requests_total{job="rdev-api",status=~"5.."}[5m])
/ rate(http_requests_total{job="rdev-api"}[5m])

Command execution rate:

rate(rdev_commands_total{job="rdev-api"}[5m])

Average command duration:

rate(rdev_command_duration_seconds_sum[5m])
/ rate(rdev_command_duration_seconds_count[5m])

Alerting

PrometheusRule

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: rdev-api-alerts
  namespace: rdev
spec:
  groups:
    - name: rdev-api
      rules:
        - alert: RdevAPIHighErrorRate
          expr: |
            rate(http_requests_total{job="rdev-api",status=~"5.."}[5m])
            / rate(http_requests_total{job="rdev-api"}[5m]) > 0.05            
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "rdev API error rate > 5%"
            description: "Error rate is {{ $value | humanizePercentage }}"

        - alert: RdevAPIHighLatency
          expr: |
            histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="rdev-api"}[5m])) > 2            
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "rdev API p99 latency > 2s"
            description: "P99 latency is {{ $value | humanizeDuration }}"

        - alert: RdevAPIPodDown
          expr: up{job="rdev-api"} == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "rdev API pod is down"

        - alert: RdevAPIHighCommandQueue
          expr: rdev_commands_active > 4
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High number of active commands"
            description: "{{ $value }} commands currently running"

        - alert: RdevAPIHighRateLimit
          expr: |
            rate(rdev_ratelimit_rejected_total[5m])
            / rate(rdev_ratelimit_requests_total[5m]) > 0.1            
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High rate limit rejection rate"

Logging

Log Format

rdev uses structured JSON logging:

{
  "level": "info",
  "time": "2024-01-15T10:30:00Z",
  "msg": "request completed",
  "request_id": "req-abc123",
  "method": "POST",
  "path": "/projects/test/claude",
  "status": 201,
  "duration_ms": 45,
  "client_ip": "10.0.0.1"
}

Log Levels

Level	Description
`debug`	Detailed debugging info
`info`	Normal operations
`warn`	Potential issues
`error`	Errors requiring attention

Loki/Promtail

# promtail config
scrape_configs:
  - job_name: rdev-api
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: rdev-api
        action: keep
    pipeline_stages:
      - json:
          expressions:
            level: level
            request_id: request_id
            path: path
            status: status
      - labels:
          level:
          path:

LogQL Queries

Errors in last hour:

{app="rdev-api"} |= "error"

Slow requests:

{app="rdev-api"} | json | duration_ms > 1000

Requests by status:

sum by (status) (count_over_time({app="rdev-api"} | json [1h]))

Health Checks

Liveness

curl http://rdev-api:8080/health
# Returns 200 if process is alive

Readiness

curl http://rdev-api:8080/ready
# Returns 200 if ready to serve traffic
# Checks: database connectivity, K8s API access

Response:

{
  "status": "healthy",
  "checks": {
    "database": {
      "status": "healthy",
      "latency_ms": 5
    },
    "kubernetes": {
      "status": "healthy",
      "latency_ms": 12
    }
  }
}

7.4 KiB Raw Blame History