Major refactoring to hexagonal (ports & adapters) architecture: - Add service layer (apikey_service, project_service) for business logic - Add webhook system with dispatcher and delivery tracking - Add command queue with priority-based processing - Add rate limiting with sliding window algorithm - Add audit logging for command execution - Add OpenTelemetry integration (traces, metrics, spans) - Add circuit breaker for fault tolerance - Add cached repository wrapper for performance - Add comprehensive validation package - Add Kubernetes client integration for pod management - Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks) - Add network policy and PodDisruptionBudget for k8s - Remove legacy executor and projects/registry packages - Untrack secrets.yaml (now managed via envault) - Add coverage.out to .gitignore - Add e2e test infrastructure with docker-compose - Add comprehensive documentation (API, architecture, operations, plans) - Add golangci-lint config and pre-commit hook Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
7.4 KiB
7.4 KiB
Monitoring Guide
This guide covers monitoring rdev API with Prometheus and Grafana.
Metrics Endpoint
rdev exposes Prometheus metrics at /metrics:
curl http://rdev-api:8080/metrics
Available Metrics
HTTP Metrics
| Metric | Type | Description |
|---|---|---|
http_requests_total |
Counter | Total HTTP requests |
http_request_duration_seconds |
Histogram | Request latency |
http_requests_in_flight |
Gauge | Current active requests |
Labels: method, path, status
Command Metrics
| Metric | Type | Description |
|---|---|---|
rdev_commands_total |
Counter | Total commands executed |
rdev_commands_active |
Gauge | Currently running commands |
rdev_command_duration_seconds |
Histogram | Command execution time |
Labels: project, type (claude/shell/git), status
SSE Metrics
| Metric | Type | Description |
|---|---|---|
rdev_sse_connections_total |
Counter | Total SSE connections |
rdev_sse_connections_active |
Gauge | Active SSE connections |
rdev_sse_events_sent_total |
Counter | Total events sent |
Labels: project, event_type
Auth Metrics
| Metric | Type | Description |
|---|---|---|
rdev_auth_requests_total |
Counter | Auth attempts |
rdev_auth_failures_total |
Counter | Auth failures |
Labels: reason (invalid, revoked, expired, ip_blocked)
Rate Limit Metrics
| Metric | Type | Description |
|---|---|---|
rdev_ratelimit_requests_total |
Counter | Rate limit checks |
rdev_ratelimit_rejected_total |
Counter | Rejected requests |
Prometheus Configuration
ServiceMonitor (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rdev-api
namespace: rdev
labels:
app: rdev-api
spec:
selector:
matchLabels:
app: rdev-api
endpoints:
- port: http
path: /metrics
interval: 15s
Static Config
scrape_configs:
- job_name: 'rdev-api'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- rdev
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app]
regex: rdev-api
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
regex: http
action: keep
Grafana Dashboards
Overview Dashboard
{
"title": "rdev API Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{job=\"rdev-api\"}[5m])",
"legendFormat": "{{method}} {{path}}"
}
]
},
{
"title": "Latency P99",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job=\"rdev-api\"}[5m]))",
"legendFormat": "p99"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{job=\"rdev-api\",status=~\"5..\"}[5m])",
"legendFormat": "5xx errors"
}
]
},
{
"title": "Active Commands",
"type": "gauge",
"targets": [
{
"expr": "rdev_commands_active",
"legendFormat": "{{project}}"
}
]
}
]
}
Key PromQL Queries
Request rate by endpoint:
rate(http_requests_total{job="rdev-api"}[5m])
P99 latency:
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="rdev-api"}[5m]))
Error rate percentage:
100 * rate(http_requests_total{job="rdev-api",status=~"5.."}[5m])
/ rate(http_requests_total{job="rdev-api"}[5m])
Command execution rate:
rate(rdev_commands_total{job="rdev-api"}[5m])
Average command duration:
rate(rdev_command_duration_seconds_sum[5m])
/ rate(rdev_command_duration_seconds_count[5m])
Alerting
PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: rdev-api-alerts
namespace: rdev
spec:
groups:
- name: rdev-api
rules:
- alert: RdevAPIHighErrorRate
expr: |
rate(http_requests_total{job="rdev-api",status=~"5.."}[5m])
/ rate(http_requests_total{job="rdev-api"}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "rdev API error rate > 5%"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: RdevAPIHighLatency
expr: |
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="rdev-api"}[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "rdev API p99 latency > 2s"
description: "P99 latency is {{ $value | humanizeDuration }}"
- alert: RdevAPIPodDown
expr: up{job="rdev-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "rdev API pod is down"
- alert: RdevAPIHighCommandQueue
expr: rdev_commands_active > 4
for: 5m
labels:
severity: warning
annotations:
summary: "High number of active commands"
description: "{{ $value }} commands currently running"
- alert: RdevAPIHighRateLimit
expr: |
rate(rdev_ratelimit_rejected_total[5m])
/ rate(rdev_ratelimit_requests_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High rate limit rejection rate"
Logging
Log Format
rdev uses structured JSON logging:
{
"level": "info",
"time": "2024-01-15T10:30:00Z",
"msg": "request completed",
"request_id": "req-abc123",
"method": "POST",
"path": "/projects/test/claude",
"status": 201,
"duration_ms": 45,
"client_ip": "10.0.0.1"
}
Log Levels
| Level | Description |
|---|---|
debug |
Detailed debugging info |
info |
Normal operations |
warn |
Potential issues |
error |
Errors requiring attention |
Loki/Promtail
# promtail config
scrape_configs:
- job_name: rdev-api
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: rdev-api
action: keep
pipeline_stages:
- json:
expressions:
level: level
request_id: request_id
path: path
status: status
- labels:
level:
path:
LogQL Queries
Errors in last hour:
{app="rdev-api"} |= "error"
Slow requests:
{app="rdev-api"} | json | duration_ms > 1000
Requests by status:
sum by (status) (count_over_time({app="rdev-api"} | json [1h]))
Health Checks
Liveness
curl http://rdev-api:8080/health
# Returns 200 if process is alive
Readiness
curl http://rdev-api:8080/ready
# Returns 200 if ready to serve traffic
# Checks: database connectivity, K8s API access
Response:
{
"status": "healthy",
"checks": {
"database": {
"status": "healthy",
"latency_ms": 5
},
"kubernetes": {
"status": "healthy",
"latency_ms": 12
}
}
}