rdev/docs/operations/monitoring.md
jordan 72d16929ca feat: Implement hexagonal architecture with services, webhooks, queue, and telemetry
Major refactoring to hexagonal (ports & adapters) architecture:

- Add service layer (apikey_service, project_service) for business logic
- Add webhook system with dispatcher and delivery tracking
- Add command queue with priority-based processing
- Add rate limiting with sliding window algorithm
- Add audit logging for command execution
- Add OpenTelemetry integration (traces, metrics, spans)
- Add circuit breaker for fault tolerance
- Add cached repository wrapper for performance
- Add comprehensive validation package
- Add Kubernetes client integration for pod management
- Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks)
- Add network policy and PodDisruptionBudget for k8s
- Remove legacy executor and projects/registry packages
- Untrack secrets.yaml (now managed via envault)
- Add coverage.out to .gitignore
- Add e2e test infrastructure with docker-compose
- Add comprehensive documentation (API, architecture, operations, plans)
- Add golangci-lint config and pre-commit hook

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 19:57:46 -07:00

349 lines
7.4 KiB
Markdown

# Monitoring Guide
This guide covers monitoring rdev API with Prometheus and Grafana.
## Metrics Endpoint
rdev exposes Prometheus metrics at `/metrics`:
```bash
curl http://rdev-api:8080/metrics
```
## Available Metrics
### HTTP Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `http_requests_total` | Counter | Total HTTP requests |
| `http_request_duration_seconds` | Histogram | Request latency |
| `http_requests_in_flight` | Gauge | Current active requests |
Labels: `method`, `path`, `status`
### Command Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `rdev_commands_total` | Counter | Total commands executed |
| `rdev_commands_active` | Gauge | Currently running commands |
| `rdev_command_duration_seconds` | Histogram | Command execution time |
Labels: `project`, `type` (claude/shell/git), `status`
### SSE Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `rdev_sse_connections_total` | Counter | Total SSE connections |
| `rdev_sse_connections_active` | Gauge | Active SSE connections |
| `rdev_sse_events_sent_total` | Counter | Total events sent |
Labels: `project`, `event_type`
### Auth Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `rdev_auth_requests_total` | Counter | Auth attempts |
| `rdev_auth_failures_total` | Counter | Auth failures |
Labels: `reason` (invalid, revoked, expired, ip_blocked)
### Rate Limit Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `rdev_ratelimit_requests_total` | Counter | Rate limit checks |
| `rdev_ratelimit_rejected_total` | Counter | Rejected requests |
## Prometheus Configuration
### ServiceMonitor (Prometheus Operator)
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rdev-api
namespace: rdev
labels:
app: rdev-api
spec:
selector:
matchLabels:
app: rdev-api
endpoints:
- port: http
path: /metrics
interval: 15s
```
### Static Config
```yaml
scrape_configs:
- job_name: 'rdev-api'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- rdev
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app]
regex: rdev-api
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
regex: http
action: keep
```
## Grafana Dashboards
### Overview Dashboard
```json
{
"title": "rdev API Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{job=\"rdev-api\"}[5m])",
"legendFormat": "{{method}} {{path}}"
}
]
},
{
"title": "Latency P99",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job=\"rdev-api\"}[5m]))",
"legendFormat": "p99"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{job=\"rdev-api\",status=~\"5..\"}[5m])",
"legendFormat": "5xx errors"
}
]
},
{
"title": "Active Commands",
"type": "gauge",
"targets": [
{
"expr": "rdev_commands_active",
"legendFormat": "{{project}}"
}
]
}
]
}
```
### Key PromQL Queries
**Request rate by endpoint:**
```promql
rate(http_requests_total{job="rdev-api"}[5m])
```
**P99 latency:**
```promql
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="rdev-api"}[5m]))
```
**Error rate percentage:**
```promql
100 * rate(http_requests_total{job="rdev-api",status=~"5.."}[5m])
/ rate(http_requests_total{job="rdev-api"}[5m])
```
**Command execution rate:**
```promql
rate(rdev_commands_total{job="rdev-api"}[5m])
```
**Average command duration:**
```promql
rate(rdev_command_duration_seconds_sum[5m])
/ rate(rdev_command_duration_seconds_count[5m])
```
## Alerting
### PrometheusRule
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: rdev-api-alerts
namespace: rdev
spec:
groups:
- name: rdev-api
rules:
- alert: RdevAPIHighErrorRate
expr: |
rate(http_requests_total{job="rdev-api",status=~"5.."}[5m])
/ rate(http_requests_total{job="rdev-api"}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "rdev API error rate > 5%"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: RdevAPIHighLatency
expr: |
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="rdev-api"}[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "rdev API p99 latency > 2s"
description: "P99 latency is {{ $value | humanizeDuration }}"
- alert: RdevAPIPodDown
expr: up{job="rdev-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "rdev API pod is down"
- alert: RdevAPIHighCommandQueue
expr: rdev_commands_active > 4
for: 5m
labels:
severity: warning
annotations:
summary: "High number of active commands"
description: "{{ $value }} commands currently running"
- alert: RdevAPIHighRateLimit
expr: |
rate(rdev_ratelimit_rejected_total[5m])
/ rate(rdev_ratelimit_requests_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High rate limit rejection rate"
```
## Logging
### Log Format
rdev uses structured JSON logging:
```json
{
"level": "info",
"time": "2024-01-15T10:30:00Z",
"msg": "request completed",
"request_id": "req-abc123",
"method": "POST",
"path": "/projects/test/claude",
"status": 201,
"duration_ms": 45,
"client_ip": "10.0.0.1"
}
```
### Log Levels
| Level | Description |
|-------|-------------|
| `debug` | Detailed debugging info |
| `info` | Normal operations |
| `warn` | Potential issues |
| `error` | Errors requiring attention |
### Loki/Promtail
```yaml
# promtail config
scrape_configs:
- job_name: rdev-api
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: rdev-api
action: keep
pipeline_stages:
- json:
expressions:
level: level
request_id: request_id
path: path
status: status
- labels:
level:
path:
```
### LogQL Queries
**Errors in last hour:**
```logql
{app="rdev-api"} |= "error"
```
**Slow requests:**
```logql
{app="rdev-api"} | json | duration_ms > 1000
```
**Requests by status:**
```logql
sum by (status) (count_over_time({app="rdev-api"} | json [1h]))
```
## Health Checks
### Liveness
```bash
curl http://rdev-api:8080/health
# Returns 200 if process is alive
```
### Readiness
```bash
curl http://rdev-api:8080/ready
# Returns 200 if ready to serve traffic
# Checks: database connectivity, K8s API access
```
Response:
```json
{
"status": "healthy",
"checks": {
"database": {
"status": "healthy",
"latency_ms": 5
},
"kubernetes": {
"status": "healthy",
"latency_ms": 12
}
}
}
```