Major refactoring to hexagonal (ports & adapters) architecture: - Add service layer (apikey_service, project_service) for business logic - Add webhook system with dispatcher and delivery tracking - Add command queue with priority-based processing - Add rate limiting with sliding window algorithm - Add audit logging for command execution - Add OpenTelemetry integration (traces, metrics, spans) - Add circuit breaker for fault tolerance - Add cached repository wrapper for performance - Add comprehensive validation package - Add Kubernetes client integration for pod management - Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks) - Add network policy and PodDisruptionBudget for k8s - Remove legacy executor and projects/registry packages - Untrack secrets.yaml (now managed via envault) - Add coverage.out to .gitignore - Add e2e test infrastructure with docker-compose - Add comprehensive documentation (API, architecture, operations, plans) - Add golangci-lint config and pre-commit hook Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
349 lines
7.4 KiB
Markdown
349 lines
7.4 KiB
Markdown
# Monitoring Guide
|
|
|
|
This guide covers monitoring rdev API with Prometheus and Grafana.
|
|
|
|
## Metrics Endpoint
|
|
|
|
rdev exposes Prometheus metrics at `/metrics`:
|
|
|
|
```bash
|
|
curl http://rdev-api:8080/metrics
|
|
```
|
|
|
|
## Available Metrics
|
|
|
|
### HTTP Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `http_requests_total` | Counter | Total HTTP requests |
|
|
| `http_request_duration_seconds` | Histogram | Request latency |
|
|
| `http_requests_in_flight` | Gauge | Current active requests |
|
|
|
|
Labels: `method`, `path`, `status`
|
|
|
|
### Command Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `rdev_commands_total` | Counter | Total commands executed |
|
|
| `rdev_commands_active` | Gauge | Currently running commands |
|
|
| `rdev_command_duration_seconds` | Histogram | Command execution time |
|
|
|
|
Labels: `project`, `type` (claude/shell/git), `status`
|
|
|
|
### SSE Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `rdev_sse_connections_total` | Counter | Total SSE connections |
|
|
| `rdev_sse_connections_active` | Gauge | Active SSE connections |
|
|
| `rdev_sse_events_sent_total` | Counter | Total events sent |
|
|
|
|
Labels: `project`, `event_type`
|
|
|
|
### Auth Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `rdev_auth_requests_total` | Counter | Auth attempts |
|
|
| `rdev_auth_failures_total` | Counter | Auth failures |
|
|
|
|
Labels: `reason` (invalid, revoked, expired, ip_blocked)
|
|
|
|
### Rate Limit Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `rdev_ratelimit_requests_total` | Counter | Rate limit checks |
|
|
| `rdev_ratelimit_rejected_total` | Counter | Rejected requests |
|
|
|
|
## Prometheus Configuration
|
|
|
|
### ServiceMonitor (Prometheus Operator)
|
|
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: ServiceMonitor
|
|
metadata:
|
|
name: rdev-api
|
|
namespace: rdev
|
|
labels:
|
|
app: rdev-api
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app: rdev-api
|
|
endpoints:
|
|
- port: http
|
|
path: /metrics
|
|
interval: 15s
|
|
```
|
|
|
|
### Static Config
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'rdev-api'
|
|
kubernetes_sd_configs:
|
|
- role: endpoints
|
|
namespaces:
|
|
names:
|
|
- rdev
|
|
relabel_configs:
|
|
- source_labels: [__meta_kubernetes_service_label_app]
|
|
regex: rdev-api
|
|
action: keep
|
|
- source_labels: [__meta_kubernetes_endpoint_port_name]
|
|
regex: http
|
|
action: keep
|
|
```
|
|
|
|
## Grafana Dashboards
|
|
|
|
### Overview Dashboard
|
|
|
|
```json
|
|
{
|
|
"title": "rdev API Overview",
|
|
"panels": [
|
|
{
|
|
"title": "Request Rate",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "rate(http_requests_total{job=\"rdev-api\"}[5m])",
|
|
"legendFormat": "{{method}} {{path}}"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Latency P99",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job=\"rdev-api\"}[5m]))",
|
|
"legendFormat": "p99"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Error Rate",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "rate(http_requests_total{job=\"rdev-api\",status=~\"5..\"}[5m])",
|
|
"legendFormat": "5xx errors"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Active Commands",
|
|
"type": "gauge",
|
|
"targets": [
|
|
{
|
|
"expr": "rdev_commands_active",
|
|
"legendFormat": "{{project}}"
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Key PromQL Queries
|
|
|
|
**Request rate by endpoint:**
|
|
```promql
|
|
rate(http_requests_total{job="rdev-api"}[5m])
|
|
```
|
|
|
|
**P99 latency:**
|
|
```promql
|
|
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="rdev-api"}[5m]))
|
|
```
|
|
|
|
**Error rate percentage:**
|
|
```promql
|
|
100 * rate(http_requests_total{job="rdev-api",status=~"5.."}[5m])
|
|
/ rate(http_requests_total{job="rdev-api"}[5m])
|
|
```
|
|
|
|
**Command execution rate:**
|
|
```promql
|
|
rate(rdev_commands_total{job="rdev-api"}[5m])
|
|
```
|
|
|
|
**Average command duration:**
|
|
```promql
|
|
rate(rdev_command_duration_seconds_sum[5m])
|
|
/ rate(rdev_command_duration_seconds_count[5m])
|
|
```
|
|
|
|
## Alerting
|
|
|
|
### PrometheusRule
|
|
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: PrometheusRule
|
|
metadata:
|
|
name: rdev-api-alerts
|
|
namespace: rdev
|
|
spec:
|
|
groups:
|
|
- name: rdev-api
|
|
rules:
|
|
- alert: RdevAPIHighErrorRate
|
|
expr: |
|
|
rate(http_requests_total{job="rdev-api",status=~"5.."}[5m])
|
|
/ rate(http_requests_total{job="rdev-api"}[5m]) > 0.05
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "rdev API error rate > 5%"
|
|
description: "Error rate is {{ $value | humanizePercentage }}"
|
|
|
|
- alert: RdevAPIHighLatency
|
|
expr: |
|
|
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="rdev-api"}[5m])) > 2
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "rdev API p99 latency > 2s"
|
|
description: "P99 latency is {{ $value | humanizeDuration }}"
|
|
|
|
- alert: RdevAPIPodDown
|
|
expr: up{job="rdev-api"} == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "rdev API pod is down"
|
|
|
|
- alert: RdevAPIHighCommandQueue
|
|
expr: rdev_commands_active > 4
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High number of active commands"
|
|
description: "{{ $value }} commands currently running"
|
|
|
|
- alert: RdevAPIHighRateLimit
|
|
expr: |
|
|
rate(rdev_ratelimit_rejected_total[5m])
|
|
/ rate(rdev_ratelimit_requests_total[5m]) > 0.1
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High rate limit rejection rate"
|
|
```
|
|
|
|
## Logging
|
|
|
|
### Log Format
|
|
|
|
rdev uses structured JSON logging:
|
|
|
|
```json
|
|
{
|
|
"level": "info",
|
|
"time": "2024-01-15T10:30:00Z",
|
|
"msg": "request completed",
|
|
"request_id": "req-abc123",
|
|
"method": "POST",
|
|
"path": "/projects/test/claude",
|
|
"status": 201,
|
|
"duration_ms": 45,
|
|
"client_ip": "10.0.0.1"
|
|
}
|
|
```
|
|
|
|
### Log Levels
|
|
|
|
| Level | Description |
|
|
|-------|-------------|
|
|
| `debug` | Detailed debugging info |
|
|
| `info` | Normal operations |
|
|
| `warn` | Potential issues |
|
|
| `error` | Errors requiring attention |
|
|
|
|
### Loki/Promtail
|
|
|
|
```yaml
|
|
# promtail config
|
|
scrape_configs:
|
|
- job_name: rdev-api
|
|
kubernetes_sd_configs:
|
|
- role: pod
|
|
relabel_configs:
|
|
- source_labels: [__meta_kubernetes_pod_label_app]
|
|
regex: rdev-api
|
|
action: keep
|
|
pipeline_stages:
|
|
- json:
|
|
expressions:
|
|
level: level
|
|
request_id: request_id
|
|
path: path
|
|
status: status
|
|
- labels:
|
|
level:
|
|
path:
|
|
```
|
|
|
|
### LogQL Queries
|
|
|
|
**Errors in last hour:**
|
|
```logql
|
|
{app="rdev-api"} |= "error"
|
|
```
|
|
|
|
**Slow requests:**
|
|
```logql
|
|
{app="rdev-api"} | json | duration_ms > 1000
|
|
```
|
|
|
|
**Requests by status:**
|
|
```logql
|
|
sum by (status) (count_over_time({app="rdev-api"} | json [1h]))
|
|
```
|
|
|
|
## Health Checks
|
|
|
|
### Liveness
|
|
|
|
```bash
|
|
curl http://rdev-api:8080/health
|
|
# Returns 200 if process is alive
|
|
```
|
|
|
|
### Readiness
|
|
|
|
```bash
|
|
curl http://rdev-api:8080/ready
|
|
# Returns 200 if ready to serve traffic
|
|
# Checks: database connectivity, K8s API access
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"checks": {
|
|
"database": {
|
|
"status": "healthy",
|
|
"latency_ms": 5
|
|
},
|
|
"kubernetes": {
|
|
"status": "healthy",
|
|
"latency_ms": 12
|
|
}
|
|
}
|
|
}
|
|
```
|