jordan 72d16929ca feat: Implement hexagonal architecture with services, webhooks, queue, and telemetry

Major refactoring to hexagonal (ports & adapters) architecture:

- Add service layer (apikey_service, project_service) for business logic
- Add webhook system with dispatcher and delivery tracking
- Add command queue with priority-based processing
- Add rate limiting with sliding window algorithm
- Add audit logging for command execution
- Add OpenTelemetry integration (traces, metrics, spans)
- Add circuit breaker for fault tolerance
- Add cached repository wrapper for performance
- Add comprehensive validation package
- Add Kubernetes client integration for pod management
- Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks)
- Add network policy and PodDisruptionBudget for k8s
- Remove legacy executor and projects/registry packages
- Untrack secrets.yaml (now managed via envault)
- Add coverage.out to .gitignore
- Add e2e test infrastructure with docker-compose
- Add comprehensive documentation (API, architecture, operations, plans)
- Add golangci-lint config and pre-commit hook

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-25 19:57:46 -07:00

3.4 KiB

Raw Blame History

Runbook: Authentication Failures

Alert

RdevAPIAuthFailures: High rate of authentication failures

Impact

Legitimate users unable to access API
Potential security incident (brute force)
Service degradation

Investigation

1. Confirm the Issue

# Check auth failure metrics
curl -s http://rdev-api:8080/metrics | grep auth_failures

# Check auth logs
kubectl -n rdev logs -l app=rdev-api --since=10m | grep -E "(UNAUTHORIZED|KEY_REVOKED|KEY_EXPIRED|IP_NOT_ALLOWED)"

2. Identify Failure Type

# Count by failure reason
kubectl -n rdev logs -l app=rdev-api --since=10m | \
  grep -oE '"code":"[^"]+' | sort | uniq -c | sort -rn

Common reasons:

UNAUTHORIZED - Invalid or missing key
KEY_REVOKED - Key was revoked
KEY_EXPIRED - Key has expired
IP_NOT_ALLOWED - IP not in allowlist

3. Check for Attack Patterns

# Check unique IPs making failed requests
kubectl -n rdev logs -l app=rdev-api --since=10m | \
  grep UNAUTHORIZED | grep -oE '"client_ip":"[^"]+' | sort | uniq -c | sort -rn

# Check request patterns
kubectl -n rdev logs -l app=rdev-api --since=10m | \
  grep UNAUTHORIZED | grep -oE '"path":"[^"]+' | sort | uniq -c | sort -rn

Remediation

If Keys Are Invalid (UNAUTHORIZED)

Verify keys exist in database:

kubectl -n rdev exec -it deployment/rdev-api -- sh
psql $DATABASE_URL -c "SELECT id, name, key_prefix, revoked_at FROM api_keys;"

Help users create new keys if needed
If brute force detected:
- Block offending IPs at ingress level
- Increase rate limiting

If Keys Are Revoked (KEY_REVOKED)

Check who revoked and when:

SELECT id, name, revoked_at, revoked_by FROM api_keys WHERE revoked_at IS NOT NULL;

Determine if revocation was intentional
Issue new keys to affected users if legitimate

If Keys Are Expired (KEY_EXPIRED)

Check which keys expired:

SELECT id, name, expires_at FROM api_keys WHERE expires_at < NOW();

Issue new keys to affected users
Consider extending default expiration if too short

If IP Not Allowed (IP_NOT_ALLOWED)

Check which keys have IP restrictions:

SELECT id, name, allowed_ips FROM api_keys WHERE allowed_ips IS NOT NULL;

Verify client IPs match allowlist
Update allowlist if legitimate IPs changed:
- Cloud provider IP ranges change
- User moved networks

If Under Attack

Immediate: Block at ingress

# Add to ingress annotations
nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8,192.168.0.0/16"

Short-term: Increase rate limits

kubectl -n rdev set env deployment/rdev-api RATE_LIMIT_RPS=2

Long-term:
- Implement IP-based blocking
- Add fail2ban-style lockout
- Review API key issuance process

Verification

# Check auth success rate
curl -s http://rdev-api:8080/metrics | grep -E "auth_(requests|failures)"

# Test authentication
curl -H "X-API-Key: $VALID_KEY" http://rdev-api:8080/projects

# Check logs for successful auths
kubectl -n rdev logs -l app=rdev-api --since=5m | grep "request completed" | head -5

Post-Incident

Review auth failure patterns
Update IP allowlists if needed
Communicate with affected users
Consider additional security measures:
- API key rotation policy
- Automated key expiration alerts
- IP-based anomaly detection

3.4 KiB Raw Blame History