rdev/docs/operations/runbooks/auth-failures.md
jordan a9ad3d8304
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
chore: accumulated platform hardening and CI fixes
CI / Woodpecker:
- Add explicit depends_on to all .woodpecker.yml steps (rdev + templates)
- Fix skip_tls_verify -> skip-tls-verify (correct Kaniko flag name)
- Add replicasets get/list to deployer RBAC for rollout status
- Skeleton template: add failure:ignore on docs steps, Traefik TLS
  annotations on ingress, depends_on on verify step

Component templates:
- Fix container name in deploy steps (PROJECT_NAME-COMPONENT_NAME)
- Replace kubectl scale with kubectl patch for replicas
- Add post-deploy image verification and rollout status checks
- Applied consistently across all 5 component templates

Adapters:
- gitea: Add HTTP client timeout (30s), context cancellation checks,
  handle 404 on GetRepo/DeleteRepo
- zot: Add retry with exponential backoff (doWithRetry), limit response
  body reads to 10MB
- cockroach: Use net.JoinHostPort for IPv6-safe DSN construction
- woodpecker: Fix error wrapping (%v -> %w)
- redis: Fix error wrapping (%v -> %w)
- deployer: Add context cancellation checks

Services:
- apikey_service: Fix error wrapping (%v -> %w)
- component_deploy: Fix error wrapping (%v -> %w)
- project_infra: Fix error wrapping (%v -> %w)
- webhook/dispatcher: Fix error wrapping (%v -> %w)

Other:
- CLAUDE.md: Add guide links for Gitea, Go 1.25, Woodpecker v3,
  Traefik v3, Zot registry
- circuitbreaker: Add test for error wrapping
- docs: Update deployment, troubleshooting, and runbook docs
- health: Fix error wrapping (%v -> %w)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 23:16:56 -07:00

3.6 KiB

Runbook: Authentication Failures

Alert

RdevAPIAuthFailures: High rate of authentication failures

Impact

  • Legitimate users unable to access API
  • Potential security incident (brute force)
  • Service degradation

Investigation

1. Confirm the Issue

# Check auth failure metrics
curl -s http://rdev-api:8080/metrics | grep auth_failures

# Check auth logs
kubectl -n rdev logs -l app=rdev-api --since=10m | grep -E "(UNAUTHORIZED|KEY_REVOKED|KEY_EXPIRED|IP_NOT_ALLOWED)"

2. Identify Failure Type

# Count by failure reason
kubectl -n rdev logs -l app=rdev-api --since=10m | \
  grep -oE '"code":"[^"]+' | sort | uniq -c | sort -rn

Common reasons:

  • UNAUTHORIZED - Invalid or missing key
  • KEY_REVOKED - Key was revoked
  • KEY_EXPIRED - Key has expired
  • IP_NOT_ALLOWED - IP not in allowlist

3. Check for Attack Patterns

# Check unique IPs making failed requests
kubectl -n rdev logs -l app=rdev-api --since=10m | \
  grep UNAUTHORIZED | grep -oE '"client_ip":"[^"]+' | sort | uniq -c | sort -rn

# Check request patterns
kubectl -n rdev logs -l app=rdev-api --since=10m | \
  grep UNAUTHORIZED | grep -oE '"path":"[^"]+' | sort | uniq -c | sort -rn

Remediation

If Keys Are Invalid (UNAUTHORIZED)

  1. Verify keys exist in database:

    kubectl -n rdev exec -it deployment/rdev-api -- sh
    psql $DATABASE_URL -c "SELECT id, name, key_prefix, revoked_at FROM api_keys;"
    
  2. Help users create new keys if needed

  3. If brute force detected:

    • Block offending IPs at ingress level
    • Increase rate limiting

If Keys Are Revoked (KEY_REVOKED)

  1. Check who revoked and when:

    SELECT id, name, revoked_at, revoked_by FROM api_keys WHERE revoked_at IS NOT NULL;
    
  2. Determine if revocation was intentional

  3. Issue new keys to affected users if legitimate

If Keys Are Expired (KEY_EXPIRED)

  1. Check which keys expired:

    SELECT id, name, expires_at FROM api_keys WHERE expires_at < NOW();
    
  2. Issue new keys to affected users

  3. Consider extending default expiration if too short

If IP Not Allowed (IP_NOT_ALLOWED)

  1. Check which keys have IP restrictions:

    SELECT id, name, allowed_ips FROM api_keys WHERE allowed_ips IS NOT NULL;
    
  2. Verify client IPs match allowlist

  3. Update allowlist if legitimate IPs changed:

    • Cloud provider IP ranges change
    • User moved networks

If Under Attack

  1. Immediate: Block at ingress using Traefik ipAllowList Middleware

    # Use Traefik ipAllowList Middleware CRD instead:
    # apiVersion: traefik.io/v1alpha1
    # kind: Middleware
    # metadata:
    #   name: internal-only
    # spec:
    #   ipAllowList:
    #     sourceRange:
    #       - "10.0.0.0/8"
    #       - "192.168.0.0/16"
    
  2. Short-term: Increase rate limits

    kubectl -n rdev set env deployment/rdev-api RATE_LIMIT_RPS=2
    
  3. Long-term:

    • Implement IP-based blocking
    • Add fail2ban-style lockout
    • Review API key issuance process

Verification

# Check auth success rate
curl -s http://rdev-api:8080/metrics | grep -E "auth_(requests|failures)"

# Test authentication
curl -H "X-API-Key: $VALID_KEY" http://rdev-api:8080/projects

# Check logs for successful auths
kubectl -n rdev logs -l app=rdev-api --since=5m | grep "request completed" | head -5

Post-Incident

  1. Review auth failure patterns
  2. Update IP allowlists if needed
  3. Communicate with affected users
  4. Consider additional security measures:
    • API key rotation policy
    • Automated key expiration alerts
    • IP-based anomaly detection