All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
CI / Woodpecker: - Add explicit depends_on to all .woodpecker.yml steps (rdev + templates) - Fix skip_tls_verify -> skip-tls-verify (correct Kaniko flag name) - Add replicasets get/list to deployer RBAC for rollout status - Skeleton template: add failure:ignore on docs steps, Traefik TLS annotations on ingress, depends_on on verify step Component templates: - Fix container name in deploy steps (PROJECT_NAME-COMPONENT_NAME) - Replace kubectl scale with kubectl patch for replicas - Add post-deploy image verification and rollout status checks - Applied consistently across all 5 component templates Adapters: - gitea: Add HTTP client timeout (30s), context cancellation checks, handle 404 on GetRepo/DeleteRepo - zot: Add retry with exponential backoff (doWithRetry), limit response body reads to 10MB - cockroach: Use net.JoinHostPort for IPv6-safe DSN construction - woodpecker: Fix error wrapping (%v -> %w) - redis: Fix error wrapping (%v -> %w) - deployer: Add context cancellation checks Services: - apikey_service: Fix error wrapping (%v -> %w) - component_deploy: Fix error wrapping (%v -> %w) - project_infra: Fix error wrapping (%v -> %w) - webhook/dispatcher: Fix error wrapping (%v -> %w) Other: - CLAUDE.md: Add guide links for Gitea, Go 1.25, Woodpecker v3, Traefik v3, Zot registry - circuitbreaker: Add test for error wrapping - docs: Update deployment, troubleshooting, and runbook docs - health: Fix error wrapping (%v -> %w) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
150 lines
3.6 KiB
Markdown
150 lines
3.6 KiB
Markdown
# Runbook: Authentication Failures
|
|
|
|
## Alert
|
|
|
|
**RdevAPIAuthFailures**: High rate of authentication failures
|
|
|
|
## Impact
|
|
|
|
- Legitimate users unable to access API
|
|
- Potential security incident (brute force)
|
|
- Service degradation
|
|
|
|
## Investigation
|
|
|
|
### 1. Confirm the Issue
|
|
|
|
```bash
|
|
# Check auth failure metrics
|
|
curl -s http://rdev-api:8080/metrics | grep auth_failures
|
|
|
|
# Check auth logs
|
|
kubectl -n rdev logs -l app=rdev-api --since=10m | grep -E "(UNAUTHORIZED|KEY_REVOKED|KEY_EXPIRED|IP_NOT_ALLOWED)"
|
|
```
|
|
|
|
### 2. Identify Failure Type
|
|
|
|
```bash
|
|
# Count by failure reason
|
|
kubectl -n rdev logs -l app=rdev-api --since=10m | \
|
|
grep -oE '"code":"[^"]+' | sort | uniq -c | sort -rn
|
|
```
|
|
|
|
Common reasons:
|
|
- `UNAUTHORIZED` - Invalid or missing key
|
|
- `KEY_REVOKED` - Key was revoked
|
|
- `KEY_EXPIRED` - Key has expired
|
|
- `IP_NOT_ALLOWED` - IP not in allowlist
|
|
|
|
### 3. Check for Attack Patterns
|
|
|
|
```bash
|
|
# Check unique IPs making failed requests
|
|
kubectl -n rdev logs -l app=rdev-api --since=10m | \
|
|
grep UNAUTHORIZED | grep -oE '"client_ip":"[^"]+' | sort | uniq -c | sort -rn
|
|
|
|
# Check request patterns
|
|
kubectl -n rdev logs -l app=rdev-api --since=10m | \
|
|
grep UNAUTHORIZED | grep -oE '"path":"[^"]+' | sort | uniq -c | sort -rn
|
|
```
|
|
|
|
## Remediation
|
|
|
|
### If Keys Are Invalid (UNAUTHORIZED)
|
|
|
|
1. Verify keys exist in database:
|
|
```bash
|
|
kubectl -n rdev exec -it deployment/rdev-api -- sh
|
|
psql $DATABASE_URL -c "SELECT id, name, key_prefix, revoked_at FROM api_keys;"
|
|
```
|
|
|
|
2. Help users create new keys if needed
|
|
|
|
3. If brute force detected:
|
|
- Block offending IPs at ingress level
|
|
- Increase rate limiting
|
|
|
|
### If Keys Are Revoked (KEY_REVOKED)
|
|
|
|
1. Check who revoked and when:
|
|
```sql
|
|
SELECT id, name, revoked_at, revoked_by FROM api_keys WHERE revoked_at IS NOT NULL;
|
|
```
|
|
|
|
2. Determine if revocation was intentional
|
|
|
|
3. Issue new keys to affected users if legitimate
|
|
|
|
### If Keys Are Expired (KEY_EXPIRED)
|
|
|
|
1. Check which keys expired:
|
|
```sql
|
|
SELECT id, name, expires_at FROM api_keys WHERE expires_at < NOW();
|
|
```
|
|
|
|
2. Issue new keys to affected users
|
|
|
|
3. Consider extending default expiration if too short
|
|
|
|
### If IP Not Allowed (IP_NOT_ALLOWED)
|
|
|
|
1. Check which keys have IP restrictions:
|
|
```sql
|
|
SELECT id, name, allowed_ips FROM api_keys WHERE allowed_ips IS NOT NULL;
|
|
```
|
|
|
|
2. Verify client IPs match allowlist
|
|
|
|
3. Update allowlist if legitimate IPs changed:
|
|
- Cloud provider IP ranges change
|
|
- User moved networks
|
|
|
|
### If Under Attack
|
|
|
|
1. **Immediate**: Block at ingress using Traefik ipAllowList Middleware
|
|
```yaml
|
|
# Use Traefik ipAllowList Middleware CRD instead:
|
|
# apiVersion: traefik.io/v1alpha1
|
|
# kind: Middleware
|
|
# metadata:
|
|
# name: internal-only
|
|
# spec:
|
|
# ipAllowList:
|
|
# sourceRange:
|
|
# - "10.0.0.0/8"
|
|
# - "192.168.0.0/16"
|
|
```
|
|
|
|
2. **Short-term**: Increase rate limits
|
|
```bash
|
|
kubectl -n rdev set env deployment/rdev-api RATE_LIMIT_RPS=2
|
|
```
|
|
|
|
3. **Long-term**:
|
|
- Implement IP-based blocking
|
|
- Add fail2ban-style lockout
|
|
- Review API key issuance process
|
|
|
|
## Verification
|
|
|
|
```bash
|
|
# Check auth success rate
|
|
curl -s http://rdev-api:8080/metrics | grep -E "auth_(requests|failures)"
|
|
|
|
# Test authentication
|
|
curl -H "X-API-Key: $VALID_KEY" http://rdev-api:8080/projects
|
|
|
|
# Check logs for successful auths
|
|
kubectl -n rdev logs -l app=rdev-api --since=5m | grep "request completed" | head -5
|
|
```
|
|
|
|
## Post-Incident
|
|
|
|
1. Review auth failure patterns
|
|
2. Update IP allowlists if needed
|
|
3. Communicate with affected users
|
|
4. Consider additional security measures:
|
|
- API key rotation policy
|
|
- Automated key expiration alerts
|
|
- IP-based anomaly detection
|