# Runbook: Circuit Breaker Stuck ## Symptom - Agent getting 429 "Too Many Requests" responses - Dashboard shows circuit breaker in "OPEN" state - Legitimate agent unable to submit assertions - Circuit breaker won't transition to "HALF_OPEN" or "CLOSED" **Metrics Alerts:** - `stemedb_circuit_breaker_state{state="OPEN"}` > 0 for >1 hour - `stemedb_requests_rejected_total{reason="circuit_breaker"}` increasing **Response Headers:** ``` HTTP/1.1 429 Too Many Requests x-circuit-breaker-state: OPEN retry-after: 3600 ``` --- ## Quick Diagnosis ``` Circuit breaker stuck │ ├─► Check: curl .../admin/circuit_breakers | jq '.circuit_breakers[] | select(.state=="OPEN")' │ └─► Agent banned? → §1 Manual Ban │ ├─► Check: When was circuit breaker opened? │ └─► >1 hour ago but still OPEN? → §2 Stuck in OPEN │ ├─► Check: Agent repeatedly failing? │ └─► Automatic ban due to failures → §3 Legitimate Ban │ └─► Check: Circuit breaker in HALF_OPEN but requests still failing? └─► Stuck in HALF_OPEN loop → §4 HALF_OPEN Loop ``` --- ## Common Causes 1. **Manual ban not reset** — Likelihood: **40%** - Admin manually opened circuit breaker - Forgot to reset after issue resolved - No automatic timeout configured 2. **Automatic ban due to high failure rate** — Likelihood: **30%** - Agent submitting low-quality assertions (quarantined) - Agent hitting rate limits - Agent violating content defense rules 3. **Circuit breaker timeout too long** — Likelihood: **15%** - Default timeout (1 hour) too conservative - Agent blocked longer than needed - No process to review stuck breakers 4. **HALF_OPEN loop (test requests failing)** — Likelihood: **15%** - Agent still misconfigured - Content defense still rejecting - Circuit breaker testing with same bad requests --- ## Circuit Breaker State Machine ``` CLOSED (normal) │ ├─► Failure rate >30% over 5 min │ └─► OPEN (banned) │ │ │ ├─► Wait timeout (default: 1 hour) │ │ └─► HALF_OPEN (testing) │ │ │ │ │ ├─► Test requests succeed │ │ │ └─► CLOSED (restored) │ │ │ │ │ └─► Test requests fail │ │ └─► OPEN (banned again) │ │ │ └─► Manual reset │ └─► HALF_OPEN or CLOSED ``` --- ## Resolution Steps ### §1. Manual Reset (Intended Ban) **Diagnostic:** ```bash # List all circuit breakers in OPEN state curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state == "OPEN")' # Expected output: # { # "agent_id": "8f3a2b1c...", # "state": "OPEN", # "opened_at": "2026-02-11T09:00:00Z", # "reason": "flooding_quarantine", # "failure_count": 487, # "timeout_until": "2026-02-11T10:00:00Z" # } # Check if ban was manual journalctl -u stemedb-api | grep "circuit_breaker.*manual" ``` **Resolution: Manual reset** ⚠️ **WARNING:** Only reset if confident agent issue is resolved. Otherwise will immediately re-open. ```bash # Get agent ID AGENT_ID="8f3a2b1c..." # Check current state curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID # Option 1: Reset to HALF_OPEN (conservative - test first) curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \ -H "Content-Type: application/json" \ -d '{"target_state": "HALF_OPEN", "reason": "issue_resolved"}' # Expected response: # {"status": "reset", "agent_id": "8f3a2b1c...", "state": "HALF_OPEN"} # Wait for agent to submit test assertion # If succeeds → Transitions to CLOSED # If fails → Returns to OPEN # Option 2: Reset to CLOSED (aggressive - trust immediately) curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \ -H "Content-Type: application/json" \ -d '{"target_state": "CLOSED", "reason": "false_positive"}' # Verify state curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state' # Should return: "CLOSED" or "HALF_OPEN" ``` **Test agent access:** ```bash # Submit test assertion from agent curl -X POST http://localhost:18180/v1/assert \ -H "Content-Type: application/json" \ -H "X-Agent-Signature: $AGENT_SIGNATURE" \ -d '{ "concept_path": "test/circuit_breaker", "predicate": "reset_test", "value": true, "confidence": 0.9 }' # Should return: 201 Created (not 429) ``` **If failed:** Reset to HALF_OPEN but immediately returns to OPEN → Agent still submitting bad requests. Fix agent first. --- ### §2. Stuck in OPEN (Timeout Not Expiring) **Diagnostic:** ```bash # Check timeout expiry curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state == "OPEN") | {agent_id, timeout_until, now: (now | todate)}' # If timeout_until is in the past but still OPEN → Bug or manual ban with no timeout # Check for manual ban journalctl -u stemedb-api | grep "circuit_breaker.*$AGENT_ID" ``` **Resolution: Force reset** ```bash # Force transition to HALF_OPEN AGENT_ID="stuck-agent-id" curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \ -H "Content-Type: application/json" \ -d '{"target_state": "HALF_OPEN", "reason": "timeout_expired", "force": true}' # Monitor transition watch -n 2 'curl -s http://localhost:18180/v1/admin/circuit_breakers/'$AGENT_ID' | jq .state' # Should transition: OPEN → HALF_OPEN → CLOSED (after test request) ``` **If failed:** Force reset doesn't work → Potential bug. Escalate to engineering. Workaround: Restart server (resets all circuit breakers to CLOSED). --- ### §3. Legitimate Ban (Agent Still Misbehaving) **Diagnostic:** ```bash # Check why agent was banned curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '{reason, failure_count, failure_rate}' # Check recent quarantine items from this agent curl http://localhost:18180/v1/admin/quarantine?agent_id=$AGENT_ID | jq '.items[0:5]' # Check agent's recent assertion history curl http://localhost:18180/metrics | grep "stemedb_ingest_rejected_total.*$AGENT_ID" ``` **Resolution: Fix agent, then reset** **Step 1: Identify agent issue** Common issues: - Submitting duplicate assertions (same concept_path/predicate repeatedly) - Low-quality data (confidence too high for source authority) - Malformed payloads - Rate limiting (>1K assertions/min) **Step 2: Contact agent operator** ```bash # Get agent contact info (if available) curl http://localhost:18180/v1/admin/agents/$AGENT_ID | jq '.contact' # Or check agent metadata curl http://localhost:18180/v1/query \ -H "Content-Type: application/json" \ -d '{"concept_path": "agent/'$AGENT_ID'/metadata", "lens": "recency"}' ``` **Step 3: Test fix** ```bash # After agent operator claims fix, reset to HALF_OPEN curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \ -H "Content-Type: application/json" \ -d '{"target_state": "HALF_OPEN", "reason": "agent_fixed"}' # Agent submits test assertion # Monitor for success/failure curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state' ``` **If failed:** Agent still misbehaving after "fix" → Keep banned. Agent must resolve issue before reset. --- ### §4. HALF_OPEN Loop (Test Requests Failing) **Diagnostic:** ```bash # Check how many times circuit breaker has cycled HALF_OPEN → OPEN curl http://localhost:18180/metrics | grep "circuit_breaker_transitions.*$AGENT_ID" # If count >5 in last hour → Loop detected # Check test request failures journalctl -u stemedb-api | grep "circuit_breaker.*half_open_test.*$AGENT_ID" ``` **Resolution: Increase test threshold** ⚠️ **NOTE:** Default: Circuit breaker tests with 5 requests. If 3+ succeed, transitions to CLOSED. If 3+ fail, returns to OPEN. ```bash # Temporarily relax test threshold (requires restart) export STEMEDB_CIRCUIT_BREAKER_HALF_OPEN_SUCCESS_THRESHOLD=2 # Lower from 3 to 2 sudo systemctl restart stemedb-api # Reset circuit breaker curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \ -H "Content-Type: application/json" \ -d '{"target_state": "HALF_OPEN", "reason": "relaxed_threshold"}' # Monitor watch -n 2 'curl -s http://localhost:18180/v1/admin/circuit_breakers/'$AGENT_ID' | jq .state' ``` **If failed:** Still looping → Agent fundamentally broken. Keep banned until operator resolves. --- ## Validation After applying resolution, validate circuit breaker is functioning: - [ ] **Circuit breaker state is CLOSED** ```bash curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state' # Should return: "CLOSED" ``` - [ ] **Agent can submit assertions** ```bash # Test assertion from agent curl -X POST http://localhost:18180/v1/assert \ -H "X-Agent-Signature: $AGENT_SIGNATURE" \ -d '{...}' # Should return: 201 Created ``` - [ ] **No 429 responses** ```bash curl http://localhost:18180/metrics | grep "stemedb_requests_rejected_total.*circuit_breaker.*$AGENT_ID" # Counter should stop increasing ``` - [ ] **Circuit breaker metrics healthy** ```bash curl http://localhost:18180/metrics | grep "circuit_breaker_state.*$AGENT_ID" # Should show: stemedb_circuit_breaker_state{agent_id="...",state="CLOSED"} 1 ``` --- ## Prevention ### Monitoring **Set up alerts for:** ```yaml # Prometheus alert rules groups: - name: stemedb_circuit_breakers rules: - alert: StemeDBCircuitBreakerOpen expr: stemedb_circuit_breaker_state{state="OPEN"} > 0 for: 1h labels: severity: warning annotations: summary: "Circuit breaker stuck open (>1 hour)" description: "Agent {{ $labels.agent_id }} banned for >1h" - alert: StemeDBCircuitBreakerLoop expr: rate(stemedb_circuit_breaker_transitions_total[1h]) > 5 for: 30m labels: severity: warning annotations: summary: "Circuit breaker looping" description: "Agent {{ $labels.agent_id }} cycling >5 times/hour" ``` ### Configuration Changes **To prevent recurrence:** 1. **Review stuck breakers daily:** Add to on-call checklist 2. **Tune timeouts:** Adjust based on agent behavior patterns 3. **Document ban reasons:** Always add reason when manually opening 4. **Agent health checks:** Implement agent-side health checks before submitting **Example: Shorter timeout for pilot** ```toml # /etc/stemedb/config.toml [circuit_breaker] timeout_seconds = 1800 # 30 minutes instead of 1 hour half_open_success_threshold = 3 half_open_request_count = 5 ``` --- ## Circuit Breaker Admin Workflow **Standard procedure for stuck circuit breakers:** 1. **Identify stuck breaker:** ```bash curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state != "CLOSED")' ``` 2. **Investigate cause:** - Check quarantine items from agent - Review failure reason - Contact agent operator 3. **Decide action:** - If agent fixed → Reset to HALF_OPEN - If false positive → Reset to CLOSED - If still broken → Keep banned 4. **Document decision:** - Add note to incident log - Update agent metadata if persistent issue 5. **Monitor transition:** - Watch for immediate re-ban (indicates agent still broken) - Verify assertion rate returns to normal --- ## Response Headers Reference **Circuit breaker state is communicated via response headers:** | State | Status Code | Headers | |-------|-------------|---------| | **CLOSED** | 201 Created | (none) | | **OPEN** | 429 Too Many Requests | `x-circuit-breaker-state: OPEN`
`retry-after: 3600` | | **HALF_OPEN** | 429 Too Many Requests | `x-circuit-breaker-state: HALF_OPEN`
`retry-after: 60` | **Agent Implementation Guidelines:** Agents should: 1. Check for `x-circuit-breaker-state` header on 429 responses 2. If `OPEN`: Back off for `retry-after` seconds 3. If `HALF_OPEN`: Retry cautiously (exponential backoff) 4. Log circuit breaker state for operator visibility --- ## Related Runbooks - [Quarantine Overflow](./quarantine-overflow.md) - Related content defense issues - [High Query Latency](./high-query-latency.md) - Performance impact - [Server Won't Start](./server-wont-start.md) - Restart impacts circuit breakers --- ## Last Updated 2026-02-11