stemedb/docs/operations/runbooks/circuit-breaker-stuck.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

432 lines
12 KiB
Markdown

# Runbook: Circuit Breaker Stuck
## Symptom
- Agent getting 429 "Too Many Requests" responses
- Dashboard shows circuit breaker in "OPEN" state
- Legitimate agent unable to submit assertions
- Circuit breaker won't transition to "HALF_OPEN" or "CLOSED"
**Metrics Alerts:**
- `stemedb_circuit_breaker_state{state="OPEN"}` > 0 for >1 hour
- `stemedb_requests_rejected_total{reason="circuit_breaker"}` increasing
**Response Headers:**
```
HTTP/1.1 429 Too Many Requests
x-circuit-breaker-state: OPEN
retry-after: 3600
```
---
## Quick Diagnosis
```
Circuit breaker stuck
├─► Check: curl .../admin/circuit_breakers | jq '.circuit_breakers[] | select(.state=="OPEN")'
│ └─► Agent banned? → §1 Manual Ban
├─► Check: When was circuit breaker opened?
│ └─► >1 hour ago but still OPEN? → §2 Stuck in OPEN
├─► Check: Agent repeatedly failing?
│ └─► Automatic ban due to failures → §3 Legitimate Ban
└─► Check: Circuit breaker in HALF_OPEN but requests still failing?
└─► Stuck in HALF_OPEN loop → §4 HALF_OPEN Loop
```
---
## Common Causes
1. **Manual ban not reset** — Likelihood: **40%**
- Admin manually opened circuit breaker
- Forgot to reset after issue resolved
- No automatic timeout configured
2. **Automatic ban due to high failure rate** — Likelihood: **30%**
- Agent submitting low-quality assertions (quarantined)
- Agent hitting rate limits
- Agent violating content defense rules
3. **Circuit breaker timeout too long** — Likelihood: **15%**
- Default timeout (1 hour) too conservative
- Agent blocked longer than needed
- No process to review stuck breakers
4. **HALF_OPEN loop (test requests failing)** — Likelihood: **15%**
- Agent still misconfigured
- Content defense still rejecting
- Circuit breaker testing with same bad requests
---
## Circuit Breaker State Machine
```
CLOSED (normal)
├─► Failure rate >30% over 5 min
│ └─► OPEN (banned)
│ │
│ ├─► Wait timeout (default: 1 hour)
│ │ └─► HALF_OPEN (testing)
│ │ │
│ │ ├─► Test requests succeed
│ │ │ └─► CLOSED (restored)
│ │ │
│ │ └─► Test requests fail
│ │ └─► OPEN (banned again)
│ │
│ └─► Manual reset
│ └─► HALF_OPEN or CLOSED
```
---
## Resolution Steps
### §1. Manual Reset (Intended Ban)
**Diagnostic:**
```bash
# List all circuit breakers in OPEN state
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state == "OPEN")'
# Expected output:
# {
# "agent_id": "8f3a2b1c...",
# "state": "OPEN",
# "opened_at": "2026-02-11T09:00:00Z",
# "reason": "flooding_quarantine",
# "failure_count": 487,
# "timeout_until": "2026-02-11T10:00:00Z"
# }
# Check if ban was manual
journalctl -u stemedb-api | grep "circuit_breaker.*manual"
```
**Resolution: Manual reset**
⚠️ **WARNING:** Only reset if confident agent issue is resolved. Otherwise will immediately re-open.
```bash
# Get agent ID
AGENT_ID="8f3a2b1c..."
# Check current state
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID
# Option 1: Reset to HALF_OPEN (conservative - test first)
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
-H "Content-Type: application/json" \
-d '{"target_state": "HALF_OPEN", "reason": "issue_resolved"}'
# Expected response:
# {"status": "reset", "agent_id": "8f3a2b1c...", "state": "HALF_OPEN"}
# Wait for agent to submit test assertion
# If succeeds → Transitions to CLOSED
# If fails → Returns to OPEN
# Option 2: Reset to CLOSED (aggressive - trust immediately)
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
-H "Content-Type: application/json" \
-d '{"target_state": "CLOSED", "reason": "false_positive"}'
# Verify state
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state'
# Should return: "CLOSED" or "HALF_OPEN"
```
**Test agent access:**
```bash
# Submit test assertion from agent
curl -X POST http://localhost:18180/v1/assert \
-H "Content-Type: application/json" \
-H "X-Agent-Signature: $AGENT_SIGNATURE" \
-d '{
"concept_path": "test/circuit_breaker",
"predicate": "reset_test",
"value": true,
"confidence": 0.9
}'
# Should return: 201 Created (not 429)
```
**If failed:** Reset to HALF_OPEN but immediately returns to OPEN → Agent still submitting bad requests. Fix agent first.
---
### §2. Stuck in OPEN (Timeout Not Expiring)
**Diagnostic:**
```bash
# Check timeout expiry
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state == "OPEN") | {agent_id, timeout_until, now: (now | todate)}'
# If timeout_until is in the past but still OPEN → Bug or manual ban with no timeout
# Check for manual ban
journalctl -u stemedb-api | grep "circuit_breaker.*$AGENT_ID"
```
**Resolution: Force reset**
```bash
# Force transition to HALF_OPEN
AGENT_ID="stuck-agent-id"
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
-H "Content-Type: application/json" \
-d '{"target_state": "HALF_OPEN", "reason": "timeout_expired", "force": true}'
# Monitor transition
watch -n 2 'curl -s http://localhost:18180/v1/admin/circuit_breakers/'$AGENT_ID' | jq .state'
# Should transition: OPEN → HALF_OPEN → CLOSED (after test request)
```
**If failed:** Force reset doesn't work → Potential bug. Escalate to engineering. Workaround: Restart server (resets all circuit breakers to CLOSED).
---
### §3. Legitimate Ban (Agent Still Misbehaving)
**Diagnostic:**
```bash
# Check why agent was banned
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '{reason, failure_count, failure_rate}'
# Check recent quarantine items from this agent
curl http://localhost:18180/v1/admin/quarantine?agent_id=$AGENT_ID | jq '.items[0:5]'
# Check agent's recent assertion history
curl http://localhost:18180/metrics | grep "stemedb_ingest_rejected_total.*$AGENT_ID"
```
**Resolution: Fix agent, then reset**
**Step 1: Identify agent issue**
Common issues:
- Submitting duplicate assertions (same concept_path/predicate repeatedly)
- Low-quality data (confidence too high for source authority)
- Malformed payloads
- Rate limiting (>1K assertions/min)
**Step 2: Contact agent operator**
```bash
# Get agent contact info (if available)
curl http://localhost:18180/v1/admin/agents/$AGENT_ID | jq '.contact'
# Or check agent metadata
curl http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "agent/'$AGENT_ID'/metadata", "lens": "recency"}'
```
**Step 3: Test fix**
```bash
# After agent operator claims fix, reset to HALF_OPEN
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
-H "Content-Type: application/json" \
-d '{"target_state": "HALF_OPEN", "reason": "agent_fixed"}'
# Agent submits test assertion
# Monitor for success/failure
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state'
```
**If failed:** Agent still misbehaving after "fix" → Keep banned. Agent must resolve issue before reset.
---
### §4. HALF_OPEN Loop (Test Requests Failing)
**Diagnostic:**
```bash
# Check how many times circuit breaker has cycled HALF_OPEN → OPEN
curl http://localhost:18180/metrics | grep "circuit_breaker_transitions.*$AGENT_ID"
# If count >5 in last hour → Loop detected
# Check test request failures
journalctl -u stemedb-api | grep "circuit_breaker.*half_open_test.*$AGENT_ID"
```
**Resolution: Increase test threshold**
⚠️ **NOTE:** Default: Circuit breaker tests with 5 requests. If 3+ succeed, transitions to CLOSED. If 3+ fail, returns to OPEN.
```bash
# Temporarily relax test threshold (requires restart)
export STEMEDB_CIRCUIT_BREAKER_HALF_OPEN_SUCCESS_THRESHOLD=2 # Lower from 3 to 2
sudo systemctl restart stemedb-api
# Reset circuit breaker
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
-H "Content-Type: application/json" \
-d '{"target_state": "HALF_OPEN", "reason": "relaxed_threshold"}'
# Monitor
watch -n 2 'curl -s http://localhost:18180/v1/admin/circuit_breakers/'$AGENT_ID' | jq .state'
```
**If failed:** Still looping → Agent fundamentally broken. Keep banned until operator resolves.
---
## Validation
After applying resolution, validate circuit breaker is functioning:
- [ ] **Circuit breaker state is CLOSED**
```bash
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state'
# Should return: "CLOSED"
```
- [ ] **Agent can submit assertions**
```bash
# Test assertion from agent
curl -X POST http://localhost:18180/v1/assert \
-H "X-Agent-Signature: $AGENT_SIGNATURE" \
-d '{...}'
# Should return: 201 Created
```
- [ ] **No 429 responses**
```bash
curl http://localhost:18180/metrics | grep "stemedb_requests_rejected_total.*circuit_breaker.*$AGENT_ID"
# Counter should stop increasing
```
- [ ] **Circuit breaker metrics healthy**
```bash
curl http://localhost:18180/metrics | grep "circuit_breaker_state.*$AGENT_ID"
# Should show: stemedb_circuit_breaker_state{agent_id="...",state="CLOSED"} 1
```
---
## Prevention
### Monitoring
**Set up alerts for:**
```yaml
# Prometheus alert rules
groups:
- name: stemedb_circuit_breakers
rules:
- alert: StemeDBCircuitBreakerOpen
expr: stemedb_circuit_breaker_state{state="OPEN"} > 0
for: 1h
labels:
severity: warning
annotations:
summary: "Circuit breaker stuck open (>1 hour)"
description: "Agent {{ $labels.agent_id }} banned for >1h"
- alert: StemeDBCircuitBreakerLoop
expr: rate(stemedb_circuit_breaker_transitions_total[1h]) > 5
for: 30m
labels:
severity: warning
annotations:
summary: "Circuit breaker looping"
description: "Agent {{ $labels.agent_id }} cycling >5 times/hour"
```
### Configuration Changes
**To prevent recurrence:**
1. **Review stuck breakers daily:** Add to on-call checklist
2. **Tune timeouts:** Adjust based on agent behavior patterns
3. **Document ban reasons:** Always add reason when manually opening
4. **Agent health checks:** Implement agent-side health checks before submitting
**Example: Shorter timeout for pilot**
```toml
# /etc/stemedb/config.toml
[circuit_breaker]
timeout_seconds = 1800 # 30 minutes instead of 1 hour
half_open_success_threshold = 3
half_open_request_count = 5
```
---
## Circuit Breaker Admin Workflow
**Standard procedure for stuck circuit breakers:**
1. **Identify stuck breaker:**
```bash
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state != "CLOSED")'
```
2. **Investigate cause:**
- Check quarantine items from agent
- Review failure reason
- Contact agent operator
3. **Decide action:**
- If agent fixed → Reset to HALF_OPEN
- If false positive → Reset to CLOSED
- If still broken → Keep banned
4. **Document decision:**
- Add note to incident log
- Update agent metadata if persistent issue
5. **Monitor transition:**
- Watch for immediate re-ban (indicates agent still broken)
- Verify assertion rate returns to normal
---
## Response Headers Reference
**Circuit breaker state is communicated via response headers:**
| State | Status Code | Headers |
|-------|-------------|---------|
| **CLOSED** | 201 Created | (none) |
| **OPEN** | 429 Too Many Requests | `x-circuit-breaker-state: OPEN`<br>`retry-after: 3600` |
| **HALF_OPEN** | 429 Too Many Requests | `x-circuit-breaker-state: HALF_OPEN`<br>`retry-after: 60` |
**Agent Implementation Guidelines:**
Agents should:
1. Check for `x-circuit-breaker-state` header on 429 responses
2. If `OPEN`: Back off for `retry-after` seconds
3. If `HALF_OPEN`: Retry cautiously (exponential backoff)
4. Log circuit breaker state for operator visibility
---
## Related Runbooks
- [Quarantine Overflow](./quarantine-overflow.md) - Related content defense issues
- [High Query Latency](./high-query-latency.md) - Performance impact
- [Server Won't Start](./server-wont-start.md) - Restart impacts circuit breakers
---
## Last Updated
2026-02-11