This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
432 lines
12 KiB
Markdown
432 lines
12 KiB
Markdown
# Runbook: Circuit Breaker Stuck
|
|
|
|
## Symptom
|
|
|
|
- Agent getting 429 "Too Many Requests" responses
|
|
- Dashboard shows circuit breaker in "OPEN" state
|
|
- Legitimate agent unable to submit assertions
|
|
- Circuit breaker won't transition to "HALF_OPEN" or "CLOSED"
|
|
|
|
**Metrics Alerts:**
|
|
- `stemedb_circuit_breaker_state{state="OPEN"}` > 0 for >1 hour
|
|
- `stemedb_requests_rejected_total{reason="circuit_breaker"}` increasing
|
|
|
|
**Response Headers:**
|
|
```
|
|
HTTP/1.1 429 Too Many Requests
|
|
x-circuit-breaker-state: OPEN
|
|
retry-after: 3600
|
|
```
|
|
|
|
---
|
|
|
|
## Quick Diagnosis
|
|
|
|
```
|
|
Circuit breaker stuck
|
|
│
|
|
├─► Check: curl .../admin/circuit_breakers | jq '.circuit_breakers[] | select(.state=="OPEN")'
|
|
│ └─► Agent banned? → §1 Manual Ban
|
|
│
|
|
├─► Check: When was circuit breaker opened?
|
|
│ └─► >1 hour ago but still OPEN? → §2 Stuck in OPEN
|
|
│
|
|
├─► Check: Agent repeatedly failing?
|
|
│ └─► Automatic ban due to failures → §3 Legitimate Ban
|
|
│
|
|
└─► Check: Circuit breaker in HALF_OPEN but requests still failing?
|
|
└─► Stuck in HALF_OPEN loop → §4 HALF_OPEN Loop
|
|
```
|
|
|
|
---
|
|
|
|
## Common Causes
|
|
|
|
1. **Manual ban not reset** — Likelihood: **40%**
|
|
- Admin manually opened circuit breaker
|
|
- Forgot to reset after issue resolved
|
|
- No automatic timeout configured
|
|
|
|
2. **Automatic ban due to high failure rate** — Likelihood: **30%**
|
|
- Agent submitting low-quality assertions (quarantined)
|
|
- Agent hitting rate limits
|
|
- Agent violating content defense rules
|
|
|
|
3. **Circuit breaker timeout too long** — Likelihood: **15%**
|
|
- Default timeout (1 hour) too conservative
|
|
- Agent blocked longer than needed
|
|
- No process to review stuck breakers
|
|
|
|
4. **HALF_OPEN loop (test requests failing)** — Likelihood: **15%**
|
|
- Agent still misconfigured
|
|
- Content defense still rejecting
|
|
- Circuit breaker testing with same bad requests
|
|
|
|
---
|
|
|
|
## Circuit Breaker State Machine
|
|
|
|
```
|
|
CLOSED (normal)
|
|
│
|
|
├─► Failure rate >30% over 5 min
|
|
│ └─► OPEN (banned)
|
|
│ │
|
|
│ ├─► Wait timeout (default: 1 hour)
|
|
│ │ └─► HALF_OPEN (testing)
|
|
│ │ │
|
|
│ │ ├─► Test requests succeed
|
|
│ │ │ └─► CLOSED (restored)
|
|
│ │ │
|
|
│ │ └─► Test requests fail
|
|
│ │ └─► OPEN (banned again)
|
|
│ │
|
|
│ └─► Manual reset
|
|
│ └─► HALF_OPEN or CLOSED
|
|
```
|
|
|
|
---
|
|
|
|
## Resolution Steps
|
|
|
|
### §1. Manual Reset (Intended Ban)
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# List all circuit breakers in OPEN state
|
|
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state == "OPEN")'
|
|
|
|
# Expected output:
|
|
# {
|
|
# "agent_id": "8f3a2b1c...",
|
|
# "state": "OPEN",
|
|
# "opened_at": "2026-02-11T09:00:00Z",
|
|
# "reason": "flooding_quarantine",
|
|
# "failure_count": 487,
|
|
# "timeout_until": "2026-02-11T10:00:00Z"
|
|
# }
|
|
|
|
# Check if ban was manual
|
|
journalctl -u stemedb-api | grep "circuit_breaker.*manual"
|
|
```
|
|
|
|
**Resolution: Manual reset**
|
|
|
|
⚠️ **WARNING:** Only reset if confident agent issue is resolved. Otherwise will immediately re-open.
|
|
|
|
```bash
|
|
# Get agent ID
|
|
AGENT_ID="8f3a2b1c..."
|
|
|
|
# Check current state
|
|
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID
|
|
|
|
# Option 1: Reset to HALF_OPEN (conservative - test first)
|
|
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"target_state": "HALF_OPEN", "reason": "issue_resolved"}'
|
|
|
|
# Expected response:
|
|
# {"status": "reset", "agent_id": "8f3a2b1c...", "state": "HALF_OPEN"}
|
|
|
|
# Wait for agent to submit test assertion
|
|
# If succeeds → Transitions to CLOSED
|
|
# If fails → Returns to OPEN
|
|
|
|
# Option 2: Reset to CLOSED (aggressive - trust immediately)
|
|
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"target_state": "CLOSED", "reason": "false_positive"}'
|
|
|
|
# Verify state
|
|
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state'
|
|
# Should return: "CLOSED" or "HALF_OPEN"
|
|
```
|
|
|
|
**Test agent access:**
|
|
```bash
|
|
# Submit test assertion from agent
|
|
curl -X POST http://localhost:18180/v1/assert \
|
|
-H "Content-Type: application/json" \
|
|
-H "X-Agent-Signature: $AGENT_SIGNATURE" \
|
|
-d '{
|
|
"concept_path": "test/circuit_breaker",
|
|
"predicate": "reset_test",
|
|
"value": true,
|
|
"confidence": 0.9
|
|
}'
|
|
|
|
# Should return: 201 Created (not 429)
|
|
```
|
|
|
|
**If failed:** Reset to HALF_OPEN but immediately returns to OPEN → Agent still submitting bad requests. Fix agent first.
|
|
|
|
---
|
|
|
|
### §2. Stuck in OPEN (Timeout Not Expiring)
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check timeout expiry
|
|
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state == "OPEN") | {agent_id, timeout_until, now: (now | todate)}'
|
|
|
|
# If timeout_until is in the past but still OPEN → Bug or manual ban with no timeout
|
|
|
|
# Check for manual ban
|
|
journalctl -u stemedb-api | grep "circuit_breaker.*$AGENT_ID"
|
|
```
|
|
|
|
**Resolution: Force reset**
|
|
|
|
```bash
|
|
# Force transition to HALF_OPEN
|
|
AGENT_ID="stuck-agent-id"
|
|
|
|
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"target_state": "HALF_OPEN", "reason": "timeout_expired", "force": true}'
|
|
|
|
# Monitor transition
|
|
watch -n 2 'curl -s http://localhost:18180/v1/admin/circuit_breakers/'$AGENT_ID' | jq .state'
|
|
|
|
# Should transition: OPEN → HALF_OPEN → CLOSED (after test request)
|
|
```
|
|
|
|
**If failed:** Force reset doesn't work → Potential bug. Escalate to engineering. Workaround: Restart server (resets all circuit breakers to CLOSED).
|
|
|
|
---
|
|
|
|
### §3. Legitimate Ban (Agent Still Misbehaving)
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check why agent was banned
|
|
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '{reason, failure_count, failure_rate}'
|
|
|
|
# Check recent quarantine items from this agent
|
|
curl http://localhost:18180/v1/admin/quarantine?agent_id=$AGENT_ID | jq '.items[0:5]'
|
|
|
|
# Check agent's recent assertion history
|
|
curl http://localhost:18180/metrics | grep "stemedb_ingest_rejected_total.*$AGENT_ID"
|
|
```
|
|
|
|
**Resolution: Fix agent, then reset**
|
|
|
|
**Step 1: Identify agent issue**
|
|
|
|
Common issues:
|
|
- Submitting duplicate assertions (same concept_path/predicate repeatedly)
|
|
- Low-quality data (confidence too high for source authority)
|
|
- Malformed payloads
|
|
- Rate limiting (>1K assertions/min)
|
|
|
|
**Step 2: Contact agent operator**
|
|
|
|
```bash
|
|
# Get agent contact info (if available)
|
|
curl http://localhost:18180/v1/admin/agents/$AGENT_ID | jq '.contact'
|
|
|
|
# Or check agent metadata
|
|
curl http://localhost:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "agent/'$AGENT_ID'/metadata", "lens": "recency"}'
|
|
```
|
|
|
|
**Step 3: Test fix**
|
|
|
|
```bash
|
|
# After agent operator claims fix, reset to HALF_OPEN
|
|
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"target_state": "HALF_OPEN", "reason": "agent_fixed"}'
|
|
|
|
# Agent submits test assertion
|
|
# Monitor for success/failure
|
|
|
|
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state'
|
|
```
|
|
|
|
**If failed:** Agent still misbehaving after "fix" → Keep banned. Agent must resolve issue before reset.
|
|
|
|
---
|
|
|
|
### §4. HALF_OPEN Loop (Test Requests Failing)
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check how many times circuit breaker has cycled HALF_OPEN → OPEN
|
|
curl http://localhost:18180/metrics | grep "circuit_breaker_transitions.*$AGENT_ID"
|
|
|
|
# If count >5 in last hour → Loop detected
|
|
|
|
# Check test request failures
|
|
journalctl -u stemedb-api | grep "circuit_breaker.*half_open_test.*$AGENT_ID"
|
|
```
|
|
|
|
**Resolution: Increase test threshold**
|
|
|
|
⚠️ **NOTE:** Default: Circuit breaker tests with 5 requests. If 3+ succeed, transitions to CLOSED. If 3+ fail, returns to OPEN.
|
|
|
|
```bash
|
|
# Temporarily relax test threshold (requires restart)
|
|
export STEMEDB_CIRCUIT_BREAKER_HALF_OPEN_SUCCESS_THRESHOLD=2 # Lower from 3 to 2
|
|
|
|
sudo systemctl restart stemedb-api
|
|
|
|
# Reset circuit breaker
|
|
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"target_state": "HALF_OPEN", "reason": "relaxed_threshold"}'
|
|
|
|
# Monitor
|
|
watch -n 2 'curl -s http://localhost:18180/v1/admin/circuit_breakers/'$AGENT_ID' | jq .state'
|
|
```
|
|
|
|
**If failed:** Still looping → Agent fundamentally broken. Keep banned until operator resolves.
|
|
|
|
---
|
|
|
|
## Validation
|
|
|
|
After applying resolution, validate circuit breaker is functioning:
|
|
|
|
- [ ] **Circuit breaker state is CLOSED**
|
|
```bash
|
|
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state'
|
|
# Should return: "CLOSED"
|
|
```
|
|
|
|
- [ ] **Agent can submit assertions**
|
|
```bash
|
|
# Test assertion from agent
|
|
curl -X POST http://localhost:18180/v1/assert \
|
|
-H "X-Agent-Signature: $AGENT_SIGNATURE" \
|
|
-d '{...}'
|
|
# Should return: 201 Created
|
|
```
|
|
|
|
- [ ] **No 429 responses**
|
|
```bash
|
|
curl http://localhost:18180/metrics | grep "stemedb_requests_rejected_total.*circuit_breaker.*$AGENT_ID"
|
|
# Counter should stop increasing
|
|
```
|
|
|
|
- [ ] **Circuit breaker metrics healthy**
|
|
```bash
|
|
curl http://localhost:18180/metrics | grep "circuit_breaker_state.*$AGENT_ID"
|
|
# Should show: stemedb_circuit_breaker_state{agent_id="...",state="CLOSED"} 1
|
|
```
|
|
|
|
---
|
|
|
|
## Prevention
|
|
|
|
### Monitoring
|
|
|
|
**Set up alerts for:**
|
|
|
|
```yaml
|
|
# Prometheus alert rules
|
|
groups:
|
|
- name: stemedb_circuit_breakers
|
|
rules:
|
|
- alert: StemeDBCircuitBreakerOpen
|
|
expr: stemedb_circuit_breaker_state{state="OPEN"} > 0
|
|
for: 1h
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Circuit breaker stuck open (>1 hour)"
|
|
description: "Agent {{ $labels.agent_id }} banned for >1h"
|
|
|
|
- alert: StemeDBCircuitBreakerLoop
|
|
expr: rate(stemedb_circuit_breaker_transitions_total[1h]) > 5
|
|
for: 30m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Circuit breaker looping"
|
|
description: "Agent {{ $labels.agent_id }} cycling >5 times/hour"
|
|
```
|
|
|
|
### Configuration Changes
|
|
|
|
**To prevent recurrence:**
|
|
|
|
1. **Review stuck breakers daily:** Add to on-call checklist
|
|
2. **Tune timeouts:** Adjust based on agent behavior patterns
|
|
3. **Document ban reasons:** Always add reason when manually opening
|
|
4. **Agent health checks:** Implement agent-side health checks before submitting
|
|
|
|
**Example: Shorter timeout for pilot**
|
|
```toml
|
|
# /etc/stemedb/config.toml
|
|
[circuit_breaker]
|
|
timeout_seconds = 1800 # 30 minutes instead of 1 hour
|
|
half_open_success_threshold = 3
|
|
half_open_request_count = 5
|
|
```
|
|
|
|
---
|
|
|
|
## Circuit Breaker Admin Workflow
|
|
|
|
**Standard procedure for stuck circuit breakers:**
|
|
|
|
1. **Identify stuck breaker:**
|
|
```bash
|
|
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state != "CLOSED")'
|
|
```
|
|
|
|
2. **Investigate cause:**
|
|
- Check quarantine items from agent
|
|
- Review failure reason
|
|
- Contact agent operator
|
|
|
|
3. **Decide action:**
|
|
- If agent fixed → Reset to HALF_OPEN
|
|
- If false positive → Reset to CLOSED
|
|
- If still broken → Keep banned
|
|
|
|
4. **Document decision:**
|
|
- Add note to incident log
|
|
- Update agent metadata if persistent issue
|
|
|
|
5. **Monitor transition:**
|
|
- Watch for immediate re-ban (indicates agent still broken)
|
|
- Verify assertion rate returns to normal
|
|
|
|
---
|
|
|
|
## Response Headers Reference
|
|
|
|
**Circuit breaker state is communicated via response headers:**
|
|
|
|
| State | Status Code | Headers |
|
|
|-------|-------------|---------|
|
|
| **CLOSED** | 201 Created | (none) |
|
|
| **OPEN** | 429 Too Many Requests | `x-circuit-breaker-state: OPEN`<br>`retry-after: 3600` |
|
|
| **HALF_OPEN** | 429 Too Many Requests | `x-circuit-breaker-state: HALF_OPEN`<br>`retry-after: 60` |
|
|
|
|
**Agent Implementation Guidelines:**
|
|
|
|
Agents should:
|
|
1. Check for `x-circuit-breaker-state` header on 429 responses
|
|
2. If `OPEN`: Back off for `retry-after` seconds
|
|
3. If `HALF_OPEN`: Retry cautiously (exponential backoff)
|
|
4. Log circuit breaker state for operator visibility
|
|
|
|
---
|
|
|
|
## Related Runbooks
|
|
|
|
- [Quarantine Overflow](./quarantine-overflow.md) - Related content defense issues
|
|
- [High Query Latency](./high-query-latency.md) - Performance impact
|
|
- [Server Won't Start](./server-wont-start.md) - Restart impacts circuit breakers
|
|
|
|
---
|
|
|
|
## Last Updated
|
|
|
|
2026-02-11
|