stemedb/docs/operations/runbooks/circuit-breaker-stuck.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

12 KiB

Runbook: Circuit Breaker Stuck

Symptom

  • Agent getting 429 "Too Many Requests" responses
  • Dashboard shows circuit breaker in "OPEN" state
  • Legitimate agent unable to submit assertions
  • Circuit breaker won't transition to "HALF_OPEN" or "CLOSED"

Metrics Alerts:

  • stemedb_circuit_breaker_state{state="OPEN"} > 0 for >1 hour
  • stemedb_requests_rejected_total{reason="circuit_breaker"} increasing

Response Headers:

HTTP/1.1 429 Too Many Requests
x-circuit-breaker-state: OPEN
retry-after: 3600

Quick Diagnosis

Circuit breaker stuck
    │
    ├─► Check: curl .../admin/circuit_breakers | jq '.circuit_breakers[] | select(.state=="OPEN")'
    │   └─► Agent banned? → §1 Manual Ban
    │
    ├─► Check: When was circuit breaker opened?
    │   └─► >1 hour ago but still OPEN? → §2 Stuck in OPEN
    │
    ├─► Check: Agent repeatedly failing?
    │   └─► Automatic ban due to failures → §3 Legitimate Ban
    │
    └─► Check: Circuit breaker in HALF_OPEN but requests still failing?
        └─► Stuck in HALF_OPEN loop → §4 HALF_OPEN Loop

Common Causes

  1. Manual ban not reset — Likelihood: 40%

    • Admin manually opened circuit breaker
    • Forgot to reset after issue resolved
    • No automatic timeout configured
  2. Automatic ban due to high failure rate — Likelihood: 30%

    • Agent submitting low-quality assertions (quarantined)
    • Agent hitting rate limits
    • Agent violating content defense rules
  3. Circuit breaker timeout too long — Likelihood: 15%

    • Default timeout (1 hour) too conservative
    • Agent blocked longer than needed
    • No process to review stuck breakers
  4. HALF_OPEN loop (test requests failing) — Likelihood: 15%

    • Agent still misconfigured
    • Content defense still rejecting
    • Circuit breaker testing with same bad requests

Circuit Breaker State Machine

CLOSED (normal)
    │
    ├─► Failure rate >30% over 5 min
    │   └─► OPEN (banned)
    │           │
    │           ├─► Wait timeout (default: 1 hour)
    │           │   └─► HALF_OPEN (testing)
    │           │           │
    │           │           ├─► Test requests succeed
    │           │           │   └─► CLOSED (restored)
    │           │           │
    │           │           └─► Test requests fail
    │           │               └─► OPEN (banned again)
    │           │
    │           └─► Manual reset
    │               └─► HALF_OPEN or CLOSED

Resolution Steps

§1. Manual Reset (Intended Ban)

Diagnostic:

# List all circuit breakers in OPEN state
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state == "OPEN")'

# Expected output:
# {
#   "agent_id": "8f3a2b1c...",
#   "state": "OPEN",
#   "opened_at": "2026-02-11T09:00:00Z",
#   "reason": "flooding_quarantine",
#   "failure_count": 487,
#   "timeout_until": "2026-02-11T10:00:00Z"
# }

# Check if ban was manual
journalctl -u stemedb-api | grep "circuit_breaker.*manual"

Resolution: Manual reset

⚠️ WARNING: Only reset if confident agent issue is resolved. Otherwise will immediately re-open.

# Get agent ID
AGENT_ID="8f3a2b1c..."

# Check current state
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID

# Option 1: Reset to HALF_OPEN (conservative - test first)
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
  -H "Content-Type: application/json" \
  -d '{"target_state": "HALF_OPEN", "reason": "issue_resolved"}'

# Expected response:
# {"status": "reset", "agent_id": "8f3a2b1c...", "state": "HALF_OPEN"}

# Wait for agent to submit test assertion
# If succeeds → Transitions to CLOSED
# If fails → Returns to OPEN

# Option 2: Reset to CLOSED (aggressive - trust immediately)
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
  -H "Content-Type: application/json" \
  -d '{"target_state": "CLOSED", "reason": "false_positive"}'

# Verify state
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state'
# Should return: "CLOSED" or "HALF_OPEN"

Test agent access:

# Submit test assertion from agent
curl -X POST http://localhost:18180/v1/assert \
  -H "Content-Type: application/json" \
  -H "X-Agent-Signature: $AGENT_SIGNATURE" \
  -d '{
    "concept_path": "test/circuit_breaker",
    "predicate": "reset_test",
    "value": true,
    "confidence": 0.9
  }'

# Should return: 201 Created (not 429)

If failed: Reset to HALF_OPEN but immediately returns to OPEN → Agent still submitting bad requests. Fix agent first.


§2. Stuck in OPEN (Timeout Not Expiring)

Diagnostic:

# Check timeout expiry
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state == "OPEN") | {agent_id, timeout_until, now: (now | todate)}'

# If timeout_until is in the past but still OPEN → Bug or manual ban with no timeout

# Check for manual ban
journalctl -u stemedb-api | grep "circuit_breaker.*$AGENT_ID"

Resolution: Force reset

# Force transition to HALF_OPEN
AGENT_ID="stuck-agent-id"

curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
  -H "Content-Type: application/json" \
  -d '{"target_state": "HALF_OPEN", "reason": "timeout_expired", "force": true}'

# Monitor transition
watch -n 2 'curl -s http://localhost:18180/v1/admin/circuit_breakers/'$AGENT_ID' | jq .state'

# Should transition: OPEN → HALF_OPEN → CLOSED (after test request)

If failed: Force reset doesn't work → Potential bug. Escalate to engineering. Workaround: Restart server (resets all circuit breakers to CLOSED).


§3. Legitimate Ban (Agent Still Misbehaving)

Diagnostic:

# Check why agent was banned
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '{reason, failure_count, failure_rate}'

# Check recent quarantine items from this agent
curl http://localhost:18180/v1/admin/quarantine?agent_id=$AGENT_ID | jq '.items[0:5]'

# Check agent's recent assertion history
curl http://localhost:18180/metrics | grep "stemedb_ingest_rejected_total.*$AGENT_ID"

Resolution: Fix agent, then reset

Step 1: Identify agent issue

Common issues:

  • Submitting duplicate assertions (same concept_path/predicate repeatedly)
  • Low-quality data (confidence too high for source authority)
  • Malformed payloads
  • Rate limiting (>1K assertions/min)

Step 2: Contact agent operator

# Get agent contact info (if available)
curl http://localhost:18180/v1/admin/agents/$AGENT_ID | jq '.contact'

# Or check agent metadata
curl http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "agent/'$AGENT_ID'/metadata", "lens": "recency"}'

Step 3: Test fix

# After agent operator claims fix, reset to HALF_OPEN
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
  -H "Content-Type: application/json" \
  -d '{"target_state": "HALF_OPEN", "reason": "agent_fixed"}'

# Agent submits test assertion
# Monitor for success/failure

curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state'

If failed: Agent still misbehaving after "fix" → Keep banned. Agent must resolve issue before reset.


§4. HALF_OPEN Loop (Test Requests Failing)

Diagnostic:

# Check how many times circuit breaker has cycled HALF_OPEN → OPEN
curl http://localhost:18180/metrics | grep "circuit_breaker_transitions.*$AGENT_ID"

# If count >5 in last hour → Loop detected

# Check test request failures
journalctl -u stemedb-api | grep "circuit_breaker.*half_open_test.*$AGENT_ID"

Resolution: Increase test threshold

⚠️ NOTE: Default: Circuit breaker tests with 5 requests. If 3+ succeed, transitions to CLOSED. If 3+ fail, returns to OPEN.

# Temporarily relax test threshold (requires restart)
export STEMEDB_CIRCUIT_BREAKER_HALF_OPEN_SUCCESS_THRESHOLD=2  # Lower from 3 to 2

sudo systemctl restart stemedb-api

# Reset circuit breaker
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
  -H "Content-Type: application/json" \
  -d '{"target_state": "HALF_OPEN", "reason": "relaxed_threshold"}'

# Monitor
watch -n 2 'curl -s http://localhost:18180/v1/admin/circuit_breakers/'$AGENT_ID' | jq .state'

If failed: Still looping → Agent fundamentally broken. Keep banned until operator resolves.


Validation

After applying resolution, validate circuit breaker is functioning:

  • Circuit breaker state is CLOSED

    curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state'
    # Should return: "CLOSED"
    
  • Agent can submit assertions

    # Test assertion from agent
    curl -X POST http://localhost:18180/v1/assert \
      -H "X-Agent-Signature: $AGENT_SIGNATURE" \
      -d '{...}'
    # Should return: 201 Created
    
  • No 429 responses

    curl http://localhost:18180/metrics | grep "stemedb_requests_rejected_total.*circuit_breaker.*$AGENT_ID"
    # Counter should stop increasing
    
  • Circuit breaker metrics healthy

    curl http://localhost:18180/metrics | grep "circuit_breaker_state.*$AGENT_ID"
    # Should show: stemedb_circuit_breaker_state{agent_id="...",state="CLOSED"} 1
    

Prevention

Monitoring

Set up alerts for:

# Prometheus alert rules
groups:
  - name: stemedb_circuit_breakers
    rules:
      - alert: StemeDBCircuitBreakerOpen
        expr: stemedb_circuit_breaker_state{state="OPEN"} > 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker stuck open (>1 hour)"
          description: "Agent {{ $labels.agent_id }} banned for >1h"

      - alert: StemeDBCircuitBreakerLoop
        expr: rate(stemedb_circuit_breaker_transitions_total[1h]) > 5
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker looping"
          description: "Agent {{ $labels.agent_id }} cycling >5 times/hour"

Configuration Changes

To prevent recurrence:

  1. Review stuck breakers daily: Add to on-call checklist
  2. Tune timeouts: Adjust based on agent behavior patterns
  3. Document ban reasons: Always add reason when manually opening
  4. Agent health checks: Implement agent-side health checks before submitting

Example: Shorter timeout for pilot

# /etc/stemedb/config.toml
[circuit_breaker]
timeout_seconds = 1800  # 30 minutes instead of 1 hour
half_open_success_threshold = 3
half_open_request_count = 5

Circuit Breaker Admin Workflow

Standard procedure for stuck circuit breakers:

  1. Identify stuck breaker:

    curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state != "CLOSED")'
    
  2. Investigate cause:

    • Check quarantine items from agent
    • Review failure reason
    • Contact agent operator
  3. Decide action:

    • If agent fixed → Reset to HALF_OPEN
    • If false positive → Reset to CLOSED
    • If still broken → Keep banned
  4. Document decision:

    • Add note to incident log
    • Update agent metadata if persistent issue
  5. Monitor transition:

    • Watch for immediate re-ban (indicates agent still broken)
    • Verify assertion rate returns to normal

Response Headers Reference

Circuit breaker state is communicated via response headers:

State Status Code Headers
CLOSED 201 Created (none)
OPEN 429 Too Many Requests x-circuit-breaker-state: OPEN
retry-after: 3600
HALF_OPEN 429 Too Many Requests x-circuit-breaker-state: HALF_OPEN
retry-after: 60

Agent Implementation Guidelines:

Agents should:

  1. Check for x-circuit-breaker-state header on 429 responses
  2. If OPEN: Back off for retry-after seconds
  3. If HALF_OPEN: Retry cautiously (exponential backoff)
  4. Log circuit breaker state for operator visibility


Last Updated

2026-02-11