This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
12 KiB
Runbook: Circuit Breaker Stuck
Symptom
- Agent getting 429 "Too Many Requests" responses
- Dashboard shows circuit breaker in "OPEN" state
- Legitimate agent unable to submit assertions
- Circuit breaker won't transition to "HALF_OPEN" or "CLOSED"
Metrics Alerts:
stemedb_circuit_breaker_state{state="OPEN"}> 0 for >1 hourstemedb_requests_rejected_total{reason="circuit_breaker"}increasing
Response Headers:
HTTP/1.1 429 Too Many Requests
x-circuit-breaker-state: OPEN
retry-after: 3600
Quick Diagnosis
Circuit breaker stuck
│
├─► Check: curl .../admin/circuit_breakers | jq '.circuit_breakers[] | select(.state=="OPEN")'
│ └─► Agent banned? → §1 Manual Ban
│
├─► Check: When was circuit breaker opened?
│ └─► >1 hour ago but still OPEN? → §2 Stuck in OPEN
│
├─► Check: Agent repeatedly failing?
│ └─► Automatic ban due to failures → §3 Legitimate Ban
│
└─► Check: Circuit breaker in HALF_OPEN but requests still failing?
└─► Stuck in HALF_OPEN loop → §4 HALF_OPEN Loop
Common Causes
-
Manual ban not reset — Likelihood: 40%
- Admin manually opened circuit breaker
- Forgot to reset after issue resolved
- No automatic timeout configured
-
Automatic ban due to high failure rate — Likelihood: 30%
- Agent submitting low-quality assertions (quarantined)
- Agent hitting rate limits
- Agent violating content defense rules
-
Circuit breaker timeout too long — Likelihood: 15%
- Default timeout (1 hour) too conservative
- Agent blocked longer than needed
- No process to review stuck breakers
-
HALF_OPEN loop (test requests failing) — Likelihood: 15%
- Agent still misconfigured
- Content defense still rejecting
- Circuit breaker testing with same bad requests
Circuit Breaker State Machine
CLOSED (normal)
│
├─► Failure rate >30% over 5 min
│ └─► OPEN (banned)
│ │
│ ├─► Wait timeout (default: 1 hour)
│ │ └─► HALF_OPEN (testing)
│ │ │
│ │ ├─► Test requests succeed
│ │ │ └─► CLOSED (restored)
│ │ │
│ │ └─► Test requests fail
│ │ └─► OPEN (banned again)
│ │
│ └─► Manual reset
│ └─► HALF_OPEN or CLOSED
Resolution Steps
§1. Manual Reset (Intended Ban)
Diagnostic:
# List all circuit breakers in OPEN state
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state == "OPEN")'
# Expected output:
# {
# "agent_id": "8f3a2b1c...",
# "state": "OPEN",
# "opened_at": "2026-02-11T09:00:00Z",
# "reason": "flooding_quarantine",
# "failure_count": 487,
# "timeout_until": "2026-02-11T10:00:00Z"
# }
# Check if ban was manual
journalctl -u stemedb-api | grep "circuit_breaker.*manual"
Resolution: Manual reset
⚠️ WARNING: Only reset if confident agent issue is resolved. Otherwise will immediately re-open.
# Get agent ID
AGENT_ID="8f3a2b1c..."
# Check current state
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID
# Option 1: Reset to HALF_OPEN (conservative - test first)
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
-H "Content-Type: application/json" \
-d '{"target_state": "HALF_OPEN", "reason": "issue_resolved"}'
# Expected response:
# {"status": "reset", "agent_id": "8f3a2b1c...", "state": "HALF_OPEN"}
# Wait for agent to submit test assertion
# If succeeds → Transitions to CLOSED
# If fails → Returns to OPEN
# Option 2: Reset to CLOSED (aggressive - trust immediately)
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
-H "Content-Type: application/json" \
-d '{"target_state": "CLOSED", "reason": "false_positive"}'
# Verify state
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state'
# Should return: "CLOSED" or "HALF_OPEN"
Test agent access:
# Submit test assertion from agent
curl -X POST http://localhost:18180/v1/assert \
-H "Content-Type: application/json" \
-H "X-Agent-Signature: $AGENT_SIGNATURE" \
-d '{
"concept_path": "test/circuit_breaker",
"predicate": "reset_test",
"value": true,
"confidence": 0.9
}'
# Should return: 201 Created (not 429)
If failed: Reset to HALF_OPEN but immediately returns to OPEN → Agent still submitting bad requests. Fix agent first.
§2. Stuck in OPEN (Timeout Not Expiring)
Diagnostic:
# Check timeout expiry
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state == "OPEN") | {agent_id, timeout_until, now: (now | todate)}'
# If timeout_until is in the past but still OPEN → Bug or manual ban with no timeout
# Check for manual ban
journalctl -u stemedb-api | grep "circuit_breaker.*$AGENT_ID"
Resolution: Force reset
# Force transition to HALF_OPEN
AGENT_ID="stuck-agent-id"
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
-H "Content-Type: application/json" \
-d '{"target_state": "HALF_OPEN", "reason": "timeout_expired", "force": true}'
# Monitor transition
watch -n 2 'curl -s http://localhost:18180/v1/admin/circuit_breakers/'$AGENT_ID' | jq .state'
# Should transition: OPEN → HALF_OPEN → CLOSED (after test request)
If failed: Force reset doesn't work → Potential bug. Escalate to engineering. Workaround: Restart server (resets all circuit breakers to CLOSED).
§3. Legitimate Ban (Agent Still Misbehaving)
Diagnostic:
# Check why agent was banned
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '{reason, failure_count, failure_rate}'
# Check recent quarantine items from this agent
curl http://localhost:18180/v1/admin/quarantine?agent_id=$AGENT_ID | jq '.items[0:5]'
# Check agent's recent assertion history
curl http://localhost:18180/metrics | grep "stemedb_ingest_rejected_total.*$AGENT_ID"
Resolution: Fix agent, then reset
Step 1: Identify agent issue
Common issues:
- Submitting duplicate assertions (same concept_path/predicate repeatedly)
- Low-quality data (confidence too high for source authority)
- Malformed payloads
- Rate limiting (>1K assertions/min)
Step 2: Contact agent operator
# Get agent contact info (if available)
curl http://localhost:18180/v1/admin/agents/$AGENT_ID | jq '.contact'
# Or check agent metadata
curl http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "agent/'$AGENT_ID'/metadata", "lens": "recency"}'
Step 3: Test fix
# After agent operator claims fix, reset to HALF_OPEN
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
-H "Content-Type: application/json" \
-d '{"target_state": "HALF_OPEN", "reason": "agent_fixed"}'
# Agent submits test assertion
# Monitor for success/failure
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state'
If failed: Agent still misbehaving after "fix" → Keep banned. Agent must resolve issue before reset.
§4. HALF_OPEN Loop (Test Requests Failing)
Diagnostic:
# Check how many times circuit breaker has cycled HALF_OPEN → OPEN
curl http://localhost:18180/metrics | grep "circuit_breaker_transitions.*$AGENT_ID"
# If count >5 in last hour → Loop detected
# Check test request failures
journalctl -u stemedb-api | grep "circuit_breaker.*half_open_test.*$AGENT_ID"
Resolution: Increase test threshold
⚠️ NOTE: Default: Circuit breaker tests with 5 requests. If 3+ succeed, transitions to CLOSED. If 3+ fail, returns to OPEN.
# Temporarily relax test threshold (requires restart)
export STEMEDB_CIRCUIT_BREAKER_HALF_OPEN_SUCCESS_THRESHOLD=2 # Lower from 3 to 2
sudo systemctl restart stemedb-api
# Reset circuit breaker
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/reset \
-H "Content-Type: application/json" \
-d '{"target_state": "HALF_OPEN", "reason": "relaxed_threshold"}'
# Monitor
watch -n 2 'curl -s http://localhost:18180/v1/admin/circuit_breakers/'$AGENT_ID' | jq .state'
If failed: Still looping → Agent fundamentally broken. Keep banned until operator resolves.
Validation
After applying resolution, validate circuit breaker is functioning:
-
Circuit breaker state is CLOSED
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID | jq '.state' # Should return: "CLOSED" -
Agent can submit assertions
# Test assertion from agent curl -X POST http://localhost:18180/v1/assert \ -H "X-Agent-Signature: $AGENT_SIGNATURE" \ -d '{...}' # Should return: 201 Created -
No 429 responses
curl http://localhost:18180/metrics | grep "stemedb_requests_rejected_total.*circuit_breaker.*$AGENT_ID" # Counter should stop increasing -
Circuit breaker metrics healthy
curl http://localhost:18180/metrics | grep "circuit_breaker_state.*$AGENT_ID" # Should show: stemedb_circuit_breaker_state{agent_id="...",state="CLOSED"} 1
Prevention
Monitoring
Set up alerts for:
# Prometheus alert rules
groups:
- name: stemedb_circuit_breakers
rules:
- alert: StemeDBCircuitBreakerOpen
expr: stemedb_circuit_breaker_state{state="OPEN"} > 0
for: 1h
labels:
severity: warning
annotations:
summary: "Circuit breaker stuck open (>1 hour)"
description: "Agent {{ $labels.agent_id }} banned for >1h"
- alert: StemeDBCircuitBreakerLoop
expr: rate(stemedb_circuit_breaker_transitions_total[1h]) > 5
for: 30m
labels:
severity: warning
annotations:
summary: "Circuit breaker looping"
description: "Agent {{ $labels.agent_id }} cycling >5 times/hour"
Configuration Changes
To prevent recurrence:
- Review stuck breakers daily: Add to on-call checklist
- Tune timeouts: Adjust based on agent behavior patterns
- Document ban reasons: Always add reason when manually opening
- Agent health checks: Implement agent-side health checks before submitting
Example: Shorter timeout for pilot
# /etc/stemedb/config.toml
[circuit_breaker]
timeout_seconds = 1800 # 30 minutes instead of 1 hour
half_open_success_threshold = 3
half_open_request_count = 5
Circuit Breaker Admin Workflow
Standard procedure for stuck circuit breakers:
-
Identify stuck breaker:
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.state != "CLOSED")' -
Investigate cause:
- Check quarantine items from agent
- Review failure reason
- Contact agent operator
-
Decide action:
- If agent fixed → Reset to HALF_OPEN
- If false positive → Reset to CLOSED
- If still broken → Keep banned
-
Document decision:
- Add note to incident log
- Update agent metadata if persistent issue
-
Monitor transition:
- Watch for immediate re-ban (indicates agent still broken)
- Verify assertion rate returns to normal
Response Headers Reference
Circuit breaker state is communicated via response headers:
| State | Status Code | Headers |
|---|---|---|
| CLOSED | 201 Created | (none) |
| OPEN | 429 Too Many Requests | x-circuit-breaker-state: OPENretry-after: 3600 |
| HALF_OPEN | 429 Too Many Requests | x-circuit-breaker-state: HALF_OPENretry-after: 60 |
Agent Implementation Guidelines:
Agents should:
- Check for
x-circuit-breaker-stateheader on 429 responses - If
OPEN: Back off forretry-afterseconds - If
HALF_OPEN: Retry cautiously (exponential backoff) - Log circuit breaker state for operator visibility
Related Runbooks
- Quarantine Overflow - Related content defense issues
- High Query Latency - Performance impact
- Server Won't Start - Restart impacts circuit breakers
Last Updated
2026-02-11