# Runbook: Quarantine Overflow ## Symptom - Quarantine dashboard panel shows 100+ pending items - Admin receiving alerts about "quarantine_pending" metric high - Legitimate assertions getting quarantined (false positives) - Single agent flooding quarantine queue **Metrics Alerts:** - `stemedb_quarantine_pending` > 100 for 10 minutes - `stemedb_quarantine_rate_per_agent` > 50/min for single agent --- ## Quick Diagnosis ``` Quarantine overflow │ ├─► Check: curl .../admin/quarantine | jq '.items | group_by(.agent_id)' │ └─► Single agent? → §1 Single Agent Flooding │ ├─► Check: Are items "Duplicate" or "LowQuality"? │ └─► Multiple agents, varied reasons → §2 Multiple Agents │ ├─► Check: Recent system changes? │ └─► Content defense tuned too aggressive → §3 False Positives │ └─► Check: Legitimate surge (e.g., new data source)? └─► Expected behavior → §4 Legitimate Surge ``` --- ## Common Causes 1. **Single agent flooding** — Likelihood: **45%** - Misconfigured agent - Agent in retry loop - Malicious actor testing limits 2. **Content defense too aggressive** — Likelihood: **25%** - Recently tuned thresholds - False positive rate high - Quality scoring bugs 3. **Multiple agents with low-quality data** — Likelihood: **20%** - Integration issues - Bad data sources - Extraction pipeline bugs 4. **Legitimate surge** — Likelihood: **10%** - New data source onboarded - Backfill operation - Expected high-volume event --- ## Resolution Steps ### §1. Single Agent Flooding **Diagnostic:** ```bash # List quarantine items grouped by agent curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.agent_id) | map({agent: .[0].agent_id, count: length}) | sort_by(.count) | reverse | .[0:5]' # Expected output (flooding): # [ # {"agent": "8f3a2b1c...", "count": 487}, <-- Flooding! # {"agent": "7d2e5f9a...", "count": 12}, # {"agent": "6c1b4a8e...", "count": 8} # ] # Check agent's recent assertions curl http://localhost:18180/v1/admin/quarantine?agent_id=8f3a2b1c... | jq '.items[0:5]' # Check circuit breaker status for this agent curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.agent_id == "8f3a2b1c...")' ``` **Resolution: Ban agent via circuit breaker** ```bash # Get agent's full public key from quarantine item AGENT_ID="8f3a2b1c..." # Replace with actual agent ID # Check current circuit breaker state curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID # Manually open circuit breaker (ban agent) curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/open \ -H "Content-Type: application/json" \ -d '{"reason": "flooding_quarantine", "duration_seconds": 3600}' # Expected response: # {"status": "opened", "agent_id": "8f3a2b1c...", "state": "OPEN", "until": "2026-02-11T11:23:45Z"} # Verify agent now gets 429 responses curl -X POST http://localhost:18180/v1/assert \ -H "X-Agent-Signature: $AGENT_SIGNATURE" \ -d '{...}' # Should return: 429 Too Many Requests with x-circuit-breaker-state: OPEN ``` **Bulk reject all items from flooding agent:** ```bash # Get all quarantine item IDs from flooding agent ITEM_IDS=$(curl -s http://localhost:18180/v1/admin/quarantine?agent_id=$AGENT_ID | jq -r '.items[].id') # Batch reject for id in $ITEM_IDS; do curl -X POST http://localhost:18180/v1/admin/quarantine/$id/reject \ -H "Content-Type: application/json" \ -d '{"reason": "agent_flooding"}' done # Verify quarantine count reduced curl http://localhost:18180/v1/admin/quarantine | jq '.items | length' ``` **If failed:** Agent bypassing circuit breaker → Check if using different keys. May need firewall-level ban. --- ### §2. Multiple Agents (False Positives) **Diagnostic:** ```bash # Check quarantine reasons curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.reason) | map({reason: .[0].reason, count: length})' # Expected output: # [ # {"reason": "LowQuality", "count": 87}, # {"reason": "UntrustedHighConfidence", "count": 34}, # {"reason": "Duplicate", "count": 12} # ] # Sample items from each reason curl http://localhost:18180/v1/admin/quarantine | jq '.items[] | select(.reason == "LowQuality") | .[0:3]' ``` **Resolution: Tune content defense thresholds** ⚠️ **NOTE:** Requires restart to apply new thresholds. ```bash # Current thresholds curl http://localhost:18180/v1/admin/content_defense/thresholds # Adjust quality threshold (example: lower from 0.7 to 0.5) export STEMEDB_QUALITY_THRESHOLD=0.5 # Or in config file /etc/stemedb/config.toml: cat >> /etc/stemedb/config.toml <30%, content defense is too aggressive # Review recent config changes journalctl -u stemedb-api -n 500 | grep -i "content_defense" ``` **Resolution: Revert to default thresholds** ```bash # Default thresholds (tested in production readiness UAT) cat > /etc/stemedb/config.toml <20 items ``` - [ ] **False positive rate <20%** ```bash curl http://localhost:18180/metrics | grep -E '(quarantine_approved|quarantine_rejected)' # approved/(approved+rejected) should be <0.2 ``` - [ ] **Quarantine rate stabilized** ```bash curl http://localhost:18180/metrics | grep stemedb_quarantine_rate_per_minute # Should be <10/min for pilot workloads ``` - [ ] **Legitimate assertions not quarantined** - Submit test assertion from known-good agent - Should immediately appear in dashboard (not quarantined) --- ## Prevention ### Monitoring **Set up alerts for:** ```yaml # Prometheus alert rules groups: - name: stemedb_quarantine rules: - alert: StemeDBQuarantineOverflow expr: stemedb_quarantine_pending > 100 for: 10m labels: severity: warning annotations: summary: "Quarantine queue overflow (>100 items)" description: "Current count: {{ $value }}" - alert: StemeDBAgentFlooding expr: rate(stemedb_quarantine_total{agent_id}[5m]) > 50 for: 5m labels: severity: warning annotations: summary: "Agent flooding quarantine" description: "Agent {{ $labels.agent_id }} submitting >50/min" - alert: StemeDBHighFalsePositiveRate expr: rate(stemedb_quarantine_approved_total[1h]) / (rate(stemedb_quarantine_approved_total[1h]) + rate(stemedb_quarantine_rejected_total[1h])) > 0.3 for: 30m labels: severity: warning annotations: summary: "Content defense false positive rate high (>30%)" ``` ### Configuration Changes **To prevent recurrence:** 1. **Agent flooding:** Tune circuit breaker thresholds (failure_rate, timeout) 2. **False positives:** Regularly review and adjust content defense thresholds based on approval/rejection rates 3. **Legitimate surges:** Create agent allowlist for backfill operations 4. **Review capacity:** Assign on-call rotation for quarantine review (aim for <24hr SLA) **Example: Stricter circuit breaker** ```toml # /etc/stemedb/config.toml [circuit_breaker] failure_rate_threshold = 0.3 # Open after 30% quarantine rate timeout_seconds = 3600 # Ban for 1 hour min_requests = 20 # Require 20 requests before evaluating ``` --- ## Quarantine Dashboard Workflow **Standard review procedure:** 1. **Open dashboard:** http://localhost:18188/quarantine 2. **Sort by agent:** Identify flooding patterns 3. **Review sample items:** Check assertion quality 4. **Batch action:** - If flooding → Ban agent via circuit breaker - If false positives → Approve batch + adjust thresholds - If legitimate → Approve individually or add to allowlist 5. **Document decision:** Add note to item before approve/reject --- ## Admin Endpoint Reference ⚠️ **CRITICAL WARNING:** Admin endpoints have NO authentication. Must be restricted to internal network only. | Endpoint | Method | Purpose | |----------|--------|---------| | `/v1/admin/quarantine` | GET | List all quarantine items | | `/v1/admin/quarantine?agent_id={id}` | GET | Filter by agent | | `/v1/admin/quarantine/{id}/approve` | POST | Promote item to main store | | `/v1/admin/quarantine/{id}/reject` | POST | Permanently reject item | | `/v1/admin/circuit_breakers` | GET | List all circuit breaker states | | `/v1/admin/circuit_breakers/{id}/open` | POST | Manually ban agent | | `/v1/admin/circuit_breakers/{id}/reset` | POST | Unban agent | | `/v1/admin/content_defense/thresholds` | GET | Current thresholds | | `/v1/admin/content_defense/allowlist` | POST | Add agent to allowlist | --- ## Related Runbooks - [Circuit Breaker Stuck](./circuit-breaker-stuck.md) - Agent ban management - [High Query Latency](./high-query-latency.md) - Performance impact of large quarantine - [Server Won't Start](./server-wont-start.md) - Disk full from quarantine overflow --- ## Last Updated 2026-02-11