stemedb/docs/operations/runbooks/quarantine-overflow.md

# Runbook: Quarantine Overflow

## Symptom

- Quarantine dashboard panel shows 100+ pending items
- Admin receiving alerts about "quarantine_pending" metric high
- Legitimate assertions getting quarantined (false positives)
- Single agent flooding quarantine queue

**Metrics Alerts:**
- `stemedb_quarantine_pending` > 100 for 10 minutes
- `stemedb_quarantine_rate_per_agent` > 50/min for single agent

---

## Quick Diagnosis

```
Quarantine overflow
    │
    ├─► Check: curl .../admin/quarantine | jq '.items | group_by(.agent_id)'
    │   └─► Single agent? → §1 Single Agent Flooding
    │
    ├─► Check: Are items "Duplicate" or "LowQuality"?
    │   └─► Multiple agents, varied reasons → §2 Multiple Agents
    │
    ├─► Check: Recent system changes?
    │   └─► Content defense tuned too aggressive → §3 False Positives
    │
    └─► Check: Legitimate surge (e.g., new data source)?
        └─► Expected behavior → §4 Legitimate Surge
```

---

## Common Causes

1. **Single agent flooding** — Likelihood: **45%**
   - Misconfigured agent
   - Agent in retry loop
   - Malicious actor testing limits

2. **Content defense too aggressive** — Likelihood: **25%**
   - Recently tuned thresholds
   - False positive rate high
   - Quality scoring bugs

3. **Multiple agents with low-quality data** — Likelihood: **20%**
   - Integration issues
   - Bad data sources
   - Extraction pipeline bugs

4. **Legitimate surge** — Likelihood: **10%**
   - New data source onboarded
   - Backfill operation
   - Expected high-volume event

---

## Resolution Steps

### §1. Single Agent Flooding

**Diagnostic:**
```bash
# List quarantine items grouped by agent
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.agent_id) | map({agent: .[0].agent_id, count: length}) | sort_by(.count) | reverse | .[0:5]'

# Expected output (flooding):
# [
#   {"agent": "8f3a2b1c...", "count": 487},  <-- Flooding!
#   {"agent": "7d2e5f9a...", "count": 12},
#   {"agent": "6c1b4a8e...", "count": 8}
# ]

# Check agent's recent assertions
curl http://localhost:18180/v1/admin/quarantine?agent_id=8f3a2b1c... | jq '.items[0:5]'

# Check circuit breaker status for this agent
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.agent_id == "8f3a2b1c...")'
```

**Resolution: Ban agent via circuit breaker**

```bash
# Get agent's full public key from quarantine item
AGENT_ID="8f3a2b1c..."  # Replace with actual agent ID

# Check current circuit breaker state
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID

# Manually open circuit breaker (ban agent)
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/open \
  -H "Content-Type: application/json" \
  -d '{"reason": "flooding_quarantine", "duration_seconds": 3600}'

# Expected response:
# {"status": "opened", "agent_id": "8f3a2b1c...", "state": "OPEN", "until": "2026-02-11T11:23:45Z"}

# Verify agent now gets 429 responses
curl -X POST http://localhost:18180/v1/assert \
  -H "X-Agent-Signature: $AGENT_SIGNATURE" \
  -d '{...}'
# Should return: 429 Too Many Requests with x-circuit-breaker-state: OPEN
```

**Bulk reject all items from flooding agent:**

```bash
# Get all quarantine item IDs from flooding agent
ITEM_IDS=$(curl -s http://localhost:18180/v1/admin/quarantine?agent_id=$AGENT_ID | jq -r '.items[].id')

# Batch reject
for id in $ITEM_IDS; do
  curl -X POST http://localhost:18180/v1/admin/quarantine/$id/reject \
    -H "Content-Type: application/json" \
    -d '{"reason": "agent_flooding"}'
done

# Verify quarantine count reduced
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'
```

**If failed:** Agent bypassing circuit breaker → Check if using different keys. May need firewall-level ban.

---

### §2. Multiple Agents (False Positives)

**Diagnostic:**
```bash
# Check quarantine reasons
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.reason) | map({reason: .[0].reason, count: length})'

# Expected output:
# [
#   {"reason": "LowQuality", "count": 87},
#   {"reason": "UntrustedHighConfidence", "count": 34},
#   {"reason": "Duplicate", "count": 12}
# ]

# Sample items from each reason
curl http://localhost:18180/v1/admin/quarantine | jq '.items[] | select(.reason == "LowQuality") | .[0:3]'
```

**Resolution: Tune content defense thresholds**

⚠️ **NOTE:** Requires restart to apply new thresholds.

```bash
# Current thresholds
curl http://localhost:18180/v1/admin/content_defense/thresholds

# Adjust quality threshold (example: lower from 0.7 to 0.5)
export STEMEDB_QUALITY_THRESHOLD=0.5

# Or in config file /etc/stemedb/config.toml:
cat >> /etc/stemedb/config.toml <<EOF
[content_defense]
quality_threshold = 0.5
confidence_threshold = 0.9  # Raised from 0.8 to reduce false positives
duplicate_lookback_hours = 24
EOF

# Restart server
sudo systemctl restart stemedb-api

# Verify new thresholds
curl http://localhost:18180/v1/admin/content_defense/thresholds
```

**Batch approve legitimate items:**

```bash
# Sample and approve items manually (for known-good agents)
curl http://localhost:18180/v1/admin/quarantine | jq '.items[] | select(.agent_id == "KNOWN_GOOD_AGENT") | .id' | xargs -I {} \
  curl -X POST http://localhost:18180/v1/admin/quarantine/{}/approve

# Verify items promoted
curl http://localhost:18180/metrics | grep stemedb_quarantine_approved_total
```

**If failed:** False positives persist after tuning → Review quality scoring logic. May be bug in ContentDefenseLayer.

---

### §3. Content Defense Too Aggressive

**Diagnostic:**
```bash
# Check false positive rate
curl http://localhost:18180/metrics | grep -E '(quarantine_total|quarantine_approved_total)'

# Calculate false positive rate:
# FP_rate = quarantine_approved_total / (quarantine_approved_total + quarantine_rejected_total)

# If FP_rate >30%, content defense is too aggressive

# Review recent config changes
journalctl -u stemedb-api -n 500 | grep -i "content_defense"
```

**Resolution: Revert to default thresholds**

```bash
# Default thresholds (tested in production readiness UAT)
cat > /etc/stemedb/config.toml <<EOF
[content_defense]
quality_threshold = 0.6
confidence_threshold = 0.85
duplicate_lookback_hours = 48
untrusted_confidence_threshold = 0.95
EOF

sudo systemctl restart stemedb-api

# Monitor quarantine rate
watch -n 10 'curl -s http://localhost:18180/metrics | grep quarantine_pending'
```

**If failed:** Even defaults too aggressive → May indicate upstream data quality issues. Review agent implementations.

---

### §4. Legitimate Surge

**Diagnostic:**
```bash
# Check if surge is expected
# - Recent data source onboarding?
# - Backfill operation in progress?
# - Known high-volume event?

# Check quarantine rate over time
curl http://localhost:18180/metrics | grep stemedb_quarantine_rate_per_minute

# Compare to historical baseline (if available)
# If current rate 10x baseline → surge likely

# Check assertion rate (should also be high)
curl http://localhost:18180/metrics | grep stemedb_ingest_rate_per_minute
```

**Resolution: Increase quarantine review capacity**

```bash
# Option 1: Batch approve known-good patterns
# (Example: Approve all items from trusted agent during backfill)
TRUSTED_AGENT="known-backfill-agent-id"

curl http://localhost:18180/v1/admin/quarantine?agent_id=$TRUSTED_AGENT | jq -r '.items[].id' | xargs -I {} \
  curl -X POST http://localhost:18180/v1/admin/quarantine/{}/approve

# Option 2: Temporarily disable content defense for trusted agents
# (Add to agent allowlist)
curl -X POST http://localhost:18180/v1/admin/content_defense/allowlist \
  -H "Content-Type: application/json" \
  -d '{"agent_id": "'$TRUSTED_AGENT'", "expires_at": "2026-02-12T00:00:00Z", "reason": "backfill_operation"}'

# Option 3: Scale review team (manual triage)
# Assign additional staff to review quarantine dashboard
```

**If failed:** Surge overwhelming even with increased capacity → Consider pausing ingest, scaling infrastructure, or auto-approving low-risk items.

---

## Validation

After applying resolution, validate quarantine is manageable:

- [ ] **Quarantine count <50**
  ```bash
  curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'
  # Should be <50
  ```

- [ ] **No single agent dominating**
  ```bash
  curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.agent_id) | map(length) | max'
  # No agent should have >20 items
  ```

- [ ] **False positive rate <20%**
  ```bash
  curl http://localhost:18180/metrics | grep -E '(quarantine_approved|quarantine_rejected)'
  # approved/(approved+rejected) should be <0.2
  ```

- [ ] **Quarantine rate stabilized**
  ```bash
  curl http://localhost:18180/metrics | grep stemedb_quarantine_rate_per_minute
  # Should be <10/min for pilot workloads
  ```

- [ ] **Legitimate assertions not quarantined**
  - Submit test assertion from known-good agent
  - Should immediately appear in dashboard (not quarantined)

---

## Prevention

### Monitoring

**Set up alerts for:**

```yaml
# Prometheus alert rules
groups:
  - name: stemedb_quarantine
    rules:
      - alert: StemeDBQuarantineOverflow
        expr: stemedb_quarantine_pending > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Quarantine queue overflow (>100 items)"
          description: "Current count: {{ $value }}"

      - alert: StemeDBAgentFlooding
        expr: rate(stemedb_quarantine_total{agent_id}[5m]) > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent flooding quarantine"
          description: "Agent {{ $labels.agent_id }} submitting >50/min"

      - alert: StemeDBHighFalsePositiveRate
        expr: rate(stemedb_quarantine_approved_total[1h]) / (rate(stemedb_quarantine_approved_total[1h]) + rate(stemedb_quarantine_rejected_total[1h])) > 0.3
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Content defense false positive rate high (>30%)"
```

### Configuration Changes

**To prevent recurrence:**

1. **Agent flooding:** Tune circuit breaker thresholds (failure_rate, timeout)
2. **False positives:** Regularly review and adjust content defense thresholds based on approval/rejection rates
3. **Legitimate surges:** Create agent allowlist for backfill operations
4. **Review capacity:** Assign on-call rotation for quarantine review (aim for <24hr SLA)

**Example: Stricter circuit breaker**
```toml
# /etc/stemedb/config.toml
[circuit_breaker]
failure_rate_threshold = 0.3  # Open after 30% quarantine rate
timeout_seconds = 3600  # Ban for 1 hour
min_requests = 20  # Require 20 requests before evaluating
```

---

## Quarantine Dashboard Workflow

**Standard review procedure:**

1. **Open dashboard:** http://localhost:18188/quarantine
2. **Sort by agent:** Identify flooding patterns
3. **Review sample items:** Check assertion quality
4. **Batch action:**
   - If flooding → Ban agent via circuit breaker
   - If false positives → Approve batch + adjust thresholds
   - If legitimate → Approve individually or add to allowlist
5. **Document decision:** Add note to item before approve/reject

---

## Admin Endpoint Reference

⚠️ **CRITICAL WARNING:** Admin endpoints have NO authentication. Must be restricted to internal network only.

| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/v1/admin/quarantine` | GET | List all quarantine items |
| `/v1/admin/quarantine?agent_id={id}` | GET | Filter by agent |
| `/v1/admin/quarantine/{id}/approve` | POST | Promote item to main store |
| `/v1/admin/quarantine/{id}/reject` | POST | Permanently reject item |
| `/v1/admin/circuit_breakers` | GET | List all circuit breaker states |
| `/v1/admin/circuit_breakers/{id}/open` | POST | Manually ban agent |
| `/v1/admin/circuit_breakers/{id}/reset` | POST | Unban agent |
| `/v1/admin/content_defense/thresholds` | GET | Current thresholds |
| `/v1/admin/content_defense/allowlist` | POST | Add agent to allowlist |

---

## Related Runbooks

- [Circuit Breaker Stuck](./circuit-breaker-stuck.md) - Agent ban management
- [High Query Latency](./high-query-latency.md) - Performance impact of large quarantine
- [Server Won't Start](./server-wont-start.md) - Disk full from quarantine overflow

---

## Last Updated

2026-02-11