stemedb/docs/operations/runbooks/quarantine-overflow.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

404 lines
12 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook: Quarantine Overflow
## Symptom
- Quarantine dashboard panel shows 100+ pending items
- Admin receiving alerts about "quarantine_pending" metric high
- Legitimate assertions getting quarantined (false positives)
- Single agent flooding quarantine queue
**Metrics Alerts:**
- `stemedb_quarantine_pending` > 100 for 10 minutes
- `stemedb_quarantine_rate_per_agent` > 50/min for single agent
---
## Quick Diagnosis
```
Quarantine overflow
├─► Check: curl .../admin/quarantine | jq '.items | group_by(.agent_id)'
│ └─► Single agent? → §1 Single Agent Flooding
├─► Check: Are items "Duplicate" or "LowQuality"?
│ └─► Multiple agents, varied reasons → §2 Multiple Agents
├─► Check: Recent system changes?
│ └─► Content defense tuned too aggressive → §3 False Positives
└─► Check: Legitimate surge (e.g., new data source)?
└─► Expected behavior → §4 Legitimate Surge
```
---
## Common Causes
1. **Single agent flooding** — Likelihood: **45%**
- Misconfigured agent
- Agent in retry loop
- Malicious actor testing limits
2. **Content defense too aggressive** — Likelihood: **25%**
- Recently tuned thresholds
- False positive rate high
- Quality scoring bugs
3. **Multiple agents with low-quality data** — Likelihood: **20%**
- Integration issues
- Bad data sources
- Extraction pipeline bugs
4. **Legitimate surge** — Likelihood: **10%**
- New data source onboarded
- Backfill operation
- Expected high-volume event
---
## Resolution Steps
### §1. Single Agent Flooding
**Diagnostic:**
```bash
# List quarantine items grouped by agent
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.agent_id) | map({agent: .[0].agent_id, count: length}) | sort_by(.count) | reverse | .[0:5]'
# Expected output (flooding):
# [
# {"agent": "8f3a2b1c...", "count": 487}, <-- Flooding!
# {"agent": "7d2e5f9a...", "count": 12},
# {"agent": "6c1b4a8e...", "count": 8}
# ]
# Check agent's recent assertions
curl http://localhost:18180/v1/admin/quarantine?agent_id=8f3a2b1c... | jq '.items[0:5]'
# Check circuit breaker status for this agent
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.agent_id == "8f3a2b1c...")'
```
**Resolution: Ban agent via circuit breaker**
```bash
# Get agent's full public key from quarantine item
AGENT_ID="8f3a2b1c..." # Replace with actual agent ID
# Check current circuit breaker state
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID
# Manually open circuit breaker (ban agent)
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/open \
-H "Content-Type: application/json" \
-d '{"reason": "flooding_quarantine", "duration_seconds": 3600}'
# Expected response:
# {"status": "opened", "agent_id": "8f3a2b1c...", "state": "OPEN", "until": "2026-02-11T11:23:45Z"}
# Verify agent now gets 429 responses
curl -X POST http://localhost:18180/v1/assert \
-H "X-Agent-Signature: $AGENT_SIGNATURE" \
-d '{...}'
# Should return: 429 Too Many Requests with x-circuit-breaker-state: OPEN
```
**Bulk reject all items from flooding agent:**
```bash
# Get all quarantine item IDs from flooding agent
ITEM_IDS=$(curl -s http://localhost:18180/v1/admin/quarantine?agent_id=$AGENT_ID | jq -r '.items[].id')
# Batch reject
for id in $ITEM_IDS; do
curl -X POST http://localhost:18180/v1/admin/quarantine/$id/reject \
-H "Content-Type: application/json" \
-d '{"reason": "agent_flooding"}'
done
# Verify quarantine count reduced
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'
```
**If failed:** Agent bypassing circuit breaker → Check if using different keys. May need firewall-level ban.
---
### §2. Multiple Agents (False Positives)
**Diagnostic:**
```bash
# Check quarantine reasons
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.reason) | map({reason: .[0].reason, count: length})'
# Expected output:
# [
# {"reason": "LowQuality", "count": 87},
# {"reason": "UntrustedHighConfidence", "count": 34},
# {"reason": "Duplicate", "count": 12}
# ]
# Sample items from each reason
curl http://localhost:18180/v1/admin/quarantine | jq '.items[] | select(.reason == "LowQuality") | .[0:3]'
```
**Resolution: Tune content defense thresholds**
⚠️ **NOTE:** Requires restart to apply new thresholds.
```bash
# Current thresholds
curl http://localhost:18180/v1/admin/content_defense/thresholds
# Adjust quality threshold (example: lower from 0.7 to 0.5)
export STEMEDB_QUALITY_THRESHOLD=0.5
# Or in config file /etc/stemedb/config.toml:
cat >> /etc/stemedb/config.toml <<EOF
[content_defense]
quality_threshold = 0.5
confidence_threshold = 0.9 # Raised from 0.8 to reduce false positives
duplicate_lookback_hours = 24
EOF
# Restart server
sudo systemctl restart stemedb-api
# Verify new thresholds
curl http://localhost:18180/v1/admin/content_defense/thresholds
```
**Batch approve legitimate items:**
```bash
# Sample and approve items manually (for known-good agents)
curl http://localhost:18180/v1/admin/quarantine | jq '.items[] | select(.agent_id == "KNOWN_GOOD_AGENT") | .id' | xargs -I {} \
curl -X POST http://localhost:18180/v1/admin/quarantine/{}/approve
# Verify items promoted
curl http://localhost:18180/metrics | grep stemedb_quarantine_approved_total
```
**If failed:** False positives persist after tuning → Review quality scoring logic. May be bug in ContentDefenseLayer.
---
### §3. Content Defense Too Aggressive
**Diagnostic:**
```bash
# Check false positive rate
curl http://localhost:18180/metrics | grep -E '(quarantine_total|quarantine_approved_total)'
# Calculate false positive rate:
# FP_rate = quarantine_approved_total / (quarantine_approved_total + quarantine_rejected_total)
# If FP_rate >30%, content defense is too aggressive
# Review recent config changes
journalctl -u stemedb-api -n 500 | grep -i "content_defense"
```
**Resolution: Revert to default thresholds**
```bash
# Default thresholds (tested in production readiness UAT)
cat > /etc/stemedb/config.toml <<EOF
[content_defense]
quality_threshold = 0.6
confidence_threshold = 0.85
duplicate_lookback_hours = 48
untrusted_confidence_threshold = 0.95
EOF
sudo systemctl restart stemedb-api
# Monitor quarantine rate
watch -n 10 'curl -s http://localhost:18180/metrics | grep quarantine_pending'
```
**If failed:** Even defaults too aggressive → May indicate upstream data quality issues. Review agent implementations.
---
### §4. Legitimate Surge
**Diagnostic:**
```bash
# Check if surge is expected
# - Recent data source onboarding?
# - Backfill operation in progress?
# - Known high-volume event?
# Check quarantine rate over time
curl http://localhost:18180/metrics | grep stemedb_quarantine_rate_per_minute
# Compare to historical baseline (if available)
# If current rate 10x baseline → surge likely
# Check assertion rate (should also be high)
curl http://localhost:18180/metrics | grep stemedb_ingest_rate_per_minute
```
**Resolution: Increase quarantine review capacity**
```bash
# Option 1: Batch approve known-good patterns
# (Example: Approve all items from trusted agent during backfill)
TRUSTED_AGENT="known-backfill-agent-id"
curl http://localhost:18180/v1/admin/quarantine?agent_id=$TRUSTED_AGENT | jq -r '.items[].id' | xargs -I {} \
curl -X POST http://localhost:18180/v1/admin/quarantine/{}/approve
# Option 2: Temporarily disable content defense for trusted agents
# (Add to agent allowlist)
curl -X POST http://localhost:18180/v1/admin/content_defense/allowlist \
-H "Content-Type: application/json" \
-d '{"agent_id": "'$TRUSTED_AGENT'", "expires_at": "2026-02-12T00:00:00Z", "reason": "backfill_operation"}'
# Option 3: Scale review team (manual triage)
# Assign additional staff to review quarantine dashboard
```
**If failed:** Surge overwhelming even with increased capacity → Consider pausing ingest, scaling infrastructure, or auto-approving low-risk items.
---
## Validation
After applying resolution, validate quarantine is manageable:
- [ ] **Quarantine count <50**
```bash
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'
# Should be <50
```
- [ ] **No single agent dominating**
```bash
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.agent_id) | map(length) | max'
# No agent should have >20 items
```
- [ ] **False positive rate <20%**
```bash
curl http://localhost:18180/metrics | grep -E '(quarantine_approved|quarantine_rejected)'
# approved/(approved+rejected) should be <0.2
```
- [ ] **Quarantine rate stabilized**
```bash
curl http://localhost:18180/metrics | grep stemedb_quarantine_rate_per_minute
# Should be <10/min for pilot workloads
```
- [ ] **Legitimate assertions not quarantined**
- Submit test assertion from known-good agent
- Should immediately appear in dashboard (not quarantined)
---
## Prevention
### Monitoring
**Set up alerts for:**
```yaml
# Prometheus alert rules
groups:
- name: stemedb_quarantine
rules:
- alert: StemeDBQuarantineOverflow
expr: stemedb_quarantine_pending > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Quarantine queue overflow (>100 items)"
description: "Current count: {{ $value }}"
- alert: StemeDBAgentFlooding
expr: rate(stemedb_quarantine_total{agent_id}[5m]) > 50
for: 5m
labels:
severity: warning
annotations:
summary: "Agent flooding quarantine"
description: "Agent {{ $labels.agent_id }} submitting >50/min"
- alert: StemeDBHighFalsePositiveRate
expr: rate(stemedb_quarantine_approved_total[1h]) / (rate(stemedb_quarantine_approved_total[1h]) + rate(stemedb_quarantine_rejected_total[1h])) > 0.3
for: 30m
labels:
severity: warning
annotations:
summary: "Content defense false positive rate high (>30%)"
```
### Configuration Changes
**To prevent recurrence:**
1. **Agent flooding:** Tune circuit breaker thresholds (failure_rate, timeout)
2. **False positives:** Regularly review and adjust content defense thresholds based on approval/rejection rates
3. **Legitimate surges:** Create agent allowlist for backfill operations
4. **Review capacity:** Assign on-call rotation for quarantine review (aim for <24hr SLA)
**Example: Stricter circuit breaker**
```toml
# /etc/stemedb/config.toml
[circuit_breaker]
failure_rate_threshold = 0.3 # Open after 30% quarantine rate
timeout_seconds = 3600 # Ban for 1 hour
min_requests = 20 # Require 20 requests before evaluating
```
---
## Quarantine Dashboard Workflow
**Standard review procedure:**
1. **Open dashboard:** http://localhost:18188/quarantine
2. **Sort by agent:** Identify flooding patterns
3. **Review sample items:** Check assertion quality
4. **Batch action:**
- If flooding Ban agent via circuit breaker
- If false positives Approve batch + adjust thresholds
- If legitimate Approve individually or add to allowlist
5. **Document decision:** Add note to item before approve/reject
---
## Admin Endpoint Reference
**CRITICAL WARNING:** Admin endpoints have NO authentication. Must be restricted to internal network only.
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/v1/admin/quarantine` | GET | List all quarantine items |
| `/v1/admin/quarantine?agent_id={id}` | GET | Filter by agent |
| `/v1/admin/quarantine/{id}/approve` | POST | Promote item to main store |
| `/v1/admin/quarantine/{id}/reject` | POST | Permanently reject item |
| `/v1/admin/circuit_breakers` | GET | List all circuit breaker states |
| `/v1/admin/circuit_breakers/{id}/open` | POST | Manually ban agent |
| `/v1/admin/circuit_breakers/{id}/reset` | POST | Unban agent |
| `/v1/admin/content_defense/thresholds` | GET | Current thresholds |
| `/v1/admin/content_defense/allowlist` | POST | Add agent to allowlist |
---
## Related Runbooks
- [Circuit Breaker Stuck](./circuit-breaker-stuck.md) - Agent ban management
- [High Query Latency](./high-query-latency.md) - Performance impact of large quarantine
- [Server Won't Start](./server-wont-start.md) - Disk full from quarantine overflow
---
## Last Updated
2026-02-11