This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
404 lines
12 KiB
Markdown
404 lines
12 KiB
Markdown
# Runbook: Quarantine Overflow
|
||
|
||
## Symptom
|
||
|
||
- Quarantine dashboard panel shows 100+ pending items
|
||
- Admin receiving alerts about "quarantine_pending" metric high
|
||
- Legitimate assertions getting quarantined (false positives)
|
||
- Single agent flooding quarantine queue
|
||
|
||
**Metrics Alerts:**
|
||
- `stemedb_quarantine_pending` > 100 for 10 minutes
|
||
- `stemedb_quarantine_rate_per_agent` > 50/min for single agent
|
||
|
||
---
|
||
|
||
## Quick Diagnosis
|
||
|
||
```
|
||
Quarantine overflow
|
||
│
|
||
├─► Check: curl .../admin/quarantine | jq '.items | group_by(.agent_id)'
|
||
│ └─► Single agent? → §1 Single Agent Flooding
|
||
│
|
||
├─► Check: Are items "Duplicate" or "LowQuality"?
|
||
│ └─► Multiple agents, varied reasons → §2 Multiple Agents
|
||
│
|
||
├─► Check: Recent system changes?
|
||
│ └─► Content defense tuned too aggressive → §3 False Positives
|
||
│
|
||
└─► Check: Legitimate surge (e.g., new data source)?
|
||
└─► Expected behavior → §4 Legitimate Surge
|
||
```
|
||
|
||
---
|
||
|
||
## Common Causes
|
||
|
||
1. **Single agent flooding** — Likelihood: **45%**
|
||
- Misconfigured agent
|
||
- Agent in retry loop
|
||
- Malicious actor testing limits
|
||
|
||
2. **Content defense too aggressive** — Likelihood: **25%**
|
||
- Recently tuned thresholds
|
||
- False positive rate high
|
||
- Quality scoring bugs
|
||
|
||
3. **Multiple agents with low-quality data** — Likelihood: **20%**
|
||
- Integration issues
|
||
- Bad data sources
|
||
- Extraction pipeline bugs
|
||
|
||
4. **Legitimate surge** — Likelihood: **10%**
|
||
- New data source onboarded
|
||
- Backfill operation
|
||
- Expected high-volume event
|
||
|
||
---
|
||
|
||
## Resolution Steps
|
||
|
||
### §1. Single Agent Flooding
|
||
|
||
**Diagnostic:**
|
||
```bash
|
||
# List quarantine items grouped by agent
|
||
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.agent_id) | map({agent: .[0].agent_id, count: length}) | sort_by(.count) | reverse | .[0:5]'
|
||
|
||
# Expected output (flooding):
|
||
# [
|
||
# {"agent": "8f3a2b1c...", "count": 487}, <-- Flooding!
|
||
# {"agent": "7d2e5f9a...", "count": 12},
|
||
# {"agent": "6c1b4a8e...", "count": 8}
|
||
# ]
|
||
|
||
# Check agent's recent assertions
|
||
curl http://localhost:18180/v1/admin/quarantine?agent_id=8f3a2b1c... | jq '.items[0:5]'
|
||
|
||
# Check circuit breaker status for this agent
|
||
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.agent_id == "8f3a2b1c...")'
|
||
```
|
||
|
||
**Resolution: Ban agent via circuit breaker**
|
||
|
||
```bash
|
||
# Get agent's full public key from quarantine item
|
||
AGENT_ID="8f3a2b1c..." # Replace with actual agent ID
|
||
|
||
# Check current circuit breaker state
|
||
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID
|
||
|
||
# Manually open circuit breaker (ban agent)
|
||
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/open \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"reason": "flooding_quarantine", "duration_seconds": 3600}'
|
||
|
||
# Expected response:
|
||
# {"status": "opened", "agent_id": "8f3a2b1c...", "state": "OPEN", "until": "2026-02-11T11:23:45Z"}
|
||
|
||
# Verify agent now gets 429 responses
|
||
curl -X POST http://localhost:18180/v1/assert \
|
||
-H "X-Agent-Signature: $AGENT_SIGNATURE" \
|
||
-d '{...}'
|
||
# Should return: 429 Too Many Requests with x-circuit-breaker-state: OPEN
|
||
```
|
||
|
||
**Bulk reject all items from flooding agent:**
|
||
|
||
```bash
|
||
# Get all quarantine item IDs from flooding agent
|
||
ITEM_IDS=$(curl -s http://localhost:18180/v1/admin/quarantine?agent_id=$AGENT_ID | jq -r '.items[].id')
|
||
|
||
# Batch reject
|
||
for id in $ITEM_IDS; do
|
||
curl -X POST http://localhost:18180/v1/admin/quarantine/$id/reject \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"reason": "agent_flooding"}'
|
||
done
|
||
|
||
# Verify quarantine count reduced
|
||
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'
|
||
```
|
||
|
||
**If failed:** Agent bypassing circuit breaker → Check if using different keys. May need firewall-level ban.
|
||
|
||
---
|
||
|
||
### §2. Multiple Agents (False Positives)
|
||
|
||
**Diagnostic:**
|
||
```bash
|
||
# Check quarantine reasons
|
||
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.reason) | map({reason: .[0].reason, count: length})'
|
||
|
||
# Expected output:
|
||
# [
|
||
# {"reason": "LowQuality", "count": 87},
|
||
# {"reason": "UntrustedHighConfidence", "count": 34},
|
||
# {"reason": "Duplicate", "count": 12}
|
||
# ]
|
||
|
||
# Sample items from each reason
|
||
curl http://localhost:18180/v1/admin/quarantine | jq '.items[] | select(.reason == "LowQuality") | .[0:3]'
|
||
```
|
||
|
||
**Resolution: Tune content defense thresholds**
|
||
|
||
⚠️ **NOTE:** Requires restart to apply new thresholds.
|
||
|
||
```bash
|
||
# Current thresholds
|
||
curl http://localhost:18180/v1/admin/content_defense/thresholds
|
||
|
||
# Adjust quality threshold (example: lower from 0.7 to 0.5)
|
||
export STEMEDB_QUALITY_THRESHOLD=0.5
|
||
|
||
# Or in config file /etc/stemedb/config.toml:
|
||
cat >> /etc/stemedb/config.toml <<EOF
|
||
[content_defense]
|
||
quality_threshold = 0.5
|
||
confidence_threshold = 0.9 # Raised from 0.8 to reduce false positives
|
||
duplicate_lookback_hours = 24
|
||
EOF
|
||
|
||
# Restart server
|
||
sudo systemctl restart stemedb-api
|
||
|
||
# Verify new thresholds
|
||
curl http://localhost:18180/v1/admin/content_defense/thresholds
|
||
```
|
||
|
||
**Batch approve legitimate items:**
|
||
|
||
```bash
|
||
# Sample and approve items manually (for known-good agents)
|
||
curl http://localhost:18180/v1/admin/quarantine | jq '.items[] | select(.agent_id == "KNOWN_GOOD_AGENT") | .id' | xargs -I {} \
|
||
curl -X POST http://localhost:18180/v1/admin/quarantine/{}/approve
|
||
|
||
# Verify items promoted
|
||
curl http://localhost:18180/metrics | grep stemedb_quarantine_approved_total
|
||
```
|
||
|
||
**If failed:** False positives persist after tuning → Review quality scoring logic. May be bug in ContentDefenseLayer.
|
||
|
||
---
|
||
|
||
### §3. Content Defense Too Aggressive
|
||
|
||
**Diagnostic:**
|
||
```bash
|
||
# Check false positive rate
|
||
curl http://localhost:18180/metrics | grep -E '(quarantine_total|quarantine_approved_total)'
|
||
|
||
# Calculate false positive rate:
|
||
# FP_rate = quarantine_approved_total / (quarantine_approved_total + quarantine_rejected_total)
|
||
|
||
# If FP_rate >30%, content defense is too aggressive
|
||
|
||
# Review recent config changes
|
||
journalctl -u stemedb-api -n 500 | grep -i "content_defense"
|
||
```
|
||
|
||
**Resolution: Revert to default thresholds**
|
||
|
||
```bash
|
||
# Default thresholds (tested in production readiness UAT)
|
||
cat > /etc/stemedb/config.toml <<EOF
|
||
[content_defense]
|
||
quality_threshold = 0.6
|
||
confidence_threshold = 0.85
|
||
duplicate_lookback_hours = 48
|
||
untrusted_confidence_threshold = 0.95
|
||
EOF
|
||
|
||
sudo systemctl restart stemedb-api
|
||
|
||
# Monitor quarantine rate
|
||
watch -n 10 'curl -s http://localhost:18180/metrics | grep quarantine_pending'
|
||
```
|
||
|
||
**If failed:** Even defaults too aggressive → May indicate upstream data quality issues. Review agent implementations.
|
||
|
||
---
|
||
|
||
### §4. Legitimate Surge
|
||
|
||
**Diagnostic:**
|
||
```bash
|
||
# Check if surge is expected
|
||
# - Recent data source onboarding?
|
||
# - Backfill operation in progress?
|
||
# - Known high-volume event?
|
||
|
||
# Check quarantine rate over time
|
||
curl http://localhost:18180/metrics | grep stemedb_quarantine_rate_per_minute
|
||
|
||
# Compare to historical baseline (if available)
|
||
# If current rate 10x baseline → surge likely
|
||
|
||
# Check assertion rate (should also be high)
|
||
curl http://localhost:18180/metrics | grep stemedb_ingest_rate_per_minute
|
||
```
|
||
|
||
**Resolution: Increase quarantine review capacity**
|
||
|
||
```bash
|
||
# Option 1: Batch approve known-good patterns
|
||
# (Example: Approve all items from trusted agent during backfill)
|
||
TRUSTED_AGENT="known-backfill-agent-id"
|
||
|
||
curl http://localhost:18180/v1/admin/quarantine?agent_id=$TRUSTED_AGENT | jq -r '.items[].id' | xargs -I {} \
|
||
curl -X POST http://localhost:18180/v1/admin/quarantine/{}/approve
|
||
|
||
# Option 2: Temporarily disable content defense for trusted agents
|
||
# (Add to agent allowlist)
|
||
curl -X POST http://localhost:18180/v1/admin/content_defense/allowlist \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"agent_id": "'$TRUSTED_AGENT'", "expires_at": "2026-02-12T00:00:00Z", "reason": "backfill_operation"}'
|
||
|
||
# Option 3: Scale review team (manual triage)
|
||
# Assign additional staff to review quarantine dashboard
|
||
```
|
||
|
||
**If failed:** Surge overwhelming even with increased capacity → Consider pausing ingest, scaling infrastructure, or auto-approving low-risk items.
|
||
|
||
---
|
||
|
||
## Validation
|
||
|
||
After applying resolution, validate quarantine is manageable:
|
||
|
||
- [ ] **Quarantine count <50**
|
||
```bash
|
||
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'
|
||
# Should be <50
|
||
```
|
||
|
||
- [ ] **No single agent dominating**
|
||
```bash
|
||
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.agent_id) | map(length) | max'
|
||
# No agent should have >20 items
|
||
```
|
||
|
||
- [ ] **False positive rate <20%**
|
||
```bash
|
||
curl http://localhost:18180/metrics | grep -E '(quarantine_approved|quarantine_rejected)'
|
||
# approved/(approved+rejected) should be <0.2
|
||
```
|
||
|
||
- [ ] **Quarantine rate stabilized**
|
||
```bash
|
||
curl http://localhost:18180/metrics | grep stemedb_quarantine_rate_per_minute
|
||
# Should be <10/min for pilot workloads
|
||
```
|
||
|
||
- [ ] **Legitimate assertions not quarantined**
|
||
- Submit test assertion from known-good agent
|
||
- Should immediately appear in dashboard (not quarantined)
|
||
|
||
---
|
||
|
||
## Prevention
|
||
|
||
### Monitoring
|
||
|
||
**Set up alerts for:**
|
||
|
||
```yaml
|
||
# Prometheus alert rules
|
||
groups:
|
||
- name: stemedb_quarantine
|
||
rules:
|
||
- alert: StemeDBQuarantineOverflow
|
||
expr: stemedb_quarantine_pending > 100
|
||
for: 10m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Quarantine queue overflow (>100 items)"
|
||
description: "Current count: {{ $value }}"
|
||
|
||
- alert: StemeDBAgentFlooding
|
||
expr: rate(stemedb_quarantine_total{agent_id}[5m]) > 50
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Agent flooding quarantine"
|
||
description: "Agent {{ $labels.agent_id }} submitting >50/min"
|
||
|
||
- alert: StemeDBHighFalsePositiveRate
|
||
expr: rate(stemedb_quarantine_approved_total[1h]) / (rate(stemedb_quarantine_approved_total[1h]) + rate(stemedb_quarantine_rejected_total[1h])) > 0.3
|
||
for: 30m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Content defense false positive rate high (>30%)"
|
||
```
|
||
|
||
### Configuration Changes
|
||
|
||
**To prevent recurrence:**
|
||
|
||
1. **Agent flooding:** Tune circuit breaker thresholds (failure_rate, timeout)
|
||
2. **False positives:** Regularly review and adjust content defense thresholds based on approval/rejection rates
|
||
3. **Legitimate surges:** Create agent allowlist for backfill operations
|
||
4. **Review capacity:** Assign on-call rotation for quarantine review (aim for <24hr SLA)
|
||
|
||
**Example: Stricter circuit breaker**
|
||
```toml
|
||
# /etc/stemedb/config.toml
|
||
[circuit_breaker]
|
||
failure_rate_threshold = 0.3 # Open after 30% quarantine rate
|
||
timeout_seconds = 3600 # Ban for 1 hour
|
||
min_requests = 20 # Require 20 requests before evaluating
|
||
```
|
||
|
||
---
|
||
|
||
## Quarantine Dashboard Workflow
|
||
|
||
**Standard review procedure:**
|
||
|
||
1. **Open dashboard:** http://localhost:18188/quarantine
|
||
2. **Sort by agent:** Identify flooding patterns
|
||
3. **Review sample items:** Check assertion quality
|
||
4. **Batch action:**
|
||
- If flooding → Ban agent via circuit breaker
|
||
- If false positives → Approve batch + adjust thresholds
|
||
- If legitimate → Approve individually or add to allowlist
|
||
5. **Document decision:** Add note to item before approve/reject
|
||
|
||
---
|
||
|
||
## Admin Endpoint Reference
|
||
|
||
⚠️ **CRITICAL WARNING:** Admin endpoints have NO authentication. Must be restricted to internal network only.
|
||
|
||
| Endpoint | Method | Purpose |
|
||
|----------|--------|---------|
|
||
| `/v1/admin/quarantine` | GET | List all quarantine items |
|
||
| `/v1/admin/quarantine?agent_id={id}` | GET | Filter by agent |
|
||
| `/v1/admin/quarantine/{id}/approve` | POST | Promote item to main store |
|
||
| `/v1/admin/quarantine/{id}/reject` | POST | Permanently reject item |
|
||
| `/v1/admin/circuit_breakers` | GET | List all circuit breaker states |
|
||
| `/v1/admin/circuit_breakers/{id}/open` | POST | Manually ban agent |
|
||
| `/v1/admin/circuit_breakers/{id}/reset` | POST | Unban agent |
|
||
| `/v1/admin/content_defense/thresholds` | GET | Current thresholds |
|
||
| `/v1/admin/content_defense/allowlist` | POST | Add agent to allowlist |
|
||
|
||
---
|
||
|
||
## Related Runbooks
|
||
|
||
- [Circuit Breaker Stuck](./circuit-breaker-stuck.md) - Agent ban management
|
||
- [High Query Latency](./high-query-latency.md) - Performance impact of large quarantine
|
||
- [Server Won't Start](./server-wont-start.md) - Disk full from quarantine overflow
|
||
|
||
---
|
||
|
||
## Last Updated
|
||
|
||
2026-02-11
|