stemedb/docs/operations/runbooks/quarantine-overflow.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

12 KiB

Runbook: Quarantine Overflow

Symptom

  • Quarantine dashboard panel shows 100+ pending items
  • Admin receiving alerts about "quarantine_pending" metric high
  • Legitimate assertions getting quarantined (false positives)
  • Single agent flooding quarantine queue

Metrics Alerts:

  • stemedb_quarantine_pending > 100 for 10 minutes
  • stemedb_quarantine_rate_per_agent > 50/min for single agent

Quick Diagnosis

Quarantine overflow
    │
    ├─► Check: curl .../admin/quarantine | jq '.items | group_by(.agent_id)'
    │   └─► Single agent? → §1 Single Agent Flooding
    │
    ├─► Check: Are items "Duplicate" or "LowQuality"?
    │   └─► Multiple agents, varied reasons → §2 Multiple Agents
    │
    ├─► Check: Recent system changes?
    │   └─► Content defense tuned too aggressive → §3 False Positives
    │
    └─► Check: Legitimate surge (e.g., new data source)?
        └─► Expected behavior → §4 Legitimate Surge

Common Causes

  1. Single agent flooding — Likelihood: 45%

    • Misconfigured agent
    • Agent in retry loop
    • Malicious actor testing limits
  2. Content defense too aggressive — Likelihood: 25%

    • Recently tuned thresholds
    • False positive rate high
    • Quality scoring bugs
  3. Multiple agents with low-quality data — Likelihood: 20%

    • Integration issues
    • Bad data sources
    • Extraction pipeline bugs
  4. Legitimate surge — Likelihood: 10%

    • New data source onboarded
    • Backfill operation
    • Expected high-volume event

Resolution Steps

§1. Single Agent Flooding

Diagnostic:

# List quarantine items grouped by agent
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.agent_id) | map({agent: .[0].agent_id, count: length}) | sort_by(.count) | reverse | .[0:5]'

# Expected output (flooding):
# [
#   {"agent": "8f3a2b1c...", "count": 487},  <-- Flooding!
#   {"agent": "7d2e5f9a...", "count": 12},
#   {"agent": "6c1b4a8e...", "count": 8}
# ]

# Check agent's recent assertions
curl http://localhost:18180/v1/admin/quarantine?agent_id=8f3a2b1c... | jq '.items[0:5]'

# Check circuit breaker status for this agent
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.agent_id == "8f3a2b1c...")'

Resolution: Ban agent via circuit breaker

# Get agent's full public key from quarantine item
AGENT_ID="8f3a2b1c..."  # Replace with actual agent ID

# Check current circuit breaker state
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID

# Manually open circuit breaker (ban agent)
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/open \
  -H "Content-Type: application/json" \
  -d '{"reason": "flooding_quarantine", "duration_seconds": 3600}'

# Expected response:
# {"status": "opened", "agent_id": "8f3a2b1c...", "state": "OPEN", "until": "2026-02-11T11:23:45Z"}

# Verify agent now gets 429 responses
curl -X POST http://localhost:18180/v1/assert \
  -H "X-Agent-Signature: $AGENT_SIGNATURE" \
  -d '{...}'
# Should return: 429 Too Many Requests with x-circuit-breaker-state: OPEN

Bulk reject all items from flooding agent:

# Get all quarantine item IDs from flooding agent
ITEM_IDS=$(curl -s http://localhost:18180/v1/admin/quarantine?agent_id=$AGENT_ID | jq -r '.items[].id')

# Batch reject
for id in $ITEM_IDS; do
  curl -X POST http://localhost:18180/v1/admin/quarantine/$id/reject \
    -H "Content-Type: application/json" \
    -d '{"reason": "agent_flooding"}'
done

# Verify quarantine count reduced
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'

If failed: Agent bypassing circuit breaker → Check if using different keys. May need firewall-level ban.


§2. Multiple Agents (False Positives)

Diagnostic:

# Check quarantine reasons
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.reason) | map({reason: .[0].reason, count: length})'

# Expected output:
# [
#   {"reason": "LowQuality", "count": 87},
#   {"reason": "UntrustedHighConfidence", "count": 34},
#   {"reason": "Duplicate", "count": 12}
# ]

# Sample items from each reason
curl http://localhost:18180/v1/admin/quarantine | jq '.items[] | select(.reason == "LowQuality") | .[0:3]'

Resolution: Tune content defense thresholds

⚠️ NOTE: Requires restart to apply new thresholds.

# Current thresholds
curl http://localhost:18180/v1/admin/content_defense/thresholds

# Adjust quality threshold (example: lower from 0.7 to 0.5)
export STEMEDB_QUALITY_THRESHOLD=0.5

# Or in config file /etc/stemedb/config.toml:
cat >> /etc/stemedb/config.toml <<EOF
[content_defense]
quality_threshold = 0.5
confidence_threshold = 0.9  # Raised from 0.8 to reduce false positives
duplicate_lookback_hours = 24
EOF

# Restart server
sudo systemctl restart stemedb-api

# Verify new thresholds
curl http://localhost:18180/v1/admin/content_defense/thresholds

Batch approve legitimate items:

# Sample and approve items manually (for known-good agents)
curl http://localhost:18180/v1/admin/quarantine | jq '.items[] | select(.agent_id == "KNOWN_GOOD_AGENT") | .id' | xargs -I {} \
  curl -X POST http://localhost:18180/v1/admin/quarantine/{}/approve

# Verify items promoted
curl http://localhost:18180/metrics | grep stemedb_quarantine_approved_total

If failed: False positives persist after tuning → Review quality scoring logic. May be bug in ContentDefenseLayer.


§3. Content Defense Too Aggressive

Diagnostic:

# Check false positive rate
curl http://localhost:18180/metrics | grep -E '(quarantine_total|quarantine_approved_total)'

# Calculate false positive rate:
# FP_rate = quarantine_approved_total / (quarantine_approved_total + quarantine_rejected_total)

# If FP_rate >30%, content defense is too aggressive

# Review recent config changes
journalctl -u stemedb-api -n 500 | grep -i "content_defense"

Resolution: Revert to default thresholds

# Default thresholds (tested in production readiness UAT)
cat > /etc/stemedb/config.toml <<EOF
[content_defense]
quality_threshold = 0.6
confidence_threshold = 0.85
duplicate_lookback_hours = 48
untrusted_confidence_threshold = 0.95
EOF

sudo systemctl restart stemedb-api

# Monitor quarantine rate
watch -n 10 'curl -s http://localhost:18180/metrics | grep quarantine_pending'

If failed: Even defaults too aggressive → May indicate upstream data quality issues. Review agent implementations.


§4. Legitimate Surge

Diagnostic:

# Check if surge is expected
# - Recent data source onboarding?
# - Backfill operation in progress?
# - Known high-volume event?

# Check quarantine rate over time
curl http://localhost:18180/metrics | grep stemedb_quarantine_rate_per_minute

# Compare to historical baseline (if available)
# If current rate 10x baseline → surge likely

# Check assertion rate (should also be high)
curl http://localhost:18180/metrics | grep stemedb_ingest_rate_per_minute

Resolution: Increase quarantine review capacity

# Option 1: Batch approve known-good patterns
# (Example: Approve all items from trusted agent during backfill)
TRUSTED_AGENT="known-backfill-agent-id"

curl http://localhost:18180/v1/admin/quarantine?agent_id=$TRUSTED_AGENT | jq -r '.items[].id' | xargs -I {} \
  curl -X POST http://localhost:18180/v1/admin/quarantine/{}/approve

# Option 2: Temporarily disable content defense for trusted agents
# (Add to agent allowlist)
curl -X POST http://localhost:18180/v1/admin/content_defense/allowlist \
  -H "Content-Type: application/json" \
  -d '{"agent_id": "'$TRUSTED_AGENT'", "expires_at": "2026-02-12T00:00:00Z", "reason": "backfill_operation"}'

# Option 3: Scale review team (manual triage)
# Assign additional staff to review quarantine dashboard

If failed: Surge overwhelming even with increased capacity → Consider pausing ingest, scaling infrastructure, or auto-approving low-risk items.


Validation

After applying resolution, validate quarantine is manageable:

  • Quarantine count <50

    curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'
    # Should be <50
    
  • No single agent dominating

    curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.agent_id) | map(length) | max'
    # No agent should have >20 items
    
  • False positive rate <20%

    curl http://localhost:18180/metrics | grep -E '(quarantine_approved|quarantine_rejected)'
    # approved/(approved+rejected) should be <0.2
    
  • Quarantine rate stabilized

    curl http://localhost:18180/metrics | grep stemedb_quarantine_rate_per_minute
    # Should be <10/min for pilot workloads
    
  • Legitimate assertions not quarantined

    • Submit test assertion from known-good agent
    • Should immediately appear in dashboard (not quarantined)

Prevention

Monitoring

Set up alerts for:

# Prometheus alert rules
groups:
  - name: stemedb_quarantine
    rules:
      - alert: StemeDBQuarantineOverflow
        expr: stemedb_quarantine_pending > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Quarantine queue overflow (>100 items)"
          description: "Current count: {{ $value }}"

      - alert: StemeDBAgentFlooding
        expr: rate(stemedb_quarantine_total{agent_id}[5m]) > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent flooding quarantine"
          description: "Agent {{ $labels.agent_id }} submitting >50/min"

      - alert: StemeDBHighFalsePositiveRate
        expr: rate(stemedb_quarantine_approved_total[1h]) / (rate(stemedb_quarantine_approved_total[1h]) + rate(stemedb_quarantine_rejected_total[1h])) > 0.3
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Content defense false positive rate high (>30%)"

Configuration Changes

To prevent recurrence:

  1. Agent flooding: Tune circuit breaker thresholds (failure_rate, timeout)
  2. False positives: Regularly review and adjust content defense thresholds based on approval/rejection rates
  3. Legitimate surges: Create agent allowlist for backfill operations
  4. Review capacity: Assign on-call rotation for quarantine review (aim for <24hr SLA)

Example: Stricter circuit breaker

# /etc/stemedb/config.toml
[circuit_breaker]
failure_rate_threshold = 0.3  # Open after 30% quarantine rate
timeout_seconds = 3600  # Ban for 1 hour
min_requests = 20  # Require 20 requests before evaluating

Quarantine Dashboard Workflow

Standard review procedure:

  1. Open dashboard: http://localhost:18188/quarantine
  2. Sort by agent: Identify flooding patterns
  3. Review sample items: Check assertion quality
  4. Batch action:
    • If flooding → Ban agent via circuit breaker
    • If false positives → Approve batch + adjust thresholds
    • If legitimate → Approve individually or add to allowlist
  5. Document decision: Add note to item before approve/reject

Admin Endpoint Reference

⚠️ CRITICAL WARNING: Admin endpoints have NO authentication. Must be restricted to internal network only.

Endpoint Method Purpose
/v1/admin/quarantine GET List all quarantine items
/v1/admin/quarantine?agent_id={id} GET Filter by agent
/v1/admin/quarantine/{id}/approve POST Promote item to main store
/v1/admin/quarantine/{id}/reject POST Permanently reject item
/v1/admin/circuit_breakers GET List all circuit breaker states
/v1/admin/circuit_breakers/{id}/open POST Manually ban agent
/v1/admin/circuit_breakers/{id}/reset POST Unban agent
/v1/admin/content_defense/thresholds GET Current thresholds
/v1/admin/content_defense/allowlist POST Add agent to allowlist


Last Updated

2026-02-11