This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
12 KiB
Runbook: Quarantine Overflow
Symptom
- Quarantine dashboard panel shows 100+ pending items
- Admin receiving alerts about "quarantine_pending" metric high
- Legitimate assertions getting quarantined (false positives)
- Single agent flooding quarantine queue
Metrics Alerts:
stemedb_quarantine_pending> 100 for 10 minutesstemedb_quarantine_rate_per_agent> 50/min for single agent
Quick Diagnosis
Quarantine overflow
│
├─► Check: curl .../admin/quarantine | jq '.items | group_by(.agent_id)'
│ └─► Single agent? → §1 Single Agent Flooding
│
├─► Check: Are items "Duplicate" or "LowQuality"?
│ └─► Multiple agents, varied reasons → §2 Multiple Agents
│
├─► Check: Recent system changes?
│ └─► Content defense tuned too aggressive → §3 False Positives
│
└─► Check: Legitimate surge (e.g., new data source)?
└─► Expected behavior → §4 Legitimate Surge
Common Causes
-
Single agent flooding — Likelihood: 45%
- Misconfigured agent
- Agent in retry loop
- Malicious actor testing limits
-
Content defense too aggressive — Likelihood: 25%
- Recently tuned thresholds
- False positive rate high
- Quality scoring bugs
-
Multiple agents with low-quality data — Likelihood: 20%
- Integration issues
- Bad data sources
- Extraction pipeline bugs
-
Legitimate surge — Likelihood: 10%
- New data source onboarded
- Backfill operation
- Expected high-volume event
Resolution Steps
§1. Single Agent Flooding
Diagnostic:
# List quarantine items grouped by agent
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.agent_id) | map({agent: .[0].agent_id, count: length}) | sort_by(.count) | reverse | .[0:5]'
# Expected output (flooding):
# [
# {"agent": "8f3a2b1c...", "count": 487}, <-- Flooding!
# {"agent": "7d2e5f9a...", "count": 12},
# {"agent": "6c1b4a8e...", "count": 8}
# ]
# Check agent's recent assertions
curl http://localhost:18180/v1/admin/quarantine?agent_id=8f3a2b1c... | jq '.items[0:5]'
# Check circuit breaker status for this agent
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.agent_id == "8f3a2b1c...")'
Resolution: Ban agent via circuit breaker
# Get agent's full public key from quarantine item
AGENT_ID="8f3a2b1c..." # Replace with actual agent ID
# Check current circuit breaker state
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID
# Manually open circuit breaker (ban agent)
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/open \
-H "Content-Type: application/json" \
-d '{"reason": "flooding_quarantine", "duration_seconds": 3600}'
# Expected response:
# {"status": "opened", "agent_id": "8f3a2b1c...", "state": "OPEN", "until": "2026-02-11T11:23:45Z"}
# Verify agent now gets 429 responses
curl -X POST http://localhost:18180/v1/assert \
-H "X-Agent-Signature: $AGENT_SIGNATURE" \
-d '{...}'
# Should return: 429 Too Many Requests with x-circuit-breaker-state: OPEN
Bulk reject all items from flooding agent:
# Get all quarantine item IDs from flooding agent
ITEM_IDS=$(curl -s http://localhost:18180/v1/admin/quarantine?agent_id=$AGENT_ID | jq -r '.items[].id')
# Batch reject
for id in $ITEM_IDS; do
curl -X POST http://localhost:18180/v1/admin/quarantine/$id/reject \
-H "Content-Type: application/json" \
-d '{"reason": "agent_flooding"}'
done
# Verify quarantine count reduced
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'
If failed: Agent bypassing circuit breaker → Check if using different keys. May need firewall-level ban.
§2. Multiple Agents (False Positives)
Diagnostic:
# Check quarantine reasons
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.reason) | map({reason: .[0].reason, count: length})'
# Expected output:
# [
# {"reason": "LowQuality", "count": 87},
# {"reason": "UntrustedHighConfidence", "count": 34},
# {"reason": "Duplicate", "count": 12}
# ]
# Sample items from each reason
curl http://localhost:18180/v1/admin/quarantine | jq '.items[] | select(.reason == "LowQuality") | .[0:3]'
Resolution: Tune content defense thresholds
⚠️ NOTE: Requires restart to apply new thresholds.
# Current thresholds
curl http://localhost:18180/v1/admin/content_defense/thresholds
# Adjust quality threshold (example: lower from 0.7 to 0.5)
export STEMEDB_QUALITY_THRESHOLD=0.5
# Or in config file /etc/stemedb/config.toml:
cat >> /etc/stemedb/config.toml <<EOF
[content_defense]
quality_threshold = 0.5
confidence_threshold = 0.9 # Raised from 0.8 to reduce false positives
duplicate_lookback_hours = 24
EOF
# Restart server
sudo systemctl restart stemedb-api
# Verify new thresholds
curl http://localhost:18180/v1/admin/content_defense/thresholds
Batch approve legitimate items:
# Sample and approve items manually (for known-good agents)
curl http://localhost:18180/v1/admin/quarantine | jq '.items[] | select(.agent_id == "KNOWN_GOOD_AGENT") | .id' | xargs -I {} \
curl -X POST http://localhost:18180/v1/admin/quarantine/{}/approve
# Verify items promoted
curl http://localhost:18180/metrics | grep stemedb_quarantine_approved_total
If failed: False positives persist after tuning → Review quality scoring logic. May be bug in ContentDefenseLayer.
§3. Content Defense Too Aggressive
Diagnostic:
# Check false positive rate
curl http://localhost:18180/metrics | grep -E '(quarantine_total|quarantine_approved_total)'
# Calculate false positive rate:
# FP_rate = quarantine_approved_total / (quarantine_approved_total + quarantine_rejected_total)
# If FP_rate >30%, content defense is too aggressive
# Review recent config changes
journalctl -u stemedb-api -n 500 | grep -i "content_defense"
Resolution: Revert to default thresholds
# Default thresholds (tested in production readiness UAT)
cat > /etc/stemedb/config.toml <<EOF
[content_defense]
quality_threshold = 0.6
confidence_threshold = 0.85
duplicate_lookback_hours = 48
untrusted_confidence_threshold = 0.95
EOF
sudo systemctl restart stemedb-api
# Monitor quarantine rate
watch -n 10 'curl -s http://localhost:18180/metrics | grep quarantine_pending'
If failed: Even defaults too aggressive → May indicate upstream data quality issues. Review agent implementations.
§4. Legitimate Surge
Diagnostic:
# Check if surge is expected
# - Recent data source onboarding?
# - Backfill operation in progress?
# - Known high-volume event?
# Check quarantine rate over time
curl http://localhost:18180/metrics | grep stemedb_quarantine_rate_per_minute
# Compare to historical baseline (if available)
# If current rate 10x baseline → surge likely
# Check assertion rate (should also be high)
curl http://localhost:18180/metrics | grep stemedb_ingest_rate_per_minute
Resolution: Increase quarantine review capacity
# Option 1: Batch approve known-good patterns
# (Example: Approve all items from trusted agent during backfill)
TRUSTED_AGENT="known-backfill-agent-id"
curl http://localhost:18180/v1/admin/quarantine?agent_id=$TRUSTED_AGENT | jq -r '.items[].id' | xargs -I {} \
curl -X POST http://localhost:18180/v1/admin/quarantine/{}/approve
# Option 2: Temporarily disable content defense for trusted agents
# (Add to agent allowlist)
curl -X POST http://localhost:18180/v1/admin/content_defense/allowlist \
-H "Content-Type: application/json" \
-d '{"agent_id": "'$TRUSTED_AGENT'", "expires_at": "2026-02-12T00:00:00Z", "reason": "backfill_operation"}'
# Option 3: Scale review team (manual triage)
# Assign additional staff to review quarantine dashboard
If failed: Surge overwhelming even with increased capacity → Consider pausing ingest, scaling infrastructure, or auto-approving low-risk items.
Validation
After applying resolution, validate quarantine is manageable:
-
Quarantine count <50
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length' # Should be <50 -
No single agent dominating
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.agent_id) | map(length) | max' # No agent should have >20 items -
False positive rate <20%
curl http://localhost:18180/metrics | grep -E '(quarantine_approved|quarantine_rejected)' # approved/(approved+rejected) should be <0.2 -
Quarantine rate stabilized
curl http://localhost:18180/metrics | grep stemedb_quarantine_rate_per_minute # Should be <10/min for pilot workloads -
Legitimate assertions not quarantined
- Submit test assertion from known-good agent
- Should immediately appear in dashboard (not quarantined)
Prevention
Monitoring
Set up alerts for:
# Prometheus alert rules
groups:
- name: stemedb_quarantine
rules:
- alert: StemeDBQuarantineOverflow
expr: stemedb_quarantine_pending > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Quarantine queue overflow (>100 items)"
description: "Current count: {{ $value }}"
- alert: StemeDBAgentFlooding
expr: rate(stemedb_quarantine_total{agent_id}[5m]) > 50
for: 5m
labels:
severity: warning
annotations:
summary: "Agent flooding quarantine"
description: "Agent {{ $labels.agent_id }} submitting >50/min"
- alert: StemeDBHighFalsePositiveRate
expr: rate(stemedb_quarantine_approved_total[1h]) / (rate(stemedb_quarantine_approved_total[1h]) + rate(stemedb_quarantine_rejected_total[1h])) > 0.3
for: 30m
labels:
severity: warning
annotations:
summary: "Content defense false positive rate high (>30%)"
Configuration Changes
To prevent recurrence:
- Agent flooding: Tune circuit breaker thresholds (failure_rate, timeout)
- False positives: Regularly review and adjust content defense thresholds based on approval/rejection rates
- Legitimate surges: Create agent allowlist for backfill operations
- Review capacity: Assign on-call rotation for quarantine review (aim for <24hr SLA)
Example: Stricter circuit breaker
# /etc/stemedb/config.toml
[circuit_breaker]
failure_rate_threshold = 0.3 # Open after 30% quarantine rate
timeout_seconds = 3600 # Ban for 1 hour
min_requests = 20 # Require 20 requests before evaluating
Quarantine Dashboard Workflow
Standard review procedure:
- Open dashboard: http://localhost:18188/quarantine
- Sort by agent: Identify flooding patterns
- Review sample items: Check assertion quality
- Batch action:
- If flooding → Ban agent via circuit breaker
- If false positives → Approve batch + adjust thresholds
- If legitimate → Approve individually or add to allowlist
- Document decision: Add note to item before approve/reject
Admin Endpoint Reference
⚠️ CRITICAL WARNING: Admin endpoints have NO authentication. Must be restricted to internal network only.
| Endpoint | Method | Purpose |
|---|---|---|
/v1/admin/quarantine |
GET | List all quarantine items |
/v1/admin/quarantine?agent_id={id} |
GET | Filter by agent |
/v1/admin/quarantine/{id}/approve |
POST | Promote item to main store |
/v1/admin/quarantine/{id}/reject |
POST | Permanently reject item |
/v1/admin/circuit_breakers |
GET | List all circuit breaker states |
/v1/admin/circuit_breakers/{id}/open |
POST | Manually ban agent |
/v1/admin/circuit_breakers/{id}/reset |
POST | Unban agent |
/v1/admin/content_defense/thresholds |
GET | Current thresholds |
/v1/admin/content_defense/allowlist |
POST | Add agent to allowlist |
Related Runbooks
- Circuit Breaker Stuck - Agent ban management
- High Query Latency - Performance impact of large quarantine
- Server Won't Start - Disk full from quarantine overflow
Last Updated
2026-02-11