jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

12 KiB

Raw Blame History

Runbook: Quarantine Overflow

Symptom

Quarantine dashboard panel shows 100+ pending items
Admin receiving alerts about "quarantine_pending" metric high
Legitimate assertions getting quarantined (false positives)
Single agent flooding quarantine queue

Metrics Alerts:

stemedb_quarantine_pending > 100 for 10 minutes
stemedb_quarantine_rate_per_agent > 50/min for single agent

Quick Diagnosis

Quarantine overflow
    │
    ├─► Check: curl .../admin/quarantine | jq '.items | group_by(.agent_id)'
    │   └─► Single agent? → §1 Single Agent Flooding
    │
    ├─► Check: Are items "Duplicate" or "LowQuality"?
    │   └─► Multiple agents, varied reasons → §2 Multiple Agents
    │
    ├─► Check: Recent system changes?
    │   └─► Content defense tuned too aggressive → §3 False Positives
    │
    └─► Check: Legitimate surge (e.g., new data source)?
        └─► Expected behavior → §4 Legitimate Surge

Common Causes

Single agent flooding — Likelihood: 45%
- Misconfigured agent
- Agent in retry loop
- Malicious actor testing limits
Content defense too aggressive — Likelihood: 25%
- Recently tuned thresholds
- False positive rate high
- Quality scoring bugs
Multiple agents with low-quality data — Likelihood: 20%
- Integration issues
- Bad data sources
- Extraction pipeline bugs
Legitimate surge — Likelihood: 10%
- New data source onboarded
- Backfill operation
- Expected high-volume event

Resolution Steps

§1. Single Agent Flooding

Diagnostic:

# List quarantine items grouped by agent
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.agent_id) | map({agent: .[0].agent_id, count: length}) | sort_by(.count) | reverse | .[0:5]'

# Expected output (flooding):
# [
#   {"agent": "8f3a2b1c...", "count": 487},  <-- Flooding!
#   {"agent": "7d2e5f9a...", "count": 12},
#   {"agent": "6c1b4a8e...", "count": 8}
# ]

# Check agent's recent assertions
curl http://localhost:18180/v1/admin/quarantine?agent_id=8f3a2b1c... | jq '.items[0:5]'

# Check circuit breaker status for this agent
curl http://localhost:18180/v1/admin/circuit_breakers | jq '.circuit_breakers[] | select(.agent_id == "8f3a2b1c...")'

Resolution: Ban agent via circuit breaker

# Get agent's full public key from quarantine item
AGENT_ID="8f3a2b1c..."  # Replace with actual agent ID

# Check current circuit breaker state
curl http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID

# Manually open circuit breaker (ban agent)
curl -X POST http://localhost:18180/v1/admin/circuit_breakers/$AGENT_ID/open \
  -H "Content-Type: application/json" \
  -d '{"reason": "flooding_quarantine", "duration_seconds": 3600}'

# Expected response:
# {"status": "opened", "agent_id": "8f3a2b1c...", "state": "OPEN", "until": "2026-02-11T11:23:45Z"}

# Verify agent now gets 429 responses
curl -X POST http://localhost:18180/v1/assert \
  -H "X-Agent-Signature: $AGENT_SIGNATURE" \
  -d '{...}'
# Should return: 429 Too Many Requests with x-circuit-breaker-state: OPEN

Bulk reject all items from flooding agent:

# Get all quarantine item IDs from flooding agent
ITEM_IDS=$(curl -s http://localhost:18180/v1/admin/quarantine?agent_id=$AGENT_ID | jq -r '.items[].id')

# Batch reject
for id in $ITEM_IDS; do
  curl -X POST http://localhost:18180/v1/admin/quarantine/$id/reject \
    -H "Content-Type: application/json" \
    -d '{"reason": "agent_flooding"}'
done

# Verify quarantine count reduced
curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'

If failed: Agent bypassing circuit breaker → Check if using different keys. May need firewall-level ban.

§2. Multiple Agents (False Positives)

Diagnostic:

# Check quarantine reasons
curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.reason) | map({reason: .[0].reason, count: length})'

# Expected output:
# [
#   {"reason": "LowQuality", "count": 87},
#   {"reason": "UntrustedHighConfidence", "count": 34},
#   {"reason": "Duplicate", "count": 12}
# ]

# Sample items from each reason
curl http://localhost:18180/v1/admin/quarantine | jq '.items[] | select(.reason == "LowQuality") | .[0:3]'

Resolution: Tune content defense thresholds

⚠️ NOTE: Requires restart to apply new thresholds.

# Current thresholds
curl http://localhost:18180/v1/admin/content_defense/thresholds

# Adjust quality threshold (example: lower from 0.7 to 0.5)
export STEMEDB_QUALITY_THRESHOLD=0.5

# Or in config file /etc/stemedb/config.toml:
cat >> /etc/stemedb/config.toml <<EOF
[content_defense]
quality_threshold = 0.5
confidence_threshold = 0.9  # Raised from 0.8 to reduce false positives
duplicate_lookback_hours = 24
EOF

# Restart server
sudo systemctl restart stemedb-api

# Verify new thresholds
curl http://localhost:18180/v1/admin/content_defense/thresholds

Batch approve legitimate items:

# Sample and approve items manually (for known-good agents)
curl http://localhost:18180/v1/admin/quarantine | jq '.items[] | select(.agent_id == "KNOWN_GOOD_AGENT") | .id' | xargs -I {} \
  curl -X POST http://localhost:18180/v1/admin/quarantine/{}/approve

# Verify items promoted
curl http://localhost:18180/metrics | grep stemedb_quarantine_approved_total

If failed: False positives persist after tuning → Review quality scoring logic. May be bug in ContentDefenseLayer.

§3. Content Defense Too Aggressive

Diagnostic:

# Check false positive rate
curl http://localhost:18180/metrics | grep -E '(quarantine_total|quarantine_approved_total)'

# Calculate false positive rate:
# FP_rate = quarantine_approved_total / (quarantine_approved_total + quarantine_rejected_total)

# If FP_rate >30%, content defense is too aggressive

# Review recent config changes
journalctl -u stemedb-api -n 500 | grep -i "content_defense"

Resolution: Revert to default thresholds

# Default thresholds (tested in production readiness UAT)
cat > /etc/stemedb/config.toml <<EOF
[content_defense]
quality_threshold = 0.6
confidence_threshold = 0.85
duplicate_lookback_hours = 48
untrusted_confidence_threshold = 0.95
EOF

sudo systemctl restart stemedb-api

# Monitor quarantine rate
watch -n 10 'curl -s http://localhost:18180/metrics | grep quarantine_pending'

If failed: Even defaults too aggressive → May indicate upstream data quality issues. Review agent implementations.

§4. Legitimate Surge

Diagnostic:

# Check if surge is expected
# - Recent data source onboarding?
# - Backfill operation in progress?
# - Known high-volume event?

# Check quarantine rate over time
curl http://localhost:18180/metrics | grep stemedb_quarantine_rate_per_minute

# Compare to historical baseline (if available)
# If current rate 10x baseline → surge likely

# Check assertion rate (should also be high)
curl http://localhost:18180/metrics | grep stemedb_ingest_rate_per_minute

Resolution: Increase quarantine review capacity

# Option 1: Batch approve known-good patterns
# (Example: Approve all items from trusted agent during backfill)
TRUSTED_AGENT="known-backfill-agent-id"

curl http://localhost:18180/v1/admin/quarantine?agent_id=$TRUSTED_AGENT | jq -r '.items[].id' | xargs -I {} \
  curl -X POST http://localhost:18180/v1/admin/quarantine/{}/approve

# Option 2: Temporarily disable content defense for trusted agents
# (Add to agent allowlist)
curl -X POST http://localhost:18180/v1/admin/content_defense/allowlist \
  -H "Content-Type: application/json" \
  -d '{"agent_id": "'$TRUSTED_AGENT'", "expires_at": "2026-02-12T00:00:00Z", "reason": "backfill_operation"}'

# Option 3: Scale review team (manual triage)
# Assign additional staff to review quarantine dashboard

If failed: Surge overwhelming even with increased capacity → Consider pausing ingest, scaling infrastructure, or auto-approving low-risk items.

Validation

After applying resolution, validate quarantine is manageable:

Quarantine count <50

curl http://localhost:18180/v1/admin/quarantine | jq '.items | length'
# Should be <50

No single agent dominating

curl http://localhost:18180/v1/admin/quarantine | jq '.items | group_by(.agent_id) | map(length) | max'
# No agent should have >20 items

False positive rate <20%

curl http://localhost:18180/metrics | grep -E '(quarantine_approved|quarantine_rejected)'
# approved/(approved+rejected) should be <0.2

Quarantine rate stabilized

curl http://localhost:18180/metrics | grep stemedb_quarantine_rate_per_minute
# Should be <10/min for pilot workloads

Legitimate assertions not quarantined
- Submit test assertion from known-good agent
- Should immediately appear in dashboard (not quarantined)

Prevention

Monitoring

Set up alerts for:

# Prometheus alert rules
groups:
  - name: stemedb_quarantine
    rules:
      - alert: StemeDBQuarantineOverflow
        expr: stemedb_quarantine_pending > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Quarantine queue overflow (>100 items)"
          description: "Current count: {{ $value }}"

      - alert: StemeDBAgentFlooding
        expr: rate(stemedb_quarantine_total{agent_id}[5m]) > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent flooding quarantine"
          description: "Agent {{ $labels.agent_id }} submitting >50/min"

      - alert: StemeDBHighFalsePositiveRate
        expr: rate(stemedb_quarantine_approved_total[1h]) / (rate(stemedb_quarantine_approved_total[1h]) + rate(stemedb_quarantine_rejected_total[1h])) > 0.3
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Content defense false positive rate high (>30%)"

Configuration Changes

To prevent recurrence:

Agent flooding: Tune circuit breaker thresholds (failure_rate, timeout)
False positives: Regularly review and adjust content defense thresholds based on approval/rejection rates
Legitimate surges: Create agent allowlist for backfill operations
Review capacity: Assign on-call rotation for quarantine review (aim for <24hr SLA)

Example: Stricter circuit breaker

# /etc/stemedb/config.toml
[circuit_breaker]
failure_rate_threshold = 0.3  # Open after 30% quarantine rate
timeout_seconds = 3600  # Ban for 1 hour
min_requests = 20  # Require 20 requests before evaluating

Quarantine Dashboard Workflow

Standard review procedure:

Open dashboard: http://localhost:18188/quarantine
Sort by agent: Identify flooding patterns
Review sample items: Check assertion quality
Batch action:
- If flooding → Ban agent via circuit breaker
- If false positives → Approve batch + adjust thresholds
- If legitimate → Approve individually or add to allowlist
Document decision: Add note to item before approve/reject

Admin Endpoint Reference

⚠️ CRITICAL WARNING: Admin endpoints have NO authentication. Must be restricted to internal network only.

Endpoint	Method	Purpose
`/v1/admin/quarantine`	GET	List all quarantine items
`/v1/admin/quarantine?agent_id={id}`	GET	Filter by agent
`/v1/admin/quarantine/{id}/approve`	POST	Promote item to main store
`/v1/admin/quarantine/{id}/reject`	POST	Permanently reject item
`/v1/admin/circuit_breakers`	GET	List all circuit breaker states
`/v1/admin/circuit_breakers/{id}/open`	POST	Manually ban agent
`/v1/admin/circuit_breakers/{id}/reset`	POST	Unban agent
`/v1/admin/content_defense/thresholds`	GET	Current thresholds
`/v1/admin/content_defense/allowlist`	POST	Add agent to allowlist

Circuit Breaker Stuck - Agent ban management
High Query Latency - Performance impact of large quarantine
Server Won't Start - Disk full from quarantine overflow

Last Updated

2026-02-11

12 KiB Raw Blame History

Runbook: Quarantine Overflow

Symptom

Quick Diagnosis

Common Causes

Resolution Steps

§1. Single Agent Flooding

§2. Multiple Agents (False Positives)

§3. Content Defense Too Aggressive

§4. Legitimate Surge

Validation

Prevention

Monitoring

Configuration Changes

Quarantine Dashboard Workflow

Admin Endpoint Reference

Related Runbooks

Last Updated

12 KiB

Raw Blame History