stemedb/docs/operations/pilot-success-criteria.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

21 KiB

Pilot Success Criteria

Definition of "done" for StemeDB pilot deployments

This document defines the acceptance criteria for validating a StemeDB pilot before promoting to production. All "Must Pass" criteria are ship blockers.


Overview

Section Must Pass Should Pass Nice to Have Total
1. Performance 3 2 1 6
2. Functional 4 2 1 7
3. Operational 3 2 1 6
4. Demo Validation 5 0 0 5
5. Acceptance - - - -
Total 15 6 3 24

Pass threshold: All 15 "Must Pass" + 4/6 "Should Pass" = 19/24 minimum


1. Performance Requirements

Must Pass

1.1 Sub-Second Query Latency (p99 <1s)

Requirement: p99 query latency <1 second at 10K assertions baseline.

Test Procedure:

# Load 10K assertions
./scripts/load-test-data.sh --count 10000

# Run query load test (100 queries/sec for 5 minutes)
./scripts/query-load-test.sh \
  --rate 100 \
  --duration 300 \
  --endpoint /v1/query \
  --lens recency

# Extract p99 latency
curl http://localhost:18180/metrics | grep 'stemedb_query_latency_seconds{quantile="0.99"}'

Expected Result:

stemedb_query_latency_seconds{quantile="0.99"} 0.987  # <1.0 ✅

Acceptance:

  • Pass: p99 <1000ms
  • ⚠️ Warning: p99 1000-1500ms (acceptable with explanation)
  • Fail: p99 >1500ms

1.2 Sustained Ingest Rate (1K assertions/sec, 5 minutes)

Requirement: Handle 1,000 assertions/sec sustained for 5 minutes with p99 latency <200ms.

Test Procedure:

# Run ingest load test
./scripts/ingest-load-test.sh \
  --rate 1000 \
  --duration 300

# Monitor metrics
curl http://localhost:18180/metrics | grep -E '(ingest_rate|wal_fsync_latency)'

Expected Result:

# Ingest rate maintained
rate(stemedb_assertions_total[1m]) ~= 1000

# WAL fsync latency <200ms
stemedb_wal_fsync_latency_seconds{quantile="0.99"} 0.189  # <0.2 ✅

Acceptance:

  • Pass: 1K/sec sustained, p99 <200ms, no errors
  • ⚠️ Warning: 800-1000/sec OR p99 200-300ms
  • Fail: <800/sec OR p99 >300ms OR errors >1%

1.3 Conflict Detection (Score >0.5 on contradictions)

Requirement: ConflictLens assigns conflict_score >0.5 when assertions contradict.

Test Procedure:

# Submit contradictory assertions
curl -X POST http://localhost:18180/v1/assert \
  -d '{
    "concept_path": "drug/aspirin/safety",
    "predicate": "adverse_event_rate",
    "value": 0.002,  # 0.2%
    "confidence": 0.95,
    "agent_id": "fda-clinical-trial"
  }'

curl -X POST http://localhost:18180/v1/assert \
  -d '{
    "concept_path": "drug/aspirin/safety",
    "predicate": "adverse_event_rate",
    "value": 0.12,  # 12% (contradicts)
    "confidence": 0.7,
    "agent_id": "anecdotal-reports"
  }'

# Query with ConflictLens
curl -X POST http://localhost:18180/v1/query \
  -d '{
    "concept_path": "drug/aspirin/safety",
    "lens": "conflict"
  }' | jq '.conflict_score'

Expected Result:

{
  "conflict_score": 0.87,  # >0.5  (high conflict detected)
  "assertions": [
    {"value": 0.002, "confidence": 0.95, "agent": "fda-clinical-trial"},
    {"value": 0.12, "confidence": 0.7, "agent": "anecdotal-reports"}
  ]
}

Acceptance:

  • Pass: conflict_score >0.5 for contradictory values
  • Fail: conflict_score ≤0.5

Should Pass

1.4 Concurrent Query Capacity (100 readers, <2x degradation)

Requirement: Support 100 concurrent readers with <2x latency degradation vs baseline.

Test Procedure:

# Measure baseline (1 concurrent reader)
ab -n 1000 -c 1 -p query.json http://localhost:18180/v1/query
# Note: mean latency (e.g., 50ms)

# Measure under load (100 concurrent readers)
ab -n 10000 -c 100 -p query.json http://localhost:18180/v1/query
# Note: mean latency (e.g., 85ms)

# Calculate degradation
echo "scale=2; 85 / 50" | bc  # = 1.7x (acceptable)

Expected Result:

  • Baseline: 50ms mean
  • Under load: <100ms mean (2x degradation)

Acceptance:

  • Pass: <2x degradation
  • ⚠️ Warning: 2-3x degradation
  • Fail: >3x degradation

1.5 Replication Lag <1s (Cluster Only)

Requirement: Three-node cluster maintains replication lag <1 second.

Test Procedure:

# Submit assertion to Node 1
curl -X POST http://node1:18180/v1/assert -d '{...}'

# Wait 1 second
sleep 1

# Query from Node 2 (different node)
curl -X POST http://node2:18180/v1/query -d '{...}'
# Should return the assertion

# Check replication lag metric
curl http://node1:18180/metrics | grep replication_lag_seconds

Expected Result:

replication_lag_seconds{node="node1"} 0.234  # <1.0 ✅
replication_lag_seconds{node="node2"} 0.456  # <1.0 ✅
replication_lag_seconds{node="node3"} 0.123  # <1.0 ✅

Acceptance:

  • Pass: All nodes <1s
  • ⚠️ Warning: Any node 1-5s
  • Fail: Any node >5s

Nice to Have

1.6 Dashboard Load Time <2s

Requirement: StemeDB dashboard loads in <2 seconds.

Test Procedure:

# Measure page load time
curl -w "@curl-format.txt" -o /dev/null -s http://localhost:18188/

# Or use browser DevTools Network tab
# Load: http://localhost:18188/
# Check: DOMContentLoaded time

Expected Result:

  • DOMContentLoaded: <2000ms

Acceptance:

  • Pass: <2s
  • ⚠️ Warning: 2-5s
  • Fail: >5s

2. Functional Requirements

Must Pass

2.1 Complete Audit Trail (Export 100 assertions with signatures)

Requirement: Export 100 assertions with full provenance chain and verify Ed25519 signatures.

Test Procedure:

# Query 100 assertions
curl -X POST http://localhost:18180/v1/query \
  -d '{
    "concept_path": "drug/*",
    "lens": "recency",
    "limit": 100
  }' > assertions.json

# Verify each signature
cat assertions.json | jq -r '.assertions[] | .signature' | while read sig; do
  # Extract public key, message, signature
  # Verify Ed25519 signature
  echo "Verifying $sig..."
done

# Check provenance fields
cat assertions.json | jq '.assertions[] | select(.provenance == null or .provenance == "")'
# Should return empty (all have provenance)

Expected Result:

  • 100 assertions exported
  • All have non-empty provenance field
  • All have non-empty agent_id field
  • All signatures verify successfully

Acceptance:

  • Pass: 100/100 valid signatures + provenance
  • Fail: Any missing provenance or invalid signature

2.2 Source Retraction Cascade

Requirement: Retracting source cascades to 110+ dependent assertions.

Test Procedure:

# Submit source + 110 dependent assertions
./scripts/seed-retraction-test-data.sh

# Retract source
curl -X POST http://localhost:18180/v1/retract \
  -d '{
    "concept_path": "source/CARDIOVASC_MEGA_TRIAL",
    "reason": "study_retracted_fabricated_data",
    "cascade": true
  }'

# Query retracted assertions
curl -X POST http://localhost:18180/v1/query \
  -d '{
    "concept_path": "drug/*/cardiovascular_risk",
    "lens": "recency",
    "include_retracted": true
  }' | jq '.assertions[] | select(.lifecycle_stage == "RETRACTED") | length'

Expected Result:

111  # Source + 110 dependents (≥110 ✅)

Acceptance:

  • Pass: ≥110 assertions retracted
  • Fail: <110 assertions retracted

2.3 Multi-Lens Resolution

Requirement: RecencyLens, ConsensusLens, and AuthorityLens return different winners for same query.

Test Procedure:

# Submit 3 assertions (different agents, times, confidence)
curl -X POST http://localhost:18180/v1/assert -d '{
  "concept_path": "drug/aspirin/dosage",
  "predicate": "recommended_mg",
  "value": 81,
  "confidence": 0.95,
  "agent_id": "fda-guidelines",
  "timestamp": "2024-01-01T00:00:00Z"
}'

curl -X POST http://localhost:18180/v1/assert -d '{
  "concept_path": "drug/aspirin/dosage",
  "predicate": "recommended_mg",
  "value": 100,
  "confidence": 0.7,
  "agent_id": "mayo-clinic",
  "timestamp": "2025-06-01T00:00:00Z"
}'

curl -X POST http://localhost:18180/v1/assert -d '{
  "concept_path": "drug/aspirin/dosage",
  "predicate": "recommended_mg",
  "value": 325,
  "confidence": 0.6,
  "agent_id": "patient-forum",
  "timestamp": "2025-12-01T00:00:00Z"
}'

# Query with each lens
curl -X POST http://localhost:18180/v1/query \
  -d '{"concept_path": "drug/aspirin/dosage", "lens": "recency"}' \
  | jq '.assertions[0].value'
# Expected: 325 (most recent)

curl -X POST http://localhost:18180/v1/query \
  -d '{"concept_path": "drug/aspirin/dosage", "lens": "authority"}' \
  | jq '.assertions[0].value'
# Expected: 81 (highest confidence from FDA)

curl -X POST http://localhost:18180/v1/query \
  -d '{"concept_path": "drug/aspirin/dosage", "lens": "consensus"}' \
  | jq '.assertions[0].value'
# Expected: 100 (middle value, balances recency + authority)

Expected Result:

  • RecencyLens returns: 325 (latest timestamp)
  • AuthorityLens returns: 81 (FDA, highest confidence)
  • ConsensusLens returns: 100 (middle value)

All 3 lenses return different winners

Acceptance:

  • Pass: 3 different winners across lenses
  • Fail: Same winner for all lenses (indicates lens not working)

2.4 Health Endpoint Returns 200

Requirement: /v1/health returns 200 with valid JSON.

Test Procedure:

curl -i http://localhost:18180/v1/health

Expected Result:

HTTP/1.1 200 OK
Content-Type: application/json

{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 12345,
  "assertion_count": 10234
}

Acceptance:

  • Pass: 200 status + valid JSON
  • Fail: Non-200 status OR malformed JSON

Should Pass

2.5 Query with Complex Lens (AuthorityLens with deep chain)

Requirement: AuthorityLens resolves assertions with trust chain depth ≥3.

Test Procedure:

# Submit assertions with trust chain:
# Agent A → Agent B → Agent C → Agent D (depth 3)

./scripts/seed-trust-chain.sh --depth 3

# Query with AuthorityLens
curl -X POST http://localhost:18180/v1/query \
  -d '{
    "concept_path": "research/deep_chain",
    "lens": "authority"
  }' | jq '.trust_chain_depth'

Expected Result:

3  # Depth ≥3 ✅

Acceptance:

  • Pass: Depth ≥3
  • Fail: Depth <3

2.6 Time-Travel Query (2023 vs 2025 comparison)

Requirement: Query returns different results for different timestamps.

Test Procedure:

# Query as of 2023
curl -X POST http://localhost:18180/v1/query \
  -d '{
    "concept_path": "drug/aspirin/dosage",
    "lens": "recency",
    "as_of": "2023-01-01T00:00:00Z"
  }' | jq '.assertions[0].value'
# Expected: 81 (old guideline)

# Query as of 2025
curl -X POST http://localhost:18180/v1/query \
  -d '{
    "concept_path": "drug/aspirin/dosage",
    "lens": "recency",
    "as_of": "2025-12-31T23:59:59Z"
  }' | jq '.assertions[0].value'
# Expected: 325 (updated guideline)

Expected Result:

  • 2023: 81
  • 2025: 325
  • Different values

Acceptance:

  • Pass: Different values for different timestamps
  • Fail: Same value (time-travel not working)

Nice to Have

2.7 Swagger UI Accessible

Requirement: OpenAPI docs accessible at /swagger-ui.

Test Procedure:

curl -I http://localhost:18180/swagger-ui/

Expected Result:

HTTP/1.1 200 OK
Content-Type: text/html

Acceptance:

  • Pass: 200 status
  • ⚠️ Warning: 404 (acceptable if documented)

3. Operational Requirements

Must Pass

3.1 Backup/Restore Roundtrip

Requirement: Load 10K assertions → backup → restore → verify count matches.

Test Procedure:

# Load 10K assertions
./scripts/load-test-data.sh --count 10000

# Check count
ORIGINAL_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count')
echo "Original count: $ORIGINAL_COUNT"

# Backup
sudo ./scripts/backup-stemedb.sh
BACKUP_DIR=$(ls -dt backups/stemedb-backup-* | head -1)

# Stop server
sudo systemctl stop stemedb-api

# Restore
sudo ./scripts/restore-stemedb.sh $BACKUP_DIR

# Start server
sudo systemctl start stemedb-api

# Wait for startup
sleep 10

# Check count
RESTORED_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count')
echo "Restored count: $RESTORED_COUNT"

# Verify match
[ "$ORIGINAL_COUNT" -eq "$RESTORED_COUNT" ] && echo "✅ Pass" || echo "❌ Fail"

Expected Result:

Original count: 10234
Restored count: 10234
✅ Pass

Acceptance:

  • Pass: Counts match exactly
  • Fail: Counts differ

3.2 Node Failure Recovery (Three-Node Cluster)

Requirement: Kill Node 2 → queries continue → node recovers → re-replicates <5 min.

Test Procedure:

# Kill Node 2
ssh node2 "sudo systemctl stop stemedb-api"

# Verify cluster detects failure
curl http://node1:18181/cluster/members | jq '.members[] | select(.id=="node2") | .status'
# Expected: "DOWN"

# Submit query to Node 1 (should succeed)
curl -X POST http://node1:18180/v1/query -d '{...}'
# Expected: 200 OK

# Restart Node 2
ssh node2 "sudo systemctl start stemedb-api"

# Wait for re-replication
sleep 300  # 5 minutes

# Check replication lag
curl http://node2:18180/metrics | grep replication_lag_seconds
# Expected: <1.0

Expected Result:

  • Node 2 failure detected within 30s
  • Queries continue to succeed on Node 1 & 3
  • Node 2 recovers and re-replicates within 5 minutes
  • Final replication lag <1s

Acceptance:

  • Pass: All criteria met
  • Fail: Queries failed OR recovery >5 min

3.3 Rolling Restart (Three-Node Cluster, Zero Downtime)

Requirement: Restart nodes one-by-one during load test → 100% success rate.

Test Procedure:

# Start load test (background)
./scripts/query-load-test.sh --rate 10 --duration 600 &
LOAD_PID=$!

# Wait 60s for baseline
sleep 60

# Restart Node 1
ssh node1 "sudo systemctl restart stemedb-api"
sleep 60

# Restart Node 2
ssh node2 "sudo systemctl restart stemedb-api"
sleep 60

# Restart Node 3
ssh node3 "sudo systemctl restart stemedb-api"
sleep 60

# Wait for load test to complete
wait $LOAD_PID

# Check success rate
grep "Success rate" load-test-results.log

Expected Result:

Success rate: 100.0% (6000/6000 requests succeeded)

Acceptance:

  • Pass: 100% success rate
  • ⚠️ Warning: 98-99.9% success rate
  • Fail: <98% success rate

Should Pass

3.4 Metrics Exposed (Prometheus Format)

Requirement: /metrics endpoint returns Prometheus-format metrics.

Test Procedure:

curl http://localhost:18180/metrics | head -20

Expected Result:

# HELP stemedb_assertions_total Total assertions ingested
# TYPE stemedb_assertions_total counter
stemedb_assertions_total 10234

# HELP stemedb_query_latency_seconds Query latency histogram
# TYPE stemedb_query_latency_seconds histogram
stemedb_query_latency_seconds_bucket{le="0.005"} 1234
...

Acceptance:

  • Pass: Valid Prometheus format
  • Fail: Invalid format OR endpoint unreachable

3.5 Grafana Dashboard Loads

Requirement: Grafana dashboard displays StemeDB metrics without errors.

Test Procedure:

  1. Open http://localhost:3000 (Grafana)
  2. Navigate to "StemeDB Overview" dashboard
  3. Check all panels load without errors

Expected Result:

  • All panels display data
  • No "No data" or "Error" messages

Acceptance:

  • Pass: All panels load
  • ⚠️ Warning: 1-2 panels missing data
  • Fail: >2 panels missing data

Nice to Have

3.6 Backup Automation (Cron Job Running)

Requirement: Daily backup cron job configured and executed.

Test Procedure:

# Check cron job exists
sudo crontab -l | grep backup-stemedb

# Expected:
# 0 2 * * * /usr/local/bin/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1

# Check last backup
ls -lt backups/ | head -3

# Expected: Backup from last 24 hours

Acceptance:

  • Pass: Cron job exists + recent backup
  • ⚠️ Warning: Cron job exists but no recent backup
  • Fail: No cron job

4. Demo Validation: 5 Amazement Moments

All 5 moments must be demonstrable without errors.

Moment 1: Conflicting Claims (FDA 0.2% vs Anecdotal 12%)

Setup:

./scripts/demo-moment-1-conflicting-claims.sh

Demo Script:

  1. Show 2 assertions: FDA (0.2%) vs Anecdotal (12%)
  2. Query with ConflictLens → Shows conflict_score: 0.87
  3. Query with AuthorityLens → Returns FDA value (higher confidence)
  4. Amazement: "Same data, different answers based on lens choice"

Acceptance:

  • Pass: ConflictLens detects conflict, AuthorityLens picks FDA
  • Fail: Lenses don't differentiate

Moment 2: Source Retraction Cascade (110 Assertions Flagged)

Setup:

./scripts/demo-moment-2-retraction.sh

Demo Script:

  1. Show study with 110 dependent drug safety assertions
  2. Retract study: POST /v1/retract with cascade: true
  3. Query retracted assertions → 111 total (study + dependents)
  4. Amazement: "One retraction cascades to 110+ assertions automatically"

Acceptance:

  • Pass: 111 assertions retracted
  • Fail: <110 assertions retracted

Moment 3: Audit Trail (Provenance Chain to Source)

Setup:

./scripts/demo-moment-3-audit-trail.sh

Demo Script:

  1. Query assertion: "Drug X has adverse event rate 5%"
  2. Show provenance: "Clinical trial ABC, 2024-06-15"
  3. Trace to source: "Trial ABC run by Pharma Corp, funded by..."
  4. Verify signature: Ed25519 signature valid
  5. Amazement: "Full audit trail from claim to original source"

Acceptance:

  • Pass: Provenance chain complete, signature valid
  • Fail: Missing provenance OR invalid signature

Moment 4: Time-Travel (Query 2023 vs 2025 Guidelines)

Setup:

./scripts/demo-moment-4-time-travel.sh

Demo Script:

  1. Query aspirin dosage as of 2023 → Returns 81mg
  2. Query same as of 2025 → Returns 325mg
  3. Show timeline of changes (3 updates over 2 years)
  4. Amazement: "See how medical guidelines evolved over time"

Acceptance:

  • Pass: Different values for different timestamps
  • Fail: Same value (time-travel not working)

Moment 5: Lens-Based Resolution (3 Lenses → 3 Winners)

Setup:

./scripts/demo-moment-5-lens-resolution.sh

Demo Script:

  1. Show 5 conflicting assertions for "recommended dosage"
  2. Query with RecencyLens → Returns latest assertion
  3. Query with ConsensusLens → Returns middle value
  4. Query with AuthorityLens → Returns highest confidence assertion
  5. Amazement: "Same query, 3 different answers - you choose resolution strategy"

Acceptance:

  • Pass: 3 lenses return 3 different winners
  • Fail: Lenses return same winner

5. Acceptance Criteria

Must Pass (Ship Blockers)

All 15 "Must Pass" criteria must be met:

  • 1.1 Query latency p99 <1s
  • 1.2 Sustained ingest 1K/sec
  • 1.3 Conflict detection >0.5
  • 2.1 Audit trail complete
  • 2.2 Retraction cascade ≥110
  • 2.3 Multi-lens resolution
  • 2.4 Health endpoint 200 OK
  • 3.1 Backup/restore roundtrip
  • 3.2 Node failure recovery (cluster)
  • 3.3 Rolling restart (cluster)
  • 4.1 Moment 1: Conflicting claims
  • 4.2 Moment 2: Retraction cascade
  • 4.3 Moment 3: Audit trail
  • 4.4 Moment 4: Time-travel
  • 4.5 Moment 5: Lens resolution

At least 4/6 "Should Pass" required:

  • 1.4 Concurrent query capacity
  • 1.5 Replication lag <1s (cluster)
  • 2.5 Complex lens (deep chain)
  • 2.6 Time-travel query
  • 3.4 Metrics exposed
  • 3.5 Grafana dashboard

Nice to Have (Optional)

Not required for pilot approval:

  • 1.6 Dashboard load time <2s
  • 2.7 Swagger UI accessible
  • 3.6 Backup automation (cron)

Validation Report Template

Copy this template to document pilot validation results:

# StemeDB Pilot Validation Report

**Date:** YYYY-MM-DD
**Deployment:** [Single-node / Three-node cluster]
**Instance Type:** [AWS t3.large / etc.]
**Assertions:** [Count]
**Evaluator:** [Name]

## Results Summary

| Category | Must Pass | Should Pass | Nice to Have | Total |
|----------|-----------|-------------|--------------|-------|
| Performance | [X/3] | [X/2] | [X/1] | [X/6] |
| Functional | [X/4] | [X/2] | [X/1] | [X/7] |
| Operational | [X/3] | [X/2] | [X/1] | [X/6] |
| Demo | [X/5] | [0/0] | [0/0] | [X/5] |
| **Total** | **[X/15]** | **[X/6]** | **[X/3]** | **[X/24]** |

**Pass Threshold:** 15/15 Must Pass + 4/6 Should Pass = 19/24 minimum
**Actual Score:** [X/24]
**Status:** [✅ PASS / ❌ FAIL]

## Detailed Results

[Paste test results for each criterion]

## Blockers (if any)

[List any "Must Pass" failures]

## Recommendations

[Next steps for production deployment]

## Sign-Off

- [ ] Engineering Lead: ___________________ Date: ___________
- [ ] Operations Lead: ___________________ Date: ___________
- [ ] Product Lead: ___________________    Date: ___________


Last Updated: 2026-02-11