# StemeDB Operations Guide **Welcome to the StemeDB operations hub.** This documentation provides everything you need to deploy, monitor, troubleshoot, and maintain StemeDB in production environments. ## Quick Links | Need to... | Go to | |------------|-------| | **Deploy for the first time** | [Single-Node Pilot Architecture](./reference-architecture/single-node-pilot.md) | | **Troubleshoot an incident** | [Operational Runbooks](./runbooks/) | | **Scale to production** | [Three-Node Cluster Architecture](./reference-architecture/three-node-cluster.md) | | **Size your deployment** | [Resource Sizing Guide](./reference-architecture/resource-sizing.md) | | **Configure networking** | [Network Requirements](./reference-architecture/network-requirements.md) | | **Deploy with Docker Compose** | [Pilot with Monitoring](./deployment/docker-compose/pilot-with-monitoring.yml) | | **Set up reverse proxy** | [Nginx Config](./deployment/nginx/stemedb.conf) / [Envoy Config](./deployment/envoy/stemedb.yaml) | | **Validate pilot success** | [Pilot Success Criteria](./pilot-success-criteria.md) | --- ## Operations Documentation ### 🚨 Runbooks **When things go wrong at 2am**, these runbooks provide step-by-step incident response procedures: - **[Server Won't Start](./runbooks/server-wont-start.md)** - Port conflicts, TLS errors, WAL corruption - **[High Query Latency](./runbooks/high-query-latency.md)** - Performance degradation, replication lag - **[Quarantine Overflow](./runbooks/quarantine-overflow.md)** - Content defense queue management - **[Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)** - Agent bans and manual resets - **[Restore from Backup](./runbooks/restore-from-backup.md)** - Disaster recovery procedures - **[Disk Full](./runbooks/disk-full.md)** - Storage management and WAL cleanup - **[Add Node to Cluster](./runbooks/add-node.md)** - Cluster expansion procedures **Start here:** [Troubleshooting Flowchart](./troubleshooting-flowchart.md) - Decision tree from symptom to runbook --- ### 🏗️ Reference Architectures **Choose your deployment model** based on scale, availability requirements, and operational maturity: | Architecture | Target | Assertions | Queries/sec | RTO/RPO | Guide | |--------------|--------|-----------|-------------|---------|-------| | **Single-Node Pilot** | PoC, friendly pilot | <10K | <100/sec | 2hr / 24hr | [Guide](./reference-architecture/single-node-pilot.md) | | **Three-Node Cluster** | Production | <100K | <1K/sec | 5min / 1min | [Guide](./reference-architecture/three-node-cluster.md) | | **Enterprise (future)** | Large-scale | >100K | >1K/sec | 1min / 0min | Roadmap (P6+) | **Also see:** - [Network Requirements](./reference-architecture/network-requirements.md) - Ports, firewalls, TLS, DNS - [Resource Sizing](./reference-architecture/resource-sizing.md) - CPU, RAM, disk calculations --- ### 📦 Deployment Examples **Infrastructure-as-Code** examples ready to customize for your environment: - **[Docker Compose + Monitoring](./deployment/docker-compose/pilot-with-monitoring.yml)** - Turnkey deployment with Prometheus + Grafana - **[Nginx Reverse Proxy](./deployment/nginx/stemedb.conf)** - TLS termination, rate limiting, security headers - **[Envoy Gateway](./deployment/envoy/stemedb.yaml)** - Advanced load balancing, circuit breakers, retries --- ### ✅ Pilot Success Criteria **Before going to production**, validate your pilot meets these criteria: - **[Pilot Success Criteria](./pilot-success-criteria.md)** - Performance, functional, operational requirements - **5 Amazement Moments** - Demo validation checklist - **Acceptance Criteria** - Must Pass / Should Pass / Nice to Have --- ## Common Tasks ### First-Time Deployment 1. Review [Single-Node Pilot Architecture](./reference-architecture/single-node-pilot.md) 2. Follow [Resource Sizing Guide](./reference-architecture/resource-sizing.md) to choose hardware 3. Deploy using [Docker Compose example](./deployment/docker-compose/pilot-with-monitoring.yml) 4. Configure reverse proxy ([Nginx](./deployment/nginx/stemedb.conf) or [Envoy](./deployment/envoy/stemedb.yaml)) 5. Validate against [Pilot Success Criteria](./pilot-success-criteria.md) ### Incident Response 1. Identify symptom (error message, alert, user report) 2. Check [Troubleshooting Flowchart](./troubleshooting-flowchart.md) 3. Follow relevant runbook (see list above) 4. Document resolution and add to runbook if new scenario ### Scaling to Production 1. Validate pilot success with [Success Criteria](./pilot-success-criteria.md) 2. Review [Three-Node Cluster Architecture](./reference-architecture/three-node-cluster.md) 3. Plan migration (data backup, node provisioning, DNS changes) 4. Execute deployment with rolling validation 5. Set up monitoring (see [Docker Compose example](./deployment/docker-compose/pilot-with-monitoring.yml)) --- ## Prerequisites **Before using these operations guides**, ensure you've completed: - ✅ [Production Readiness Verification](../../uat/production-readiness/README.md) - 84% CLI score, all critical checks pass - ✅ [Load Testing](../../uat/production-readiness/README.md#load-testing) - 10K assertions baseline, 1K/sec sustained - ✅ [Backup/Restore Testing](../../scripts/) - Validated roundtrip recovery --- ## Support **For questions or issues:** - 📖 **Documentation bugs:** Report at [GitHub Issues](https://github.com/anthropics/stemedb/issues) - 💬 **Community support:** [Discussion forum link TBD] - 🚨 **Security issues:** security@stemedb.io (or your org's security contact) --- ## Contributing **Operations documentation is living documentation.** If you: - Encounter an incident not covered by runbooks → Add it - Find an architecture pattern that works well → Document it - Discover a configuration improvement → Share the example Submit pull requests to keep this guide current and valuable. --- **Last Updated:** 2026-02-11