stemedb/docs/operations/README.md
jordan 1e5ba8b946
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
feat: wire auth bootstrap, cluster gateway, k8s deploy skill, and ops docs
- Wire auth bootstrap (root API key, startup guard, auth-first router) in main.rs
- Add cluster gateway handlers with proper error handling
- Update Dockerfile with optimized multi-stage build and .dockerignore
- Add orchard9-deploy skill for CI/CD pipeline (Gitea/Woodpecker/Kaniko/Zot)
- Add k8s deployment roadmap and provision-project-keys script
- Document production infrastructure in CLAUDE.md
- Update three-node-cluster reference architecture
- Trim hosted.rs doc comments to stay under 800-line limit
2026-03-07 00:56:31 -07:00

135 lines
5.9 KiB
Markdown

# StemeDB Operations Guide
**Welcome to the StemeDB operations hub.** This documentation provides everything you need to deploy, monitor, troubleshoot, and maintain StemeDB in production environments.
## Quick Links
| Need to... | Go to |
|------------|-------|
| **Deploy to k3s (100 projects)** | [k3s Deploy Roadmap](./deployment/k8s-deploy-roadmap.md) |
| **Deploy for the first time** | [Single-Node Pilot Architecture](./reference-architecture/single-node-pilot.md) |
| **Troubleshoot an incident** | [Operational Runbooks](./runbooks/) |
| **Scale to production** | [Three-Node Cluster Architecture](./reference-architecture/three-node-cluster.md) |
| **Size your deployment** | [Resource Sizing Guide](./reference-architecture/resource-sizing.md) |
| **Configure networking** | [Network Requirements](./reference-architecture/network-requirements.md) |
| **Deploy with Docker Compose** | [Pilot with Monitoring](./deployment/docker-compose/pilot-with-monitoring.yml) |
| **Set up reverse proxy** | [Nginx Config](./deployment/nginx/stemedb.conf) / [Envoy Config](./deployment/envoy/stemedb.yaml) |
| **Validate pilot success** | [Pilot Success Criteria](./pilot-success-criteria.md) |
---
## Operations Documentation
### 🚨 Runbooks
**When things go wrong at 2am**, these runbooks provide step-by-step incident response procedures:
- **[Server Won't Start](./runbooks/server-wont-start.md)** - Port conflicts, TLS errors, WAL corruption
- **[High Query Latency](./runbooks/high-query-latency.md)** - Performance degradation, replication lag
- **[Quarantine Overflow](./runbooks/quarantine-overflow.md)** - Content defense queue management
- **[Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)** - Agent bans and manual resets
- **[Restore from Backup](./runbooks/restore-from-backup.md)** - Disaster recovery procedures
- **[Disk Full](./runbooks/disk-full.md)** - Storage management and WAL cleanup
- **[Add Node to Cluster](./runbooks/add-node.md)** - Cluster expansion procedures
**Start here:** [Troubleshooting Flowchart](./troubleshooting-flowchart.md) - Decision tree from symptom to runbook
---
### 🏗️ Reference Architectures
**Choose your deployment model** based on scale, availability requirements, and operational maturity:
| Architecture | Target | Assertions | Queries/sec | RTO/RPO | Guide |
|--------------|--------|-----------|-------------|---------|-------|
| **Single-Node Pilot** | PoC, friendly pilot | <10K | <100/sec | 2hr / 24hr | [Guide](./reference-architecture/single-node-pilot.md) |
| **Three-Node Cluster** | Production | <100K | <1K/sec | 5min / 1min | [Guide](./reference-architecture/three-node-cluster.md) |
| **Enterprise (future)** | Large-scale | >100K | >1K/sec | 1min / 0min | Roadmap (P6+) |
**Also see:**
- [Network Requirements](./reference-architecture/network-requirements.md) - Ports, firewalls, TLS, DNS
- [Resource Sizing](./reference-architecture/resource-sizing.md) - CPU, RAM, disk calculations
---
### 📦 Deployment Examples
**Infrastructure-as-Code** examples ready to customize for your environment:
- **[Docker Compose + Monitoring](./deployment/docker-compose/pilot-with-monitoring.yml)** - Turnkey deployment with Prometheus + Grafana
- **[Nginx Reverse Proxy](./deployment/nginx/stemedb.conf)** - TLS termination, rate limiting, security headers
- **[Envoy Gateway](./deployment/envoy/stemedb.yaml)** - Advanced load balancing, circuit breakers, retries
---
### ✅ Pilot Success Criteria
**Before going to production**, validate your pilot meets these criteria:
- **[Pilot Success Criteria](./pilot-success-criteria.md)** - Performance, functional, operational requirements
- **5 Amazement Moments** - Demo validation checklist
- **Acceptance Criteria** - Must Pass / Should Pass / Nice to Have
---
## Common Tasks
### First-Time Deployment
1. Review [Single-Node Pilot Architecture](./reference-architecture/single-node-pilot.md)
2. Follow [Resource Sizing Guide](./reference-architecture/resource-sizing.md) to choose hardware
3. Deploy using [Docker Compose example](./deployment/docker-compose/pilot-with-monitoring.yml)
4. Configure reverse proxy ([Nginx](./deployment/nginx/stemedb.conf) or [Envoy](./deployment/envoy/stemedb.yaml))
5. Validate against [Pilot Success Criteria](./pilot-success-criteria.md)
### Incident Response
1. Identify symptom (error message, alert, user report)
2. Check [Troubleshooting Flowchart](./troubleshooting-flowchart.md)
3. Follow relevant runbook (see list above)
4. Document resolution and add to runbook if new scenario
### Scaling to Production
1. Validate pilot success with [Success Criteria](./pilot-success-criteria.md)
2. Review [Three-Node Cluster Architecture](./reference-architecture/three-node-cluster.md)
3. Plan migration (data backup, node provisioning, DNS changes)
4. Execute deployment with rolling validation
5. Set up monitoring (see [Docker Compose example](./deployment/docker-compose/pilot-with-monitoring.yml))
---
## Prerequisites
**Before using these operations guides**, ensure you've completed:
- ✅ [Production Readiness Verification](../../uat/production-readiness/README.md) - 84% CLI score, all critical checks pass
- ✅ [Load Testing](../../uat/production-readiness/README.md#load-testing) - 10K assertions baseline, 1K/sec sustained
- ✅ [Backup/Restore Testing](../../scripts/) - Validated roundtrip recovery
---
## Support
**For questions or issues:**
- 📖 **Documentation bugs:** Report at [GitHub Issues](https://github.com/anthropics/stemedb/issues)
- 💬 **Community support:** [Discussion forum link TBD]
- 🚨 **Security issues:** security@stemedb.io (or your org's security contact)
---
## Contributing
**Operations documentation is living documentation.** If you:
- Encounter an incident not covered by runbooks → Add it
- Find an architecture pattern that works well → Document it
- Discover a configuration improvement → Share the example
Submit pull requests to keep this guide current and valuable.
---
**Last Updated:** 2026-03-02