Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
8-task cluster completion bringing 3-replica StatefulSet from isolated nodes to fully functional cluster: 1. Fix Gateway /metrics 500 (wire PrometheusHandle) 2. gRPC server + SWIM background tasks (probe, suspicion, gossip dissemination) 3. join() registers peers in membership table via PingResponse fields 4. Shard rebalancing on membership changes (deterministic round-robin) 5. API cluster wiring (DNS resolution, Gateway, gRPC, gossip broadcaster) 6. Single binary merge (stemedb-api --features cluster replaces stemedb-node) 7. Auth header forwarding (X-API-Key passed through Gateway to backends) 8. CORS restriction (STEMEDB_ALLOWED_ORIGINS env var, permissive fallback)
135 lines
5.9 KiB
Markdown
135 lines
5.9 KiB
Markdown
# StemeDB Operations Guide
|
|
|
|
**Welcome to the StemeDB operations hub.** This documentation provides everything you need to deploy, monitor, troubleshoot, and maintain StemeDB in production environments.
|
|
|
|
## Quick Links
|
|
|
|
| Need to... | Go to |
|
|
|------------|-------|
|
|
| **Deploy to k3s (100 projects)** | [k3s Deploy Roadmap](./deployment/k8s-deploy-roadmap.md) |
|
|
| **Deploy for the first time** | [Single-Node Pilot Architecture](./reference-architecture/single-node-pilot.md) |
|
|
| **Troubleshoot an incident** | [Operational Runbooks](./runbooks/) |
|
|
| **Scale to production** | [Three-Node Cluster Architecture](./reference-architecture/three-node-cluster.md) |
|
|
| **Size your deployment** | [Resource Sizing Guide](./reference-architecture/resource-sizing.md) |
|
|
| **Configure networking** | [Network Requirements](./reference-architecture/network-requirements.md) |
|
|
| **Deploy with Docker Compose** | [Pilot with Monitoring](./deployment/docker-compose/pilot-with-monitoring.yml) |
|
|
| **Set up reverse proxy** | [Nginx Config](./deployment/nginx/stemedb.conf) / [Envoy Config](./deployment/envoy/stemedb.yaml) |
|
|
| **Validate pilot success** | [Pilot Success Criteria](./pilot-success-criteria.md) |
|
|
|
|
---
|
|
|
|
## Operations Documentation
|
|
|
|
### 🚨 Runbooks
|
|
|
|
**When things go wrong at 2am**, these runbooks provide step-by-step incident response procedures:
|
|
|
|
- **[Server Won't Start](./runbooks/server-wont-start.md)** - Port conflicts, TLS errors, WAL corruption
|
|
- **[High Query Latency](./runbooks/high-query-latency.md)** - Performance degradation, replication lag
|
|
- **[Quarantine Overflow](./runbooks/quarantine-overflow.md)** - Content defense queue management
|
|
- **[Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)** - Agent bans and manual resets
|
|
- **[Restore from Backup](./runbooks/restore-from-backup.md)** - Disaster recovery procedures
|
|
- **[Disk Full](./runbooks/disk-full.md)** - Storage management and WAL cleanup
|
|
- **[Add Node to Cluster](./runbooks/add-node.md)** - Cluster expansion procedures
|
|
|
|
**Start here:** [Troubleshooting Flowchart](./troubleshooting-flowchart.md) - Decision tree from symptom to runbook
|
|
|
|
---
|
|
|
|
### 🏗️ Reference Architectures
|
|
|
|
**Choose your deployment model** based on scale, availability requirements, and operational maturity:
|
|
|
|
| Architecture | Target | Assertions | Queries/sec | RTO/RPO | Guide |
|
|
|--------------|--------|-----------|-------------|---------|-------|
|
|
| **Single-Node Pilot** | PoC, friendly pilot | <10K | <100/sec | 2hr / 24hr | [Guide](./reference-architecture/single-node-pilot.md) |
|
|
| **Three-Node Cluster** | Production | <100K | <1K/sec | 5min / 1min | [Guide](./reference-architecture/three-node-cluster.md) |
|
|
| **Enterprise (future)** | Large-scale | >100K | >1K/sec | 1min / 0min | Roadmap (P6+) |
|
|
|
|
**Also see:**
|
|
- [Network Requirements](./reference-architecture/network-requirements.md) - Ports, firewalls, TLS, DNS
|
|
- [Resource Sizing](./reference-architecture/resource-sizing.md) - CPU, RAM, disk calculations
|
|
|
|
---
|
|
|
|
### 📦 Deployment Examples
|
|
|
|
**Infrastructure-as-Code** examples ready to customize for your environment:
|
|
|
|
- **[Docker Compose + Monitoring](./deployment/docker-compose/pilot-with-monitoring.yml)** - Turnkey deployment with Prometheus + Grafana
|
|
- **[Nginx Reverse Proxy](./deployment/nginx/stemedb.conf)** - TLS termination, rate limiting, security headers
|
|
- **[Envoy Gateway](./deployment/envoy/stemedb.yaml)** - Advanced load balancing, circuit breakers, retries
|
|
|
|
---
|
|
|
|
### ✅ Pilot Success Criteria
|
|
|
|
**Before going to production**, validate your pilot meets these criteria:
|
|
|
|
- **[Pilot Success Criteria](./pilot-success-criteria.md)** - Performance, functional, operational requirements
|
|
- **5 Amazement Moments** - Demo validation checklist
|
|
- **Acceptance Criteria** - Must Pass / Should Pass / Nice to Have
|
|
|
|
---
|
|
|
|
## Common Tasks
|
|
|
|
### First-Time Deployment
|
|
|
|
1. Review [Single-Node Pilot Architecture](./reference-architecture/single-node-pilot.md)
|
|
2. Follow [Resource Sizing Guide](./reference-architecture/resource-sizing.md) to choose hardware
|
|
3. Deploy using [Docker Compose example](./deployment/docker-compose/pilot-with-monitoring.yml)
|
|
4. Configure reverse proxy ([Nginx](./deployment/nginx/stemedb.conf) or [Envoy](./deployment/envoy/stemedb.yaml))
|
|
5. Validate against [Pilot Success Criteria](./pilot-success-criteria.md)
|
|
|
|
### Incident Response
|
|
|
|
1. Identify symptom (error message, alert, user report)
|
|
2. Check [Troubleshooting Flowchart](./troubleshooting-flowchart.md)
|
|
3. Follow relevant runbook (see list above)
|
|
4. Document resolution and add to runbook if new scenario
|
|
|
|
### Scaling to Production
|
|
|
|
1. Validate pilot success with [Success Criteria](./pilot-success-criteria.md)
|
|
2. Review [Three-Node Cluster Architecture](./reference-architecture/three-node-cluster.md)
|
|
3. Plan migration (data backup, node provisioning, DNS changes)
|
|
4. Execute deployment with rolling validation
|
|
5. Set up monitoring (see [Docker Compose example](./deployment/docker-compose/pilot-with-monitoring.yml))
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
**Before using these operations guides**, ensure you've completed:
|
|
|
|
- ✅ [Production Readiness Verification](../../uat/production-readiness/README.md) - 84% CLI score, all critical checks pass
|
|
- ✅ [Load Testing](../../uat/production-readiness/README.md#load-testing) - 10K assertions baseline, 1K/sec sustained
|
|
- ✅ [Backup/Restore Testing](../../scripts/) - Validated roundtrip recovery
|
|
|
|
---
|
|
|
|
## Support
|
|
|
|
**For questions or issues:**
|
|
|
|
- 📖 **Documentation bugs:** Report at [GitHub Issues](https://github.com/anthropics/stemedb/issues)
|
|
- 💬 **Community support:** [Discussion forum link TBD]
|
|
- 🚨 **Security issues:** security@stemedb.io (or your org's security contact)
|
|
|
|
---
|
|
|
|
## Contributing
|
|
|
|
**Operations documentation is living documentation.** If you:
|
|
|
|
- Encounter an incident not covered by runbooks → Add it
|
|
- Find an architecture pattern that works well → Document it
|
|
- Discover a configuration improvement → Share the example
|
|
|
|
Submit pull requests to keep this guide current and valuable.
|
|
|
|
---
|
|
|
|
**Last Updated:** 2026-03-07
|