- Wire auth bootstrap (root API key, startup guard, auth-first router) in main.rs - Add cluster gateway handlers with proper error handling - Update Dockerfile with optimized multi-stage build and .dockerignore - Add orchard9-deploy skill for CI/CD pipeline (Gitea/Woodpecker/Kaniko/Zot) - Add k8s deployment roadmap and provision-project-keys script - Document production infrastructure in CLAUDE.md - Update three-node-cluster reference architecture - Trim hosted.rs doc comments to stay under 800-line limit
5.9 KiB
StemeDB Operations Guide
Welcome to the StemeDB operations hub. This documentation provides everything you need to deploy, monitor, troubleshoot, and maintain StemeDB in production environments.
Quick Links
| Need to... | Go to |
|---|---|
| Deploy to k3s (100 projects) | k3s Deploy Roadmap |
| Deploy for the first time | Single-Node Pilot Architecture |
| Troubleshoot an incident | Operational Runbooks |
| Scale to production | Three-Node Cluster Architecture |
| Size your deployment | Resource Sizing Guide |
| Configure networking | Network Requirements |
| Deploy with Docker Compose | Pilot with Monitoring |
| Set up reverse proxy | Nginx Config / Envoy Config |
| Validate pilot success | Pilot Success Criteria |
Operations Documentation
🚨 Runbooks
When things go wrong at 2am, these runbooks provide step-by-step incident response procedures:
- Server Won't Start - Port conflicts, TLS errors, WAL corruption
- High Query Latency - Performance degradation, replication lag
- Quarantine Overflow - Content defense queue management
- Circuit Breaker Stuck - Agent bans and manual resets
- Restore from Backup - Disaster recovery procedures
- Disk Full - Storage management and WAL cleanup
- Add Node to Cluster - Cluster expansion procedures
Start here: Troubleshooting Flowchart - Decision tree from symptom to runbook
🏗️ Reference Architectures
Choose your deployment model based on scale, availability requirements, and operational maturity:
| Architecture | Target | Assertions | Queries/sec | RTO/RPO | Guide |
|---|---|---|---|---|---|
| Single-Node Pilot | PoC, friendly pilot | <10K | <100/sec | 2hr / 24hr | Guide |
| Three-Node Cluster | Production | <100K | <1K/sec | 5min / 1min | Guide |
| Enterprise (future) | Large-scale | >100K | >1K/sec | 1min / 0min | Roadmap (P6+) |
Also see:
- Network Requirements - Ports, firewalls, TLS, DNS
- Resource Sizing - CPU, RAM, disk calculations
📦 Deployment Examples
Infrastructure-as-Code examples ready to customize for your environment:
- Docker Compose + Monitoring - Turnkey deployment with Prometheus + Grafana
- Nginx Reverse Proxy - TLS termination, rate limiting, security headers
- Envoy Gateway - Advanced load balancing, circuit breakers, retries
✅ Pilot Success Criteria
Before going to production, validate your pilot meets these criteria:
- Pilot Success Criteria - Performance, functional, operational requirements
- 5 Amazement Moments - Demo validation checklist
- Acceptance Criteria - Must Pass / Should Pass / Nice to Have
Common Tasks
First-Time Deployment
- Review Single-Node Pilot Architecture
- Follow Resource Sizing Guide to choose hardware
- Deploy using Docker Compose example
- Configure reverse proxy (Nginx or Envoy)
- Validate against Pilot Success Criteria
Incident Response
- Identify symptom (error message, alert, user report)
- Check Troubleshooting Flowchart
- Follow relevant runbook (see list above)
- Document resolution and add to runbook if new scenario
Scaling to Production
- Validate pilot success with Success Criteria
- Review Three-Node Cluster Architecture
- Plan migration (data backup, node provisioning, DNS changes)
- Execute deployment with rolling validation
- Set up monitoring (see Docker Compose example)
Prerequisites
Before using these operations guides, ensure you've completed:
- ✅ Production Readiness Verification - 84% CLI score, all critical checks pass
- ✅ Load Testing - 10K assertions baseline, 1K/sec sustained
- ✅ Backup/Restore Testing - Validated roundtrip recovery
Support
For questions or issues:
- 📖 Documentation bugs: Report at GitHub Issues
- 💬 Community support: [Discussion forum link TBD]
- 🚨 Security issues: security@stemedb.io (or your org's security contact)
Contributing
Operations documentation is living documentation. If you:
- Encounter an incident not covered by runbooks → Add it
- Find an architecture pattern that works well → Document it
- Discover a configuration improvement → Share the example
Submit pull requests to keep this guide current and valuable.
Last Updated: 2026-03-02