|
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
- Wire auth bootstrap (root API key, startup guard, auth-first router) in main.rs - Add cluster gateway handlers with proper error handling - Update Dockerfile with optimized multi-stage build and .dockerignore - Add orchard9-deploy skill for CI/CD pipeline (Gitea/Woodpecker/Kaniko/Zot) - Add k8s deployment roadmap and provision-project-keys script - Document production infrastructure in CLAUDE.md - Update three-node-cluster reference architecture - Trim hosted.rs doc comments to stay under 800-line limit |
||
|---|---|---|
| .. | ||
| deployment | ||
| monitoring | ||
| reference-architecture | ||
| runbooks | ||
| node-lifecycle.md | ||
| pilot-success-criteria.md | ||
| README.md | ||
| troubleshooting-flowchart.md | ||
StemeDB Operations Guide
Welcome to the StemeDB operations hub. This documentation provides everything you need to deploy, monitor, troubleshoot, and maintain StemeDB in production environments.
Quick Links
| Need to... | Go to |
|---|---|
| Deploy to k3s (100 projects) | k3s Deploy Roadmap |
| Deploy for the first time | Single-Node Pilot Architecture |
| Troubleshoot an incident | Operational Runbooks |
| Scale to production | Three-Node Cluster Architecture |
| Size your deployment | Resource Sizing Guide |
| Configure networking | Network Requirements |
| Deploy with Docker Compose | Pilot with Monitoring |
| Set up reverse proxy | Nginx Config / Envoy Config |
| Validate pilot success | Pilot Success Criteria |
Operations Documentation
🚨 Runbooks
When things go wrong at 2am, these runbooks provide step-by-step incident response procedures:
- Server Won't Start - Port conflicts, TLS errors, WAL corruption
- High Query Latency - Performance degradation, replication lag
- Quarantine Overflow - Content defense queue management
- Circuit Breaker Stuck - Agent bans and manual resets
- Restore from Backup - Disaster recovery procedures
- Disk Full - Storage management and WAL cleanup
- Add Node to Cluster - Cluster expansion procedures
Start here: Troubleshooting Flowchart - Decision tree from symptom to runbook
🏗️ Reference Architectures
Choose your deployment model based on scale, availability requirements, and operational maturity:
| Architecture | Target | Assertions | Queries/sec | RTO/RPO | Guide |
|---|---|---|---|---|---|
| Single-Node Pilot | PoC, friendly pilot | <10K | <100/sec | 2hr / 24hr | Guide |
| Three-Node Cluster | Production | <100K | <1K/sec | 5min / 1min | Guide |
| Enterprise (future) | Large-scale | >100K | >1K/sec | 1min / 0min | Roadmap (P6+) |
Also see:
- Network Requirements - Ports, firewalls, TLS, DNS
- Resource Sizing - CPU, RAM, disk calculations
📦 Deployment Examples
Infrastructure-as-Code examples ready to customize for your environment:
- Docker Compose + Monitoring - Turnkey deployment with Prometheus + Grafana
- Nginx Reverse Proxy - TLS termination, rate limiting, security headers
- Envoy Gateway - Advanced load balancing, circuit breakers, retries
✅ Pilot Success Criteria
Before going to production, validate your pilot meets these criteria:
- Pilot Success Criteria - Performance, functional, operational requirements
- 5 Amazement Moments - Demo validation checklist
- Acceptance Criteria - Must Pass / Should Pass / Nice to Have
Common Tasks
First-Time Deployment
- Review Single-Node Pilot Architecture
- Follow Resource Sizing Guide to choose hardware
- Deploy using Docker Compose example
- Configure reverse proxy (Nginx or Envoy)
- Validate against Pilot Success Criteria
Incident Response
- Identify symptom (error message, alert, user report)
- Check Troubleshooting Flowchart
- Follow relevant runbook (see list above)
- Document resolution and add to runbook if new scenario
Scaling to Production
- Validate pilot success with Success Criteria
- Review Three-Node Cluster Architecture
- Plan migration (data backup, node provisioning, DNS changes)
- Execute deployment with rolling validation
- Set up monitoring (see Docker Compose example)
Prerequisites
Before using these operations guides, ensure you've completed:
- ✅ Production Readiness Verification - 84% CLI score, all critical checks pass
- ✅ Load Testing - 10K assertions baseline, 1K/sec sustained
- ✅ Backup/Restore Testing - Validated roundtrip recovery
Support
For questions or issues:
- 📖 Documentation bugs: Report at GitHub Issues
- 💬 Community support: [Discussion forum link TBD]
- 🚨 Security issues: security@stemedb.io (or your org's security contact)
Contributing
Operations documentation is living documentation. If you:
- Encounter an incident not covered by runbooks → Add it
- Find an architecture pattern that works well → Document it
- Discover a configuration improvement → Share the example
Submit pull requests to keep this guide current and valuable.
Last Updated: 2026-03-02