History

jordan 1e5ba8b946 Some checks failed ci/woodpecker/push/woodpecker Pipeline failed Details feat: wire auth bootstrap, cluster gateway, k8s deploy skill, and ops docs - Wire auth bootstrap (root API key, startup guard, auth-first router) in main.rs - Add cluster gateway handlers with proper error handling - Update Dockerfile with optimized multi-stage build and .dockerignore - Add orchard9-deploy skill for CI/CD pipeline (Gitea/Woodpecker/Kaniko/Zot) - Add k8s deployment roadmap and provision-project-keys script - Document production infrastructure in CLAUDE.md - Update three-node-cluster reference architecture - Trim hosted.rs doc comments to stay under 800-line limit		2026-03-07 00:56:31 -07:00
..
deployment	feat: wire auth bootstrap, cluster gateway, k8s deploy skill, and ops docs	2026-03-07 00:56:31 -07:00
monitoring	feat: add enterprise production readiness infrastructure	2026-02-12 06:08:15 +00:00
reference-architecture	feat: wire auth bootstrap, cluster gateway, k8s deploy skill, and ops docs	2026-03-07 00:56:31 -07:00
runbooks	feat: add enterprise production readiness infrastructure	2026-02-12 06:08:15 +00:00
node-lifecycle.md	feat(admin): implement stemedb-admin CLI with API contract fixes	2026-02-12 08:23:36 +00:00
pilot-success-criteria.md	feat: add enterprise production readiness infrastructure	2026-02-12 06:08:15 +00:00
README.md	feat: wire auth bootstrap, cluster gateway, k8s deploy skill, and ops docs	2026-03-07 00:56:31 -07:00
troubleshooting-flowchart.md	feat: add enterprise production readiness infrastructure	2026-02-12 06:08:15 +00:00

README.md

StemeDB Operations Guide

Welcome to the StemeDB operations hub. This documentation provides everything you need to deploy, monitor, troubleshoot, and maintain StemeDB in production environments.

Quick Links

Need to...	Go to
Deploy to k3s (100 projects)	k3s Deploy Roadmap
Deploy for the first time	Single-Node Pilot Architecture
Troubleshoot an incident	Operational Runbooks
Scale to production	Three-Node Cluster Architecture
Size your deployment	Resource Sizing Guide
Configure networking	Network Requirements
Deploy with Docker Compose	Pilot with Monitoring
Set up reverse proxy	Nginx Config / Envoy Config
Validate pilot success	Pilot Success Criteria

Operations Documentation

🚨 Runbooks

When things go wrong at 2am, these runbooks provide step-by-step incident response procedures:

Server Won't Start - Port conflicts, TLS errors, WAL corruption
High Query Latency - Performance degradation, replication lag
Quarantine Overflow - Content defense queue management
Circuit Breaker Stuck - Agent bans and manual resets
Restore from Backup - Disaster recovery procedures
Disk Full - Storage management and WAL cleanup
Add Node to Cluster - Cluster expansion procedures

Start here: Troubleshooting Flowchart - Decision tree from symptom to runbook

🏗️ Reference Architectures

Choose your deployment model based on scale, availability requirements, and operational maturity:

Architecture	Target	Assertions	Queries/sec	RTO/RPO	Guide
Single-Node Pilot	PoC, friendly pilot	<10K	<100/sec	2hr / 24hr	Guide
Three-Node Cluster	Production	<100K	<1K/sec	5min / 1min	Guide
Enterprise (future)	Large-scale	>100K	>1K/sec	1min / 0min	Roadmap (P6+)

Also see:

Network Requirements - Ports, firewalls, TLS, DNS
Resource Sizing - CPU, RAM, disk calculations

📦 Deployment Examples

Infrastructure-as-Code examples ready to customize for your environment:

Docker Compose + Monitoring - Turnkey deployment with Prometheus + Grafana
Nginx Reverse Proxy - TLS termination, rate limiting, security headers
Envoy Gateway - Advanced load balancing, circuit breakers, retries

✅ Pilot Success Criteria

Before going to production, validate your pilot meets these criteria:

Pilot Success Criteria - Performance, functional, operational requirements
5 Amazement Moments - Demo validation checklist
Acceptance Criteria - Must Pass / Should Pass / Nice to Have

Common Tasks

First-Time Deployment

Review Single-Node Pilot Architecture
Follow Resource Sizing Guide to choose hardware
Deploy using Docker Compose example
Configure reverse proxy (Nginx or Envoy)
Validate against Pilot Success Criteria

Incident Response

Identify symptom (error message, alert, user report)
Check Troubleshooting Flowchart
Follow relevant runbook (see list above)
Document resolution and add to runbook if new scenario

Scaling to Production

Validate pilot success with Success Criteria
Review Three-Node Cluster Architecture
Plan migration (data backup, node provisioning, DNS changes)
Execute deployment with rolling validation
Set up monitoring (see Docker Compose example)

Prerequisites

Before using these operations guides, ensure you've completed:

✅ Production Readiness Verification - 84% CLI score, all critical checks pass
✅ Load Testing - 10K assertions baseline, 1K/sec sustained
✅ Backup/Restore Testing - Validated roundtrip recovery

Support

For questions or issues:

📖 Documentation bugs: Report at GitHub Issues
💬 Community support: [Discussion forum link TBD]
🚨 Security issues: security@stemedb.io (or your org's security contact)

Contributing

Operations documentation is living documentation. If you:

Encounter an incident not covered by runbooks → Add it
Find an architecture pattern that works well → Document it
Discover a configuration improvement → Share the example

Submit pull requests to keep this guide current and valuable.

Last Updated: 2026-03-02