stemedb/docs/operations/README.md

# StemeDB Operations Guide

**Welcome to the StemeDB operations hub.** This documentation provides everything you need to deploy, monitor, troubleshoot, and maintain StemeDB in production environments.

## Quick Links

| Need to... | Go to |
|------------|-------|
| **Deploy to k3s (100 projects)** | [k3s Deploy Roadmap](./deployment/k8s-deploy-roadmap.md) |
| **Deploy for the first time** | [Single-Node Pilot Architecture](./reference-architecture/single-node-pilot.md) |
| **Troubleshoot an incident** | [Operational Runbooks](./runbooks/) |
| **Scale to production** | [Three-Node Cluster Architecture](./reference-architecture/three-node-cluster.md) |
| **Size your deployment** | [Resource Sizing Guide](./reference-architecture/resource-sizing.md) |
| **Configure networking** | [Network Requirements](./reference-architecture/network-requirements.md) |
| **Deploy with Docker Compose** | [Pilot with Monitoring](./deployment/docker-compose/pilot-with-monitoring.yml) |
| **Set up reverse proxy** | [Nginx Config](./deployment/nginx/stemedb.conf) / [Envoy Config](./deployment/envoy/stemedb.yaml) |
| **Validate pilot success** | [Pilot Success Criteria](./pilot-success-criteria.md) |

---

## Operations Documentation

### 🚨 Runbooks

**When things go wrong at 2am**, these runbooks provide step-by-step incident response procedures:

- **[Server Won't Start](./runbooks/server-wont-start.md)** - Port conflicts, TLS errors, WAL corruption
- **[High Query Latency](./runbooks/high-query-latency.md)** - Performance degradation, replication lag
- **[Quarantine Overflow](./runbooks/quarantine-overflow.md)** - Content defense queue management
- **[Circuit Breaker Stuck](./runbooks/circuit-breaker-stuck.md)** - Agent bans and manual resets
- **[Restore from Backup](./runbooks/restore-from-backup.md)** - Disaster recovery procedures
- **[Disk Full](./runbooks/disk-full.md)** - Storage management and WAL cleanup
- **[Add Node to Cluster](./runbooks/add-node.md)** - Cluster expansion procedures

**Start here:** [Troubleshooting Flowchart](./troubleshooting-flowchart.md) - Decision tree from symptom to runbook

---

### 🏗️ Reference Architectures

**Choose your deployment model** based on scale, availability requirements, and operational maturity:

| Architecture | Target | Assertions | Queries/sec | RTO/RPO | Guide |
|--------------|--------|-----------|-------------|---------|-------|
| **Single-Node Pilot** | PoC, friendly pilot | <10K | <100/sec | 2hr / 24hr | [Guide](./reference-architecture/single-node-pilot.md) |
| **Three-Node Cluster** | Production | <100K | <1K/sec | 5min / 1min | [Guide](./reference-architecture/three-node-cluster.md) |
| **Enterprise (future)** | Large-scale | >100K | >1K/sec | 1min / 0min | Roadmap (P6+) |

**Also see:**
- [Network Requirements](./reference-architecture/network-requirements.md) - Ports, firewalls, TLS, DNS
- [Resource Sizing](./reference-architecture/resource-sizing.md) - CPU, RAM, disk calculations

---

### 📦 Deployment Examples

**Infrastructure-as-Code** examples ready to customize for your environment:

- **[Docker Compose + Monitoring](./deployment/docker-compose/pilot-with-monitoring.yml)** - Turnkey deployment with Prometheus + Grafana
- **[Nginx Reverse Proxy](./deployment/nginx/stemedb.conf)** - TLS termination, rate limiting, security headers
- **[Envoy Gateway](./deployment/envoy/stemedb.yaml)** - Advanced load balancing, circuit breakers, retries

---

### ✅ Pilot Success Criteria

**Before going to production**, validate your pilot meets these criteria:

- **[Pilot Success Criteria](./pilot-success-criteria.md)** - Performance, functional, operational requirements
- **5 Amazement Moments** - Demo validation checklist
- **Acceptance Criteria** - Must Pass / Should Pass / Nice to Have

---

## Common Tasks

### First-Time Deployment

1. Review [Single-Node Pilot Architecture](./reference-architecture/single-node-pilot.md)
2. Follow [Resource Sizing Guide](./reference-architecture/resource-sizing.md) to choose hardware
3. Deploy using [Docker Compose example](./deployment/docker-compose/pilot-with-monitoring.yml)
4. Configure reverse proxy ([Nginx](./deployment/nginx/stemedb.conf) or [Envoy](./deployment/envoy/stemedb.yaml))
5. Validate against [Pilot Success Criteria](./pilot-success-criteria.md)

### Incident Response

1. Identify symptom (error message, alert, user report)
2. Check [Troubleshooting Flowchart](./troubleshooting-flowchart.md)
3. Follow relevant runbook (see list above)
4. Document resolution and add to runbook if new scenario

### Scaling to Production

1. Validate pilot success with [Success Criteria](./pilot-success-criteria.md)
2. Review [Three-Node Cluster Architecture](./reference-architecture/three-node-cluster.md)
3. Plan migration (data backup, node provisioning, DNS changes)
4. Execute deployment with rolling validation
5. Set up monitoring (see [Docker Compose example](./deployment/docker-compose/pilot-with-monitoring.yml))

---

## Prerequisites

**Before using these operations guides**, ensure you've completed:

- ✅ [Production Readiness Verification](../../uat/production-readiness/README.md) - 84% CLI score, all critical checks pass
- ✅ [Load Testing](../../uat/production-readiness/README.md#load-testing) - 10K assertions baseline, 1K/sec sustained
- ✅ [Backup/Restore Testing](../../scripts/) - Validated roundtrip recovery

---

## Support

**For questions or issues:**

- 📖 **Documentation bugs:** Report at [GitHub Issues](https://github.com/anthropics/stemedb/issues)
- 💬 **Community support:** [Discussion forum link TBD]
- 🚨 **Security issues:** security@stemedb.io (or your org's security contact)

---

## Contributing

**Operations documentation is living documentation.** If you:

- Encounter an incident not covered by runbooks → Add it
- Find an architecture pattern that works well → Document it
- Discover a configuration improvement → Share the example

Submit pull requests to keep this guide current and valuable.

---

**Last Updated:** 2026-03-07