# StemeDB Reference Architectures **Choose the right deployment model** for your scale, availability requirements, and operational maturity. --- ## Architecture Comparison | Architecture | Target Use Case | Assertions | Queries/sec | Availability | RTO/RPO | Complexity | |--------------|----------------|-----------|-------------|--------------|---------|------------| | **[Single-Node Pilot](./single-node-pilot.md)** | PoC, friendly pilot, development | <10K | <100/sec | Single point of failure | 2hr / 24hr | ⭐ Low | | **[Three-Node Cluster](./three-node-cluster.md)** | Production, enterprise pilot | <100K | <1K/sec | Survives 1 node failure | 5min / 1min | ⭐⭐ Medium | | **Enterprise Cluster** (Roadmap P6) | Large-scale production | >100K | >1K/sec | Survives 2 node failures | 1min / 10s | ⭐⭐⭐ High | --- ## Quick Links | Need to... | Go to | |------------|-------| | **Deploy first pilot** | [Single-Node Pilot](./single-node-pilot.md) | | **Scale to production** | [Three-Node Cluster](./three-node-cluster.md) | | **Configure networking** | [Network Requirements](./network-requirements.md) | | **Size hardware** | [Resource Sizing](./resource-sizing.md) | | **View architecture diagrams** | [Diagrams Directory](./diagrams/) | --- ## Decision Tree ``` What's your use case? │ ├─► Proof of concept / Friendly pilot │ └─► [Single-Node Pilot](./single-node-pilot.md) │ • Simplest deployment │ • Manual recovery acceptable │ • <10K assertions │ • Deploy time: <2 hours │ ├─► Production deployment │ └─► [Three-Node Cluster](./three-node-cluster.md) │ • High availability (1 node failure) │ • Automatic replication │ • <100K assertions, <1K queries/sec │ • Deploy time: <1 day │ └─► Large-scale production └─► Enterprise Cluster (Roadmap P6) • Multi-region support • Automatic failover • >100K assertions, >1K queries/sec • Requires enterprise support ``` --- ## Key Concepts ### RTO (Recovery Time Objective) **How long until service is restored after failure?** - **Single-Node:** 2 hours (manual restore from backup) - **Three-Node:** 5 minutes (automatic failover to remaining nodes) - **Enterprise:** 1 minute (multi-region automatic failover) ### RPO (Recovery Point Objective) **How much data loss is acceptable?** - **Single-Node:** 24 hours (daily backup schedule) - **Three-Node:** 1 minute (real-time replication with replication factor 2) - **Enterprise:** 10 seconds (multi-region replication) ### Replication Factor **How many copies of each assertion?** - **Single-Node:** 1 copy (no replication) - **Three-Node:** 2 copies (survives 1 node loss) - **Enterprise:** 3 copies (survives 2 node losses) ### Consistency Model **All deployments use eventual consistency via CRDTs:** - Writes accepted immediately (optimistic) - Conflicts resolved at read-time via Lenses - Replication lag typically <1s within cluster - No distributed transactions or 2PC overhead --- ## Architecture Principles **All StemeDB architectures follow these principles:** 1. **Append-Only:** No overwrites, all history preserved 2. **Conflict-Free:** CRDTs for automatic merge without coordination 3. **Lens-Based Resolution:** Conflicts resolved at query time, not write time 4. **Content-Addressed:** Assertions identified by BLAKE3 hash, enabling Merkle sync 5. **Zero-Copy Serialization:** rkyv for minimal overhead **See:** [Architecture Overview](../../../architecture.md) for full details. --- ## Migration Paths ### Single-Node → Three-Node **When to migrate:** - Assertion count approaching 10K - Query latency >1s sustained - Need for high availability - Production readiness validation complete **Migration procedure:** 1. Provision 2 new nodes 2. Configure cluster on all 3 nodes 3. Restart single-node with cluster config 4. Trigger Merkle sync to replicate data 5. Update DNS/load balancer to point to cluster **Estimated downtime:** 5-15 minutes for replication **See:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster) for detailed steps. ### Three-Node → Enterprise Cluster **When to migrate:** - Assertion count approaching 100K - Query rate >1K/sec - Need for multi-region deployment - Compliance requirements for geo-redundancy **Requires:** Enterprise support (Roadmap P6) --- ## Deployment Checklist **Before deploying ANY architecture:** - [ ] **Production readiness verification passed** - See: [UAT Production Readiness](../../../../uat/production-readiness/README.md) - Minimum 84% CLI score required - [ ] **Backup/restore tested** - Validated backup script execution - Tested restore roundtrip - Documented recovery procedures - [ ] **Network configuration complete** - Firewall rules applied - DNS records configured - TLS certificates provisioned - See: [Network Requirements](./network-requirements.md) - [ ] **Monitoring set up** - Prometheus scraping /metrics - Grafana dashboards deployed - Alerts configured (disk, latency, availability) - [ ] **Runbooks reviewed** - Team familiar with [7 operational runbooks](../../runbooks/) - On-call rotation established - Escalation paths documented - [ ] **Pilot success criteria defined** - See: [Pilot Success Criteria](../../pilot-success-criteria.md) - Acceptance tests written - Demo script prepared --- ## Related Documentation - [Operations Hub](../../README.md) - Main operations documentation - [Deployment Examples](../../deployment/) - IaC configs (Docker Compose, Nginx, Envoy) - [Operational Runbooks](../../runbooks/) - Incident response procedures - [Pilot Success Criteria](../../pilot-success-criteria.md) - Validation checklist --- **Last Updated:** 2026-02-11