stemedb/docs/operations/reference-architecture/README.md

# StemeDB Reference Architectures

**Choose the right deployment model** for your scale, availability requirements, and operational maturity.

---

## Architecture Comparison

| Architecture | Target Use Case | Assertions | Queries/sec | Availability | RTO/RPO | Complexity |
|--------------|----------------|-----------|-------------|--------------|---------|------------|
| **[Single-Node Pilot](./single-node-pilot.md)** | PoC, friendly pilot, development | <10K | <100/sec | Single point of failure | 2hr / 24hr | ⭐ Low |
| **[Three-Node Cluster](./three-node-cluster.md)** | Production, enterprise pilot | <100K | <1K/sec | Survives 1 node failure | 5min / 1min | ⭐⭐ Medium |
| **Enterprise Cluster** (Roadmap P6) | Large-scale production | >100K | >1K/sec | Survives 2 node failures | 1min / 10s | ⭐⭐⭐ High |

---

## Quick Links

| Need to... | Go to |
|------------|-------|
| **Deploy first pilot** | [Single-Node Pilot](./single-node-pilot.md) |
| **Scale to production** | [Three-Node Cluster](./three-node-cluster.md) |
| **Configure networking** | [Network Requirements](./network-requirements.md) |
| **Size hardware** | [Resource Sizing](./resource-sizing.md) |
| **View architecture diagrams** | [Diagrams Directory](./diagrams/) |

---

## Decision Tree

```
What's your use case?
    │
    ├─► Proof of concept / Friendly pilot
    │   └─► [Single-Node Pilot](./single-node-pilot.md)
    │       • Simplest deployment
    │       • Manual recovery acceptable
    │       • <10K assertions
    │       • Deploy time: <2 hours
    │
    ├─► Production deployment
    │   └─► [Three-Node Cluster](./three-node-cluster.md)
    │       • High availability (1 node failure)
    │       • Automatic replication
    │       • <100K assertions, <1K queries/sec
    │       • Deploy time: <1 day
    │
    └─► Large-scale production
        └─► Enterprise Cluster (Roadmap P6)
            • Multi-region support
            • Automatic failover
            • >100K assertions, >1K queries/sec
            • Requires enterprise support
```

---

## Key Concepts

### RTO (Recovery Time Objective)

**How long until service is restored after failure?**

- **Single-Node:** 2 hours (manual restore from backup)
- **Three-Node:** 5 minutes (automatic failover to remaining nodes)
- **Enterprise:** 1 minute (multi-region automatic failover)

### RPO (Recovery Point Objective)

**How much data loss is acceptable?**

- **Single-Node:** 24 hours (daily backup schedule)
- **Three-Node:** 1 minute (real-time replication with replication factor 2)
- **Enterprise:** 10 seconds (multi-region replication)

### Replication Factor

**How many copies of each assertion?**

- **Single-Node:** 1 copy (no replication)
- **Three-Node:** 2 copies (survives 1 node loss)
- **Enterprise:** 3 copies (survives 2 node losses)

### Consistency Model

**All deployments use eventual consistency via CRDTs:**
- Writes accepted immediately (optimistic)
- Conflicts resolved at read-time via Lenses
- Replication lag typically <1s within cluster
- No distributed transactions or 2PC overhead

---

## Architecture Principles

**All StemeDB architectures follow these principles:**

1. **Append-Only:** No overwrites, all history preserved
2. **Conflict-Free:** CRDTs for automatic merge without coordination
3. **Lens-Based Resolution:** Conflicts resolved at query time, not write time
4. **Content-Addressed:** Assertions identified by BLAKE3 hash, enabling Merkle sync
5. **Zero-Copy Serialization:** rkyv for minimal overhead

**See:** [Architecture Overview](../../../architecture.md) for full details.

---

## Migration Paths

### Single-Node → Three-Node

**When to migrate:**
- Assertion count approaching 10K
- Query latency >1s sustained
- Need for high availability
- Production readiness validation complete

**Migration procedure:**
1. Provision 2 new nodes
2. Configure cluster on all 3 nodes
3. Restart single-node with cluster config
4. Trigger Merkle sync to replicate data
5. Update DNS/load balancer to point to cluster

**Estimated downtime:** 5-15 minutes for replication

**See:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster) for detailed steps.

### Three-Node → Enterprise Cluster

**When to migrate:**
- Assertion count approaching 100K
- Query rate >1K/sec
- Need for multi-region deployment
- Compliance requirements for geo-redundancy

**Requires:** Enterprise support (Roadmap P6)

---

## Deployment Checklist

**Before deploying ANY architecture:**

- [ ] **Production readiness verification passed**
  - See: [UAT Production Readiness](../../../../uat/production-readiness/README.md)
  - Minimum 84% CLI score required

- [ ] **Backup/restore tested**
  - Validated backup script execution
  - Tested restore roundtrip
  - Documented recovery procedures

- [ ] **Network configuration complete**
  - Firewall rules applied
  - DNS records configured
  - TLS certificates provisioned
  - See: [Network Requirements](./network-requirements.md)

- [ ] **Monitoring set up**
  - Prometheus scraping /metrics
  - Grafana dashboards deployed
  - Alerts configured (disk, latency, availability)

- [ ] **Runbooks reviewed**
  - Team familiar with [7 operational runbooks](../../runbooks/)
  - On-call rotation established
  - Escalation paths documented

- [ ] **Pilot success criteria defined**
  - See: [Pilot Success Criteria](../../pilot-success-criteria.md)
  - Acceptance tests written
  - Demo script prepared

---

## Related Documentation

- [Operations Hub](../../README.md) - Main operations documentation
- [Deployment Examples](../../deployment/) - IaC configs (Docker Compose, Nginx, Envoy)
- [Operational Runbooks](../../runbooks/) - Incident response procedures
- [Pilot Success Criteria](../../pilot-success-criteria.md) - Validation checklist

---

**Last Updated:** 2026-02-11