This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
187 lines
5.8 KiB
Markdown
187 lines
5.8 KiB
Markdown
# StemeDB Reference Architectures
|
|
|
|
**Choose the right deployment model** for your scale, availability requirements, and operational maturity.
|
|
|
|
---
|
|
|
|
## Architecture Comparison
|
|
|
|
| Architecture | Target Use Case | Assertions | Queries/sec | Availability | RTO/RPO | Complexity |
|
|
|--------------|----------------|-----------|-------------|--------------|---------|------------|
|
|
| **[Single-Node Pilot](./single-node-pilot.md)** | PoC, friendly pilot, development | <10K | <100/sec | Single point of failure | 2hr / 24hr | ⭐ Low |
|
|
| **[Three-Node Cluster](./three-node-cluster.md)** | Production, enterprise pilot | <100K | <1K/sec | Survives 1 node failure | 5min / 1min | ⭐⭐ Medium |
|
|
| **Enterprise Cluster** (Roadmap P6) | Large-scale production | >100K | >1K/sec | Survives 2 node failures | 1min / 10s | ⭐⭐⭐ High |
|
|
|
|
---
|
|
|
|
## Quick Links
|
|
|
|
| Need to... | Go to |
|
|
|------------|-------|
|
|
| **Deploy first pilot** | [Single-Node Pilot](./single-node-pilot.md) |
|
|
| **Scale to production** | [Three-Node Cluster](./three-node-cluster.md) |
|
|
| **Configure networking** | [Network Requirements](./network-requirements.md) |
|
|
| **Size hardware** | [Resource Sizing](./resource-sizing.md) |
|
|
| **View architecture diagrams** | [Diagrams Directory](./diagrams/) |
|
|
|
|
---
|
|
|
|
## Decision Tree
|
|
|
|
```
|
|
What's your use case?
|
|
│
|
|
├─► Proof of concept / Friendly pilot
|
|
│ └─► [Single-Node Pilot](./single-node-pilot.md)
|
|
│ • Simplest deployment
|
|
│ • Manual recovery acceptable
|
|
│ • <10K assertions
|
|
│ • Deploy time: <2 hours
|
|
│
|
|
├─► Production deployment
|
|
│ └─► [Three-Node Cluster](./three-node-cluster.md)
|
|
│ • High availability (1 node failure)
|
|
│ • Automatic replication
|
|
│ • <100K assertions, <1K queries/sec
|
|
│ • Deploy time: <1 day
|
|
│
|
|
└─► Large-scale production
|
|
└─► Enterprise Cluster (Roadmap P6)
|
|
• Multi-region support
|
|
• Automatic failover
|
|
• >100K assertions, >1K queries/sec
|
|
• Requires enterprise support
|
|
```
|
|
|
|
---
|
|
|
|
## Key Concepts
|
|
|
|
### RTO (Recovery Time Objective)
|
|
|
|
**How long until service is restored after failure?**
|
|
|
|
- **Single-Node:** 2 hours (manual restore from backup)
|
|
- **Three-Node:** 5 minutes (automatic failover to remaining nodes)
|
|
- **Enterprise:** 1 minute (multi-region automatic failover)
|
|
|
|
### RPO (Recovery Point Objective)
|
|
|
|
**How much data loss is acceptable?**
|
|
|
|
- **Single-Node:** 24 hours (daily backup schedule)
|
|
- **Three-Node:** 1 minute (real-time replication with replication factor 2)
|
|
- **Enterprise:** 10 seconds (multi-region replication)
|
|
|
|
### Replication Factor
|
|
|
|
**How many copies of each assertion?**
|
|
|
|
- **Single-Node:** 1 copy (no replication)
|
|
- **Three-Node:** 2 copies (survives 1 node loss)
|
|
- **Enterprise:** 3 copies (survives 2 node losses)
|
|
|
|
### Consistency Model
|
|
|
|
**All deployments use eventual consistency via CRDTs:**
|
|
- Writes accepted immediately (optimistic)
|
|
- Conflicts resolved at read-time via Lenses
|
|
- Replication lag typically <1s within cluster
|
|
- No distributed transactions or 2PC overhead
|
|
|
|
---
|
|
|
|
## Architecture Principles
|
|
|
|
**All StemeDB architectures follow these principles:**
|
|
|
|
1. **Append-Only:** No overwrites, all history preserved
|
|
2. **Conflict-Free:** CRDTs for automatic merge without coordination
|
|
3. **Lens-Based Resolution:** Conflicts resolved at query time, not write time
|
|
4. **Content-Addressed:** Assertions identified by BLAKE3 hash, enabling Merkle sync
|
|
5. **Zero-Copy Serialization:** rkyv for minimal overhead
|
|
|
|
**See:** [Architecture Overview](../../../architecture.md) for full details.
|
|
|
|
---
|
|
|
|
## Migration Paths
|
|
|
|
### Single-Node → Three-Node
|
|
|
|
**When to migrate:**
|
|
- Assertion count approaching 10K
|
|
- Query latency >1s sustained
|
|
- Need for high availability
|
|
- Production readiness validation complete
|
|
|
|
**Migration procedure:**
|
|
1. Provision 2 new nodes
|
|
2. Configure cluster on all 3 nodes
|
|
3. Restart single-node with cluster config
|
|
4. Trigger Merkle sync to replicate data
|
|
5. Update DNS/load balancer to point to cluster
|
|
|
|
**Estimated downtime:** 5-15 minutes for replication
|
|
|
|
**See:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster) for detailed steps.
|
|
|
|
### Three-Node → Enterprise Cluster
|
|
|
|
**When to migrate:**
|
|
- Assertion count approaching 100K
|
|
- Query rate >1K/sec
|
|
- Need for multi-region deployment
|
|
- Compliance requirements for geo-redundancy
|
|
|
|
**Requires:** Enterprise support (Roadmap P6)
|
|
|
|
---
|
|
|
|
## Deployment Checklist
|
|
|
|
**Before deploying ANY architecture:**
|
|
|
|
- [ ] **Production readiness verification passed**
|
|
- See: [UAT Production Readiness](../../../../uat/production-readiness/README.md)
|
|
- Minimum 84% CLI score required
|
|
|
|
- [ ] **Backup/restore tested**
|
|
- Validated backup script execution
|
|
- Tested restore roundtrip
|
|
- Documented recovery procedures
|
|
|
|
- [ ] **Network configuration complete**
|
|
- Firewall rules applied
|
|
- DNS records configured
|
|
- TLS certificates provisioned
|
|
- See: [Network Requirements](./network-requirements.md)
|
|
|
|
- [ ] **Monitoring set up**
|
|
- Prometheus scraping /metrics
|
|
- Grafana dashboards deployed
|
|
- Alerts configured (disk, latency, availability)
|
|
|
|
- [ ] **Runbooks reviewed**
|
|
- Team familiar with [7 operational runbooks](../../runbooks/)
|
|
- On-call rotation established
|
|
- Escalation paths documented
|
|
|
|
- [ ] **Pilot success criteria defined**
|
|
- See: [Pilot Success Criteria](../../pilot-success-criteria.md)
|
|
- Acceptance tests written
|
|
- Demo script prepared
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- [Operations Hub](../../README.md) - Main operations documentation
|
|
- [Deployment Examples](../../deployment/) - IaC configs (Docker Compose, Nginx, Envoy)
|
|
- [Operational Runbooks](../../runbooks/) - Incident response procedures
|
|
- [Pilot Success Criteria](../../pilot-success-criteria.md) - Validation checklist
|
|
|
|
---
|
|
|
|
**Last Updated:** 2026-02-11
|