This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| diagrams | ||
| network-requirements.md | ||
| README.md | ||
| resource-sizing.md | ||
| single-node-pilot.md | ||
| three-node-cluster.md | ||
StemeDB Reference Architectures
Choose the right deployment model for your scale, availability requirements, and operational maturity.
Architecture Comparison
| Architecture | Target Use Case | Assertions | Queries/sec | Availability | RTO/RPO | Complexity |
|---|---|---|---|---|---|---|
| Single-Node Pilot | PoC, friendly pilot, development | <10K | <100/sec | Single point of failure | 2hr / 24hr | ⭐ Low |
| Three-Node Cluster | Production, enterprise pilot | <100K | <1K/sec | Survives 1 node failure | 5min / 1min | ⭐⭐ Medium |
| Enterprise Cluster (Roadmap P6) | Large-scale production | >100K | >1K/sec | Survives 2 node failures | 1min / 10s | ⭐⭐⭐ High |
Quick Links
| Need to... | Go to |
|---|---|
| Deploy first pilot | Single-Node Pilot |
| Scale to production | Three-Node Cluster |
| Configure networking | Network Requirements |
| Size hardware | Resource Sizing |
| View architecture diagrams | Diagrams Directory |
Decision Tree
What's your use case?
│
├─► Proof of concept / Friendly pilot
│ └─► [Single-Node Pilot](./single-node-pilot.md)
│ • Simplest deployment
│ • Manual recovery acceptable
│ • <10K assertions
│ • Deploy time: <2 hours
│
├─► Production deployment
│ └─► [Three-Node Cluster](./three-node-cluster.md)
│ • High availability (1 node failure)
│ • Automatic replication
│ • <100K assertions, <1K queries/sec
│ • Deploy time: <1 day
│
└─► Large-scale production
└─► Enterprise Cluster (Roadmap P6)
• Multi-region support
• Automatic failover
• >100K assertions, >1K queries/sec
• Requires enterprise support
Key Concepts
RTO (Recovery Time Objective)
How long until service is restored after failure?
- Single-Node: 2 hours (manual restore from backup)
- Three-Node: 5 minutes (automatic failover to remaining nodes)
- Enterprise: 1 minute (multi-region automatic failover)
RPO (Recovery Point Objective)
How much data loss is acceptable?
- Single-Node: 24 hours (daily backup schedule)
- Three-Node: 1 minute (real-time replication with replication factor 2)
- Enterprise: 10 seconds (multi-region replication)
Replication Factor
How many copies of each assertion?
- Single-Node: 1 copy (no replication)
- Three-Node: 2 copies (survives 1 node loss)
- Enterprise: 3 copies (survives 2 node losses)
Consistency Model
All deployments use eventual consistency via CRDTs:
- Writes accepted immediately (optimistic)
- Conflicts resolved at read-time via Lenses
- Replication lag typically <1s within cluster
- No distributed transactions or 2PC overhead
Architecture Principles
All StemeDB architectures follow these principles:
- Append-Only: No overwrites, all history preserved
- Conflict-Free: CRDTs for automatic merge without coordination
- Lens-Based Resolution: Conflicts resolved at query time, not write time
- Content-Addressed: Assertions identified by BLAKE3 hash, enabling Merkle sync
- Zero-Copy Serialization: rkyv for minimal overhead
See: Architecture Overview for full details.
Migration Paths
Single-Node → Three-Node
When to migrate:
- Assertion count approaching 10K
- Query latency >1s sustained
- Need for high availability
- Production readiness validation complete
Migration procedure:
- Provision 2 new nodes
- Configure cluster on all 3 nodes
- Restart single-node with cluster config
- Trigger Merkle sync to replicate data
- Update DNS/load balancer to point to cluster
Estimated downtime: 5-15 minutes for replication
See: Add Node Runbook for detailed steps.
Three-Node → Enterprise Cluster
When to migrate:
- Assertion count approaching 100K
- Query rate >1K/sec
- Need for multi-region deployment
- Compliance requirements for geo-redundancy
Requires: Enterprise support (Roadmap P6)
Deployment Checklist
Before deploying ANY architecture:
-
Production readiness verification passed
- See: UAT Production Readiness
- Minimum 84% CLI score required
-
Backup/restore tested
- Validated backup script execution
- Tested restore roundtrip
- Documented recovery procedures
-
Network configuration complete
- Firewall rules applied
- DNS records configured
- TLS certificates provisioned
- See: Network Requirements
-
Monitoring set up
- Prometheus scraping /metrics
- Grafana dashboards deployed
- Alerts configured (disk, latency, availability)
-
Runbooks reviewed
- Team familiar with 7 operational runbooks
- On-call rotation established
- Escalation paths documented
-
Pilot success criteria defined
- See: Pilot Success Criteria
- Acceptance tests written
- Demo script prepared
Related Documentation
- Operations Hub - Main operations documentation
- Deployment Examples - IaC configs (Docker Compose, Nginx, Envoy)
- Operational Runbooks - Incident response procedures
- Pilot Success Criteria - Validation checklist
Last Updated: 2026-02-11