stemedb/docs/operations/reference-architecture/README.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

5.8 KiB

StemeDB Reference Architectures

Choose the right deployment model for your scale, availability requirements, and operational maturity.


Architecture Comparison

Architecture Target Use Case Assertions Queries/sec Availability RTO/RPO Complexity
Single-Node Pilot PoC, friendly pilot, development <10K <100/sec Single point of failure 2hr / 24hr Low
Three-Node Cluster Production, enterprise pilot <100K <1K/sec Survives 1 node failure 5min / 1min Medium
Enterprise Cluster (Roadmap P6) Large-scale production >100K >1K/sec Survives 2 node failures 1min / 10s High

Need to... Go to
Deploy first pilot Single-Node Pilot
Scale to production Three-Node Cluster
Configure networking Network Requirements
Size hardware Resource Sizing
View architecture diagrams Diagrams Directory

Decision Tree

What's your use case?
    │
    ├─► Proof of concept / Friendly pilot
    │   └─► [Single-Node Pilot](./single-node-pilot.md)
    │       • Simplest deployment
    │       • Manual recovery acceptable
    │       • <10K assertions
    │       • Deploy time: <2 hours
    │
    ├─► Production deployment
    │   └─► [Three-Node Cluster](./three-node-cluster.md)
    │       • High availability (1 node failure)
    │       • Automatic replication
    │       • <100K assertions, <1K queries/sec
    │       • Deploy time: <1 day
    │
    └─► Large-scale production
        └─► Enterprise Cluster (Roadmap P6)
            • Multi-region support
            • Automatic failover
            • >100K assertions, >1K queries/sec
            • Requires enterprise support

Key Concepts

RTO (Recovery Time Objective)

How long until service is restored after failure?

  • Single-Node: 2 hours (manual restore from backup)
  • Three-Node: 5 minutes (automatic failover to remaining nodes)
  • Enterprise: 1 minute (multi-region automatic failover)

RPO (Recovery Point Objective)

How much data loss is acceptable?

  • Single-Node: 24 hours (daily backup schedule)
  • Three-Node: 1 minute (real-time replication with replication factor 2)
  • Enterprise: 10 seconds (multi-region replication)

Replication Factor

How many copies of each assertion?

  • Single-Node: 1 copy (no replication)
  • Three-Node: 2 copies (survives 1 node loss)
  • Enterprise: 3 copies (survives 2 node losses)

Consistency Model

All deployments use eventual consistency via CRDTs:

  • Writes accepted immediately (optimistic)
  • Conflicts resolved at read-time via Lenses
  • Replication lag typically <1s within cluster
  • No distributed transactions or 2PC overhead

Architecture Principles

All StemeDB architectures follow these principles:

  1. Append-Only: No overwrites, all history preserved
  2. Conflict-Free: CRDTs for automatic merge without coordination
  3. Lens-Based Resolution: Conflicts resolved at query time, not write time
  4. Content-Addressed: Assertions identified by BLAKE3 hash, enabling Merkle sync
  5. Zero-Copy Serialization: rkyv for minimal overhead

See: Architecture Overview for full details.


Migration Paths

Single-Node → Three-Node

When to migrate:

  • Assertion count approaching 10K
  • Query latency >1s sustained
  • Need for high availability
  • Production readiness validation complete

Migration procedure:

  1. Provision 2 new nodes
  2. Configure cluster on all 3 nodes
  3. Restart single-node with cluster config
  4. Trigger Merkle sync to replicate data
  5. Update DNS/load balancer to point to cluster

Estimated downtime: 5-15 minutes for replication

See: Add Node Runbook for detailed steps.

Three-Node → Enterprise Cluster

When to migrate:

  • Assertion count approaching 100K
  • Query rate >1K/sec
  • Need for multi-region deployment
  • Compliance requirements for geo-redundancy

Requires: Enterprise support (Roadmap P6)


Deployment Checklist

Before deploying ANY architecture:

  • Production readiness verification passed

  • Backup/restore tested

    • Validated backup script execution
    • Tested restore roundtrip
    • Documented recovery procedures
  • Network configuration complete

    • Firewall rules applied
    • DNS records configured
    • TLS certificates provisioned
    • See: Network Requirements
  • Monitoring set up

    • Prometheus scraping /metrics
    • Grafana dashboards deployed
    • Alerts configured (disk, latency, availability)
  • Runbooks reviewed

  • Pilot success criteria defined



Last Updated: 2026-02-11