jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

5.8 KiB

Raw Blame History

StemeDB Reference Architectures

Choose the right deployment model for your scale, availability requirements, and operational maturity.

Architecture Comparison

Architecture	Target Use Case	Assertions	Queries/sec	Availability	RTO/RPO	Complexity
Single-Node Pilot	PoC, friendly pilot, development	<10K	<100/sec	Single point of failure	2hr / 24hr	⭐ Low
Three-Node Cluster	Production, enterprise pilot	<100K	<1K/sec	Survives 1 node failure	5min / 1min	⭐⭐ Medium
Enterprise Cluster (Roadmap P6)	Large-scale production	>100K	>1K/sec	Survives 2 node failures	1min / 10s	⭐⭐⭐ High

Quick Links

Need to...	Go to
Deploy first pilot	Single-Node Pilot
Scale to production	Three-Node Cluster
Configure networking	Network Requirements
Size hardware	Resource Sizing
View architecture diagrams	Diagrams Directory

Decision Tree

What's your use case?
    │
    ├─► Proof of concept / Friendly pilot
    │   └─► [Single-Node Pilot](./single-node-pilot.md)
    │       • Simplest deployment
    │       • Manual recovery acceptable
    │       • <10K assertions
    │       • Deploy time: <2 hours
    │
    ├─► Production deployment
    │   └─► [Three-Node Cluster](./three-node-cluster.md)
    │       • High availability (1 node failure)
    │       • Automatic replication
    │       • <100K assertions, <1K queries/sec
    │       • Deploy time: <1 day
    │
    └─► Large-scale production
        └─► Enterprise Cluster (Roadmap P6)
            • Multi-region support
            • Automatic failover
            • >100K assertions, >1K queries/sec
            • Requires enterprise support

Key Concepts

RTO (Recovery Time Objective)

How long until service is restored after failure?

Single-Node: 2 hours (manual restore from backup)
Three-Node: 5 minutes (automatic failover to remaining nodes)
Enterprise: 1 minute (multi-region automatic failover)

RPO (Recovery Point Objective)

How much data loss is acceptable?

Single-Node: 24 hours (daily backup schedule)
Three-Node: 1 minute (real-time replication with replication factor 2)
Enterprise: 10 seconds (multi-region replication)

Replication Factor

How many copies of each assertion?

Single-Node: 1 copy (no replication)
Three-Node: 2 copies (survives 1 node loss)
Enterprise: 3 copies (survives 2 node losses)

Consistency Model

All deployments use eventual consistency via CRDTs:

Writes accepted immediately (optimistic)
Conflicts resolved at read-time via Lenses
Replication lag typically <1s within cluster
No distributed transactions or 2PC overhead

Architecture Principles

All StemeDB architectures follow these principles:

Append-Only: No overwrites, all history preserved
Conflict-Free: CRDTs for automatic merge without coordination
Lens-Based Resolution: Conflicts resolved at query time, not write time
Content-Addressed: Assertions identified by BLAKE3 hash, enabling Merkle sync
Zero-Copy Serialization: rkyv for minimal overhead

See: Architecture Overview for full details.

Migration Paths

Single-Node → Three-Node

When to migrate:

Assertion count approaching 10K
Query latency >1s sustained
Need for high availability
Production readiness validation complete

Migration procedure:

Provision 2 new nodes
Configure cluster on all 3 nodes
Restart single-node with cluster config
Trigger Merkle sync to replicate data
Update DNS/load balancer to point to cluster

Estimated downtime: 5-15 minutes for replication

See: Add Node Runbook for detailed steps.

Three-Node → Enterprise Cluster