jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

8.5 KiB

Raw Blame History

Resource Sizing Guide

Hardware sizing calculations for StemeDB deployments

Quick Reference Table

Assertions	Queries/sec	Deployment	CPU	RAM	Disk (WAL+DB)	Monthly Cost (AWS)
<10K	<100	Single-node	2-4 vCPU	4-8GB	50GB	~$87
<50K	<500	Single-node or 3-node	4-8 vCPU	8-16GB	100GB	~$180 (1) or ~$425 (3)
<100K	<1K	Three-node	8 vCPU	16GB	200GB	~$425
<500K	<5K	Five-node (P6)	16 vCPU	32GB	500GB	~$1,200
<1M	<10K	Enterprise (P6)	32 vCPU	64GB	1TB	~$3,000

Costs are estimates for AWS us-east-1. Actual costs vary by region and instance type.

Sizing Methodology

CPU Calculation

Formula:

vCPUs = (query_rate × 0.005) + (ingest_rate × 0.002) + 2

Where:

query_rate = queries per second (peak)
ingest_rate = assertions per second (sustained)
+2 = baseline for background tasks (compaction, replication)

Examples:

Pilot (100 queries/sec, 50 assertions/sec):

vCPUs = (100 × 0.005) + (50 × 0.002) + 2
      = 0.5 + 0.1 + 2
      = 2.6 vCPUs → **4 vCPUs** (round up)

Production (1K queries/sec, 500 assertions/sec):

vCPUs = (1000 × 0.005) + (500 × 0.002) + 2
      = 5 + 1 + 2
      = 8 vCPUs → **8 vCPUs**

Overhead factors:

Add 50% for cluster coordination (3-node)
Add 100% for complex lens queries (AuthorityLens with deep chains)

RAM Calculation

Formula:

RAM_GB = (assertions × 0.0001) + (index_overhead × 0.1) + cache_size + 2

Where:

assertions = total assertion count
index_overhead = ~10% of data size
cache_size = configurable (default: 1GB)
+2GB = OS + StemeDB runtime

Examples:

10K assertions:

Data size: 10K × 1KB = 10MB
Index: 10MB × 0.1 = 1MB
Cache: 1GB (default)
RAM = 10MB + 1MB + 1GB + 2GB ≈ 3GB → **4GB** (with headroom)

100K assertions:

Data size: 100K × 1KB = 100MB
Index: 100MB × 0.1 = 10MB
Cache: 2GB (recommended)
RAM = 100MB + 10MB + 2GB + 2GB ≈ 4.1GB → **8GB** (with headroom)

1M assertions:

Data size: 1M × 1KB = 1GB
Index: 1GB × 0.1 = 100MB
Cache: 4GB (recommended)
RAM = 1GB + 100MB + 4GB + 2GB ≈ 7.1GB → **16GB** (with headroom)

Memory pressure indicators:

Swap usage >0 → Insufficient RAM
Cache hit rate <80% → Increase cache_size
OOM kills → Increase RAM or reduce cache_size

Disk Calculation

Components:

WAL (Write-Ahead Log):

WAL_size = daily_assertions × retention_days × 10KB / 1000

Database (KV Store + Indexes):

DB_size = total_assertions × 1KB + (total_assertions × 0.1KB)  # +10% for indexes

Backups:

Backup_size = (WAL_size + DB_size) × retention_count

Examples:

10K assertions, 7-day WAL retention:

Daily ingest: 1K assertions/day
WAL: 1K × 7 days × 10KB / 1000 = 70KB ≈ 1MB (negligible)
DB: 10K × 1KB + (10K × 0.1KB) = 10MB + 1MB = 11MB
Backups: (1MB + 11MB) × 7 = 84MB

Total: 1MB + 11MB + 84MB ≈ 96MB → **50GB** (with 500× headroom for growth)

100K assertions, 7-day WAL retention:

Daily ingest: 10K assertions/day
WAL: 10K × 7 days × 10KB / 1000 = 700KB ≈ 1MB
DB: 100K × 1KB + (100K × 0.1KB) = 100MB + 10MB = 110MB
Backups: (1MB + 110MB) × 7 = 777MB

Total: 1MB + 110MB + 777MB ≈ 888MB → **100GB** (with 100× headroom)

1M assertions, 7-day WAL retention:

Daily ingest: 100K assertions/day
WAL: 100K × 7 days × 10KB / 1000 = 7MB
DB: 1M × 1KB + (1M × 0.1KB) = 1GB + 100MB = 1.1GB
Backups: (7MB + 1.1GB) × 7 = 7.75GB

Total: 7MB + 1.1GB + 7.75GB ≈ 8.86GB → **200GB** (with 20× headroom)

Disk type:

SSD required - HDD will bottleneck WAL fsync
IOPS: 3K minimum, 10K recommended
Throughput: 100 MB/sec minimum

Network Calculation

Ingest bandwidth:

Inbound = assertions/sec × 1KB × 8 bits / 1000 = Mbps

Query bandwidth:

Outbound = queries/sec × 5KB × 8 bits / 1000 = Mbps

Replication bandwidth (cluster only):

Replication = assertions/sec × 1KB × replication_factor × 8 bits / 1000 = Mbps

Examples:

100 assertions/sec, 100 queries/sec, single-node:

Inbound: 100 × 1KB × 8 / 1000 = 0.8 Mbps
Outbound: 100 × 5KB × 8 / 1000 = 4 Mbps
Total: ~5 Mbps → **100 Mbps** (with 20× headroom)

1K assertions/sec, 1K queries/sec, three-node (factor 2):

Inbound: 1000 × 1KB × 8 / 1000 = 8 Mbps
Outbound: 1000 × 5KB × 8 / 1000 = 40 Mbps
Replication: 1000 × 1KB × 2 × 8 / 1000 = 16 Mbps
Total: ~64 Mbps → **1 Gbps** (with 15× headroom)

Instance Type Selection

AWS (us-east-1)

Assertions	Instance Type	vCPU	RAM	Network	Cost/month
<10K	t3.medium	2	4GB	5 Gbps	$30
<50K	t3.large	2	8GB	5 Gbps	$60
<100K	t3.xlarge	4	16GB	5 Gbps	$122
<500K	m5.2xlarge	8	32GB	10 Gbps	$277
<1M	m5.4xlarge	16	64GB	10 Gbps	$554

Use t3 (burstable) for pilot, m5 (general purpose) for production

GCP (us-central1)

Assertions	Machine Type	vCPU	RAM	Network	Cost/month
<10K	n1-standard-1	1	3.75GB	2 Gbps	$25
<50K	n2-standard-2	2	8GB	10 Gbps	$65
<100K	n2-standard-4	4	16GB	10 Gbps	$130
<500K	n2-standard-8	8	32GB	16 Gbps	$260
<1M	n2-standard-16	16	64GB	32 Gbps	$520

Azure (East US)

Assertions	VM Size	vCPU	RAM	Network	Cost/month
<10K	Standard_B2s	2	4GB	Moderate	$30
<50K	Standard_D2s_v3	2	8GB	Moderate	$70
<100K	Standard_D4s_v3	4	16GB	High	$140
<500K	Standard_D8s_v3	8	32GB	High	$280
<1M	Standard_D16s_v3	16	64GB	Very High	$560

Growth Planning

Capacity Thresholds

When to scale vertically (bigger instance):

CPU sustained >70%
RAM used >80%
Disk >80%
Query latency p99 >500ms

When to scale horizontally (add nodes):

Single-node at max instance size
Need for high availability (1→3 nodes)
Query rate >1K/sec sustained
Write rate >1K assertions/sec

Scaling Timeline

10K → 50K assertions:

Growth rate: 1K/month typical
Timeline: 40 months
Action: Monitor, no scaling needed yet

50K → 100K assertions:

Growth rate: 5K/month typical
Timeline: 10 months
Action: Plan migration to 3-node cluster

100K → 500K assertions:

Growth rate: 10K/month typical
Timeline: 40 months
Action: Scale to 5-node cluster (requires P6)

Pilot Sizing Recommendations

Friendly Pilot (<10K assertions)

Recommended:

Deployment: Single-node
Instance: t3.medium (AWS) or equivalent
Disk: 50GB SSD
Network: 100 Mbps
Cost: ~$87/month

Rationale:

Minimal cost for early validation
Easy to deploy and manage
Sufficient for 50 concurrent users
Migrate to larger when validated

Production Pilot (<100K assertions)

Recommended:

Deployment: Three-node cluster
Instance: t3.xlarge × 3 (AWS) or equivalent
Disk: 200GB SSD per node
Network: 1 Gbps per node
Cost: ~$425/month

Rationale:

High availability (survives 1 node failure)
Room to grow to 100K assertions
Sufficient for 500 concurrent users
Production-ready architecture

Monitoring for Capacity

Metrics to Track

# Prometheus queries
- CPU: rate(process_cpu_seconds_total[5m]) * 100
  # Alert: >70% sustained

- RAM: process_resident_memory_bytes / node_memory_MemTotal_bytes * 100
  # Alert: >80%

- Disk: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
  # Alert: >80%

- Query latency: histogram_quantile(0.99, stemedb_query_latency_seconds_bucket)
  # Alert: >0.5 (500ms)

- Replication lag: replication_lag_seconds
  # Alert: >5

Capacity Planning Dashboard

Grafana panels:

Assertion growth (30-day trend)
CPU/RAM/Disk utilization
Query rate (30-day trend)
Time-to-threshold (days until 80% capacity)

Single-Node Architecture - Sizing for single-node
Three-Node Cluster - Sizing for cluster
Network Requirements - Bandwidth calculations
Disk Full Runbook - Storage management

Last Updated: 2026-02-11

8.5 KiB Raw Blame History Unescape Escape

Resource Sizing Guide

Quick Reference Table

Sizing Methodology

CPU Calculation

RAM Calculation

Disk Calculation

Network Calculation

Instance Type Selection

AWS (us-east-1)

GCP (us-central1)

Azure (East US)

Growth Planning

Capacity Thresholds

Scaling Timeline

Pilot Sizing Recommendations

Friendly Pilot (<10K assertions)

Production Pilot (<100K assertions)

Monitoring for Capacity

Metrics to Track

Capacity Planning Dashboard

Related Documentation

8.5 KiB

Raw Blame History