stemedb/docs/operations/reference-architecture/resource-sizing.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

8.5 KiB
Raw Permalink Blame History

Resource Sizing Guide

Hardware sizing calculations for StemeDB deployments


Quick Reference Table

Assertions Queries/sec Deployment CPU RAM Disk (WAL+DB) Monthly Cost (AWS)
<10K <100 Single-node 2-4 vCPU 4-8GB 50GB ~$87
<50K <500 Single-node or 3-node 4-8 vCPU 8-16GB 100GB ~$180 (1) or ~$425 (3)
<100K <1K Three-node 8 vCPU 16GB 200GB ~$425
<500K <5K Five-node (P6) 16 vCPU 32GB 500GB ~$1,200
<1M <10K Enterprise (P6) 32 vCPU 64GB 1TB ~$3,000

Costs are estimates for AWS us-east-1. Actual costs vary by region and instance type.


Sizing Methodology

CPU Calculation

Formula:

vCPUs = (query_rate × 0.005) + (ingest_rate × 0.002) + 2

Where:

  • query_rate = queries per second (peak)
  • ingest_rate = assertions per second (sustained)
  • +2 = baseline for background tasks (compaction, replication)

Examples:

Pilot (100 queries/sec, 50 assertions/sec):

vCPUs = (100 × 0.005) + (50 × 0.002) + 2
      = 0.5 + 0.1 + 2
      = 2.6 vCPUs → **4 vCPUs** (round up)

Production (1K queries/sec, 500 assertions/sec):

vCPUs = (1000 × 0.005) + (500 × 0.002) + 2
      = 5 + 1 + 2
      = 8 vCPUs → **8 vCPUs**

Overhead factors:

  • Add 50% for cluster coordination (3-node)
  • Add 100% for complex lens queries (AuthorityLens with deep chains)

RAM Calculation

Formula:

RAM_GB = (assertions × 0.0001) + (index_overhead × 0.1) + cache_size + 2

Where:

  • assertions = total assertion count
  • index_overhead = ~10% of data size
  • cache_size = configurable (default: 1GB)
  • +2GB = OS + StemeDB runtime

Examples:

10K assertions:

Data size: 10K × 1KB = 10MB
Index: 10MB × 0.1 = 1MB
Cache: 1GB (default)
RAM = 10MB + 1MB + 1GB + 2GB ≈ 3GB → **4GB** (with headroom)

100K assertions:

Data size: 100K × 1KB = 100MB
Index: 100MB × 0.1 = 10MB
Cache: 2GB (recommended)
RAM = 100MB + 10MB + 2GB + 2GB ≈ 4.1GB → **8GB** (with headroom)

1M assertions:

Data size: 1M × 1KB = 1GB
Index: 1GB × 0.1 = 100MB
Cache: 4GB (recommended)
RAM = 1GB + 100MB + 4GB + 2GB ≈ 7.1GB → **16GB** (with headroom)

Memory pressure indicators:

  • Swap usage >0 → Insufficient RAM
  • Cache hit rate <80% → Increase cache_size
  • OOM kills → Increase RAM or reduce cache_size

Disk Calculation

Components:

  1. WAL (Write-Ahead Log):

    WAL_size = daily_assertions × retention_days × 10KB / 1000
    
  2. Database (KV Store + Indexes):

    DB_size = total_assertions × 1KB + (total_assertions × 0.1KB)  # +10% for indexes
    
  3. Backups:

    Backup_size = (WAL_size + DB_size) × retention_count
    

Examples:

10K assertions, 7-day WAL retention:

Daily ingest: 1K assertions/day
WAL: 1K × 7 days × 10KB / 1000 = 70KB ≈ 1MB (negligible)
DB: 10K × 1KB + (10K × 0.1KB) = 10MB + 1MB = 11MB
Backups: (1MB + 11MB) × 7 = 84MB

Total: 1MB + 11MB + 84MB ≈ 96MB → **50GB** (with 500× headroom for growth)

100K assertions, 7-day WAL retention:

Daily ingest: 10K assertions/day
WAL: 10K × 7 days × 10KB / 1000 = 700KB ≈ 1MB
DB: 100K × 1KB + (100K × 0.1KB) = 100MB + 10MB = 110MB
Backups: (1MB + 110MB) × 7 = 777MB

Total: 1MB + 110MB + 777MB ≈ 888MB → **100GB** (with 100× headroom)

1M assertions, 7-day WAL retention:

Daily ingest: 100K assertions/day
WAL: 100K × 7 days × 10KB / 1000 = 7MB
DB: 1M × 1KB + (1M × 0.1KB) = 1GB + 100MB = 1.1GB
Backups: (7MB + 1.1GB) × 7 = 7.75GB

Total: 7MB + 1.1GB + 7.75GB ≈ 8.86GB → **200GB** (with 20× headroom)

Disk type:

  • SSD required - HDD will bottleneck WAL fsync
  • IOPS: 3K minimum, 10K recommended
  • Throughput: 100 MB/sec minimum

Network Calculation

Ingest bandwidth:

Inbound = assertions/sec × 1KB × 8 bits / 1000 = Mbps

Query bandwidth:

Outbound = queries/sec × 5KB × 8 bits / 1000 = Mbps

Replication bandwidth (cluster only):

Replication = assertions/sec × 1KB × replication_factor × 8 bits / 1000 = Mbps

Examples:

100 assertions/sec, 100 queries/sec, single-node:

Inbound: 100 × 1KB × 8 / 1000 = 0.8 Mbps
Outbound: 100 × 5KB × 8 / 1000 = 4 Mbps
Total: ~5 Mbps → **100 Mbps** (with 20× headroom)

1K assertions/sec, 1K queries/sec, three-node (factor 2):

Inbound: 1000 × 1KB × 8 / 1000 = 8 Mbps
Outbound: 1000 × 5KB × 8 / 1000 = 40 Mbps
Replication: 1000 × 1KB × 2 × 8 / 1000 = 16 Mbps
Total: ~64 Mbps → **1 Gbps** (with 15× headroom)

Instance Type Selection

AWS (us-east-1)

Assertions Instance Type vCPU RAM Network Cost/month
<10K t3.medium 2 4GB 5 Gbps $30
<50K t3.large 2 8GB 5 Gbps $60
<100K t3.xlarge 4 16GB 5 Gbps $122
<500K m5.2xlarge 8 32GB 10 Gbps $277
<1M m5.4xlarge 16 64GB 10 Gbps $554

Use t3 (burstable) for pilot, m5 (general purpose) for production

GCP (us-central1)

Assertions Machine Type vCPU RAM Network Cost/month
<10K n1-standard-1 1 3.75GB 2 Gbps $25
<50K n2-standard-2 2 8GB 10 Gbps $65
<100K n2-standard-4 4 16GB 10 Gbps $130
<500K n2-standard-8 8 32GB 16 Gbps $260
<1M n2-standard-16 16 64GB 32 Gbps $520

Azure (East US)

Assertions VM Size vCPU RAM Network Cost/month
<10K Standard_B2s 2 4GB Moderate $30
<50K Standard_D2s_v3 2 8GB Moderate $70
<100K Standard_D4s_v3 4 16GB High $140
<500K Standard_D8s_v3 8 32GB High $280
<1M Standard_D16s_v3 16 64GB Very High $560

Growth Planning

Capacity Thresholds

When to scale vertically (bigger instance):

  • CPU sustained >70%
  • RAM used >80%
  • Disk >80%
  • Query latency p99 >500ms

When to scale horizontally (add nodes):

  • Single-node at max instance size
  • Need for high availability (1→3 nodes)
  • Query rate >1K/sec sustained
  • Write rate >1K assertions/sec

Scaling Timeline

10K → 50K assertions:

  • Growth rate: 1K/month typical
  • Timeline: 40 months
  • Action: Monitor, no scaling needed yet

50K → 100K assertions:

  • Growth rate: 5K/month typical
  • Timeline: 10 months
  • Action: Plan migration to 3-node cluster

100K → 500K assertions:

  • Growth rate: 10K/month typical
  • Timeline: 40 months
  • Action: Scale to 5-node cluster (requires P6)

Pilot Sizing Recommendations

Friendly Pilot (<10K assertions)

Recommended:

  • Deployment: Single-node
  • Instance: t3.medium (AWS) or equivalent
  • Disk: 50GB SSD
  • Network: 100 Mbps
  • Cost: ~$87/month

Rationale:

  • Minimal cost for early validation
  • Easy to deploy and manage
  • Sufficient for 50 concurrent users
  • Migrate to larger when validated

Production Pilot (<100K assertions)

Recommended:

  • Deployment: Three-node cluster
  • Instance: t3.xlarge × 3 (AWS) or equivalent
  • Disk: 200GB SSD per node
  • Network: 1 Gbps per node
  • Cost: ~$425/month

Rationale:

  • High availability (survives 1 node failure)
  • Room to grow to 100K assertions
  • Sufficient for 500 concurrent users
  • Production-ready architecture

Monitoring for Capacity

Metrics to Track

# Prometheus queries
- CPU: rate(process_cpu_seconds_total[5m]) * 100
  # Alert: >70% sustained

- RAM: process_resident_memory_bytes / node_memory_MemTotal_bytes * 100
  # Alert: >80%

- Disk: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
  # Alert: >80%

- Query latency: histogram_quantile(0.99, stemedb_query_latency_seconds_bucket)
  # Alert: >0.5 (500ms)

- Replication lag: replication_lag_seconds
  # Alert: >5

Capacity Planning Dashboard

Grafana panels:

  1. Assertion growth (30-day trend)
  2. CPU/RAM/Disk utilization
  3. Query rate (30-day trend)
  4. Time-to-threshold (days until 80% capacity)


Last Updated: 2026-02-11