# Resource Sizing Guide **Hardware sizing calculations for StemeDB deployments** --- ## Quick Reference Table | Assertions | Queries/sec | Deployment | CPU | RAM | Disk (WAL+DB) | Monthly Cost (AWS) | |-----------|-------------|------------|-----|-----|---------------|-------------------| | **<10K** | <100 | Single-node | 2-4 vCPU | 4-8GB | 50GB | ~$87 | | **<50K** | <500 | Single-node or 3-node | 4-8 vCPU | 8-16GB | 100GB | ~$180 (1) or ~$425 (3) | | **<100K** | <1K | Three-node | 8 vCPU | 16GB | 200GB | ~$425 | | **<500K** | <5K | Five-node (P6) | 16 vCPU | 32GB | 500GB | ~$1,200 | | **<1M** | <10K | Enterprise (P6) | 32 vCPU | 64GB | 1TB | ~$3,000 | *Costs are estimates for AWS us-east-1. Actual costs vary by region and instance type.* --- ## Sizing Methodology ### CPU Calculation **Formula:** ``` vCPUs = (query_rate × 0.005) + (ingest_rate × 0.002) + 2 ``` **Where:** - `query_rate` = queries per second (peak) - `ingest_rate` = assertions per second (sustained) - `+2` = baseline for background tasks (compaction, replication) **Examples:** **Pilot (100 queries/sec, 50 assertions/sec):** ``` vCPUs = (100 × 0.005) + (50 × 0.002) + 2 = 0.5 + 0.1 + 2 = 2.6 vCPUs → **4 vCPUs** (round up) ``` **Production (1K queries/sec, 500 assertions/sec):** ``` vCPUs = (1000 × 0.005) + (500 × 0.002) + 2 = 5 + 1 + 2 = 8 vCPUs → **8 vCPUs** ``` **Overhead factors:** - Add 50% for cluster coordination (3-node) - Add 100% for complex lens queries (AuthorityLens with deep chains) --- ### RAM Calculation **Formula:** ``` RAM_GB = (assertions × 0.0001) + (index_overhead × 0.1) + cache_size + 2 ``` **Where:** - `assertions` = total assertion count - `index_overhead` = ~10% of data size - `cache_size` = configurable (default: 1GB) - `+2GB` = OS + StemeDB runtime **Examples:** **10K assertions:** ``` Data size: 10K × 1KB = 10MB Index: 10MB × 0.1 = 1MB Cache: 1GB (default) RAM = 10MB + 1MB + 1GB + 2GB ≈ 3GB → **4GB** (with headroom) ``` **100K assertions:** ``` Data size: 100K × 1KB = 100MB Index: 100MB × 0.1 = 10MB Cache: 2GB (recommended) RAM = 100MB + 10MB + 2GB + 2GB ≈ 4.1GB → **8GB** (with headroom) ``` **1M assertions:** ``` Data size: 1M × 1KB = 1GB Index: 1GB × 0.1 = 100MB Cache: 4GB (recommended) RAM = 1GB + 100MB + 4GB + 2GB ≈ 7.1GB → **16GB** (with headroom) ``` **Memory pressure indicators:** - Swap usage >0 → Insufficient RAM - Cache hit rate <80% → Increase cache_size - OOM kills → Increase RAM or reduce cache_size --- ### Disk Calculation **Components:** 1. **WAL (Write-Ahead Log):** ``` WAL_size = daily_assertions × retention_days × 10KB / 1000 ``` 2. **Database (KV Store + Indexes):** ``` DB_size = total_assertions × 1KB + (total_assertions × 0.1KB) # +10% for indexes ``` 3. **Backups:** ``` Backup_size = (WAL_size + DB_size) × retention_count ``` **Examples:** **10K assertions, 7-day WAL retention:** ``` Daily ingest: 1K assertions/day WAL: 1K × 7 days × 10KB / 1000 = 70KB ≈ 1MB (negligible) DB: 10K × 1KB + (10K × 0.1KB) = 10MB + 1MB = 11MB Backups: (1MB + 11MB) × 7 = 84MB Total: 1MB + 11MB + 84MB ≈ 96MB → **50GB** (with 500× headroom for growth) ``` **100K assertions, 7-day WAL retention:** ``` Daily ingest: 10K assertions/day WAL: 10K × 7 days × 10KB / 1000 = 700KB ≈ 1MB DB: 100K × 1KB + (100K × 0.1KB) = 100MB + 10MB = 110MB Backups: (1MB + 110MB) × 7 = 777MB Total: 1MB + 110MB + 777MB ≈ 888MB → **100GB** (with 100× headroom) ``` **1M assertions, 7-day WAL retention:** ``` Daily ingest: 100K assertions/day WAL: 100K × 7 days × 10KB / 1000 = 7MB DB: 1M × 1KB + (1M × 0.1KB) = 1GB + 100MB = 1.1GB Backups: (7MB + 1.1GB) × 7 = 7.75GB Total: 7MB + 1.1GB + 7.75GB ≈ 8.86GB → **200GB** (with 20× headroom) ``` **Disk type:** - **SSD required** - HDD will bottleneck WAL fsync - IOPS: 3K minimum, 10K recommended - Throughput: 100 MB/sec minimum --- ### Network Calculation **Ingest bandwidth:** ``` Inbound = assertions/sec × 1KB × 8 bits / 1000 = Mbps ``` **Query bandwidth:** ``` Outbound = queries/sec × 5KB × 8 bits / 1000 = Mbps ``` **Replication bandwidth (cluster only):** ``` Replication = assertions/sec × 1KB × replication_factor × 8 bits / 1000 = Mbps ``` **Examples:** **100 assertions/sec, 100 queries/sec, single-node:** ``` Inbound: 100 × 1KB × 8 / 1000 = 0.8 Mbps Outbound: 100 × 5KB × 8 / 1000 = 4 Mbps Total: ~5 Mbps → **100 Mbps** (with 20× headroom) ``` **1K assertions/sec, 1K queries/sec, three-node (factor 2):** ``` Inbound: 1000 × 1KB × 8 / 1000 = 8 Mbps Outbound: 1000 × 5KB × 8 / 1000 = 40 Mbps Replication: 1000 × 1KB × 2 × 8 / 1000 = 16 Mbps Total: ~64 Mbps → **1 Gbps** (with 15× headroom) ``` --- ## Instance Type Selection ### AWS (us-east-1) | Assertions | Instance Type | vCPU | RAM | Network | Cost/month | |-----------|---------------|------|-----|---------|------------| | <10K | t3.medium | 2 | 4GB | 5 Gbps | $30 | | <50K | t3.large | 2 | 8GB | 5 Gbps | $60 | | <100K | t3.xlarge | 4 | 16GB | 5 Gbps | $122 | | <500K | m5.2xlarge | 8 | 32GB | 10 Gbps | $277 | | <1M | m5.4xlarge | 16 | 64GB | 10 Gbps | $554 | *Use t3 (burstable) for pilot, m5 (general purpose) for production* ### GCP (us-central1) | Assertions | Machine Type | vCPU | RAM | Network | Cost/month | |-----------|--------------|------|-----|---------|------------| | <10K | n1-standard-1 | 1 | 3.75GB | 2 Gbps | $25 | | <50K | n2-standard-2 | 2 | 8GB | 10 Gbps | $65 | | <100K | n2-standard-4 | 4 | 16GB | 10 Gbps | $130 | | <500K | n2-standard-8 | 8 | 32GB | 16 Gbps | $260 | | <1M | n2-standard-16 | 16 | 64GB | 32 Gbps | $520 | ### Azure (East US) | Assertions | VM Size | vCPU | RAM | Network | Cost/month | |-----------|---------|------|-----|---------|------------| | <10K | Standard_B2s | 2 | 4GB | Moderate | $30 | | <50K | Standard_D2s_v3 | 2 | 8GB | Moderate | $70 | | <100K | Standard_D4s_v3 | 4 | 16GB | High | $140 | | <500K | Standard_D8s_v3 | 8 | 32GB | High | $280 | | <1M | Standard_D16s_v3 | 16 | 64GB | Very High | $560 | --- ## Growth Planning ### Capacity Thresholds **When to scale vertically (bigger instance):** - CPU sustained >70% - RAM used >80% - Disk >80% - Query latency p99 >500ms **When to scale horizontally (add nodes):** - Single-node at max instance size - Need for high availability (1→3 nodes) - Query rate >1K/sec sustained - Write rate >1K assertions/sec ### Scaling Timeline **10K → 50K assertions:** - Growth rate: 1K/month typical - Timeline: 40 months - Action: Monitor, no scaling needed yet **50K → 100K assertions:** - Growth rate: 5K/month typical - Timeline: 10 months - Action: Plan migration to 3-node cluster **100K → 500K assertions:** - Growth rate: 10K/month typical - Timeline: 40 months - Action: Scale to 5-node cluster (requires P6) --- ## Pilot Sizing Recommendations ### Friendly Pilot (<10K assertions) **Recommended:** - **Deployment:** Single-node - **Instance:** t3.medium (AWS) or equivalent - **Disk:** 50GB SSD - **Network:** 100 Mbps - **Cost:** ~$87/month **Rationale:** - Minimal cost for early validation - Easy to deploy and manage - Sufficient for 50 concurrent users - Migrate to larger when validated ### Production Pilot (<100K assertions) **Recommended:** - **Deployment:** Three-node cluster - **Instance:** t3.xlarge × 3 (AWS) or equivalent - **Disk:** 200GB SSD per node - **Network:** 1 Gbps per node - **Cost:** ~$425/month **Rationale:** - High availability (survives 1 node failure) - Room to grow to 100K assertions - Sufficient for 500 concurrent users - Production-ready architecture --- ## Monitoring for Capacity ### Metrics to Track ```yaml # Prometheus queries - CPU: rate(process_cpu_seconds_total[5m]) * 100 # Alert: >70% sustained - RAM: process_resident_memory_bytes / node_memory_MemTotal_bytes * 100 # Alert: >80% - Disk: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 # Alert: >80% - Query latency: histogram_quantile(0.99, stemedb_query_latency_seconds_bucket) # Alert: >0.5 (500ms) - Replication lag: replication_lag_seconds # Alert: >5 ``` ### Capacity Planning Dashboard **Grafana panels:** 1. Assertion growth (30-day trend) 2. CPU/RAM/Disk utilization 3. Query rate (30-day trend) 4. Time-to-threshold (days until 80% capacity) --- ## Related Documentation - [Single-Node Architecture](./single-node-pilot.md) - Sizing for single-node - [Three-Node Cluster](./three-node-cluster.md) - Sizing for cluster - [Network Requirements](./network-requirements.md) - Bandwidth calculations - [Disk Full Runbook](../../runbooks/disk-full.md) - Storage management --- **Last Updated:** 2026-02-11