stemedb/docs/operations/reference-architecture/resource-sizing.md

# Resource Sizing Guide

**Hardware sizing calculations for StemeDB deployments**

---

## Quick Reference Table

| Assertions | Queries/sec | Deployment | CPU | RAM | Disk (WAL+DB) | Monthly Cost (AWS) |
|-----------|-------------|------------|-----|-----|---------------|-------------------|
| **<10K** | <100 | Single-node | 2-4 vCPU | 4-8GB | 50GB | ~$87 |
| **<50K** | <500 | Single-node or 3-node | 4-8 vCPU | 8-16GB | 100GB | ~$180 (1) or ~$425 (3) |
| **<100K** | <1K | Three-node | 8 vCPU | 16GB | 200GB | ~$425 |
| **<500K** | <5K | Five-node (P6) | 16 vCPU | 32GB | 500GB | ~$1,200 |
| **<1M** | <10K | Enterprise (P6) | 32 vCPU | 64GB | 1TB | ~$3,000 |

*Costs are estimates for AWS us-east-1. Actual costs vary by region and instance type.*

---

## Sizing Methodology

### CPU Calculation

**Formula:**
```
vCPUs = (query_rate × 0.005) + (ingest_rate × 0.002) + 2
```

**Where:**
- `query_rate` = queries per second (peak)
- `ingest_rate` = assertions per second (sustained)
- `+2` = baseline for background tasks (compaction, replication)

**Examples:**

**Pilot (100 queries/sec, 50 assertions/sec):**
```
vCPUs = (100 × 0.005) + (50 × 0.002) + 2
      = 0.5 + 0.1 + 2
      = 2.6 vCPUs → **4 vCPUs** (round up)
```

**Production (1K queries/sec, 500 assertions/sec):**
```
vCPUs = (1000 × 0.005) + (500 × 0.002) + 2
      = 5 + 1 + 2
      = 8 vCPUs → **8 vCPUs**
```

**Overhead factors:**
- Add 50% for cluster coordination (3-node)
- Add 100% for complex lens queries (AuthorityLens with deep chains)

---

### RAM Calculation

**Formula:**
```
RAM_GB = (assertions × 0.0001) + (index_overhead × 0.1) + cache_size + 2
```

**Where:**
- `assertions` = total assertion count
- `index_overhead` = ~10% of data size
- `cache_size` = configurable (default: 1GB)
- `+2GB` = OS + StemeDB runtime

**Examples:**

**10K assertions:**
```
Data size: 10K × 1KB = 10MB
Index: 10MB × 0.1 = 1MB
Cache: 1GB (default)
RAM = 10MB + 1MB + 1GB + 2GB ≈ 3GB → **4GB** (with headroom)
```

**100K assertions:**
```
Data size: 100K × 1KB = 100MB
Index: 100MB × 0.1 = 10MB
Cache: 2GB (recommended)
RAM = 100MB + 10MB + 2GB + 2GB ≈ 4.1GB → **8GB** (with headroom)
```

**1M assertions:**
```
Data size: 1M × 1KB = 1GB
Index: 1GB × 0.1 = 100MB
Cache: 4GB (recommended)
RAM = 1GB + 100MB + 4GB + 2GB ≈ 7.1GB → **16GB** (with headroom)
```

**Memory pressure indicators:**
- Swap usage >0 → Insufficient RAM
- Cache hit rate <80% → Increase cache_size
- OOM kills → Increase RAM or reduce cache_size

---

### Disk Calculation

**Components:**

1. **WAL (Write-Ahead Log):**
   ```
   WAL_size = daily_assertions × retention_days × 10KB / 1000
   ```

2. **Database (KV Store + Indexes):**
   ```
   DB_size = total_assertions × 1KB + (total_assertions × 0.1KB)  # +10% for indexes
   ```

3. **Backups:**
   ```
   Backup_size = (WAL_size + DB_size) × retention_count
   ```

**Examples:**

**10K assertions, 7-day WAL retention:**
```
Daily ingest: 1K assertions/day
WAL: 1K × 7 days × 10KB / 1000 = 70KB ≈ 1MB (negligible)
DB: 10K × 1KB + (10K × 0.1KB) = 10MB + 1MB = 11MB
Backups: (1MB + 11MB) × 7 = 84MB

Total: 1MB + 11MB + 84MB ≈ 96MB → **50GB** (with 500× headroom for growth)
```

**100K assertions, 7-day WAL retention:**
```
Daily ingest: 10K assertions/day
WAL: 10K × 7 days × 10KB / 1000 = 700KB ≈ 1MB
DB: 100K × 1KB + (100K × 0.1KB) = 100MB + 10MB = 110MB
Backups: (1MB + 110MB) × 7 = 777MB

Total: 1MB + 110MB + 777MB ≈ 888MB → **100GB** (with 100× headroom)
```

**1M assertions, 7-day WAL retention:**
```
Daily ingest: 100K assertions/day
WAL: 100K × 7 days × 10KB / 1000 = 7MB
DB: 1M × 1KB + (1M × 0.1KB) = 1GB + 100MB = 1.1GB
Backups: (7MB + 1.1GB) × 7 = 7.75GB

Total: 7MB + 1.1GB + 7.75GB ≈ 8.86GB → **200GB** (with 20× headroom)
```

**Disk type:**
- **SSD required** - HDD will bottleneck WAL fsync
- IOPS: 3K minimum, 10K recommended
- Throughput: 100 MB/sec minimum

---

### Network Calculation

**Ingest bandwidth:**
```
Inbound = assertions/sec × 1KB × 8 bits / 1000 = Mbps
```

**Query bandwidth:**
```
Outbound = queries/sec × 5KB × 8 bits / 1000 = Mbps
```

**Replication bandwidth (cluster only):**
```
Replication = assertions/sec × 1KB × replication_factor × 8 bits / 1000 = Mbps
```

**Examples:**

**100 assertions/sec, 100 queries/sec, single-node:**
```
Inbound: 100 × 1KB × 8 / 1000 = 0.8 Mbps
Outbound: 100 × 5KB × 8 / 1000 = 4 Mbps
Total: ~5 Mbps → **100 Mbps** (with 20× headroom)
```

**1K assertions/sec, 1K queries/sec, three-node (factor 2):**
```
Inbound: 1000 × 1KB × 8 / 1000 = 8 Mbps
Outbound: 1000 × 5KB × 8 / 1000 = 40 Mbps
Replication: 1000 × 1KB × 2 × 8 / 1000 = 16 Mbps
Total: ~64 Mbps → **1 Gbps** (with 15× headroom)
```

---

## Instance Type Selection

### AWS (us-east-1)

| Assertions | Instance Type | vCPU | RAM | Network | Cost/month |
|-----------|---------------|------|-----|---------|------------|
| <10K | t3.medium | 2 | 4GB | 5 Gbps | $30 |
| <50K | t3.large | 2 | 8GB | 5 Gbps | $60 |
| <100K | t3.xlarge | 4 | 16GB | 5 Gbps | $122 |
| <500K | m5.2xlarge | 8 | 32GB | 10 Gbps | $277 |
| <1M | m5.4xlarge | 16 | 64GB | 10 Gbps | $554 |

*Use t3 (burstable) for pilot, m5 (general purpose) for production*

### GCP (us-central1)

| Assertions | Machine Type | vCPU | RAM | Network | Cost/month |
|-----------|--------------|------|-----|---------|------------|
| <10K | n1-standard-1 | 1 | 3.75GB | 2 Gbps | $25 |
| <50K | n2-standard-2 | 2 | 8GB | 10 Gbps | $65 |
| <100K | n2-standard-4 | 4 | 16GB | 10 Gbps | $130 |
| <500K | n2-standard-8 | 8 | 32GB | 16 Gbps | $260 |
| <1M | n2-standard-16 | 16 | 64GB | 32 Gbps | $520 |

### Azure (East US)

| Assertions | VM Size | vCPU | RAM | Network | Cost/month |
|-----------|---------|------|-----|---------|------------|
| <10K | Standard_B2s | 2 | 4GB | Moderate | $30 |
| <50K | Standard_D2s_v3 | 2 | 8GB | Moderate | $70 |
| <100K | Standard_D4s_v3 | 4 | 16GB | High | $140 |
| <500K | Standard_D8s_v3 | 8 | 32GB | High | $280 |
| <1M | Standard_D16s_v3 | 16 | 64GB | Very High | $560 |

---

## Growth Planning

### Capacity Thresholds

**When to scale vertically (bigger instance):**
- CPU sustained >70%
- RAM used >80%
- Disk >80%
- Query latency p99 >500ms

**When to scale horizontally (add nodes):**
- Single-node at max instance size
- Need for high availability (1→3 nodes)
- Query rate >1K/sec sustained
- Write rate >1K assertions/sec

### Scaling Timeline

**10K → 50K assertions:**
- Growth rate: 1K/month typical
- Timeline: 40 months
- Action: Monitor, no scaling needed yet

**50K → 100K assertions:**
- Growth rate: 5K/month typical
- Timeline: 10 months
- Action: Plan migration to 3-node cluster

**100K → 500K assertions:**
- Growth rate: 10K/month typical
- Timeline: 40 months
- Action: Scale to 5-node cluster (requires P6)

---

## Pilot Sizing Recommendations

### Friendly Pilot (<10K assertions)

**Recommended:**
- **Deployment:** Single-node
- **Instance:** t3.medium (AWS) or equivalent
- **Disk:** 50GB SSD
- **Network:** 100 Mbps
- **Cost:** ~$87/month

**Rationale:**
- Minimal cost for early validation
- Easy to deploy and manage
- Sufficient for 50 concurrent users
- Migrate to larger when validated

### Production Pilot (<100K assertions)

**Recommended:**
- **Deployment:** Three-node cluster
- **Instance:** t3.xlarge × 3 (AWS) or equivalent
- **Disk:** 200GB SSD per node
- **Network:** 1 Gbps per node
- **Cost:** ~$425/month

**Rationale:**
- High availability (survives 1 node failure)
- Room to grow to 100K assertions
- Sufficient for 500 concurrent users
- Production-ready architecture

---

## Monitoring for Capacity

### Metrics to Track

```yaml
# Prometheus queries
- CPU: rate(process_cpu_seconds_total[5m]) * 100
  # Alert: >70% sustained

- RAM: process_resident_memory_bytes / node_memory_MemTotal_bytes * 100
  # Alert: >80%

- Disk: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
  # Alert: >80%

- Query latency: histogram_quantile(0.99, stemedb_query_latency_seconds_bucket)
  # Alert: >0.5 (500ms)

- Replication lag: replication_lag_seconds
  # Alert: >5
```

### Capacity Planning Dashboard

**Grafana panels:**
1. Assertion growth (30-day trend)
2. CPU/RAM/Disk utilization
3. Query rate (30-day trend)
4. Time-to-threshold (days until 80% capacity)

---

## Related Documentation

- [Single-Node Architecture](./single-node-pilot.md) - Sizing for single-node
- [Three-Node Cluster](./three-node-cluster.md) - Sizing for cluster
- [Network Requirements](./network-requirements.md) - Bandwidth calculations
- [Disk Full Runbook](../../runbooks/disk-full.md) - Storage management

---

**Last Updated:** 2026-02-11