stemedb/docs/operations/reference-architecture/resource-sizing.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

344 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Resource Sizing Guide
**Hardware sizing calculations for StemeDB deployments**
---
## Quick Reference Table
| Assertions | Queries/sec | Deployment | CPU | RAM | Disk (WAL+DB) | Monthly Cost (AWS) |
|-----------|-------------|------------|-----|-----|---------------|-------------------|
| **<10K** | <100 | Single-node | 2-4 vCPU | 4-8GB | 50GB | ~$87 |
| **<50K** | <500 | Single-node or 3-node | 4-8 vCPU | 8-16GB | 100GB | ~$180 (1) or ~$425 (3) |
| **<100K** | <1K | Three-node | 8 vCPU | 16GB | 200GB | ~$425 |
| **<500K** | <5K | Five-node (P6) | 16 vCPU | 32GB | 500GB | ~$1,200 |
| **<1M** | <10K | Enterprise (P6) | 32 vCPU | 64GB | 1TB | ~$3,000 |
*Costs are estimates for AWS us-east-1. Actual costs vary by region and instance type.*
---
## Sizing Methodology
### CPU Calculation
**Formula:**
```
vCPUs = (query_rate × 0.005) + (ingest_rate × 0.002) + 2
```
**Where:**
- `query_rate` = queries per second (peak)
- `ingest_rate` = assertions per second (sustained)
- `+2` = baseline for background tasks (compaction, replication)
**Examples:**
**Pilot (100 queries/sec, 50 assertions/sec):**
```
vCPUs = (100 × 0.005) + (50 × 0.002) + 2
= 0.5 + 0.1 + 2
= 2.6 vCPUs → **4 vCPUs** (round up)
```
**Production (1K queries/sec, 500 assertions/sec):**
```
vCPUs = (1000 × 0.005) + (500 × 0.002) + 2
= 5 + 1 + 2
= 8 vCPUs → **8 vCPUs**
```
**Overhead factors:**
- Add 50% for cluster coordination (3-node)
- Add 100% for complex lens queries (AuthorityLens with deep chains)
---
### RAM Calculation
**Formula:**
```
RAM_GB = (assertions × 0.0001) + (index_overhead × 0.1) + cache_size + 2
```
**Where:**
- `assertions` = total assertion count
- `index_overhead` = ~10% of data size
- `cache_size` = configurable (default: 1GB)
- `+2GB` = OS + StemeDB runtime
**Examples:**
**10K assertions:**
```
Data size: 10K × 1KB = 10MB
Index: 10MB × 0.1 = 1MB
Cache: 1GB (default)
RAM = 10MB + 1MB + 1GB + 2GB ≈ 3GB → **4GB** (with headroom)
```
**100K assertions:**
```
Data size: 100K × 1KB = 100MB
Index: 100MB × 0.1 = 10MB
Cache: 2GB (recommended)
RAM = 100MB + 10MB + 2GB + 2GB ≈ 4.1GB → **8GB** (with headroom)
```
**1M assertions:**
```
Data size: 1M × 1KB = 1GB
Index: 1GB × 0.1 = 100MB
Cache: 4GB (recommended)
RAM = 1GB + 100MB + 4GB + 2GB ≈ 7.1GB → **16GB** (with headroom)
```
**Memory pressure indicators:**
- Swap usage >0 → Insufficient RAM
- Cache hit rate <80% Increase cache_size
- OOM kills Increase RAM or reduce cache_size
---
### Disk Calculation
**Components:**
1. **WAL (Write-Ahead Log):**
```
WAL_size = daily_assertions × retention_days × 10KB / 1000
```
2. **Database (KV Store + Indexes):**
```
DB_size = total_assertions × 1KB + (total_assertions × 0.1KB) # +10% for indexes
```
3. **Backups:**
```
Backup_size = (WAL_size + DB_size) × retention_count
```
**Examples:**
**10K assertions, 7-day WAL retention:**
```
Daily ingest: 1K assertions/day
WAL: 1K × 7 days × 10KB / 1000 = 70KB ≈ 1MB (negligible)
DB: 10K × 1KB + (10K × 0.1KB) = 10MB + 1MB = 11MB
Backups: (1MB + 11MB) × 7 = 84MB
Total: 1MB + 11MB + 84MB ≈ 96MB → **50GB** (with 500× headroom for growth)
```
**100K assertions, 7-day WAL retention:**
```
Daily ingest: 10K assertions/day
WAL: 10K × 7 days × 10KB / 1000 = 700KB ≈ 1MB
DB: 100K × 1KB + (100K × 0.1KB) = 100MB + 10MB = 110MB
Backups: (1MB + 110MB) × 7 = 777MB
Total: 1MB + 110MB + 777MB ≈ 888MB → **100GB** (with 100× headroom)
```
**1M assertions, 7-day WAL retention:**
```
Daily ingest: 100K assertions/day
WAL: 100K × 7 days × 10KB / 1000 = 7MB
DB: 1M × 1KB + (1M × 0.1KB) = 1GB + 100MB = 1.1GB
Backups: (7MB + 1.1GB) × 7 = 7.75GB
Total: 7MB + 1.1GB + 7.75GB ≈ 8.86GB → **200GB** (with 20× headroom)
```
**Disk type:**
- **SSD required** - HDD will bottleneck WAL fsync
- IOPS: 3K minimum, 10K recommended
- Throughput: 100 MB/sec minimum
---
### Network Calculation
**Ingest bandwidth:**
```
Inbound = assertions/sec × 1KB × 8 bits / 1000 = Mbps
```
**Query bandwidth:**
```
Outbound = queries/sec × 5KB × 8 bits / 1000 = Mbps
```
**Replication bandwidth (cluster only):**
```
Replication = assertions/sec × 1KB × replication_factor × 8 bits / 1000 = Mbps
```
**Examples:**
**100 assertions/sec, 100 queries/sec, single-node:**
```
Inbound: 100 × 1KB × 8 / 1000 = 0.8 Mbps
Outbound: 100 × 5KB × 8 / 1000 = 4 Mbps
Total: ~5 Mbps → **100 Mbps** (with 20× headroom)
```
**1K assertions/sec, 1K queries/sec, three-node (factor 2):**
```
Inbound: 1000 × 1KB × 8 / 1000 = 8 Mbps
Outbound: 1000 × 5KB × 8 / 1000 = 40 Mbps
Replication: 1000 × 1KB × 2 × 8 / 1000 = 16 Mbps
Total: ~64 Mbps → **1 Gbps** (with 15× headroom)
```
---
## Instance Type Selection
### AWS (us-east-1)
| Assertions | Instance Type | vCPU | RAM | Network | Cost/month |
|-----------|---------------|------|-----|---------|------------|
| <10K | t3.medium | 2 | 4GB | 5 Gbps | $30 |
| <50K | t3.large | 2 | 8GB | 5 Gbps | $60 |
| <100K | t3.xlarge | 4 | 16GB | 5 Gbps | $122 |
| <500K | m5.2xlarge | 8 | 32GB | 10 Gbps | $277 |
| <1M | m5.4xlarge | 16 | 64GB | 10 Gbps | $554 |
*Use t3 (burstable) for pilot, m5 (general purpose) for production*
### GCP (us-central1)
| Assertions | Machine Type | vCPU | RAM | Network | Cost/month |
|-----------|--------------|------|-----|---------|------------|
| <10K | n1-standard-1 | 1 | 3.75GB | 2 Gbps | $25 |
| <50K | n2-standard-2 | 2 | 8GB | 10 Gbps | $65 |
| <100K | n2-standard-4 | 4 | 16GB | 10 Gbps | $130 |
| <500K | n2-standard-8 | 8 | 32GB | 16 Gbps | $260 |
| <1M | n2-standard-16 | 16 | 64GB | 32 Gbps | $520 |
### Azure (East US)
| Assertions | VM Size | vCPU | RAM | Network | Cost/month |
|-----------|---------|------|-----|---------|------------|
| <10K | Standard_B2s | 2 | 4GB | Moderate | $30 |
| <50K | Standard_D2s_v3 | 2 | 8GB | Moderate | $70 |
| <100K | Standard_D4s_v3 | 4 | 16GB | High | $140 |
| <500K | Standard_D8s_v3 | 8 | 32GB | High | $280 |
| <1M | Standard_D16s_v3 | 16 | 64GB | Very High | $560 |
---
## Growth Planning
### Capacity Thresholds
**When to scale vertically (bigger instance):**
- CPU sustained >70%
- RAM used >80%
- Disk >80%
- Query latency p99 >500ms
**When to scale horizontally (add nodes):**
- Single-node at max instance size
- Need for high availability (1→3 nodes)
- Query rate >1K/sec sustained
- Write rate >1K assertions/sec
### Scaling Timeline
**10K → 50K assertions:**
- Growth rate: 1K/month typical
- Timeline: 40 months
- Action: Monitor, no scaling needed yet
**50K → 100K assertions:**
- Growth rate: 5K/month typical
- Timeline: 10 months
- Action: Plan migration to 3-node cluster
**100K → 500K assertions:**
- Growth rate: 10K/month typical
- Timeline: 40 months
- Action: Scale to 5-node cluster (requires P6)
---
## Pilot Sizing Recommendations
### Friendly Pilot (<10K assertions)
**Recommended:**
- **Deployment:** Single-node
- **Instance:** t3.medium (AWS) or equivalent
- **Disk:** 50GB SSD
- **Network:** 100 Mbps
- **Cost:** ~$87/month
**Rationale:**
- Minimal cost for early validation
- Easy to deploy and manage
- Sufficient for 50 concurrent users
- Migrate to larger when validated
### Production Pilot (<100K assertions)
**Recommended:**
- **Deployment:** Three-node cluster
- **Instance:** t3.xlarge × 3 (AWS) or equivalent
- **Disk:** 200GB SSD per node
- **Network:** 1 Gbps per node
- **Cost:** ~$425/month
**Rationale:**
- High availability (survives 1 node failure)
- Room to grow to 100K assertions
- Sufficient for 500 concurrent users
- Production-ready architecture
---
## Monitoring for Capacity
### Metrics to Track
```yaml
# Prometheus queries
- CPU: rate(process_cpu_seconds_total[5m]) * 100
# Alert: >70% sustained
- RAM: process_resident_memory_bytes / node_memory_MemTotal_bytes * 100
# Alert: >80%
- Disk: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
# Alert: >80%
- Query latency: histogram_quantile(0.99, stemedb_query_latency_seconds_bucket)
# Alert: >0.5 (500ms)
- Replication lag: replication_lag_seconds
# Alert: >5
```
### Capacity Planning Dashboard
**Grafana panels:**
1. Assertion growth (30-day trend)
2. CPU/RAM/Disk utilization
3. Query rate (30-day trend)
4. Time-to-threshold (days until 80% capacity)
---
## Related Documentation
- [Single-Node Architecture](./single-node-pilot.md) - Sizing for single-node
- [Three-Node Cluster](./three-node-cluster.md) - Sizing for cluster
- [Network Requirements](./network-requirements.md) - Bandwidth calculations
- [Disk Full Runbook](../../runbooks/disk-full.md) - Storage management
---
**Last Updated:** 2026-02-11