This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
8.5 KiB
Resource Sizing Guide
Hardware sizing calculations for StemeDB deployments
Quick Reference Table
| Assertions | Queries/sec | Deployment | CPU | RAM | Disk (WAL+DB) | Monthly Cost (AWS) |
|---|---|---|---|---|---|---|
| <10K | <100 | Single-node | 2-4 vCPU | 4-8GB | 50GB | ~$87 |
| <50K | <500 | Single-node or 3-node | 4-8 vCPU | 8-16GB | 100GB | ~$180 (1) or ~$425 (3) |
| <100K | <1K | Three-node | 8 vCPU | 16GB | 200GB | ~$425 |
| <500K | <5K | Five-node (P6) | 16 vCPU | 32GB | 500GB | ~$1,200 |
| <1M | <10K | Enterprise (P6) | 32 vCPU | 64GB | 1TB | ~$3,000 |
Costs are estimates for AWS us-east-1. Actual costs vary by region and instance type.
Sizing Methodology
CPU Calculation
Formula:
vCPUs = (query_rate × 0.005) + (ingest_rate × 0.002) + 2
Where:
query_rate= queries per second (peak)ingest_rate= assertions per second (sustained)+2= baseline for background tasks (compaction, replication)
Examples:
Pilot (100 queries/sec, 50 assertions/sec):
vCPUs = (100 × 0.005) + (50 × 0.002) + 2
= 0.5 + 0.1 + 2
= 2.6 vCPUs → **4 vCPUs** (round up)
Production (1K queries/sec, 500 assertions/sec):
vCPUs = (1000 × 0.005) + (500 × 0.002) + 2
= 5 + 1 + 2
= 8 vCPUs → **8 vCPUs**
Overhead factors:
- Add 50% for cluster coordination (3-node)
- Add 100% for complex lens queries (AuthorityLens with deep chains)
RAM Calculation
Formula:
RAM_GB = (assertions × 0.0001) + (index_overhead × 0.1) + cache_size + 2
Where:
assertions= total assertion countindex_overhead= ~10% of data sizecache_size= configurable (default: 1GB)+2GB= OS + StemeDB runtime
Examples:
10K assertions:
Data size: 10K × 1KB = 10MB
Index: 10MB × 0.1 = 1MB
Cache: 1GB (default)
RAM = 10MB + 1MB + 1GB + 2GB ≈ 3GB → **4GB** (with headroom)
100K assertions:
Data size: 100K × 1KB = 100MB
Index: 100MB × 0.1 = 10MB
Cache: 2GB (recommended)
RAM = 100MB + 10MB + 2GB + 2GB ≈ 4.1GB → **8GB** (with headroom)
1M assertions:
Data size: 1M × 1KB = 1GB
Index: 1GB × 0.1 = 100MB
Cache: 4GB (recommended)
RAM = 1GB + 100MB + 4GB + 2GB ≈ 7.1GB → **16GB** (with headroom)
Memory pressure indicators:
- Swap usage >0 → Insufficient RAM
- Cache hit rate <80% → Increase cache_size
- OOM kills → Increase RAM or reduce cache_size
Disk Calculation
Components:
-
WAL (Write-Ahead Log):
WAL_size = daily_assertions × retention_days × 10KB / 1000 -
Database (KV Store + Indexes):
DB_size = total_assertions × 1KB + (total_assertions × 0.1KB) # +10% for indexes -
Backups:
Backup_size = (WAL_size + DB_size) × retention_count
Examples:
10K assertions, 7-day WAL retention:
Daily ingest: 1K assertions/day
WAL: 1K × 7 days × 10KB / 1000 = 70KB ≈ 1MB (negligible)
DB: 10K × 1KB + (10K × 0.1KB) = 10MB + 1MB = 11MB
Backups: (1MB + 11MB) × 7 = 84MB
Total: 1MB + 11MB + 84MB ≈ 96MB → **50GB** (with 500× headroom for growth)
100K assertions, 7-day WAL retention:
Daily ingest: 10K assertions/day
WAL: 10K × 7 days × 10KB / 1000 = 700KB ≈ 1MB
DB: 100K × 1KB + (100K × 0.1KB) = 100MB + 10MB = 110MB
Backups: (1MB + 110MB) × 7 = 777MB
Total: 1MB + 110MB + 777MB ≈ 888MB → **100GB** (with 100× headroom)
1M assertions, 7-day WAL retention:
Daily ingest: 100K assertions/day
WAL: 100K × 7 days × 10KB / 1000 = 7MB
DB: 1M × 1KB + (1M × 0.1KB) = 1GB + 100MB = 1.1GB
Backups: (7MB + 1.1GB) × 7 = 7.75GB
Total: 7MB + 1.1GB + 7.75GB ≈ 8.86GB → **200GB** (with 20× headroom)
Disk type:
- SSD required - HDD will bottleneck WAL fsync
- IOPS: 3K minimum, 10K recommended
- Throughput: 100 MB/sec minimum
Network Calculation
Ingest bandwidth:
Inbound = assertions/sec × 1KB × 8 bits / 1000 = Mbps
Query bandwidth:
Outbound = queries/sec × 5KB × 8 bits / 1000 = Mbps
Replication bandwidth (cluster only):
Replication = assertions/sec × 1KB × replication_factor × 8 bits / 1000 = Mbps
Examples:
100 assertions/sec, 100 queries/sec, single-node:
Inbound: 100 × 1KB × 8 / 1000 = 0.8 Mbps
Outbound: 100 × 5KB × 8 / 1000 = 4 Mbps
Total: ~5 Mbps → **100 Mbps** (with 20× headroom)
1K assertions/sec, 1K queries/sec, three-node (factor 2):
Inbound: 1000 × 1KB × 8 / 1000 = 8 Mbps
Outbound: 1000 × 5KB × 8 / 1000 = 40 Mbps
Replication: 1000 × 1KB × 2 × 8 / 1000 = 16 Mbps
Total: ~64 Mbps → **1 Gbps** (with 15× headroom)
Instance Type Selection
AWS (us-east-1)
| Assertions | Instance Type | vCPU | RAM | Network | Cost/month |
|---|---|---|---|---|---|
| <10K | t3.medium | 2 | 4GB | 5 Gbps | $30 |
| <50K | t3.large | 2 | 8GB | 5 Gbps | $60 |
| <100K | t3.xlarge | 4 | 16GB | 5 Gbps | $122 |
| <500K | m5.2xlarge | 8 | 32GB | 10 Gbps | $277 |
| <1M | m5.4xlarge | 16 | 64GB | 10 Gbps | $554 |
Use t3 (burstable) for pilot, m5 (general purpose) for production
GCP (us-central1)
| Assertions | Machine Type | vCPU | RAM | Network | Cost/month |
|---|---|---|---|---|---|
| <10K | n1-standard-1 | 1 | 3.75GB | 2 Gbps | $25 |
| <50K | n2-standard-2 | 2 | 8GB | 10 Gbps | $65 |
| <100K | n2-standard-4 | 4 | 16GB | 10 Gbps | $130 |
| <500K | n2-standard-8 | 8 | 32GB | 16 Gbps | $260 |
| <1M | n2-standard-16 | 16 | 64GB | 32 Gbps | $520 |
Azure (East US)
| Assertions | VM Size | vCPU | RAM | Network | Cost/month |
|---|---|---|---|---|---|
| <10K | Standard_B2s | 2 | 4GB | Moderate | $30 |
| <50K | Standard_D2s_v3 | 2 | 8GB | Moderate | $70 |
| <100K | Standard_D4s_v3 | 4 | 16GB | High | $140 |
| <500K | Standard_D8s_v3 | 8 | 32GB | High | $280 |
| <1M | Standard_D16s_v3 | 16 | 64GB | Very High | $560 |
Growth Planning
Capacity Thresholds
When to scale vertically (bigger instance):
- CPU sustained >70%
- RAM used >80%
- Disk >80%
- Query latency p99 >500ms
When to scale horizontally (add nodes):
- Single-node at max instance size
- Need for high availability (1→3 nodes)
- Query rate >1K/sec sustained
- Write rate >1K assertions/sec
Scaling Timeline
10K → 50K assertions:
- Growth rate: 1K/month typical
- Timeline: 40 months
- Action: Monitor, no scaling needed yet
50K → 100K assertions:
- Growth rate: 5K/month typical
- Timeline: 10 months
- Action: Plan migration to 3-node cluster
100K → 500K assertions:
- Growth rate: 10K/month typical
- Timeline: 40 months
- Action: Scale to 5-node cluster (requires P6)
Pilot Sizing Recommendations
Friendly Pilot (<10K assertions)
Recommended:
- Deployment: Single-node
- Instance: t3.medium (AWS) or equivalent
- Disk: 50GB SSD
- Network: 100 Mbps
- Cost: ~$87/month
Rationale:
- Minimal cost for early validation
- Easy to deploy and manage
- Sufficient for 50 concurrent users
- Migrate to larger when validated
Production Pilot (<100K assertions)
Recommended:
- Deployment: Three-node cluster
- Instance: t3.xlarge × 3 (AWS) or equivalent
- Disk: 200GB SSD per node
- Network: 1 Gbps per node
- Cost: ~$425/month
Rationale:
- High availability (survives 1 node failure)
- Room to grow to 100K assertions
- Sufficient for 500 concurrent users
- Production-ready architecture
Monitoring for Capacity
Metrics to Track
# Prometheus queries
- CPU: rate(process_cpu_seconds_total[5m]) * 100
# Alert: >70% sustained
- RAM: process_resident_memory_bytes / node_memory_MemTotal_bytes * 100
# Alert: >80%
- Disk: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
# Alert: >80%
- Query latency: histogram_quantile(0.99, stemedb_query_latency_seconds_bucket)
# Alert: >0.5 (500ms)
- Replication lag: replication_lag_seconds
# Alert: >5
Capacity Planning Dashboard
Grafana panels:
- Assertion growth (30-day trend)
- CPU/RAM/Disk utilization
- Query rate (30-day trend)
- Time-to-threshold (days until 80% capacity)
Related Documentation
- Single-Node Architecture - Sizing for single-node
- Three-Node Cluster - Sizing for cluster
- Network Requirements - Bandwidth calculations
- Disk Full Runbook - Storage management
Last Updated: 2026-02-11