stemedb/docs/operations/reference-architecture/single-node-pilot.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

450 lines
11 KiB
Markdown

# Single-Node Pilot Architecture
**Target:** Proof of concept, friendly pilot, development environments
**⚠️ NOT RECOMMENDED FOR PRODUCTION** - Single point of failure, manual recovery required
---
## Overview
The single-node architecture is the simplest StemeDB deployment: one server running `stemedb-api` with local storage. Suitable for early pilots, development, and demonstrations where availability is not critical.
```
[See: diagrams/single-node.txt for ASCII diagram]
```
---
## Target Specifications
| Metric | Value |
|--------|-------|
| **Assertions** | <10,000 |
| **Queries/sec** | <100 |
| **Concurrent users** | <50 |
| **Availability** | Best effort (single point of failure) |
| **RTO** | 2 hours (manual restore) |
| **RPO** | 24 hours (daily backup) |
---
## Hardware Requirements
### Minimum (Pilot <5K assertions)
- **CPU:** 2 vCPUs
- **RAM:** 4GB
- **Disk:** 50GB SSD (30GB WAL + 20GB DB)
- **Network:** 100 Mbps
**Example instances:**
- AWS: `t3.medium` (2 vCPU, 4GB)
- GCP: `n1-standard-1` (1 vCPU, 3.75GB)
- Azure: `Standard_B2s` (2 vCPU, 4GB)
### Recommended (Pilot <10K assertions)
- **CPU:** 4 vCPUs
- **RAM:** 8GB
- **Disk:** 100GB SSD (50GB WAL + 50GB DB)
- **Network:** 1 Gbps
**Example instances:**
- AWS: `t3.large` (2 vCPU, 8GB)
- GCP: `n2-standard-2` (2 vCPU, 8GB)
- Azure: `Standard_D2s_v3` (2 vCPU, 8GB)
**See:** [Resource Sizing Guide](./resource-sizing.md) for calculations.
---
## Architecture Diagram
**Component layout:**
```
┌─────────────────────────────────────────────────────┐
│ StemeDB Server │
│ ┌───────────────────────────────────────────────┐ │
│ │ stemedb-api (Port 18180) │ │
│ │ ┌─────────────┐ ┌──────────────┐ │ │
│ │ │ HTTP Router │───▶│ Ingest │ │ │
│ │ │ (Axum) │ │ Pipeline │ │ │
│ │ └─────────────┘ └──────┬───────┘ │ │
│ │ │ │ │
│ │ ┌──────────────────┐ ▼ │ │
│ │ │ Query Engine │ ┌────────────┐ │ │
│ │ │ (Lenses) │ │ WAL │ │ │
│ │ └────────┬─────────┘ └────────────┘ │ │
│ │ │ /data/wal/ │ │
│ │ ▼ │ │
│ │ ┌──────────────────┐ │ │
│ │ │ HybridStore │ │ │
│ │ │ • KV Store │ │ │
│ │ │ • Indexes │ │ │
│ │ └──────────────────┘ │ │
│ │ /data/db/ │ │
│ └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
▲ │
│ ▼
┌─────────┐ ┌──────────────────┐
│ Clients │ │ Backups (daily) │
│ (Agents,│ │ /backups/ │
│ Dash) │ │ (rsync-based) │
└─────────┘ └──────────────────┘
```
---
## Deployment Steps
### Prerequisites
- [ ] Ubuntu 22.04 or RHEL 9 server
- [ ] `stemedb-api` binary installed
- [ ] systemd service configured
- [ ] Firewall rules applied
### Step 1: Install StemeDB
```bash
# Download binary (replace with your release URL)
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api
# Verify installation
stemedb-api --version
# Expected: stemedb-api 0.1.0
```
### Step 2: Create Data Directories
```bash
# Create directories
sudo mkdir -p /data/{wal,db}
sudo mkdir -p /backups
# Create stemedb user
sudo useradd -r -s /bin/false stemedb
# Set permissions
sudo chown -R stemedb:stemedb /data
sudo chown -R stemedb:stemedb /backups
sudo chmod 755 /data/{wal,db}
```
### Step 3: Configure Environment
```bash
# Create config file
sudo tee /etc/stemedb/config.env <<EOF
STEMEDB_BIND_ADDR=0.0.0.0:18180
STEMEDB_WAL_DIR=/data/wal
STEMEDB_DB_DIR=/data/db
STEMEDB_METER_ENABLED=true
RUST_LOG=info
EOF
# Set permissions
sudo chmod 600 /etc/stemedb/config.env
```
### Step 4: Create systemd Service
```bash
# Create service file
sudo tee /etc/systemd/system/stemedb-api.service <<EOF
[Unit]
Description=StemeDB API Server
After=network.target
[Service]
Type=simple
User=stemedb
Group=stemedb
EnvironmentFile=/etc/stemedb/config.env
ExecStart=/usr/local/bin/stemedb-api
Restart=on-failure
RestartSec=5s
# Resource limits
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
EOF
# Reload systemd
sudo systemctl daemon-reload
# Enable service
sudo systemctl enable stemedb-api
```
### Step 5: Start Server
```bash
# Start service
sudo systemctl start stemedb-api
# Check status
sudo systemctl status stemedb-api
# Verify health
curl http://localhost:18180/v1/health
# Expected: {"status": "healthy", "version": "0.1.0", ...}
```
### Step 6: Configure Reverse Proxy (Optional)
**For TLS termination and external access:**
See: [Nginx Config](../../deployment/nginx/stemedb.conf) for complete example.
```bash
# Install nginx
sudo apt install nginx
# Copy config
sudo cp docs/operations/deployment/nginx/stemedb.conf /etc/nginx/sites-available/stemedb
# Enable site
sudo ln -s /etc/nginx/sites-available/stemedb /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
```
### Step 7: Set Up Daily Backups
```bash
# Copy backup script
sudo cp scripts/backup-stemedb.sh /usr/local/bin/
sudo chmod +x /usr/local/bin/backup-stemedb.sh
# Create cron job
sudo crontab -e
# Add daily backup at 2 AM
0 2 * * * /usr/local/bin/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1
# Test backup
sudo /usr/local/bin/backup-stemedb.sh
ls -lh /backups/
```
**Estimated deployment time:** 1-2 hours
---
## Network Configuration
### Ports
| Port | Protocol | Purpose | Expose To |
|------|----------|---------|-----------|
| **18180** | TCP/HTTP | API queries, ingest | Clients (via reverse proxy) |
| **18180** | TCP/HTTP | Metrics endpoint | Internal monitoring |
### Firewall Rules
**AWS Security Group:**
```bash
# Allow HTTP from load balancer only
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx \
--source-group sg-lb \
--protocol tcp \
--port 18180
# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx \
--source-group sg-bastion \
--protocol tcp \
--port 22
```
**iptables:**
```bash
# Allow HTTP from internal network only
sudo iptables -A INPUT -p tcp -s 10.0.0.0/8 --dport 18180 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 18180 -j DROP
# Persist rules
sudo iptables-save > /etc/iptables/rules.v4
```
**See:** [Network Requirements](./network-requirements.md) for full details.
---
## Monitoring
### Prometheus
**Scrape configuration:**
```yaml
# /etc/prometheus/prometheus.yml
scrape_configs:
- job_name: 'stemedb'
static_configs:
- targets: ['localhost:18180']
metrics_path: '/metrics'
scrape_interval: 15s
```
### Key Metrics to Monitor
```bash
# Query latency (should be <200ms p99)
stemedb_query_latency_seconds{quantile="0.99"}
# Ingest rate (assertions/sec)
rate(stemedb_assertions_total[1m])
# WAL fsync latency (should be <10ms)
stemedb_wal_fsync_latency_seconds
# Disk usage (alert at 80%)
node_filesystem_avail_bytes{mountpoint="/data"}
# Memory usage
process_resident_memory_bytes
```
### Grafana Dashboard
**See:** Example dashboard in `docker-compose/pilot-with-monitoring.yml` stack.
**Key panels:**
- Query latency (p50, p95, p99)
- Ingest rate (assertions/sec)
- Disk usage (WAL, DB, total)
- Error rate (4xx, 5xx responses)
---
## Failure Scenarios
### Server Failure
**Impact:** Complete outage, all queries and writes fail
**Recovery:**
1. Provision new server
2. Restore from backup (see [Restore Runbook](../../runbooks/restore-from-backup.md))
3. Update DNS to point to new server
4. Validate with test queries
**Estimated RTO:** 2 hours (manual)
**Data loss:** Last 24 hours (if daily backup)
### Disk Failure
**Impact:** Data loss, server won't start
**Recovery:**
1. Replace disk
2. Restore from backup
3. Restart server
**Estimated RTO:** 2 hours
**Data loss:** Last 24 hours
### Process Crash (OOM, segfault)
**Impact:** Temporary outage, automatic restart via systemd
**Recovery:**
- Automatic (systemd restart after 5s)
- WAL replay recovers in-flight data
**Estimated RTO:** 10-30 seconds
**Data loss:** None (WAL preserves writes)
---
## Limitations
**Single-node architecture has these limitations:**
1. **No High Availability:**
- Server failure = complete outage
- No automatic failover
- Manual recovery required
2. **No Horizontal Scaling:**
- Single CPU/RAM/disk bottleneck
- Can't add capacity by adding nodes
3. **Manual Recovery:**
- Restore from backup is manual process
- Downtime 1-2 hours typical
4. **Limited Throughput:**
- ~100 queries/sec typical
- ~100 assertions/sec write capacity
5. **Data Loss Risk:**
- Daily backups = up to 24hr data loss
- No real-time replication
**For production deployments, use [Three-Node Cluster](./three-node-cluster.md) instead.**
---
## When to Migrate
**Migrate to three-node cluster when:**
- [ ] Assertion count approaching 10,000
- [ ] Query latency p99 >500ms sustained
- [ ] Availability requirements tighten (need <5min RTO)
- [ ] Pilot validated, moving to production
- [ ] Compliance requires redundancy
**Migration procedure:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster)
---
## Cost Estimate
**AWS example (t3.large, us-east-1):**
| Resource | Monthly Cost |
|----------|--------------|
| Compute (t3.large) | $60 |
| Storage (100GB SSD) | $10 |
| Backup (500GB S3) | $12 |
| Data transfer | $5 |
| **Total** | **~$87/month** |
**GCP example (n2-standard-2, us-central1):**
| Resource | Monthly Cost |
|----------|--------------|
| Compute (n2-standard-2) | $65 |
| Storage (100GB SSD) | $17 |
| Backup (500GB Cloud Storage) | $10 |
| **Total** | **~$92/month** |
---
## Related Documentation
- [Three-Node Cluster](./three-node-cluster.md) - Production architecture
- [Resource Sizing](./resource-sizing.md) - Hardware calculations
- [Network Requirements](./network-requirements.md) - Firewall rules
- [Pilot Success Criteria](../../pilot-success-criteria.md) - Validation checklist
- [Deployment Example](../../deployment/docker-compose/pilot-with-monitoring.yml) - Docker Compose stack
---
**Last Updated:** 2026-02-11