This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
450 lines
11 KiB
Markdown
450 lines
11 KiB
Markdown
# Single-Node Pilot Architecture
|
|
|
|
**Target:** Proof of concept, friendly pilot, development environments
|
|
|
|
**⚠️ NOT RECOMMENDED FOR PRODUCTION** - Single point of failure, manual recovery required
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
The single-node architecture is the simplest StemeDB deployment: one server running `stemedb-api` with local storage. Suitable for early pilots, development, and demonstrations where availability is not critical.
|
|
|
|
```
|
|
[See: diagrams/single-node.txt for ASCII diagram]
|
|
```
|
|
|
|
---
|
|
|
|
## Target Specifications
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| **Assertions** | <10,000 |
|
|
| **Queries/sec** | <100 |
|
|
| **Concurrent users** | <50 |
|
|
| **Availability** | Best effort (single point of failure) |
|
|
| **RTO** | 2 hours (manual restore) |
|
|
| **RPO** | 24 hours (daily backup) |
|
|
|
|
---
|
|
|
|
## Hardware Requirements
|
|
|
|
### Minimum (Pilot <5K assertions)
|
|
|
|
- **CPU:** 2 vCPUs
|
|
- **RAM:** 4GB
|
|
- **Disk:** 50GB SSD (30GB WAL + 20GB DB)
|
|
- **Network:** 100 Mbps
|
|
|
|
**Example instances:**
|
|
- AWS: `t3.medium` (2 vCPU, 4GB)
|
|
- GCP: `n1-standard-1` (1 vCPU, 3.75GB)
|
|
- Azure: `Standard_B2s` (2 vCPU, 4GB)
|
|
|
|
### Recommended (Pilot <10K assertions)
|
|
|
|
- **CPU:** 4 vCPUs
|
|
- **RAM:** 8GB
|
|
- **Disk:** 100GB SSD (50GB WAL + 50GB DB)
|
|
- **Network:** 1 Gbps
|
|
|
|
**Example instances:**
|
|
- AWS: `t3.large` (2 vCPU, 8GB)
|
|
- GCP: `n2-standard-2` (2 vCPU, 8GB)
|
|
- Azure: `Standard_D2s_v3` (2 vCPU, 8GB)
|
|
|
|
**See:** [Resource Sizing Guide](./resource-sizing.md) for calculations.
|
|
|
|
---
|
|
|
|
## Architecture Diagram
|
|
|
|
**Component layout:**
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ StemeDB Server │
|
|
│ ┌───────────────────────────────────────────────┐ │
|
|
│ │ stemedb-api (Port 18180) │ │
|
|
│ │ ┌─────────────┐ ┌──────────────┐ │ │
|
|
│ │ │ HTTP Router │───▶│ Ingest │ │ │
|
|
│ │ │ (Axum) │ │ Pipeline │ │ │
|
|
│ │ └─────────────┘ └──────┬───────┘ │ │
|
|
│ │ │ │ │
|
|
│ │ ┌──────────────────┐ ▼ │ │
|
|
│ │ │ Query Engine │ ┌────────────┐ │ │
|
|
│ │ │ (Lenses) │ │ WAL │ │ │
|
|
│ │ └────────┬─────────┘ └────────────┘ │ │
|
|
│ │ │ /data/wal/ │ │
|
|
│ │ ▼ │ │
|
|
│ │ ┌──────────────────┐ │ │
|
|
│ │ │ HybridStore │ │ │
|
|
│ │ │ • KV Store │ │ │
|
|
│ │ │ • Indexes │ │ │
|
|
│ │ └──────────────────┘ │ │
|
|
│ │ /data/db/ │ │
|
|
│ └───────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────┘
|
|
▲ │
|
|
│ ▼
|
|
┌─────────┐ ┌──────────────────┐
|
|
│ Clients │ │ Backups (daily) │
|
|
│ (Agents,│ │ /backups/ │
|
|
│ Dash) │ │ (rsync-based) │
|
|
└─────────┘ └──────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Deployment Steps
|
|
|
|
### Prerequisites
|
|
|
|
- [ ] Ubuntu 22.04 or RHEL 9 server
|
|
- [ ] `stemedb-api` binary installed
|
|
- [ ] systemd service configured
|
|
- [ ] Firewall rules applied
|
|
|
|
### Step 1: Install StemeDB
|
|
|
|
```bash
|
|
# Download binary (replace with your release URL)
|
|
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
|
|
sudo chmod +x /usr/local/bin/stemedb-api
|
|
|
|
# Verify installation
|
|
stemedb-api --version
|
|
# Expected: stemedb-api 0.1.0
|
|
```
|
|
|
|
### Step 2: Create Data Directories
|
|
|
|
```bash
|
|
# Create directories
|
|
sudo mkdir -p /data/{wal,db}
|
|
sudo mkdir -p /backups
|
|
|
|
# Create stemedb user
|
|
sudo useradd -r -s /bin/false stemedb
|
|
|
|
# Set permissions
|
|
sudo chown -R stemedb:stemedb /data
|
|
sudo chown -R stemedb:stemedb /backups
|
|
sudo chmod 755 /data/{wal,db}
|
|
```
|
|
|
|
### Step 3: Configure Environment
|
|
|
|
```bash
|
|
# Create config file
|
|
sudo tee /etc/stemedb/config.env <<EOF
|
|
STEMEDB_BIND_ADDR=0.0.0.0:18180
|
|
STEMEDB_WAL_DIR=/data/wal
|
|
STEMEDB_DB_DIR=/data/db
|
|
STEMEDB_METER_ENABLED=true
|
|
RUST_LOG=info
|
|
EOF
|
|
|
|
# Set permissions
|
|
sudo chmod 600 /etc/stemedb/config.env
|
|
```
|
|
|
|
### Step 4: Create systemd Service
|
|
|
|
```bash
|
|
# Create service file
|
|
sudo tee /etc/systemd/system/stemedb-api.service <<EOF
|
|
[Unit]
|
|
Description=StemeDB API Server
|
|
After=network.target
|
|
|
|
[Service]
|
|
Type=simple
|
|
User=stemedb
|
|
Group=stemedb
|
|
EnvironmentFile=/etc/stemedb/config.env
|
|
ExecStart=/usr/local/bin/stemedb-api
|
|
Restart=on-failure
|
|
RestartSec=5s
|
|
|
|
# Resource limits
|
|
LimitNOFILE=65536
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
EOF
|
|
|
|
# Reload systemd
|
|
sudo systemctl daemon-reload
|
|
|
|
# Enable service
|
|
sudo systemctl enable stemedb-api
|
|
```
|
|
|
|
### Step 5: Start Server
|
|
|
|
```bash
|
|
# Start service
|
|
sudo systemctl start stemedb-api
|
|
|
|
# Check status
|
|
sudo systemctl status stemedb-api
|
|
|
|
# Verify health
|
|
curl http://localhost:18180/v1/health
|
|
# Expected: {"status": "healthy", "version": "0.1.0", ...}
|
|
```
|
|
|
|
### Step 6: Configure Reverse Proxy (Optional)
|
|
|
|
**For TLS termination and external access:**
|
|
|
|
See: [Nginx Config](../../deployment/nginx/stemedb.conf) for complete example.
|
|
|
|
```bash
|
|
# Install nginx
|
|
sudo apt install nginx
|
|
|
|
# Copy config
|
|
sudo cp docs/operations/deployment/nginx/stemedb.conf /etc/nginx/sites-available/stemedb
|
|
|
|
# Enable site
|
|
sudo ln -s /etc/nginx/sites-available/stemedb /etc/nginx/sites-enabled/
|
|
sudo nginx -t
|
|
sudo systemctl reload nginx
|
|
```
|
|
|
|
### Step 7: Set Up Daily Backups
|
|
|
|
```bash
|
|
# Copy backup script
|
|
sudo cp scripts/backup-stemedb.sh /usr/local/bin/
|
|
sudo chmod +x /usr/local/bin/backup-stemedb.sh
|
|
|
|
# Create cron job
|
|
sudo crontab -e
|
|
|
|
# Add daily backup at 2 AM
|
|
0 2 * * * /usr/local/bin/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1
|
|
|
|
# Test backup
|
|
sudo /usr/local/bin/backup-stemedb.sh
|
|
ls -lh /backups/
|
|
```
|
|
|
|
**Estimated deployment time:** 1-2 hours
|
|
|
|
---
|
|
|
|
## Network Configuration
|
|
|
|
### Ports
|
|
|
|
| Port | Protocol | Purpose | Expose To |
|
|
|------|----------|---------|-----------|
|
|
| **18180** | TCP/HTTP | API queries, ingest | Clients (via reverse proxy) |
|
|
| **18180** | TCP/HTTP | Metrics endpoint | Internal monitoring |
|
|
|
|
### Firewall Rules
|
|
|
|
**AWS Security Group:**
|
|
```bash
|
|
# Allow HTTP from load balancer only
|
|
aws ec2 authorize-security-group-ingress \
|
|
--group-id sg-xxx \
|
|
--source-group sg-lb \
|
|
--protocol tcp \
|
|
--port 18180
|
|
|
|
# Allow SSH from bastion
|
|
aws ec2 authorize-security-group-ingress \
|
|
--group-id sg-xxx \
|
|
--source-group sg-bastion \
|
|
--protocol tcp \
|
|
--port 22
|
|
```
|
|
|
|
**iptables:**
|
|
```bash
|
|
# Allow HTTP from internal network only
|
|
sudo iptables -A INPUT -p tcp -s 10.0.0.0/8 --dport 18180 -j ACCEPT
|
|
sudo iptables -A INPUT -p tcp --dport 18180 -j DROP
|
|
|
|
# Persist rules
|
|
sudo iptables-save > /etc/iptables/rules.v4
|
|
```
|
|
|
|
**See:** [Network Requirements](./network-requirements.md) for full details.
|
|
|
|
---
|
|
|
|
## Monitoring
|
|
|
|
### Prometheus
|
|
|
|
**Scrape configuration:**
|
|
|
|
```yaml
|
|
# /etc/prometheus/prometheus.yml
|
|
scrape_configs:
|
|
- job_name: 'stemedb'
|
|
static_configs:
|
|
- targets: ['localhost:18180']
|
|
metrics_path: '/metrics'
|
|
scrape_interval: 15s
|
|
```
|
|
|
|
### Key Metrics to Monitor
|
|
|
|
```bash
|
|
# Query latency (should be <200ms p99)
|
|
stemedb_query_latency_seconds{quantile="0.99"}
|
|
|
|
# Ingest rate (assertions/sec)
|
|
rate(stemedb_assertions_total[1m])
|
|
|
|
# WAL fsync latency (should be <10ms)
|
|
stemedb_wal_fsync_latency_seconds
|
|
|
|
# Disk usage (alert at 80%)
|
|
node_filesystem_avail_bytes{mountpoint="/data"}
|
|
|
|
# Memory usage
|
|
process_resident_memory_bytes
|
|
```
|
|
|
|
### Grafana Dashboard
|
|
|
|
**See:** Example dashboard in `docker-compose/pilot-with-monitoring.yml` stack.
|
|
|
|
**Key panels:**
|
|
- Query latency (p50, p95, p99)
|
|
- Ingest rate (assertions/sec)
|
|
- Disk usage (WAL, DB, total)
|
|
- Error rate (4xx, 5xx responses)
|
|
|
|
---
|
|
|
|
## Failure Scenarios
|
|
|
|
### Server Failure
|
|
|
|
**Impact:** Complete outage, all queries and writes fail
|
|
|
|
**Recovery:**
|
|
1. Provision new server
|
|
2. Restore from backup (see [Restore Runbook](../../runbooks/restore-from-backup.md))
|
|
3. Update DNS to point to new server
|
|
4. Validate with test queries
|
|
|
|
**Estimated RTO:** 2 hours (manual)
|
|
|
|
**Data loss:** Last 24 hours (if daily backup)
|
|
|
|
### Disk Failure
|
|
|
|
**Impact:** Data loss, server won't start
|
|
|
|
**Recovery:**
|
|
1. Replace disk
|
|
2. Restore from backup
|
|
3. Restart server
|
|
|
|
**Estimated RTO:** 2 hours
|
|
|
|
**Data loss:** Last 24 hours
|
|
|
|
### Process Crash (OOM, segfault)
|
|
|
|
**Impact:** Temporary outage, automatic restart via systemd
|
|
|
|
**Recovery:**
|
|
- Automatic (systemd restart after 5s)
|
|
- WAL replay recovers in-flight data
|
|
|
|
**Estimated RTO:** 10-30 seconds
|
|
|
|
**Data loss:** None (WAL preserves writes)
|
|
|
|
---
|
|
|
|
## Limitations
|
|
|
|
**Single-node architecture has these limitations:**
|
|
|
|
1. **No High Availability:**
|
|
- Server failure = complete outage
|
|
- No automatic failover
|
|
- Manual recovery required
|
|
|
|
2. **No Horizontal Scaling:**
|
|
- Single CPU/RAM/disk bottleneck
|
|
- Can't add capacity by adding nodes
|
|
|
|
3. **Manual Recovery:**
|
|
- Restore from backup is manual process
|
|
- Downtime 1-2 hours typical
|
|
|
|
4. **Limited Throughput:**
|
|
- ~100 queries/sec typical
|
|
- ~100 assertions/sec write capacity
|
|
|
|
5. **Data Loss Risk:**
|
|
- Daily backups = up to 24hr data loss
|
|
- No real-time replication
|
|
|
|
**For production deployments, use [Three-Node Cluster](./three-node-cluster.md) instead.**
|
|
|
|
---
|
|
|
|
## When to Migrate
|
|
|
|
**Migrate to three-node cluster when:**
|
|
|
|
- [ ] Assertion count approaching 10,000
|
|
- [ ] Query latency p99 >500ms sustained
|
|
- [ ] Availability requirements tighten (need <5min RTO)
|
|
- [ ] Pilot validated, moving to production
|
|
- [ ] Compliance requires redundancy
|
|
|
|
**Migration procedure:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster)
|
|
|
|
---
|
|
|
|
## Cost Estimate
|
|
|
|
**AWS example (t3.large, us-east-1):**
|
|
|
|
| Resource | Monthly Cost |
|
|
|----------|--------------|
|
|
| Compute (t3.large) | $60 |
|
|
| Storage (100GB SSD) | $10 |
|
|
| Backup (500GB S3) | $12 |
|
|
| Data transfer | $5 |
|
|
| **Total** | **~$87/month** |
|
|
|
|
**GCP example (n2-standard-2, us-central1):**
|
|
|
|
| Resource | Monthly Cost |
|
|
|----------|--------------|
|
|
| Compute (n2-standard-2) | $65 |
|
|
| Storage (100GB SSD) | $17 |
|
|
| Backup (500GB Cloud Storage) | $10 |
|
|
| **Total** | **~$92/month** |
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- [Three-Node Cluster](./three-node-cluster.md) - Production architecture
|
|
- [Resource Sizing](./resource-sizing.md) - Hardware calculations
|
|
- [Network Requirements](./network-requirements.md) - Firewall rules
|
|
- [Pilot Success Criteria](../../pilot-success-criteria.md) - Validation checklist
|
|
- [Deployment Example](../../deployment/docker-compose/pilot-with-monitoring.yml) - Docker Compose stack
|
|
|
|
---
|
|
|
|
**Last Updated:** 2026-02-11
|