stemedb/docs/operations/reference-architecture/single-node-pilot.md

# Single-Node Pilot Architecture

**Target:** Proof of concept, friendly pilot, development environments

**⚠️ NOT RECOMMENDED FOR PRODUCTION** - Single point of failure, manual recovery required

---

## Overview

The single-node architecture is the simplest StemeDB deployment: one server running `stemedb-api` with local storage. Suitable for early pilots, development, and demonstrations where availability is not critical.

```
[See: diagrams/single-node.txt for ASCII diagram]
```

---

## Target Specifications

| Metric | Value |
|--------|-------|
| **Assertions** | <10,000 |
| **Queries/sec** | <100 |
| **Concurrent users** | <50 |
| **Availability** | Best effort (single point of failure) |
| **RTO** | 2 hours (manual restore) |
| **RPO** | 24 hours (daily backup) |

---

## Hardware Requirements

### Minimum (Pilot <5K assertions)

- **CPU:** 2 vCPUs
- **RAM:** 4GB
- **Disk:** 50GB SSD (30GB WAL + 20GB DB)
- **Network:** 100 Mbps

**Example instances:**
- AWS: `t3.medium` (2 vCPU, 4GB)
- GCP: `n1-standard-1` (1 vCPU, 3.75GB)
- Azure: `Standard_B2s` (2 vCPU, 4GB)

### Recommended (Pilot <10K assertions)

- **CPU:** 4 vCPUs
- **RAM:** 8GB
- **Disk:** 100GB SSD (50GB WAL + 50GB DB)
- **Network:** 1 Gbps

**Example instances:**
- AWS: `t3.large` (2 vCPU, 8GB)
- GCP: `n2-standard-2` (2 vCPU, 8GB)
- Azure: `Standard_D2s_v3` (2 vCPU, 8GB)

**See:** [Resource Sizing Guide](./resource-sizing.md) for calculations.

---

## Architecture Diagram

**Component layout:**

```
┌─────────────────────────────────────────────────────┐
│                  StemeDB Server                     │
│  ┌───────────────────────────────────────────────┐  │
│  │          stemedb-api (Port 18180)            │  │
│  │  ┌─────────────┐    ┌──────────────┐         │  │
│  │  │ HTTP Router │───▶│ Ingest       │         │  │
│  │  │ (Axum)      │    │ Pipeline     │         │  │
│  │  └─────────────┘    └──────┬───────┘         │  │
│  │                            │                  │  │
│  │  ┌──────────────────┐     ▼                  │  │
│  │  │ Query Engine     │  ┌────────────┐        │  │
│  │  │ (Lenses)         │  │ WAL        │        │  │
│  │  └────────┬─────────┘  └────────────┘        │  │
│  │           │              /data/wal/           │  │
│  │           ▼                                   │  │
│  │  ┌──────────────────┐                        │  │
│  │  │ HybridStore      │                        │  │
│  │  │ • KV Store       │                        │  │
│  │  │ • Indexes        │                        │  │
│  │  └──────────────────┘                        │  │
│  │     /data/db/                                │  │
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘
        ▲                           │
        │                           ▼
   ┌─────────┐            ┌──────────────────┐
   │ Clients │            │ Backups (daily)  │
   │ (Agents,│            │ /backups/        │
   │ Dash)   │            │ (rsync-based)    │
   └─────────┘            └──────────────────┘
```

---

## Deployment Steps

### Prerequisites

- [ ] Ubuntu 22.04 or RHEL 9 server
- [ ] `stemedb-api` binary installed
- [ ] systemd service configured
- [ ] Firewall rules applied

### Step 1: Install StemeDB

```bash
# Download binary (replace with your release URL)
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api

# Verify installation
stemedb-api --version
# Expected: stemedb-api 0.1.0
```

### Step 2: Create Data Directories

```bash
# Create directories
sudo mkdir -p /data/{wal,db}
sudo mkdir -p /backups

# Create stemedb user
sudo useradd -r -s /bin/false stemedb

# Set permissions
sudo chown -R stemedb:stemedb /data
sudo chown -R stemedb:stemedb /backups
sudo chmod 755 /data/{wal,db}
```

### Step 3: Configure Environment

```bash
# Create config file
sudo tee /etc/stemedb/config.env <<EOF
STEMEDB_BIND_ADDR=0.0.0.0:18180
STEMEDB_WAL_DIR=/data/wal
STEMEDB_DB_DIR=/data/db
STEMEDB_METER_ENABLED=true
RUST_LOG=info
EOF

# Set permissions
sudo chmod 600 /etc/stemedb/config.env
```

### Step 4: Create systemd Service

```bash
# Create service file
sudo tee /etc/systemd/system/stemedb-api.service <<EOF
[Unit]
Description=StemeDB API Server
After=network.target

[Service]
Type=simple
User=stemedb
Group=stemedb
EnvironmentFile=/etc/stemedb/config.env
ExecStart=/usr/local/bin/stemedb-api
Restart=on-failure
RestartSec=5s

# Resource limits
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

# Reload systemd
sudo systemctl daemon-reload

# Enable service
sudo systemctl enable stemedb-api
```

### Step 5: Start Server

```bash
# Start service
sudo systemctl start stemedb-api

# Check status
sudo systemctl status stemedb-api

# Verify health
curl http://localhost:18180/v1/health
# Expected: {"status": "healthy", "version": "0.1.0", ...}
```

### Step 6: Configure Reverse Proxy (Optional)

**For TLS termination and external access:**

See: [Nginx Config](../../deployment/nginx/stemedb.conf) for complete example.

```bash
# Install nginx
sudo apt install nginx

# Copy config
sudo cp docs/operations/deployment/nginx/stemedb.conf /etc/nginx/sites-available/stemedb

# Enable site
sudo ln -s /etc/nginx/sites-available/stemedb /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
```

### Step 7: Set Up Daily Backups

```bash
# Copy backup script
sudo cp scripts/backup-stemedb.sh /usr/local/bin/
sudo chmod +x /usr/local/bin/backup-stemedb.sh

# Create cron job
sudo crontab -e

# Add daily backup at 2 AM
0 2 * * * /usr/local/bin/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1

# Test backup
sudo /usr/local/bin/backup-stemedb.sh
ls -lh /backups/
```

**Estimated deployment time:** 1-2 hours

---

## Network Configuration

### Ports

| Port | Protocol | Purpose | Expose To |
|------|----------|---------|-----------|
| **18180** | TCP/HTTP | API queries, ingest | Clients (via reverse proxy) |
| **18180** | TCP/HTTP | Metrics endpoint | Internal monitoring |

### Firewall Rules

**AWS Security Group:**
```bash
# Allow HTTP from load balancer only
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --source-group sg-lb \
  --protocol tcp \
  --port 18180

# Allow SSH from bastion
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --source-group sg-bastion \
  --protocol tcp \
  --port 22
```

**iptables:**
```bash
# Allow HTTP from internal network only
sudo iptables -A INPUT -p tcp -s 10.0.0.0/8 --dport 18180 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 18180 -j DROP

# Persist rules
sudo iptables-save > /etc/iptables/rules.v4
```

**See:** [Network Requirements](./network-requirements.md) for full details.

---

## Monitoring

### Prometheus

**Scrape configuration:**

```yaml
# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'stemedb'
    static_configs:
      - targets: ['localhost:18180']
    metrics_path: '/metrics'
    scrape_interval: 15s
```

### Key Metrics to Monitor

```bash
# Query latency (should be <200ms p99)
stemedb_query_latency_seconds{quantile="0.99"}

# Ingest rate (assertions/sec)
rate(stemedb_assertions_total[1m])

# WAL fsync latency (should be <10ms)
stemedb_wal_fsync_latency_seconds

# Disk usage (alert at 80%)
node_filesystem_avail_bytes{mountpoint="/data"}

# Memory usage
process_resident_memory_bytes
```

### Grafana Dashboard

**See:** Example dashboard in `docker-compose/pilot-with-monitoring.yml` stack.

**Key panels:**
- Query latency (p50, p95, p99)
- Ingest rate (assertions/sec)
- Disk usage (WAL, DB, total)
- Error rate (4xx, 5xx responses)

---

## Failure Scenarios

### Server Failure

**Impact:** Complete outage, all queries and writes fail

**Recovery:**
1. Provision new server
2. Restore from backup (see [Restore Runbook](../../runbooks/restore-from-backup.md))
3. Update DNS to point to new server
4. Validate with test queries

**Estimated RTO:** 2 hours (manual)

**Data loss:** Last 24 hours (if daily backup)

### Disk Failure

**Impact:** Data loss, server won't start

**Recovery:**
1. Replace disk
2. Restore from backup
3. Restart server

**Estimated RTO:** 2 hours

**Data loss:** Last 24 hours

### Process Crash (OOM, segfault)

**Impact:** Temporary outage, automatic restart via systemd

**Recovery:**
- Automatic (systemd restart after 5s)
- WAL replay recovers in-flight data

**Estimated RTO:** 10-30 seconds

**Data loss:** None (WAL preserves writes)

---

## Limitations

**Single-node architecture has these limitations:**

1. **No High Availability:**
   - Server failure = complete outage
   - No automatic failover
   - Manual recovery required

2. **No Horizontal Scaling:**
   - Single CPU/RAM/disk bottleneck
   - Can't add capacity by adding nodes

3. **Manual Recovery:**
   - Restore from backup is manual process
   - Downtime 1-2 hours typical

4. **Limited Throughput:**
   - ~100 queries/sec typical
   - ~100 assertions/sec write capacity

5. **Data Loss Risk:**
   - Daily backups = up to 24hr data loss
   - No real-time replication

**For production deployments, use [Three-Node Cluster](./three-node-cluster.md) instead.**

---

## When to Migrate

**Migrate to three-node cluster when:**

- [ ] Assertion count approaching 10,000
- [ ] Query latency p99 >500ms sustained
- [ ] Availability requirements tighten (need <5min RTO)
- [ ] Pilot validated, moving to production
- [ ] Compliance requires redundancy

**Migration procedure:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster)

---

## Cost Estimate

**AWS example (t3.large, us-east-1):**

| Resource | Monthly Cost |
|----------|--------------|
| Compute (t3.large) | $60 |
| Storage (100GB SSD) | $10 |
| Backup (500GB S3) | $12 |
| Data transfer | $5 |
| **Total** | **~$87/month** |

**GCP example (n2-standard-2, us-central1):**

| Resource | Monthly Cost |
|----------|--------------|
| Compute (n2-standard-2) | $65 |
| Storage (100GB SSD) | $17 |
| Backup (500GB Cloud Storage) | $10 |
| **Total** | **~$92/month** |

---

## Related Documentation

- [Three-Node Cluster](./three-node-cluster.md) - Production architecture
- [Resource Sizing](./resource-sizing.md) - Hardware calculations
- [Network Requirements](./network-requirements.md) - Firewall rules
- [Pilot Success Criteria](../../pilot-success-criteria.md) - Validation checklist
- [Deployment Example](../../deployment/docker-compose/pilot-with-monitoring.yml) - Docker Compose stack

---

**Last Updated:** 2026-02-11