stemedb/docs/operations/reference-architecture/diagrams/single-node.txt
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

167 lines
11 KiB
Plaintext

# Single-Node Architecture Diagram
## High-Level Flow
```
┌──────────────────────────────────────────────────────────────────────┐
│ Client Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Agents │ │ Dashboard │ │ CLI Tools │ │
│ │ (Ed25519) │ │ (Web UI) │ │ (curl) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────────────────┴──────────────────┘ │
│ │ │
│ │ HTTPS (443) │
│ ▼ │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ Reverse Proxy Layer │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Nginx / Envoy │ │
│ │ • TLS termination │ │
│ │ • Rate limiting │ │
│ │ • Security headers │ │
│ │ • Request logging │ │
│ └────────────────────────────┬────────────────────────────────────┘ │
│ │ HTTP (18180) │
│ ▼ │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ StemeDB Server │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ stemedb-api Process │ │
│ │ │ │
│ │ ┌───────────────┐ ┌────────────────┐ │ │
│ │ │ HTTP Router │ │ Content │ │ │
│ │ │ (Axum) │──────────▶│ Defense │ │ │
│ │ │ │ │ Layer │ │ │
│ │ │ • /v1/assert │ │ • Quarantine │ │ │
│ │ │ • /v1/query │ │ • Circuit │ │ │
│ │ │ • /v1/health │ │ Breaker │ │ │
│ │ │ • /metrics │ └────────┬───────┘ │ │
│ │ └───────┬───────┘ │ │ │
│ │ │ ▼ │ │
│ │ │ ┌────────────────┐ │ │
│ │ │ │ Ingestion │ │ │
│ │ │ │ Pipeline │ │ │
│ │ │ │ • Validate │ │ │
│ │ │ │ • Sign check │ │ │
│ │ │ │ • BLAKE3 hash │ │ │
│ │ │ └────────┬───────┘ │ │
│ │ │ │ │ │
│ │ │ ▼ │ │
│ │ │ ┌────────────────┐ │ │
│ │ │ │ WAL │ │ │
│ │ │ │ (fsync) │ │ │
│ │ │ │ /data/wal/ │ │ │
│ │ │ └────────┬───────┘ │ │
│ │ │ │ │ │
│ │ │ ▼ │ │
│ │ │ ┌────────────────┐ │ │
│ │ └──────────────────▶│ HybridStore │ │ │
│ │ │ • KV Store │ │ │
│ │ ┌───────────────┐ │ • Indexes │ │ │
│ │ │ Query Engine │◀──────────│ • Merkle Tree │ │ │
│ │ │ • Lenses │ │ /data/db/ │ │ │
│ │ │ • Conflict │ └────────────────┘ │ │
│ │ │ Resolution │ │ │
│ │ └───────┬───────┘ │ │
│ │ │ │ │
│ │ └─────────────────────────────────────────────────┐ │ │
│ │ │ │ │
│ └─────────────────────────────────────────────────────────────┼──┘ │
│ │ │
│ Port 18180 (HTTP) │ │
└─────────────────────────────────────────────────────────────────┼────┘
┌──────────────────────┐
│ Metrics Scraper │
│ (Prometheus) │
│ GET /metrics │
└──────────────────────┘
## Storage Layer
```
/data/
├── wal/ Write-Ahead Log (crash recovery)
│ ├── segment-00001.log 10MB segments
│ ├── segment-00002.log Fsync on every write
│ └── segment-00003.log 7-day retention
├── db/ KV Store + Indexes
│ ├── assertions.kv Content-addressed storage
│ ├── indexes/
│ │ ├── concept_path.idx Tail-path matching
│ │ ├── predicate.idx Predicate lookup
│ │ └── agent.idx Agent-based queries
│ └── merkle_tree.dat BLAKE3 Merkle tree
└── metadata.json Assertion count, version
```
## Backup Flow
```
┌──────────────┐
│ Cron Job │ Daily at 2 AM
│ (2 0 * * *) │
└──────┬───────┘
┌────────────────────────────┐
│ backup-stemedb.sh │
│ • Stop writes (optional) │
│ • rsync WAL + DB │
│ • Create metadata.json │
│ • Resume writes │
└──────┬─────────────────────┘
┌────────────────────────────┐
│ /backups/ │
│ stemedb-backup-YYYYMMDD/ │
│ ├── wal/ │
│ ├── db/ │
│ └── metadata.json │
└────────────────────────────┘
```
## Failure Mode (Server Down)
```
┌──────────────┐
│ Clients │
└──────┬───────┘
❌ Connection refused
┌──────────────────────┐
│ Manual Recovery │
│ 1. Provision server │
│ 2. Restore backup │
│ 3. Update DNS │
│ 4. Validate health │
│ │
│ RTO: ~2 hours │
│ RPO: ~24 hours │
└──────────────────────┘
```
## Key Characteristics
- **Simplicity:** Single server, easy to deploy and manage
- **Cost:** ~$87/month (AWS t3.large)
- **Availability:** Single point of failure, no automatic failover
- **Capacity:** <10K assertions, <100 queries/sec
- **Recovery:** Manual restore from backup (2 hour RTO)
- **Use Case:** PoC, friendly pilot, development environments
⚠️ NOT RECOMMENDED FOR PRODUCTION - Use three-node cluster for HA