This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
167 lines
11 KiB
Plaintext
167 lines
11 KiB
Plaintext
# Single-Node Architecture Diagram
|
|
|
|
## High-Level Flow
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────────┐
|
|
│ Client Layer │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Agents │ │ Dashboard │ │ CLI Tools │ │
|
|
│ │ (Ed25519) │ │ (Web UI) │ │ (curl) │ │
|
|
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
|
|
│ │ │ │ │
|
|
│ └──────────────────┴──────────────────┘ │
|
|
│ │ │
|
|
│ │ HTTPS (443) │
|
|
│ ▼ │
|
|
└──────────────────────────────────────────────────────────────────────┘
|
|
|
|
┌──────────────────────────────────────────────────────────────────────┐
|
|
│ Reverse Proxy Layer │
|
|
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Nginx / Envoy │ │
|
|
│ │ • TLS termination │ │
|
|
│ │ • Rate limiting │ │
|
|
│ │ • Security headers │ │
|
|
│ │ • Request logging │ │
|
|
│ └────────────────────────────┬────────────────────────────────────┘ │
|
|
│ │ HTTP (18180) │
|
|
│ ▼ │
|
|
└──────────────────────────────────────────────────────────────────────┘
|
|
|
|
┌──────────────────────────────────────────────────────────────────────┐
|
|
│ StemeDB Server │
|
|
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
|
│ │ stemedb-api Process │ │
|
|
│ │ │ │
|
|
│ │ ┌───────────────┐ ┌────────────────┐ │ │
|
|
│ │ │ HTTP Router │ │ Content │ │ │
|
|
│ │ │ (Axum) │──────────▶│ Defense │ │ │
|
|
│ │ │ │ │ Layer │ │ │
|
|
│ │ │ • /v1/assert │ │ • Quarantine │ │ │
|
|
│ │ │ • /v1/query │ │ • Circuit │ │ │
|
|
│ │ │ • /v1/health │ │ Breaker │ │ │
|
|
│ │ │ • /metrics │ └────────┬───────┘ │ │
|
|
│ │ └───────┬───────┘ │ │ │
|
|
│ │ │ ▼ │ │
|
|
│ │ │ ┌────────────────┐ │ │
|
|
│ │ │ │ Ingestion │ │ │
|
|
│ │ │ │ Pipeline │ │ │
|
|
│ │ │ │ • Validate │ │ │
|
|
│ │ │ │ • Sign check │ │ │
|
|
│ │ │ │ • BLAKE3 hash │ │ │
|
|
│ │ │ └────────┬───────┘ │ │
|
|
│ │ │ │ │ │
|
|
│ │ │ ▼ │ │
|
|
│ │ │ ┌────────────────┐ │ │
|
|
│ │ │ │ WAL │ │ │
|
|
│ │ │ │ (fsync) │ │ │
|
|
│ │ │ │ /data/wal/ │ │ │
|
|
│ │ │ └────────┬───────┘ │ │
|
|
│ │ │ │ │ │
|
|
│ │ │ ▼ │ │
|
|
│ │ │ ┌────────────────┐ │ │
|
|
│ │ └──────────────────▶│ HybridStore │ │ │
|
|
│ │ │ • KV Store │ │ │
|
|
│ │ ┌───────────────┐ │ • Indexes │ │ │
|
|
│ │ │ Query Engine │◀──────────│ • Merkle Tree │ │ │
|
|
│ │ │ • Lenses │ │ /data/db/ │ │ │
|
|
│ │ │ • Conflict │ └────────────────┘ │ │
|
|
│ │ │ Resolution │ │ │
|
|
│ │ └───────┬───────┘ │ │
|
|
│ │ │ │ │
|
|
│ │ └─────────────────────────────────────────────────┐ │ │
|
|
│ │ │ │ │
|
|
│ └─────────────────────────────────────────────────────────────┼──┘ │
|
|
│ │ │
|
|
│ Port 18180 (HTTP) │ │
|
|
└─────────────────────────────────────────────────────────────────┼────┘
|
|
│
|
|
▼
|
|
┌──────────────────────┐
|
|
│ Metrics Scraper │
|
|
│ (Prometheus) │
|
|
│ GET /metrics │
|
|
└──────────────────────┘
|
|
|
|
## Storage Layer
|
|
|
|
```
|
|
/data/
|
|
├── wal/ Write-Ahead Log (crash recovery)
|
|
│ ├── segment-00001.log 10MB segments
|
|
│ ├── segment-00002.log Fsync on every write
|
|
│ └── segment-00003.log 7-day retention
|
|
│
|
|
├── db/ KV Store + Indexes
|
|
│ ├── assertions.kv Content-addressed storage
|
|
│ ├── indexes/
|
|
│ │ ├── concept_path.idx Tail-path matching
|
|
│ │ ├── predicate.idx Predicate lookup
|
|
│ │ └── agent.idx Agent-based queries
|
|
│ └── merkle_tree.dat BLAKE3 Merkle tree
|
|
│
|
|
└── metadata.json Assertion count, version
|
|
```
|
|
|
|
## Backup Flow
|
|
|
|
```
|
|
┌──────────────┐
|
|
│ Cron Job │ Daily at 2 AM
|
|
│ (2 0 * * *) │
|
|
└──────┬───────┘
|
|
│
|
|
▼
|
|
┌────────────────────────────┐
|
|
│ backup-stemedb.sh │
|
|
│ • Stop writes (optional) │
|
|
│ • rsync WAL + DB │
|
|
│ • Create metadata.json │
|
|
│ • Resume writes │
|
|
└──────┬─────────────────────┘
|
|
│
|
|
▼
|
|
┌────────────────────────────┐
|
|
│ /backups/ │
|
|
│ stemedb-backup-YYYYMMDD/ │
|
|
│ ├── wal/ │
|
|
│ ├── db/ │
|
|
│ └── metadata.json │
|
|
└────────────────────────────┘
|
|
```
|
|
|
|
## Failure Mode (Server Down)
|
|
|
|
```
|
|
┌──────────────┐
|
|
│ Clients │
|
|
└──────┬───────┘
|
|
│
|
|
▼
|
|
❌ Connection refused
|
|
│
|
|
▼
|
|
┌──────────────────────┐
|
|
│ Manual Recovery │
|
|
│ 1. Provision server │
|
|
│ 2. Restore backup │
|
|
│ 3. Update DNS │
|
|
│ 4. Validate health │
|
|
│ │
|
|
│ RTO: ~2 hours │
|
|
│ RPO: ~24 hours │
|
|
└──────────────────────┘
|
|
```
|
|
|
|
## Key Characteristics
|
|
|
|
- **Simplicity:** Single server, easy to deploy and manage
|
|
- **Cost:** ~$87/month (AWS t3.large)
|
|
- **Availability:** Single point of failure, no automatic failover
|
|
- **Capacity:** <10K assertions, <100 queries/sec
|
|
- **Recovery:** Manual restore from backup (2 hour RTO)
|
|
- **Use Case:** PoC, friendly pilot, development environments
|
|
|
|
⚠️ NOT RECOMMENDED FOR PRODUCTION - Use three-node cluster for HA
|