This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
5.8 KiB
StemeDB Operations Guide
Welcome to the StemeDB operations hub. This documentation provides everything you need to deploy, monitor, troubleshoot, and maintain StemeDB in production environments.
Quick Links
| Need to... | Go to |
|---|---|
| Deploy for the first time | Single-Node Pilot Architecture |
| Troubleshoot an incident | Operational Runbooks |
| Scale to production | Three-Node Cluster Architecture |
| Size your deployment | Resource Sizing Guide |
| Configure networking | Network Requirements |
| Deploy with Docker Compose | Pilot with Monitoring |
| Set up reverse proxy | Nginx Config / Envoy Config |
| Validate pilot success | Pilot Success Criteria |
Operations Documentation
🚨 Runbooks
When things go wrong at 2am, these runbooks provide step-by-step incident response procedures:
- Server Won't Start - Port conflicts, TLS errors, WAL corruption
- High Query Latency - Performance degradation, replication lag
- Quarantine Overflow - Content defense queue management
- Circuit Breaker Stuck - Agent bans and manual resets
- Restore from Backup - Disaster recovery procedures
- Disk Full - Storage management and WAL cleanup
- Add Node to Cluster - Cluster expansion procedures
Start here: Troubleshooting Flowchart - Decision tree from symptom to runbook
🏗️ Reference Architectures
Choose your deployment model based on scale, availability requirements, and operational maturity:
| Architecture | Target | Assertions | Queries/sec | RTO/RPO | Guide |
|---|---|---|---|---|---|
| Single-Node Pilot | PoC, friendly pilot | <10K | <100/sec | 2hr / 24hr | Guide |
| Three-Node Cluster | Production | <100K | <1K/sec | 5min / 1min | Guide |
| Enterprise (future) | Large-scale | >100K | >1K/sec | 1min / 0min | Roadmap (P6+) |
Also see:
- Network Requirements - Ports, firewalls, TLS, DNS
- Resource Sizing - CPU, RAM, disk calculations
📦 Deployment Examples
Infrastructure-as-Code examples ready to customize for your environment:
- Docker Compose + Monitoring - Turnkey deployment with Prometheus + Grafana
- Nginx Reverse Proxy - TLS termination, rate limiting, security headers
- Envoy Gateway - Advanced load balancing, circuit breakers, retries
✅ Pilot Success Criteria
Before going to production, validate your pilot meets these criteria:
- Pilot Success Criteria - Performance, functional, operational requirements
- 5 Amazement Moments - Demo validation checklist
- Acceptance Criteria - Must Pass / Should Pass / Nice to Have
Common Tasks
First-Time Deployment
- Review Single-Node Pilot Architecture
- Follow Resource Sizing Guide to choose hardware
- Deploy using Docker Compose example
- Configure reverse proxy (Nginx or Envoy)
- Validate against Pilot Success Criteria
Incident Response
- Identify symptom (error message, alert, user report)
- Check Troubleshooting Flowchart
- Follow relevant runbook (see list above)
- Document resolution and add to runbook if new scenario
Scaling to Production
- Validate pilot success with Success Criteria
- Review Three-Node Cluster Architecture
- Plan migration (data backup, node provisioning, DNS changes)
- Execute deployment with rolling validation
- Set up monitoring (see Docker Compose example)
Prerequisites
Before using these operations guides, ensure you've completed:
- ✅ Production Readiness Verification - 84% CLI score, all critical checks pass
- ✅ Load Testing - 10K assertions baseline, 1K/sec sustained
- ✅ Backup/Restore Testing - Validated roundtrip recovery
Support
For questions or issues:
- 📖 Documentation bugs: Report at GitHub Issues
- 💬 Community support: [Discussion forum link TBD]
- 🚨 Security issues: security@stemedb.io (or your org's security contact)
Contributing
Operations documentation is living documentation. If you:
- Encounter an incident not covered by runbooks → Add it
- Find an architecture pattern that works well → Document it
- Discover a configuration improvement → Share the example
Submit pull requests to keep this guide current and valuable.
Last Updated: 2026-02-11