Major refactoring to hexagonal (ports & adapters) architecture: - Add service layer (apikey_service, project_service) for business logic - Add webhook system with dispatcher and delivery tracking - Add command queue with priority-based processing - Add rate limiting with sliding window algorithm - Add audit logging for command execution - Add OpenTelemetry integration (traces, metrics, spans) - Add circuit breaker for fault tolerance - Add cached repository wrapper for performance - Add comprehensive validation package - Add Kubernetes client integration for pod management - Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks) - Add network policy and PodDisruptionBudget for k8s - Remove legacy executor and projects/registry packages - Untrack secrets.yaml (now managed via envault) - Add coverage.out to .gitignore - Add e2e test infrastructure with docker-compose - Add comprehensive documentation (API, architecture, operations, plans) - Add golangci-lint config and pre-commit hook Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
975 lines
23 KiB
Markdown
975 lines
23 KiB
Markdown
# rdev Implementation Plan v2
|
|
|
|
> Weeks 5-10: From 75% Complete to Pristine Production
|
|
|
|
## Current State (After Week 4)
|
|
|
|
### Completed
|
|
| Component | Status | Test Coverage |
|
|
|-----------|--------|---------------|
|
|
| Hexagonal Architecture | ✅ | Domain, Ports, Services |
|
|
| Authentication | ✅ | 394 lines |
|
|
| HTTP API + OpenAPI | ✅ | 1,189 lines |
|
|
| Command Execution | ✅ | 359 lines |
|
|
| Command Sanitization | ✅ | 257 lines |
|
|
| SSE Streaming | ✅ | Last-Event-ID support |
|
|
| Rate Limiting | ✅ | 413 lines |
|
|
| Command Limiting | ✅ | 414 lines |
|
|
| Database + Migrations | ✅ | Auto-migrations |
|
|
| Domain Models | ✅ | 542 lines |
|
|
| Port Interfaces | ✅ | 380 lines |
|
|
| Prometheus Metrics | ✅ | Path normalization |
|
|
| Validation Package | ✅ | 548 lines |
|
|
|
|
### Remaining Gaps
|
|
| Gap | Impact | Priority |
|
|
|-----|--------|----------|
|
|
| Claude config file I/O | Handlers broken | CRITICAL |
|
|
| Legacy code mixed in | Technical debt | HIGH |
|
|
| Hardcoded projects | Scalability | HIGH |
|
|
| No adapter tests | Reliability | HIGH |
|
|
| IP allowlisting | Security | HIGH |
|
|
| Production manifests | Deployment | MEDIUM |
|
|
| Validation not integrated | Consistency | MEDIUM |
|
|
| Documentation gaps | Usability | MEDIUM |
|
|
|
|
---
|
|
|
|
## Philosophy: Foundation First
|
|
|
|
```
|
|
Week 5-6: Clean the House
|
|
├── Remove all legacy code
|
|
├── Fix broken functionality
|
|
└── Achieve 100% working state
|
|
|
|
Week 7-8: Strengthen the Foundation
|
|
├── Complete test coverage
|
|
├── Add missing security features
|
|
└── Production-harden deployment
|
|
|
|
Week 9-10: Polish and Document
|
|
├── Performance optimization
|
|
├── Comprehensive documentation
|
|
└── Final quality gates
|
|
```
|
|
|
|
---
|
|
|
|
## Week 5: Legacy Removal & Core Fixes
|
|
|
|
**Goal**: Remove all legacy code, fix Claude config, integrate validation
|
|
|
|
### Task 5.1: Remove Legacy Code (4h)
|
|
**Files to delete:**
|
|
- `internal/executor/executor.go` → replaced by `internal/adapter/kubernetes/executor.go`
|
|
- `internal/projects/registry.go` → replaced by `internal/adapter/kubernetes/project_repository.go`
|
|
|
|
**Files to update:**
|
|
- `internal/handlers/claude_config.go` → Use service layer, not legacy executor
|
|
- `cmd/rdev-api/main.go` → Remove legacy imports
|
|
|
|
**Acceptance:**
|
|
- `go build ./...` passes
|
|
- No imports from `internal/executor` or `internal/projects`
|
|
- All tests pass
|
|
|
|
### Task 5.2: Implement Claude Config File I/O (6h)
|
|
**Problem**: Handlers exist but don't actually read/write files
|
|
|
|
**Create:**
|
|
```
|
|
internal/service/claude_config_service.go
|
|
internal/adapter/kubernetes/claude_config_repository.go
|
|
internal/port/claude_config_repository.go
|
|
```
|
|
|
|
**Operations to implement:**
|
|
```go
|
|
type ClaudeConfigRepository interface {
|
|
// List items in .claude/{type}/ directory
|
|
List(ctx context.Context, podName, itemType string) ([]ConfigItem, error)
|
|
|
|
// Get single item content
|
|
Get(ctx context.Context, podName, itemType, name string) (*ConfigItem, error)
|
|
|
|
// Create new item (write file)
|
|
Create(ctx context.Context, podName, itemType string, item *ConfigItem) error
|
|
|
|
// Update existing item
|
|
Update(ctx context.Context, podName, itemType, name string, content string) error
|
|
|
|
// Delete item (remove file)
|
|
Delete(ctx context.Context, podName, itemType, name string) error
|
|
}
|
|
```
|
|
|
|
**Implementation via kubectl:**
|
|
```bash
|
|
# List: kubectl exec pod -- ls /workspace/.claude/commands/
|
|
# Get: kubectl exec pod -- cat /workspace/.claude/commands/deploy.md
|
|
# Create: kubectl exec pod -- sh -c 'cat > /workspace/.claude/commands/new.md'
|
|
# Delete: kubectl exec pod -- rm /workspace/.claude/commands/old.md
|
|
```
|
|
|
|
**Acceptance:**
|
|
- Can list/create/read/update/delete commands, skills, agents via API
|
|
- E2E test proves round-trip works
|
|
|
|
### Task 5.3: Integrate Validation Package (3h)
|
|
**Replace inline checks with validate package:**
|
|
|
|
**Before:**
|
|
```go
|
|
if req.Name == "" {
|
|
api.WriteBadRequest(w, r, "name is required")
|
|
return
|
|
}
|
|
```
|
|
|
|
**After:**
|
|
```go
|
|
v := validate.New()
|
|
v.Required(req.Name, "name")
|
|
v.Name(req.Name, "name") // alphanumeric, 1-64 chars
|
|
if err := v.Error(); err != nil {
|
|
api.WriteBadRequest(w, r, err.Error())
|
|
return
|
|
}
|
|
```
|
|
|
|
**Files to update:**
|
|
- `internal/handlers/keys.go`
|
|
- `internal/handlers/projects.go`
|
|
- `internal/handlers/claude_config.go`
|
|
- `internal/service/project_service.go`
|
|
|
|
**Acceptance:**
|
|
- All inline validation replaced with validate package
|
|
- Consistent error messages across all endpoints
|
|
- All handler tests pass
|
|
|
|
### Task 5.4: Consolidate Docker Images (1h)
|
|
**Current state:** 4 Dockerfiles with unclear purpose
|
|
|
|
**Action:**
|
|
- Keep `Dockerfile` as single canonical image
|
|
- Delete `Dockerfile.api`, `Dockerfile.api.prebuild`, `Dockerfile.api.simple`
|
|
- Update any CI/scripts referencing old files
|
|
|
|
**Acceptance:**
|
|
- Single `Dockerfile` builds and runs correctly
|
|
- No references to deleted Dockerfiles
|
|
|
|
---
|
|
|
|
## Week 6: Dynamic Project Discovery
|
|
|
|
**Goal**: Remove hardcoded projects, discover from K8s
|
|
|
|
### Task 6.1: Define Project Labels (1h)
|
|
**K8s label convention:**
|
|
```yaml
|
|
metadata:
|
|
labels:
|
|
rdev.orchard9.ai/project: "true"
|
|
rdev.orchard9.ai/name: "pantheon"
|
|
rdev.orchard9.ai/workspace: "/workspace"
|
|
annotations:
|
|
rdev.orchard9.ai/description: "Go API backend"
|
|
```
|
|
|
|
**Update existing pods:**
|
|
- claudebox-pantheon-0
|
|
- claudebox-aeries-0
|
|
|
|
### Task 6.2: Implement Label Discovery (4h)
|
|
**Update `internal/adapter/kubernetes/project_repository.go`:**
|
|
|
|
```go
|
|
func (r *ProjectRepository) RefreshStatus(ctx context.Context) error {
|
|
// List pods with label rdev.orchard9.ai/project=true
|
|
pods, err := r.client.CoreV1().Pods(r.namespace).List(ctx, metav1.ListOptions{
|
|
LabelSelector: "rdev.orchard9.ai/project=true",
|
|
})
|
|
|
|
// For each pod, extract project info from labels
|
|
for _, pod := range pods.Items {
|
|
project := domain.Project{
|
|
ID: domain.ProjectID(pod.Labels["rdev.orchard9.ai/name"]),
|
|
Name: pod.Labels["rdev.orchard9.ai/name"],
|
|
Description: pod.Annotations["rdev.orchard9.ai/description"],
|
|
PodName: pod.Name,
|
|
Workspace: pod.Labels["rdev.orchard9.ai/workspace"],
|
|
Status: mapPodPhase(pod.Status.Phase),
|
|
}
|
|
r.register(project)
|
|
}
|
|
}
|
|
```
|
|
|
|
**Acceptance:**
|
|
- Projects auto-discovered from labeled pods
|
|
- No hardcoded project list
|
|
- New pods automatically appear
|
|
|
|
### Task 6.3: Add Project ConfigMap Support (3h)
|
|
**For complex project configuration:**
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ConfigMap
|
|
metadata:
|
|
name: rdev-projects
|
|
data:
|
|
pantheon.yaml: |
|
|
name: pantheon
|
|
description: Go API backend
|
|
pod_selector: claudebox-pantheon-0
|
|
workspace: /workspace
|
|
allowed_commands:
|
|
- claude
|
|
- shell
|
|
- git
|
|
max_concurrent_commands: 5
|
|
```
|
|
|
|
**Implementation:**
|
|
- Read ConfigMap on startup
|
|
- Merge with label-discovered projects
|
|
- ConfigMap takes precedence for settings
|
|
|
|
### Task 6.4: Pod Watch for Real-Time Updates (4h)
|
|
**Instead of polling, watch for changes:**
|
|
|
|
```go
|
|
func (r *ProjectRepository) StartWatching(ctx context.Context) error {
|
|
watcher, err := r.client.CoreV1().Pods(r.namespace).Watch(ctx, metav1.ListOptions{
|
|
LabelSelector: "rdev.orchard9.ai/project=true",
|
|
})
|
|
|
|
go func() {
|
|
for event := range watcher.ResultChan() {
|
|
switch event.Type {
|
|
case watch.Added:
|
|
r.register(podToProject(event.Object))
|
|
case watch.Deleted:
|
|
r.unregister(podToProjectID(event.Object))
|
|
case watch.Modified:
|
|
r.update(podToProject(event.Object))
|
|
}
|
|
}
|
|
}()
|
|
}
|
|
```
|
|
|
|
**Acceptance:**
|
|
- Projects appear within 1s of pod creation
|
|
- Projects disappear within 1s of pod deletion
|
|
- No polling required
|
|
|
|
---
|
|
|
|
## Week 7: Security & Test Completion
|
|
|
|
**Goal**: IP allowlisting, comprehensive adapter tests
|
|
|
|
### Task 7.1: IP Allowlisting (4h)
|
|
**Schema update:**
|
|
```sql
|
|
ALTER TABLE api_keys ADD COLUMN allowed_ips CIDR[];
|
|
```
|
|
|
|
**Domain update:**
|
|
```go
|
|
type APIKey struct {
|
|
// ... existing fields
|
|
AllowedIPs []net.IPNet `json:"allowed_ips,omitempty"`
|
|
}
|
|
```
|
|
|
|
**Middleware update:**
|
|
```go
|
|
func (m *AuthMiddleware) checkIPAllowed(key *domain.APIKey, clientIP string) bool {
|
|
if len(key.AllowedIPs) == 0 {
|
|
return true // No restriction
|
|
}
|
|
ip := net.ParseIP(clientIP)
|
|
for _, allowed := range key.AllowedIPs {
|
|
if allowed.Contains(ip) {
|
|
return true
|
|
}
|
|
}
|
|
return false
|
|
}
|
|
```
|
|
|
|
**Acceptance:**
|
|
- Keys can have IP restrictions
|
|
- Requests from non-allowed IPs get 403
|
|
- Admin can create unrestricted keys
|
|
|
|
### Task 7.2: Adapter Integration Tests (6h)
|
|
**Create test infrastructure:**
|
|
|
|
```
|
|
tests/
|
|
├── integration/
|
|
│ ├── postgres_test.go # Real postgres via docker
|
|
│ ├── kubernetes_test.go # Mock kubectl
|
|
│ └── testdata/
|
|
│ └── docker-compose.yml
|
|
```
|
|
|
|
**Postgres adapter tests:**
|
|
- CRUD operations for API keys
|
|
- Scope/project array handling
|
|
- Connection pool behavior
|
|
- Migration idempotency
|
|
|
|
**Kubernetes adapter tests:**
|
|
- Mock kubectl responses
|
|
- Command execution with output
|
|
- Error handling (pod not found, timeout)
|
|
- Claude config file operations
|
|
|
|
**Memory adapter tests:**
|
|
- Stream publisher pub/sub
|
|
- Event replay buffer
|
|
- Concurrent subscriber handling
|
|
|
|
**Acceptance:**
|
|
- All adapters have >80% coverage
|
|
- Tests run in CI without real K8s
|
|
- Docker-compose for postgres tests
|
|
|
|
### Task 7.3: Service Layer Tests (4h)
|
|
**Create:**
|
|
```
|
|
internal/service/project_service_test.go
|
|
internal/service/apikey_service_test.go
|
|
internal/service/claude_config_service_test.go
|
|
```
|
|
|
|
**Test patterns:**
|
|
- Happy path for all operations
|
|
- Error propagation from adapters
|
|
- Business rule enforcement
|
|
- Metrics recording
|
|
|
|
### Task 7.4: Improve E2E Test Coverage (4h)
|
|
**Expand `tests/e2e/e2e_test.go`:**
|
|
|
|
```go
|
|
func TestE2E_FullCommandLifecycle(t *testing.T) {
|
|
// 1. Create API key
|
|
// 2. Execute claude command
|
|
// 3. Stream output via SSE
|
|
// 4. Verify completion event
|
|
// 5. Check metrics incremented
|
|
}
|
|
|
|
func TestE2E_RateLimiting(t *testing.T) {
|
|
// Send 101 requests rapidly
|
|
// Verify 429 on 101st request
|
|
// Wait for bucket refill
|
|
// Verify request succeeds
|
|
}
|
|
|
|
func TestE2E_SSEReconnection(t *testing.T) {
|
|
// Start command
|
|
// Connect to stream
|
|
// Disconnect
|
|
// Reconnect with Last-Event-ID
|
|
// Verify replay
|
|
}
|
|
|
|
func TestE2E_ConcurrentCommands(t *testing.T) {
|
|
// Start 5 commands
|
|
// Verify 6th blocked
|
|
// Complete one
|
|
// Verify 6th now succeeds
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Week 8: Production Hardening
|
|
|
|
**Goal**: Production-ready K8s manifests, reliability features
|
|
|
|
### Task 8.1: K8s Manifest Hardening (4h)
|
|
**Update `deployments/k8s/base/`:**
|
|
|
|
```yaml
|
|
# deployment.yaml
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: rdev-api
|
|
resources:
|
|
requests:
|
|
memory: "128Mi"
|
|
cpu: "100m"
|
|
limits:
|
|
memory: "512Mi"
|
|
cpu: "500m"
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health
|
|
port: 8080
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 10
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /ready
|
|
port: 8080
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 5
|
|
securityContext:
|
|
runAsNonRoot: true
|
|
readOnlyRootFilesystem: true
|
|
capabilities:
|
|
drop: ["ALL"]
|
|
```
|
|
|
|
```yaml
|
|
# pdb.yaml
|
|
apiVersion: policy/v1
|
|
kind: PodDisruptionBudget
|
|
metadata:
|
|
name: rdev-api-pdb
|
|
spec:
|
|
minAvailable: 1
|
|
selector:
|
|
matchLabels:
|
|
app: rdev-api
|
|
```
|
|
|
|
```yaml
|
|
# network-policy.yaml
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: NetworkPolicy
|
|
metadata:
|
|
name: rdev-api-policy
|
|
spec:
|
|
podSelector:
|
|
matchLabels:
|
|
app: rdev-api
|
|
policyTypes:
|
|
- Ingress
|
|
- Egress
|
|
ingress:
|
|
- from:
|
|
- namespaceSelector:
|
|
matchLabels:
|
|
name: ingress
|
|
ports:
|
|
- port: 8080
|
|
egress:
|
|
- to:
|
|
- namespaceSelector:
|
|
matchLabels:
|
|
name: databases
|
|
ports:
|
|
- port: 5432
|
|
- to:
|
|
- podSelector:
|
|
matchLabels:
|
|
rdev.orchard9.ai/project: "true"
|
|
```
|
|
|
|
### Task 8.2: RBAC Configuration (2h)
|
|
```yaml
|
|
# rbac.yaml
|
|
apiVersion: v1
|
|
kind: ServiceAccount
|
|
metadata:
|
|
name: rdev-api
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: Role
|
|
metadata:
|
|
name: rdev-api-role
|
|
rules:
|
|
- apiGroups: [""]
|
|
resources: ["pods"]
|
|
verbs: ["get", "list", "watch"]
|
|
- apiGroups: [""]
|
|
resources: ["pods/exec"]
|
|
verbs: ["create"]
|
|
- apiGroups: [""]
|
|
resources: ["configmaps"]
|
|
verbs: ["get", "list", "watch"]
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: RoleBinding
|
|
metadata:
|
|
name: rdev-api-binding
|
|
roleRef:
|
|
apiGroup: rbac.authorization.k8s.io
|
|
kind: Role
|
|
name: rdev-api-role
|
|
subjects:
|
|
- kind: ServiceAccount
|
|
name: rdev-api
|
|
```
|
|
|
|
### Task 8.3: Graceful Shutdown (3h)
|
|
```go
|
|
// cmd/rdev-api/main.go
|
|
func main() {
|
|
// ... setup ...
|
|
|
|
srv := &http.Server{
|
|
Addr: cfg.Addr,
|
|
Handler: router,
|
|
}
|
|
|
|
// Start server
|
|
go func() {
|
|
if err := srv.ListenAndServe(); err != http.ErrServerClosed {
|
|
log.Fatal(err)
|
|
}
|
|
}()
|
|
|
|
// Wait for interrupt
|
|
quit := make(chan os.Signal, 1)
|
|
signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
|
|
<-quit
|
|
|
|
// Graceful shutdown
|
|
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
|
|
defer cancel()
|
|
|
|
// Stop accepting new requests
|
|
srv.SetKeepAlivesEnabled(false)
|
|
|
|
// Wait for active requests
|
|
if err := srv.Shutdown(ctx); err != nil {
|
|
log.Error("forced shutdown", "error", err)
|
|
}
|
|
|
|
// Close database connections
|
|
db.Close()
|
|
|
|
log.Info("server stopped gracefully")
|
|
}
|
|
```
|
|
|
|
### Task 8.4: Circuit Breaker for K8s (3h)
|
|
**Protect against K8s API failures:**
|
|
|
|
```go
|
|
type CircuitBreaker struct {
|
|
failures int
|
|
threshold int
|
|
resetAfter time.Duration
|
|
lastFailure time.Time
|
|
state State // Closed, Open, HalfOpen
|
|
mu sync.RWMutex
|
|
}
|
|
|
|
func (cb *CircuitBreaker) Execute(fn func() error) error {
|
|
cb.mu.RLock()
|
|
if cb.state == Open && time.Since(cb.lastFailure) < cb.resetAfter {
|
|
cb.mu.RUnlock()
|
|
return ErrCircuitOpen
|
|
}
|
|
cb.mu.RUnlock()
|
|
|
|
err := fn()
|
|
|
|
cb.mu.Lock()
|
|
defer cb.mu.Unlock()
|
|
if err != nil {
|
|
cb.failures++
|
|
cb.lastFailure = time.Now()
|
|
if cb.failures >= cb.threshold {
|
|
cb.state = Open
|
|
}
|
|
} else {
|
|
cb.failures = 0
|
|
cb.state = Closed
|
|
}
|
|
return err
|
|
}
|
|
```
|
|
|
|
### Task 8.5: Health Check Enhancements (2h)
|
|
```go
|
|
// /health - Basic liveness
|
|
func (h *HealthHandler) Health(w http.ResponseWriter, r *http.Request) {
|
|
api.WriteSuccess(w, r, map[string]string{"status": "ok"})
|
|
}
|
|
|
|
// /ready - Full readiness
|
|
func (h *HealthHandler) Ready(w http.ResponseWriter, r *http.Request) {
|
|
checks := make(map[string]string)
|
|
|
|
// Database connectivity
|
|
if err := h.db.PingContext(r.Context()); err != nil {
|
|
checks["database"] = "unhealthy: " + err.Error()
|
|
} else {
|
|
checks["database"] = "healthy"
|
|
}
|
|
|
|
// K8s connectivity
|
|
if err := h.k8sClient.Ping(r.Context()); err != nil {
|
|
checks["kubernetes"] = "unhealthy: " + err.Error()
|
|
} else {
|
|
checks["kubernetes"] = "healthy"
|
|
}
|
|
|
|
// Check for any unhealthy
|
|
for _, status := range checks {
|
|
if strings.HasPrefix(status, "unhealthy") {
|
|
api.WriteError(w, r, http.StatusServiceUnavailable,
|
|
"NOT_READY", "service not ready", checks)
|
|
return
|
|
}
|
|
}
|
|
|
|
api.WriteSuccess(w, r, map[string]any{
|
|
"status": "ready",
|
|
"checks": checks,
|
|
})
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Week 9: Performance & Observability
|
|
|
|
**Goal**: OpenTelemetry, performance optimization
|
|
|
|
### Task 9.1: OpenTelemetry Integration (6h)
|
|
**Add tracing:**
|
|
|
|
```go
|
|
// cmd/rdev-api/main.go
|
|
func initTracing() (*sdktrace.TracerProvider, error) {
|
|
exporter, err := otlptracehttp.New(context.Background(),
|
|
otlptracehttp.WithEndpoint(os.Getenv("OTEL_EXPORTER_ENDPOINT")),
|
|
)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
tp := sdktrace.NewTracerProvider(
|
|
sdktrace.WithBatcher(exporter),
|
|
sdktrace.WithResource(resource.NewWithAttributes(
|
|
semconv.SchemaURL,
|
|
semconv.ServiceName("rdev-api"),
|
|
semconv.ServiceVersion(Version),
|
|
)),
|
|
)
|
|
otel.SetTracerProvider(tp)
|
|
return tp, nil
|
|
}
|
|
```
|
|
|
|
**Instrument handlers:**
|
|
```go
|
|
func (h *ProjectsHandler) RunClaude(w http.ResponseWriter, r *http.Request) {
|
|
ctx, span := tracer.Start(r.Context(), "RunClaude")
|
|
defer span.End()
|
|
|
|
span.SetAttributes(
|
|
attribute.String("project.id", projectID),
|
|
attribute.String("command.type", "claude"),
|
|
)
|
|
|
|
// ... handler logic ...
|
|
|
|
if err != nil {
|
|
span.RecordError(err)
|
|
span.SetStatus(codes.Error, err.Error())
|
|
}
|
|
}
|
|
```
|
|
|
|
### Task 9.2: Connection Pool Tuning (2h)
|
|
**Database:**
|
|
```go
|
|
db.SetMaxOpenConns(25)
|
|
db.SetMaxIdleConns(10)
|
|
db.SetConnMaxLifetime(5 * time.Minute)
|
|
db.SetConnMaxIdleTime(1 * time.Minute)
|
|
```
|
|
|
|
**HTTP client for K8s:**
|
|
```go
|
|
transport := &http.Transport{
|
|
MaxIdleConns: 100,
|
|
MaxIdleConnsPerHost: 10,
|
|
IdleConnTimeout: 90 * time.Second,
|
|
}
|
|
```
|
|
|
|
### Task 9.3: Response Caching (3h)
|
|
**Cache project list (changes infrequently):**
|
|
|
|
```go
|
|
type CachedProjectRepository struct {
|
|
inner port.ProjectRepository
|
|
cache *sync.Map
|
|
ttl time.Duration
|
|
lastFetch time.Time
|
|
mu sync.RWMutex
|
|
}
|
|
|
|
func (r *CachedProjectRepository) List(ctx context.Context) ([]domain.Project, error) {
|
|
r.mu.RLock()
|
|
if time.Since(r.lastFetch) < r.ttl {
|
|
if cached, ok := r.cache.Load("projects"); ok {
|
|
r.mu.RUnlock()
|
|
return cached.([]domain.Project), nil
|
|
}
|
|
}
|
|
r.mu.RUnlock()
|
|
|
|
r.mu.Lock()
|
|
defer r.mu.Unlock()
|
|
|
|
// Double-check after acquiring write lock
|
|
if time.Since(r.lastFetch) < r.ttl {
|
|
if cached, ok := r.cache.Load("projects"); ok {
|
|
return cached.([]domain.Project), nil
|
|
}
|
|
}
|
|
|
|
projects, err := r.inner.List(ctx)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
r.cache.Store("projects", projects)
|
|
r.lastFetch = time.Now()
|
|
return projects, nil
|
|
}
|
|
```
|
|
|
|
### Task 9.4: Benchmark Suite (3h)
|
|
```go
|
|
// internal/handlers/projects_bench_test.go
|
|
|
|
func BenchmarkRunClaude(b *testing.B) {
|
|
// Setup
|
|
handler := setupTestHandler()
|
|
|
|
b.ResetTimer()
|
|
for i := 0; i < b.N; i++ {
|
|
req := httptest.NewRequest("POST", "/projects/test/claude",
|
|
strings.NewReader(`{"prompt":"test"}`))
|
|
rec := httptest.NewRecorder()
|
|
handler.RunClaude(rec, req)
|
|
}
|
|
}
|
|
|
|
func BenchmarkSSEStreaming(b *testing.B) {
|
|
// Measure event throughput
|
|
}
|
|
|
|
func BenchmarkAuthMiddleware(b *testing.B) {
|
|
// Measure auth overhead
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Week 10: Documentation & Polish
|
|
|
|
**Goal**: Comprehensive docs, final quality pass
|
|
|
|
### Task 10.1: Architecture Documentation (4h)
|
|
**Create `docs/architecture/`:**
|
|
|
|
```
|
|
docs/architecture/
|
|
├── README.md # Overview + diagrams
|
|
├── hexagonal.md # Port/adapter pattern
|
|
├── security.md # Auth, sanitization, rate limiting
|
|
├── streaming.md # SSE protocol, reconnection
|
|
└── diagrams/
|
|
├── system-context.mmd
|
|
├── component.mmd
|
|
└── sequence-command.mmd
|
|
```
|
|
|
|
**Include:**
|
|
- System context diagram
|
|
- Component diagram
|
|
- Sequence diagrams for key flows
|
|
- ADRs (Architecture Decision Records)
|
|
|
|
### Task 10.2: API Documentation (3h)
|
|
**Enhance OpenAPI spec:**
|
|
- Add examples for all endpoints
|
|
- Document error codes
|
|
- Add authentication examples
|
|
- Include rate limit headers
|
|
|
|
**Create `docs/api/`:**
|
|
- Quick start guide
|
|
- Authentication guide
|
|
- SSE client examples (JS, Python, Go)
|
|
- Error handling guide
|
|
|
|
### Task 10.3: Operations Documentation (3h)
|
|
**Create `docs/operations/`:**
|
|
|
|
```
|
|
docs/operations/
|
|
├── deployment.md # K8s deployment guide
|
|
├── monitoring.md # Prometheus/Grafana setup
|
|
├── troubleshooting.md # Common issues
|
|
├── runbooks/
|
|
│ ├── high-cpu.md
|
|
│ ├── high-memory.md
|
|
│ ├── pod-not-found.md
|
|
│ └── auth-failures.md
|
|
└── disaster-recovery.md
|
|
```
|
|
|
|
### Task 10.4: Final Quality Gate (4h)
|
|
**Run comprehensive checks:**
|
|
|
|
```bash
|
|
# Static analysis
|
|
golangci-lint run ./...
|
|
|
|
# Security scan
|
|
gosec ./...
|
|
|
|
# Test coverage
|
|
go test -coverprofile=coverage.out ./...
|
|
go tool cover -html=coverage.out -o coverage.html
|
|
|
|
# Benchmark baseline
|
|
go test -bench=. -benchmem ./... > benchmark.txt
|
|
|
|
# Dependency audit
|
|
go list -m all | nancy sleuth
|
|
|
|
# Build all targets
|
|
go build ./...
|
|
GOOS=linux GOARCH=amd64 go build ./...
|
|
|
|
# Docker build
|
|
docker build -t rdev-api:latest .
|
|
```
|
|
|
|
**Coverage targets:**
|
|
| Package | Target |
|
|
|---------|--------|
|
|
| internal/auth | >90% |
|
|
| internal/handlers | >85% |
|
|
| internal/service | >90% |
|
|
| internal/adapter/* | >80% |
|
|
| internal/domain | >95% |
|
|
|
|
### Task 10.5: Release Preparation (2h)
|
|
**Create release checklist:**
|
|
|
|
```markdown
|
|
## v1.0.0 Release Checklist
|
|
|
|
### Pre-release
|
|
- [ ] All tests pass
|
|
- [ ] Coverage targets met
|
|
- [ ] Security scan clean
|
|
- [ ] Benchmarks acceptable
|
|
- [ ] Documentation complete
|
|
- [ ] CHANGELOG.md updated
|
|
- [ ] Version bumped
|
|
|
|
### Release
|
|
- [ ] Tag created
|
|
- [ ] Docker image built and pushed
|
|
- [ ] K8s manifests updated
|
|
- [ ] Release notes published
|
|
|
|
### Post-release
|
|
- [ ] Smoke test in staging
|
|
- [ ] Monitor error rates
|
|
- [ ] Monitor latency
|
|
- [ ] Announce to users
|
|
```
|
|
|
|
---
|
|
|
|
## Summary: Week-by-Week
|
|
|
|
| Week | Focus | Key Deliverables |
|
|
|------|-------|------------------|
|
|
| **5** | Legacy Removal & Core Fixes | Clean codebase, working Claude config, integrated validation |
|
|
| **6** | Dynamic Project Discovery | Label-based discovery, ConfigMap support, pod watching |
|
|
| **7** | Security & Tests | IP allowlisting, adapter tests, service tests, E2E |
|
|
| **8** | Production Hardening | K8s manifests, RBAC, graceful shutdown, circuit breaker |
|
|
| **9** | Performance & Observability | OpenTelemetry, connection tuning, caching, benchmarks |
|
|
| **10** | Documentation & Polish | Architecture docs, API docs, ops docs, final QA |
|
|
|
|
---
|
|
|
|
## Success Criteria: Pristine Project
|
|
|
|
### Code Quality
|
|
- [ ] No legacy code remaining
|
|
- [ ] 100% of handlers use service layer
|
|
- [ ] All validation via validate package
|
|
- [ ] Consistent error handling throughout
|
|
- [ ] No TODO/FIXME without ticket
|
|
|
|
### Test Coverage
|
|
- [ ] >85% overall coverage
|
|
- [ ] All adapters have integration tests
|
|
- [ ] E2E tests cover all user journeys
|
|
- [ ] Benchmark suite for performance regression
|
|
|
|
### Security
|
|
- [ ] Command sanitization (shell injection)
|
|
- [ ] IP allowlisting support
|
|
- [ ] Rate limiting enforced
|
|
- [ ] Secrets never logged
|
|
- [ ] RBAC configured
|
|
|
|
### Production Ready
|
|
- [ ] Resource limits set
|
|
- [ ] Health/readiness probes
|
|
- [ ] Graceful shutdown
|
|
- [ ] Network policies
|
|
- [ ] PodDisruptionBudget
|
|
- [ ] Monitoring dashboards
|
|
|
|
### Documentation
|
|
- [ ] Architecture documented
|
|
- [ ] API fully documented with examples
|
|
- [ ] Operations runbooks
|
|
- [ ] Troubleshooting guide
|
|
- [ ] Deployment guide
|
|
|
|
### Observability
|
|
- [ ] Prometheus metrics
|
|
- [ ] OpenTelemetry tracing
|
|
- [ ] Structured logging
|
|
- [ ] Error tracking
|
|
|
|
---
|
|
|
|
## Estimated Effort
|
|
|
|
| Week | Hours |
|
|
|------|-------|
|
|
| 5 | 14h |
|
|
| 6 | 12h |
|
|
| 7 | 18h |
|
|
| 8 | 14h |
|
|
| 9 | 14h |
|
|
| 10 | 16h |
|
|
| **Total** | **88h** |
|
|
|
|
At ~15h/week pace: **6 weeks** to pristine.
|
|
At ~30h/week pace: **3 weeks** to pristine.
|