jordan 3b35900a2d feat: enterprise worker pool with HTTP sidecar pattern

Implements horizontally-scalable worker pool architecture:
- claudebox-sidecar: HTTP server for Claude Code, git, and SDLC ops
- rdev-worker: standalone worker binary polling rdev-api for tasks
- HTTP client adapter for sidecar communication
- HPA with custom Prometheus metrics for autoscaling
- ServiceMonitor for metrics scraping

Code review fixes applied:
- URL-encode query parameters in GitStatus (Critical #1)
- Remove unused shellQuote function (Critical #2)
- Use stdlib strings.Split/TrimSpace (Critical #3)
- Add version injection via ldflags (Warning #4)
- Add debug logging for swallowed git/sdlc errors (Warning #5, #6)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 16:21:11 -07:00

39 KiB

Raw Blame History

Orchard Studio: Gap Analysis

This document maps the delta between current rdev capabilities and what Orchard Studio requires.

Current Foundation (What We Have)

Capability	Status	Location
SDLC Classifier	✅ Complete	`internal/sdlc/classifier.go`
Feature State Machine	✅ Complete	`internal/sdlc/` (10 phases, 31 rules)
Composable Templates	✅ Complete	`internal/adapter/templates/`
Worker Pod Execution	✅ Complete	`internal/worker/sdlc_executor.go`
Webhook Dispatcher	✅ Complete	`internal/webhook/dispatcher.go`
Project Provisioning	✅ Complete	K8s namespace, DNS, git repo
Database Provisioning	✅ Complete	CockroachDB adapter
Tree Workflows	✅ Proven	`cookbooks/trees/*.yaml`

Gap 0: Design Reference Capture & Processing

Current: No mechanism for users to provide visual inspiration. Features are described purely in text.

Required: Users can provide URLs or screenshots as design references, which inform the Architect's questions and the Blueprint's design system section.

What's Missing

┌─────────────────────────────────────────────────────────────────────────┐
│  CURRENT FLOW                                                            │
│                                                                          │
│  User: "Build a pricing page"                                            │
│  Architect: *asks about data model, endpoints...*                        │
│  (No visual context, design decisions are guesswork)                     │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│  REQUIRED FLOW                                                           │
│                                                                          │
│  User: "Build a pricing page like this" + [URL or screenshot]            │
│  System: Captures screenshot, stores with Blueprint                      │
│  Architect: "I see a dark theme with 3 tiers..." → asks clarifying Qs   │
│  Blueprint: Populates designSystem section with extracted tokens         │
└─────────────────────────────────────────────────────────────────────────┘

Two Input Types

Input	Capture Method	Storage
URL	Playwright screenshots the page automatically	`/references/{blueprintId}/{refId}.png`
Screenshot	User uploads image (drag/drop, paste, file picker)	Same storage path

Implementation Required

Reference Capture Service:
- For URLs: Reuse verify_executor.go pattern (Playwright pod)
- For uploads: Standard file upload handling
- Store thumbnails alongside Blueprint
Chat Endpoint Enhancement:
- Accept references[] array in request body
- Process references before LLM call
- Include reference images in Architect prompt context
Architect Prompt Updates:
- Describe what it observes in natural language
- Ask clarifying questions about design intent
- Extract structured design tokens into Blueprint
Blueprint Schema:
- Add references.items[] array
- Add sections.designSystem section
- Track which references informed which design decisions
Plan Pane Rendering:
- Show reference thumbnails in UI
- Display extracted design tokens
- Allow user to add annotations

Complexity: Medium

URL capture reuses existing Playwright infrastructure
File upload is standard pattern
Main work is Architect prompt engineering for visual understanding
LLM vision capabilities needed (Claude can see images natively)

Gap 1: Blueprint Storage & Chat API

Current: Features are created via POST /sdlc/features with a complete spec. No iterative refinement.

Required: Multi-turn conversation that builds a Blueprint incrementally.

What's Missing

┌─────────────────────────────────────────────────────────────────┐
│  CURRENT FLOW                                                   │
│                                                                 │
│  User writes spec → POST /sdlc/features → Feature created       │
│  (one shot, no iteration)                                       │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  REQUIRED FLOW                                                  │
│                                                                 │
│  User message → Architect responds + updates Blueprint →        │
│  User message → Architect responds + updates Blueprint →        │
│  ...repeat until ready...                                       │
│  User: "build it" → Blueprint → SDLC Feature → Build            │
└─────────────────────────────────────────────────────────────────┘

Implementation Required

Database Tables:
- blueprints - stores structured Blueprint JSON
- blueprint_messages - conversation history with snapshots
API Endpoints:
- POST /projects/{id}/blueprint/chat - send message, get reply + updated blueprint
- GET /projects/{id}/blueprints - list blueprints
- GET /projects/{id}/blueprints/{id} - get specific blueprint
- DELETE /projects/{id}/blueprints/{id} - discard draft
Service Layer:
- ArchitectService - manages conversation, calls LLM, updates Blueprint

Complexity: Medium

Schema is defined (see app-vision.md)
Standard CRUD + LLM integration
Most work is in prompt engineering for Architect

Gap 2: Architect Agent Persona

Current: We have coding agents (/implement-feature). They write code, not specs.

Required: An agent that asks questions, fills in a structured Blueprint, knows when to stop.

What's Missing

┌─────────────────────────────────────────────────────────────────┐
│  CURRENT AGENTS                                                 │
│                                                                 │
│  User: "Add cat photos"                                         │
│  Agent: *immediately writes code*                               │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  ARCHITECT AGENT                                                │
│                                                                 │
│  User: "Add cat photos"                                         │
│  Architect: "Should photos be public or friends-only?"          │
│  User: "Public"                                                 │
│  Architect: "Got it. Do you want likes, comments, or neither?"  │
│  ...continues until Blueprint is complete...                    │
└─────────────────────────────────────────────────────────────────┘

Implementation Required

System Prompt:
- .claude/agents/architect.md - detailed persona
- Structured output format (reply + Blueprint JSON)
- Question strategy (when to ask vs assume)
Structured Output Parsing:
- LLM returns {reply: string, blueprint: Blueprint}
- Validate Blueprint against schema
- Handle partial updates (delta vs full replacement)
Completeness Logic:
- isReadyToBuild(blueprint) function
- Clear rules for when questions are resolved
- Override mechanism for user to force build

Complexity: Medium-High

Prompt engineering is iterative
Structured output from LLMs can be fragile
Need fallback handling for malformed responses

Gap 3: Operation Tracking (Tree Runner in DB)

Current: Tree workflows run via shell script (tree-runner.sh). State in local JSON files.

Required: Operations tracked in database, queryable via API, streamable to UI.

What's Missing

┌─────────────────────────────────────────────────────────────────┐
│  CURRENT                                                        │
│                                                                 │
│  ./tree-runner.sh slackpath-1.yaml                              │
│  → Runs in terminal                                             │
│  → State in .checkpoints/slackpath-1.json                       │
│  → No API visibility                                            │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  REQUIRED                                                       │
│                                                                 │
│  POST /operations/start {tree: "slackpath-1"}                   │
│  → Returns operation_id                                         │
│  → State in operations table                                    │
│  → GET /operations/{id}/stream returns SSE events               │
└─────────────────────────────────────────────────────────────────┘

Implementation Required

Database Tables:
- operations - tracks running/completed operations
- operation_events - event log for replay/streaming
Service Layer:
- OrchestratorService - manages operation lifecycle
- Port tree-runner logic from bash to Go
- Event emission during execution
API Endpoints:
- POST /projects/{id}/operations - start operation
- GET /projects/{id}/operations/{id} - get status
- GET /projects/{id}/operations/{id}/stream - SSE stream
Worker Integration:
- SDLC executor emits events as it progresses
- Events written to operation_events table
- SSE handler reads from table and streams

Complexity: High

Tree runner logic is non-trivial (dependencies, outputs, error handling)
SSE streaming requires careful connection management
Need to handle operation cancellation, resumption

Gap 4: Real-Time Progress Streaming

Current: Webhooks fire on build complete. No per-step visibility.

Required: SSE stream showing "Designing schema... Writing handlers... Running tests..."

What's Missing

┌─────────────────────────────────────────────────────────────────┐
│  CURRENT                                                        │
│                                                                 │
│  Build starts → ... silence ... → Webhook: "build complete"    │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  REQUIRED                                                       │
│                                                                 │
│  Build starts →                                                 │
│    event: {"phase": "spec", "status": "complete"}               │
│    event: {"phase": "design", "status": "in_progress"}          │
│    event: {"phase": "design", "status": "complete"}             │
│    event: {"phase": "implement", "progress": 0.5}               │
│    ...                                                          │
│    event: {"status": "complete", "url": "..."}                  │
└─────────────────────────────────────────────────────────────────┘

Implementation Required

SDLC Executor Changes:
- Emit events at phase transitions
- Emit progress within phases (task completion)
- Write events to operation_events table
SSE Handler:
- GET /operations/{id}/stream
- Long-lived connection
- Read events from DB (or Redis pub/sub)
- Handle client disconnection gracefully

Event Types:

type OperationEvent struct {
    Type      string    // "phase", "progress", "artifact", "error", "complete"
    Phase     string    // "spec", "design", "implement", "test", "deploy"
    Status    string    // "in_progress", "complete", "failed"
    Message   string    // Human-readable
    Progress  float64   // 0.0 to 1.0 for granular progress
    Timestamp time.Time
}

Complexity: Medium

SSE is straightforward in Go
Main work is instrumenting SDLC executor
Need to balance granularity vs noise

Gap 5: Blueprint → SDLC Feature Conversion

Current: SDLC features are created manually with spec documents.

Required: Automated conversion from structured Blueprint to SDLC feature spec.

What's Missing

┌─────────────────────────────────────────────────────────────────┐
│  CURRENT                                                        │
│                                                                 │
│  Human writes: spec.md with prose description                   │
│  → POST /sdlc/features                                          │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  REQUIRED                                                       │
│                                                                 │
│  Blueprint JSON → Template rendering → spec.md                  │
│  → Automated POST /sdlc/features                                │
└─────────────────────────────────────────────────────────────────┘

Implementation Required

Spec Template:

# Feature: {{.Feature}}

## Summary
{{.Summary}}

## Data Model
{{range .Sections.DataModel.Entities}}
### {{.Name}}
| Field | Type |
|-------|------|
{{range .Fields}}| {{.Name}} | {{.Type}} |
{{end}}
{{end}}

## API Endpoints
{{range .Sections.APIEndpoints.Endpoints}}
- `{{.Method}} {{.Path}}` - {{.Description}}
{{end}}

## UI Components
{{range .Sections.UIComponents.Components}}
- **{{.Name}}**: {{.Purpose}}
{{end}}

## Assumptions
{{range .Assumptions}}
- {{.Assumption}}
{{end}}

Conversion Service:
- Takes Blueprint, renders spec.md
- Creates SDLC feature via existing API
- Links Blueprint to created feature (built_feature_slug)

Complexity: Low

Template rendering is straightforward
SDLC feature creation already exists
Main work is template design

Gap 6: Frontend (Next.js Studio)

Current: No frontend. All interaction via API/CLI.

Required: Three-pane interface (Chat, Plan, Preview).

What's Missing

Everything. This is a new application.

Implementation Required

Project Setup:
- Next.js 14 with App Router
- Tailwind CSS for styling
- Authentication (integrate with rdev auth)

Core Components:

apps/studio/
├── app/
│   ├── page.tsx              # Template selection
│   ├── projects/
│   │   └── [id]/
│   │       └── page.tsx      # Three-pane workspace
│   └── api/                  # Proxy to rdev-api
├── components/
│   ├── ChatPane.tsx
│   ├── PlanPane.tsx
│   ├── PreviewPane.tsx
│   ├── ActivityFeed.tsx
│   └── BuildProgress.tsx
└── lib/
    ├── api.ts               # rdev-api client
    └── sse.ts               # SSE connection manager

State Management:
- Blueprint state (updated on each chat response)
- Operation state (updated via SSE)
- UI state (which pane is focused, etc.)
Key Interactions:
- Send chat message → receive reply + blueprint
- Click "Build It" → start operation → show progress
- Operation complete → refresh preview iframe

Complexity: Medium

Standard Next.js app
SSE client requires careful handling
Most complexity is in polish and UX

Gap 7: Platform Service Infrastructure

Current: Projects manage their own integrations. No shared services, no credential management.

Required: A service catalog with provisioning, credential injection, and upgrade paths for existing projects.

The "Upgrade" Problem

┌─────────────────────────────────────────────────────────────────┐
│  CURRENT                                                        │
│                                                                 │
│  Project created 3 months ago                                   │
│  → No centralized logging                                       │
│  → No analytics                                                 │
│  → Rolling your own email                                       │
│  → No easy way to add platform services                         │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  REQUIRED                                                       │
│                                                                 │
│  POST /projects/{id}/services                                   │
│  { "type": "logging", "provider": "loki" }                      │
│                                                                 │
│  → Provision credentials                                        │
│  → Inject into K8s secrets                                      │
│  → Create integration PR with config changes                    │
│  → Project now ships logs to centralized system                 │
└─────────────────────────────────────────────────────────────────┘

Service Rollout Order

Build infrastructure with simplest service first, then add complexity:

Order	Service	Why This Order
1	Logging	Pure infrastructure, no user-facing code changes
2	Email	Simple API calls, clear success/failure
3	Stats	Frontend SDK + backend events
4	Auth	Most complex (middleware, user model, protected routes)

Implementation Required

1. Service Catalog

# internal/platform/catalog.yaml
services:
  logging:
    description: "Centralized log aggregation"
    providers:
      loki:
        name: "Grafana Loki"
        credentials:
          - LOKI_URL
          - LOKI_TENANT_ID
        integration:
          go:
            config_template: "loki-logger.go.tmpl"
            env_example: ["LOKI_URL", "LOKI_TENANT_ID"]
          node:
            packages: ["pino", "pino-loki"]
            config_template: "pino-loki.ts.tmpl"

  email:
    description: "Transactional email"
    providers:
      resend:
        name: "Resend"
        credentials:
          - RESEND_API_KEY
        integration:
          go:
            packages: ["github.com/resendlabs/resend-go"]
            service_template: "email-service.go.tmpl"
          node:
            packages: ["resend"]
            service_template: "email-client.ts.tmpl"

  stats:
    description: "Product analytics"
    providers:
      posthog:
        name: "PostHog"
        credentials:
          - POSTHOG_API_KEY
          - POSTHOG_HOST
        integration:
          go:
            packages: ["github.com/posthog/posthog-go"]
          node:
            packages: ["posthog-js", "posthog-node"]
            provider_template: "analytics-provider.tsx.tmpl"

  auth:
    description: "User authentication"
    providers:
      clerk:
        name: "Clerk"
        credentials:
          - CLERK_SECRET_KEY
          - NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY
        integration:
          node:
            packages: ["@clerk/nextjs"]
            middleware_template: "clerk-middleware.ts.tmpl"
            provider_template: "clerk-provider.tsx.tmpl"

2. Database Schema

-- Track which services a project uses
CREATE TABLE project_services (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    project_id UUID NOT NULL REFERENCES projects(id),
    service_type TEXT NOT NULL,      -- 'logging', 'email', 'stats', 'auth'
    provider TEXT NOT NULL,           -- 'loki', 'resend', 'posthog', 'clerk'
    environment TEXT NOT NULL,        -- 'staging', 'production', 'all'

    -- Encrypted credentials
    credentials_encrypted BYTEA,

    -- Non-sensitive config
    config JSONB NOT NULL DEFAULT '{}',

    -- Status tracking
    status TEXT NOT NULL DEFAULT 'provisioning',
    -- provisioning → active → needs_update → deprovisioned

    -- Integration tracking
    integration_status TEXT DEFAULT 'pending',
    -- pending → pr_created → integrated → needs_update
    integration_pr_url TEXT,
    integration_commit TEXT,

    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

    UNIQUE(project_id, service_type, environment)
);

3. Provisioner Interface

// internal/port/platform_provisioner.go
type PlatformProvisioner interface {
    // Provision creates credentials for a project
    Provision(ctx context.Context, req ProvisionRequest) (*ProvisionResult, error)

    // Verify checks if credentials are still valid
    Verify(ctx context.Context, projectID string, creds map[string]string) error

    // Deprovision cleans up (optional, for account removal)
    Deprovision(ctx context.Context, projectID string) error
}

type ProvisionRequest struct {
    ProjectID   uuid.UUID
    ProjectName string
    Environment string  // "staging", "production"
}

type ProvisionResult struct {
    Credentials map[string]string  // Encrypted before storage
    Config      map[string]string  // Non-sensitive config
}

4. Service Addition API

POST /projects/{projectId}/services
{
  "serviceType": "logging",
  "provider": "loki"       // Optional, uses platform default
}

Response:
{
  "serviceId": "svc_abc123",
  "status": "provisioning",
  "integrationMethod": "pr",  // or "direct"
  "prUrl": null  // Populated when PR is created
}

GET /projects/{projectId}/services/{serviceId}
{
  "serviceId": "svc_abc123",
  "serviceType": "logging",
  "provider": "loki",
  "status": "active",
  "integrationStatus": "integrated",
  "integrationCommit": "abc123...",
  "credentials": {
    "LOKI_URL": "[redacted]",
    "LOKI_TENANT_ID": "project-xyz"
  }
}

5. Integration Flow

POST /projects/{id}/services {type: "logging", provider: "loki"}
                    │
                    ▼
┌─────────────────────────────────────────────────────────────────┐
│  1. PROVISION                                                   │
│                                                                 │
│  LokiProvisioner.Provision()                                    │
│  → Create tenant in Loki (or use shared with project prefix)   │
│  → Generate credentials                                         │
│  → Store encrypted in project_services                          │
└─────────────────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────────┐
│  2. INJECT                                                      │
│                                                                 │
│  K8sSecretInjector.Inject()                                     │
│  → Add LOKI_URL, LOKI_TENANT_ID to project's K8s secret        │
│  → Trigger deployment restart to pick up new env vars          │
└─────────────────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────────┐
│  3. INTEGRATE                                                   │
│                                                                 │
│  IntegrationService.CreatePR() or .DirectCommit()               │
│  → Clone project repo                                           │
│  → Apply integration templates:                                 │
│    • Update logger config to ship to Loki                       │
│    • Add env vars to .env.example                               │
│    • Update deployment to mount secrets                         │
│  → Create PR (or direct commit for new projects)                │
└─────────────────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────────┐
│  4. VERIFY                                                      │
│                                                                 │
│  After PR merge / deploy:                                       │
│  → Check logs appearing in Loki                                 │
│  → Update integration_status to "integrated"                    │
└─────────────────────────────────────────────────────────────────┘

Complexity: High

Service catalog is straightforward (YAML/DB)
Each provisioner is unique (Loki vs Resend vs PostHog)
Credential encryption and management needs care
Integration templates need to handle Go + Node + various frameworks
PR creation requires git operations

Starting Point: Logging with Loki

// internal/adapter/loki/provisioner.go
type LokiProvisioner struct {
    lokiURL    string
    adminToken string  // For tenant creation if using multi-tenant Loki
}

func (p *LokiProvisioner) Provision(ctx context.Context, req ProvisionRequest) (*ProvisionResult, error) {
    // For single-tenant Loki, just create a unique label prefix
    tenantID := fmt.Sprintf("project-%s", req.ProjectID)

    return &ProvisionResult{
        Credentials: map[string]string{
            "LOKI_URL":       p.lokiURL,
            "LOKI_TENANT_ID": tenantID,
        },
        Config: map[string]string{
            "service_name": req.ProjectName,
        },
    }, nil
}

Gap 8: Dual Environment Support

Current: Single deployment per project. Main branch = production.

Required: Staging + Production environments. Build deploys to staging, "Publish" promotes to production.

The Environment Model

┌─────────────────────────────────────────────────────────────────┐
│  Project: cool-project                                          │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  STAGING                                                 │   │
│  │  staging.cool-project.threesix.ai                       │   │
│  │                                                          │   │
│  │  • Where development happens                             │   │
│  │  • Preview pane shows this                               │   │
│  │  • "Build It" deploys here                               │   │
│  │  • May use test credentials for services                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                         │                                       │
│                    [Publish]                                    │
│                         │                                       │
│                         ▼                                       │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  PRODUCTION                                              │   │
│  │  cool-project.threesix.ai                               │   │
│  │                                                          │   │
│  │  • User-facing, stable                                   │   │
│  │  • Only updated via explicit "Publish"                   │   │
│  │  • Production credentials for services                   │   │
│  │  • Enabled after first publish                           │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Implementation Required

1. DNS Changes

// On project creation, create both records (prod may be placeholder)
CreateDNSRecord("staging.cool-project.threesix.ai", stagingIP)
CreateDNSRecord("cool-project.threesix.ai", prodIP)  // Or placeholder until first publish

2. K8s Deployment Model

# Option A: Two deployments in same namespace
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cool-project-staging
  namespace: cool-project
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cool-project-production
  namespace: cool-project

# Option B: Two namespaces (cleaner isolation)
# cool-project-staging namespace
# cool-project-production namespace

Recommendation: Same namespace, two deployments. Simpler to manage, secrets can be shared or scoped.

3. Database Model

Two options:

A. Same database, schema prefixes:

-- Staging tables
staging_users, staging_posts, staging_...

-- Production tables
prod_users, prod_posts, prod_...

B. Separate databases (cleaner):

cool-project-staging (CockroachDB database)
cool-project-production (CockroachDB database)

Recommendation: Separate databases. Cleaner isolation, no risk of cross-env data access.

4. Project Schema Updates

ALTER TABLE projects ADD COLUMN environments JSONB NOT NULL DEFAULT '{
  "staging": {"enabled": true, "deployed_at": null},
  "production": {"enabled": false, "deployed_at": null, "published_at": null}
}';

5. Publish API

POST /projects/{projectId}/publish
{
  "fromEnvironment": "staging",  // Usually staging
  "toEnvironment": "production"
}

Response:
{
  "operationId": "op_xyz789",
  "status": "publishing",
  "streamUrl": "/operations/{operationId}/stream"
}

Publish Flow:

Validate staging is healthy
Provision production credentials for any services (if not exist)
Run migrations on production database
Deploy staging image to production deployment
Health check production
Update DNS if needed
Update project.environments.production

Complexity: Medium

DNS: Already have CloudflareAdapter, just create two records
K8s: Straightforward deployment duplication
Database: CockroachDB adapter supports multiple databases
Main complexity is the publish flow coordination

Defer Until After Gap 7

Dual environments can work with platform services, but we can build Gap 7 (services) first:

Services provision for a single environment initially
Then extend to environment-aware provisioning
Then add the publish flow that syncs services to production

Summary: Work Required

Gap	Effort	Dependencies	Critical Path
0. Design References	2-3 days	Gap 1 (storage)	Yes (for design flows)
1. Blueprint Storage	2-3 days	None	Yes
2. Architect Agent	3-5 days	Gap 1	Yes
3. Operation Tracking	4-6 days	None	Yes
4. Progress Streaming	2-3 days	Gap 3	Yes
5. Blueprint → SDLC	1-2 days	Gap 1	Yes
6. Frontend	5-7 days	Gaps 1-5	Yes
7. Platform Services	5-8 days	None (can start now)	Parallel track
8. Dual Environments	3-5 days	Gap 7	After services work

Total Estimate: 4-5 weeks of focused work (Gaps 7-8 can parallel with 1-6)

Service Rollout (within Gap 7):

Logging (Loki) - 2 days
Email (Resend) - 2 days
Stats (PostHog) - 2 days
Auth (Clerk) - 3 days

Note: Gap 0 (Design References) can be implemented in parallel with Gap 2 (Architect Agent) since both involve Architect prompt engineering. The reference capture infrastructure (Gap 0) builds on Gap 1's storage layer.

Critical Path

                    ┌──► Gap 0 (References) ──┐
                    │                         │
Gap 1 (Blueprint) ──┼──► Gap 2 (Architect) ───┼──► Gap 5 (Conversion)
                    │                         │
                    │                         └──► Gap 6 (Frontend)
                    │                              ▲
Gap 3 (Operations) ─┴──► Gap 4 (Streaming) ────────┘


Parallel Track:

Gap 7 (Services) ──► Logging ──► Email ──► Stats ──► Auth
        │
        └──► Gap 8 (Environments) ──► Publish Flow

Gap 7 can start immediately and run parallel to the Studio work. Gap 8 depends on Gap 7 for service credential handling per environment.

Risk Assessment

Risk	Likelihood	Impact	Mitigation
Architect outputs malformed JSON	High	Medium	JSON schema validation, retry logic
SSE connections drop	Medium	Low	Client-side reconnection, event replay from DB
Blueprint schema too restrictive	Medium	Medium	Start minimal, add sections iteratively
LLM latency affects chat UX	Low	High	Stream partial responses, show typing indicator
Build failures leave broken state	Low	Medium	SDLC already handles partial state

What's NOT a Gap

These are already solved by the current rdev foundation:

Project provisioning - K8s, DNS, git all work
Template seeding - Composable monorepo templates
SDLC execution - Classifier + worker + artifact tracking
CI/CD - Woodpecker integration
Database provisioning - CockroachDB adapter
Webhooks - Event dispatcher with retry

The foundation is solid. The gaps are about exposing existing capabilities through a conversational UI, not rebuilding core functionality.

39 KiB Raw Blame History

Orchard Studio: Gap Analysis

Current Foundation (What We Have)

Gap 0: Design Reference Capture & Processing

What's Missing

Two Input Types

Implementation Required

Complexity: Medium

Gap 1: Blueprint Storage & Chat API

What's Missing

Implementation Required

Complexity: Medium

Gap 2: Architect Agent Persona

What's Missing

Implementation Required

Complexity: Medium-High

Gap 3: Operation Tracking (Tree Runner in DB)

What's Missing

Implementation Required

Complexity: High

Gap 4: Real-Time Progress Streaming

What's Missing

Implementation Required

Complexity: Medium

Gap 5: Blueprint → SDLC Feature Conversion

What's Missing

Implementation Required

Complexity: Low

Gap 6: Frontend (Next.js Studio)

What's Missing

Implementation Required

Complexity: Medium

Gap 7: Platform Service Infrastructure

The "Upgrade" Problem

Service Rollout Order

Implementation Required

1. Service Catalog

2. Database Schema

3. Provisioner Interface

4. Service Addition API

5. Integration Flow

Complexity: High

Starting Point: Logging with Loki

Gap 8: Dual Environment Support

The Environment Model

Implementation Required

1. DNS Changes

2. K8s Deployment Model

3. Database Model

4. Project Schema Updates

5. Publish API

Complexity: Medium

Defer Until After Gap 7

Summary: Work Required

Critical Path

Risk Assessment

What's NOT a Gap

39 KiB

Raw Blame History