rdev/docs/features/operations-audit.md
jordan c280a92012 feat: add operations audit system and template improvements
Operations Audit (new feature):
- Add Operation domain model with status tracking (pending, running, completed, failed, cancelled)
- Add OperationRepository with PostgreSQL implementation
- Add OperationService for CRUD and lifecycle management
- Add operations handlers (list, get, cancel endpoints)
- Add migration 015_operations.sql for operations table
- Add operation cleanup worker for stale operation handling
- Add ErrOperationNotFound to domain errors

Template Improvements:
- Add CLAUDE.md configuration files to astro-landing, default, and go-api templates
- Fix PORT template variable usage in nginx configs for app templates
- Add replace directives for local pkg module in Go templates
- Simplify Go service/worker Dockerfiles for workspace builds
- Fix TypeScript error in logger template

Other:
- Refactor landing-test.sh cookbook script
- Update CLAUDE.md version reference

Note: Some files exceed 500-line limit (pre-existing debt + new feature)
- component.go: 550 lines (unchanged, pre-existing)
- main.go: 522 lines (added operations wiring)
- operation_repo.go: 569 lines (new, needs splitting)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 19:08:57 -07:00

8.2 KiB

Operations Audit System

Status: Spec Purpose: Make automated development debuggable via API

Overview

Every action on a project is an Operation. Operations capture what happened, step-by-step, with enough detail to pinpoint failures without digging through logs.

GET /projects/testgo1/operations?status=failed

→ Operation "build" failed at step "build-api": git executable not found

Design Principles

  1. Queryable via API - No kubectl, no Woodpecker UI, no guessing
  2. Comprehensive, not verbose - Capture essence + detail separately
  3. 30-day retention - Operations are for debugging, not compliance
  4. Linked to permanent audit - audit_log stays forever, operations link to it

Data Model

Operations Table

CREATE TABLE operations (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    project_id TEXT NOT NULL,
    type TEXT NOT NULL,
    status TEXT NOT NULL DEFAULT 'running',

    -- Correlation
    request_id TEXT,              -- HTTP request that initiated
    triggered_by UUID,            -- Parent operation (build triggered by component.add)
    commit_sha TEXT,              -- Git commit this operation created/triggered
    external_ref TEXT,            -- Woodpecker build#, K8s deployment, etc.

    -- Timing
    started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at TIMESTAMPTZ,
    duration_ms INT,

    -- Content (JSONB for flexibility)
    input JSONB,                  -- What was requested
    output JSONB,                 -- What was produced

    -- Error handling: essence + detail
    error TEXT,                   -- One-line summary
    error_detail TEXT,            -- Full stack/output (truncated to 10KB)

    -- Steps
    steps JSONB NOT NULL DEFAULT '[]'
);

-- Indexes
CREATE INDEX idx_ops_project_time ON operations(project_id, started_at DESC);
CREATE INDEX idx_ops_project_status ON operations(project_id, status) WHERE status IN ('running', 'failed');
CREATE INDEX idx_ops_commit ON operations(commit_sha) WHERE commit_sha IS NOT NULL;
CREATE INDEX idx_ops_cleanup ON operations(started_at) WHERE started_at < NOW() - INTERVAL '30 days';

Step Structure

{
  "name": "build-api",
  "status": "failed",
  "started_at": "2026-02-01T20:31:45Z",
  "duration_ms": 17000,
  "output": {"image": "registry.threesix.ai/testgo1/api:abc123"},
  "error": "git executable not found",
  "error_detail": "exec: \"git\": executable file not found in $PATH\n  at /app/pkg/app.go:24"
}

Operation Types

Type Trigger Key Steps
project.create POST /projects create_pod, create_repo, activate_ci, create_dns
component.add POST /projects/{id}/components render_template, commit_files, create_deployment
build Woodpecker webhook git, build-{component}, deploy-{component}
resource.provision POST /projects/{id}/databases create_database, create_user, store_credentials

API

List Operations

GET /projects/{id}/operations
GET /projects/{id}/operations?status=failed
GET /projects/{id}/operations?type=build
GET /projects/{id}/operations?since=1h
GET /projects/{id}/operations?limit=50

Response:

{
  "data": [
    {
      "id": "op-abc123",
      "type": "build",
      "status": "failed",
      "started_at": "2026-02-01T20:31:45Z",
      "duration_ms": 87000,
      "error": "build-api: git executable not found",
      "steps_summary": "git ✓ → build-web ✓ → build-api ✗"
    }
  ]
}

Get Operation Detail

GET /projects/{id}/operations/{operation_id}

Response:

{
  "data": {
    "id": "op-abc123",
    "type": "build",
    "status": "failed",
    "triggered_by": "op-xyz789",
    "commit_sha": "abc123",
    "external_ref": "build#42",
    "started_at": "2026-02-01T20:31:45Z",
    "completed_at": "2026-02-01T20:33:12Z",
    "duration_ms": 87000,
    "input": {
      "commit_message": "Add service component: api"
    },
    "steps": [
      {"name": "git", "status": "completed", "duration_ms": 5000},
      {"name": "build-web", "status": "completed", "duration_ms": 48000},
      {
        "name": "build-api",
        "status": "failed",
        "duration_ms": 17000,
        "error": "git executable not found",
        "error_detail": "/app/pkg/app/app.go:24:2: github.com/jordan/testgo1/pkg@v0.0.0: exec: \"git\": executable file not found..."
      }
    ],
    "error": "build-api: git executable not found",
    "error_detail": "Full kaniko output..."
  }
}

Find by Commit

GET /projects/{id}/operations?commit=abc123

Returns operations that created or were triggered by this commit.

Correlation

Request → Operation

HTTP Request (X-Request-ID: req-123)
    ↓
Handler creates Operation (id: op-abc, request_id: req-123)
    ↓
Service executes steps, updates operation
    ↓
Response includes operation_id

Component Add → Build

component.add (op-abc)
    → commits to git (sha: abc123)
    → operation.commit_sha = "abc123"

Woodpecker webhook fires for abc123
    → rdev looks up: SELECT id FROM operations WHERE commit_sha = 'abc123'
    → creates build operation (triggered_by: op-abc)

Linking to Permanent Audit

Operations are temporary (30d). For compliance, audit_log is permanent.

-- Add operation_id to audit_log
ALTER TABLE audit_log ADD COLUMN operation_id UUID;
CREATE INDEX idx_audit_operation ON audit_log(operation_id) WHERE operation_id IS NOT NULL;

Query permanent history via audit_log, debug recent issues via operations.

Implementation

Phase 1: Foundation

  • Migration: operations table
  • Domain: Operation, OperationStep
  • Port: OperationRepository
  • Adapter: PostgreSQL implementation
  • Handler: GET /projects/{id}/operations

Phase 2: Instrumentation

  • Instrument: project.create handler
  • Instrument: component.add handler
  • Instrument: resource provisioning
  • Add operation_id to responses

Phase 3: Build Integration

  • Woodpecker webhook receiver endpoint
  • Parse build events into operation steps
  • Link via commit_sha

Phase 4: Cleanup

  • Background job: delete operations older than 30d
  • Add operation_id column to audit_log

Files to Create/Modify

internal/
├── domain/
│   └── operation.go              # NEW: Operation, OperationStep, OperationType
├── port/
│   └── operation.go              # NEW: OperationRepository interface
├── adapter/
│   └── postgres/
│       └── operation_repo.go     # NEW: PostgreSQL implementation
├── service/
│   └── operation_service.go      # NEW: Business logic
├── handlers/
│   └── operations.go             # NEW: API handlers
│   └── project.go                # MODIFY: Create operation on project.create
│   └── component.go              # MODIFY: Create operation on component.add
│   └── webhooks.go               # MODIFY: Handle Woodpecker build events
└── worker/
    └── cleanup.go                # NEW: 30-day retention cleanup

migrations/
└── 015_operations.sql            # NEW: Table + indexes

Example Debugging Session

# Project deployment failing. What happened?
$ curl -s "$API/projects/testgo1/operations?status=failed" | jq '.[0]'
{
  "id": "op-abc123",
  "type": "build",
  "error": "build-api: git executable not found",
  "steps_summary": "git ✓ → build-web ✓ → build-api ✗"
}

# Get details
$ curl -s "$API/projects/testgo1/operations/op-abc123" | jq '.steps[-1]'
{
  "name": "build-api",
  "status": "failed",
  "error": "git executable not found",
  "error_detail": "exec: \"git\": executable file not found in $PATH..."
}

# What triggered this build?
$ curl -s "$API/projects/testgo1/operations/op-abc123" | jq '.triggered_by'
"op-xyz789"

# What was that operation?
$ curl -s "$API/projects/testgo1/operations/op-xyz789" | jq '{type, input}'
{
  "type": "component.add",
  "input": {"template": "service", "name": "api"}
}

# Root cause: component.add triggered build, build failed due to missing git in Dockerfile

Open Questions

  1. Stream running operations? - Could add SSE endpoint for real-time step updates
  2. CLI integration? - rdev debug testgo1 to show recent failures
  3. Alerting? - Webhook when operation fails?