rdev/docs/features/operations-audit.md
jordan c280a92012 feat: add operations audit system and template improvements
Operations Audit (new feature):
- Add Operation domain model with status tracking (pending, running, completed, failed, cancelled)
- Add OperationRepository with PostgreSQL implementation
- Add OperationService for CRUD and lifecycle management
- Add operations handlers (list, get, cancel endpoints)
- Add migration 015_operations.sql for operations table
- Add operation cleanup worker for stale operation handling
- Add ErrOperationNotFound to domain errors

Template Improvements:
- Add CLAUDE.md configuration files to astro-landing, default, and go-api templates
- Fix PORT template variable usage in nginx configs for app templates
- Add replace directives for local pkg module in Go templates
- Simplify Go service/worker Dockerfiles for workspace builds
- Fix TypeScript error in logger template

Other:
- Refactor landing-test.sh cookbook script
- Update CLAUDE.md version reference

Note: Some files exceed 500-line limit (pre-existing debt + new feature)
- component.go: 550 lines (unchanged, pre-existing)
- main.go: 522 lines (added operations wiring)
- operation_repo.go: 569 lines (new, needs splitting)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 19:08:57 -07:00

290 lines
8.2 KiB
Markdown

# Operations Audit System
**Status**: Spec
**Purpose**: Make automated development debuggable via API
## Overview
Every action on a project is an **Operation**. Operations capture what happened, step-by-step, with enough detail to pinpoint failures without digging through logs.
```
GET /projects/testgo1/operations?status=failed
→ Operation "build" failed at step "build-api": git executable not found
```
## Design Principles
1. **Queryable via API** - No kubectl, no Woodpecker UI, no guessing
2. **Comprehensive, not verbose** - Capture essence + detail separately
3. **30-day retention** - Operations are for debugging, not compliance
4. **Linked to permanent audit** - `audit_log` stays forever, operations link to it
## Data Model
### Operations Table
```sql
CREATE TABLE operations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
project_id TEXT NOT NULL,
type TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'running',
-- Correlation
request_id TEXT, -- HTTP request that initiated
triggered_by UUID, -- Parent operation (build triggered by component.add)
commit_sha TEXT, -- Git commit this operation created/triggered
external_ref TEXT, -- Woodpecker build#, K8s deployment, etc.
-- Timing
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
duration_ms INT,
-- Content (JSONB for flexibility)
input JSONB, -- What was requested
output JSONB, -- What was produced
-- Error handling: essence + detail
error TEXT, -- One-line summary
error_detail TEXT, -- Full stack/output (truncated to 10KB)
-- Steps
steps JSONB NOT NULL DEFAULT '[]'
);
-- Indexes
CREATE INDEX idx_ops_project_time ON operations(project_id, started_at DESC);
CREATE INDEX idx_ops_project_status ON operations(project_id, status) WHERE status IN ('running', 'failed');
CREATE INDEX idx_ops_commit ON operations(commit_sha) WHERE commit_sha IS NOT NULL;
CREATE INDEX idx_ops_cleanup ON operations(started_at) WHERE started_at < NOW() - INTERVAL '30 days';
```
### Step Structure
```json
{
"name": "build-api",
"status": "failed",
"started_at": "2026-02-01T20:31:45Z",
"duration_ms": 17000,
"output": {"image": "registry.threesix.ai/testgo1/api:abc123"},
"error": "git executable not found",
"error_detail": "exec: \"git\": executable file not found in $PATH\n at /app/pkg/app.go:24"
}
```
### Operation Types
| Type | Trigger | Key Steps |
|------|---------|-----------|
| `project.create` | `POST /projects` | create_pod, create_repo, activate_ci, create_dns |
| `component.add` | `POST /projects/{id}/components` | render_template, commit_files, create_deployment |
| `build` | Woodpecker webhook | git, build-{component}, deploy-{component} |
| `resource.provision` | `POST /projects/{id}/databases` | create_database, create_user, store_credentials |
## API
### List Operations
```
GET /projects/{id}/operations
GET /projects/{id}/operations?status=failed
GET /projects/{id}/operations?type=build
GET /projects/{id}/operations?since=1h
GET /projects/{id}/operations?limit=50
```
Response:
```json
{
"data": [
{
"id": "op-abc123",
"type": "build",
"status": "failed",
"started_at": "2026-02-01T20:31:45Z",
"duration_ms": 87000,
"error": "build-api: git executable not found",
"steps_summary": "git ✓ → build-web ✓ → build-api ✗"
}
]
}
```
### Get Operation Detail
```
GET /projects/{id}/operations/{operation_id}
```
Response:
```json
{
"data": {
"id": "op-abc123",
"type": "build",
"status": "failed",
"triggered_by": "op-xyz789",
"commit_sha": "abc123",
"external_ref": "build#42",
"started_at": "2026-02-01T20:31:45Z",
"completed_at": "2026-02-01T20:33:12Z",
"duration_ms": 87000,
"input": {
"commit_message": "Add service component: api"
},
"steps": [
{"name": "git", "status": "completed", "duration_ms": 5000},
{"name": "build-web", "status": "completed", "duration_ms": 48000},
{
"name": "build-api",
"status": "failed",
"duration_ms": 17000,
"error": "git executable not found",
"error_detail": "/app/pkg/app/app.go:24:2: github.com/jordan/testgo1/pkg@v0.0.0: exec: \"git\": executable file not found..."
}
],
"error": "build-api: git executable not found",
"error_detail": "Full kaniko output..."
}
}
```
### Find by Commit
```
GET /projects/{id}/operations?commit=abc123
```
Returns operations that created or were triggered by this commit.
## Correlation
### Request → Operation
```
HTTP Request (X-Request-ID: req-123)
Handler creates Operation (id: op-abc, request_id: req-123)
Service executes steps, updates operation
Response includes operation_id
```
### Component Add → Build
```
component.add (op-abc)
→ commits to git (sha: abc123)
→ operation.commit_sha = "abc123"
Woodpecker webhook fires for abc123
→ rdev looks up: SELECT id FROM operations WHERE commit_sha = 'abc123'
→ creates build operation (triggered_by: op-abc)
```
### Linking to Permanent Audit
Operations are temporary (30d). For compliance, `audit_log` is permanent.
```sql
-- Add operation_id to audit_log
ALTER TABLE audit_log ADD COLUMN operation_id UUID;
CREATE INDEX idx_audit_operation ON audit_log(operation_id) WHERE operation_id IS NOT NULL;
```
Query permanent history via audit_log, debug recent issues via operations.
## Implementation
### Phase 1: Foundation
- [ ] Migration: operations table
- [ ] Domain: Operation, OperationStep
- [ ] Port: OperationRepository
- [ ] Adapter: PostgreSQL implementation
- [ ] Handler: GET /projects/{id}/operations
### Phase 2: Instrumentation
- [ ] Instrument: project.create handler
- [ ] Instrument: component.add handler
- [ ] Instrument: resource provisioning
- [ ] Add operation_id to responses
### Phase 3: Build Integration
- [ ] Woodpecker webhook receiver endpoint
- [ ] Parse build events into operation steps
- [ ] Link via commit_sha
### Phase 4: Cleanup
- [ ] Background job: delete operations older than 30d
- [ ] Add operation_id column to audit_log
## Files to Create/Modify
```
internal/
├── domain/
│ └── operation.go # NEW: Operation, OperationStep, OperationType
├── port/
│ └── operation.go # NEW: OperationRepository interface
├── adapter/
│ └── postgres/
│ └── operation_repo.go # NEW: PostgreSQL implementation
├── service/
│ └── operation_service.go # NEW: Business logic
├── handlers/
│ └── operations.go # NEW: API handlers
│ └── project.go # MODIFY: Create operation on project.create
│ └── component.go # MODIFY: Create operation on component.add
│ └── webhooks.go # MODIFY: Handle Woodpecker build events
└── worker/
└── cleanup.go # NEW: 30-day retention cleanup
migrations/
└── 015_operations.sql # NEW: Table + indexes
```
## Example Debugging Session
```bash
# Project deployment failing. What happened?
$ curl -s "$API/projects/testgo1/operations?status=failed" | jq '.[0]'
{
"id": "op-abc123",
"type": "build",
"error": "build-api: git executable not found",
"steps_summary": "git ✓ → build-web ✓ → build-api ✗"
}
# Get details
$ curl -s "$API/projects/testgo1/operations/op-abc123" | jq '.steps[-1]'
{
"name": "build-api",
"status": "failed",
"error": "git executable not found",
"error_detail": "exec: \"git\": executable file not found in $PATH..."
}
# What triggered this build?
$ curl -s "$API/projects/testgo1/operations/op-abc123" | jq '.triggered_by'
"op-xyz789"
# What was that operation?
$ curl -s "$API/projects/testgo1/operations/op-xyz789" | jq '{type, input}'
{
"type": "component.add",
"input": {"template": "service", "name": "api"}
}
# Root cause: component.add triggered build, build failed due to missing git in Dockerfile
```
## Open Questions
1. **Stream running operations?** - Could add SSE endpoint for real-time step updates
2. **CLI integration?** - `rdev debug testgo1` to show recent failures
3. **Alerting?** - Webhook when operation fails?