stemedb/applications/aphoria/docs/corpus-architecture.md
jml bb0c33f8d3 fix(api): enable querying of CLI-created community corpus items
## Problem
CLI-created community corpus items (tier 3) were stored correctly but
invisible via API queries. Two issues blocked discoverability:

1. **Prefix mismatch**: API hardcoded 'community://pattern/' for
   aggregated patterns, but CLI creates 'community://rust/http/...' URIs
2. **Query parameter parsing**: Axum's default parser doesn't support
   bracket notation (?sources[]=value) used by the dashboard

Result: 0/22 CLI-created items were queryable.

## Solution

### Fix 1: Broaden Community Prefix
- Changed: 'community://pattern/' → 'community://' in corpus handler
- Impact: Now matches both aggregated patterns AND CLI-created items
- Backward compatible: Broader prefix includes narrower results

### Fix 2: Add QsQuery Extractor
- Added: serde_qs dependency + custom QsQuery extractor
- Supports: Bracket notation for array parameters (?sources[]=a&sources[]=b)
- Compatible: Works with JavaScript URLSearchParams standard
- Tested: 3 new unit tests for extractor behavior

## Verification
-  All 22 CLI-created community items now queryable (was 0)
-  Source filtering works: community (22), RFC (2), vendor (5)
-  Multi-source queries work: ?sources[]=community&sources[]=rfc → 24
-  All 89 API tests pass + 3 new extractor tests
-  Clippy clean (0 warnings)
-  No regressions in existing functionality

## Files Changed
- crates/stemedb-api/Cargo.toml: Add serde_qs dependency
- crates/stemedb-api/src/extractors.rs: New QsQuery extractor (117 lines)
- crates/stemedb-api/src/handlers/aphoria/corpus.rs: Use QsQuery, broaden prefix
- crates/stemedb-api/src/lib.rs: Export extractors module

Also includes: Scale-adaptive thresholds, wiki corpus extraction,
documentation updates, and dashboard UI improvements from prior work.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-09 15:54:35 +00:00

18 KiB

Corpus Database Architecture

Audience: Engineers integrating Aphoria with StemeDB API, ops teams deploying both systems.

What you'll learn:

  • How Aphoria's corpus database integrates with StemeDB API
  • URI scheme inference for authoritative sources
  • Where CLI-created corpus items live
  • Git hooks for automatic binary rebuilds
  • Production deployment patterns

Quick Reference

# Aphoria CLI writes to:
~/.aphoria/corpus-db/

# StemeDB API reads from:
data/db/  # Default, or configure STEMEDB_CORPUS_DB_DIR

# Make API see Aphoria corpus:
export STEMEDB_CORPUS_DB_DIR="$HOME/.aphoria/corpus-db"
stemedb-api

Database Separation

The Problem

Aphoria and StemeDB API use separate databases:

Aphoria CLI:
  └─ corpus create/build → ~/.aphoria/corpus-db/

StemeDB API:
  └─ GET /v1/aphoria/corpus → data/db/

Result: Items created via CLI aren't visible in API/Dashboard

The Solution

Three integration patterns:

Point API to Aphoria's corpus database:

# .env
STEMEDB_CORPUS_DB_DIR=/home/user/.aphoria/corpus-db

# Start API
cargo run --release -p stemedb-api

Pros:

  • Zero synchronization needed
  • Single source of truth
  • Changes immediately visible

Cons:

  • API has read-only access (can't write to corpus)
  • Not suitable if API needs to write corpus items

Use shared directory for both:

# Create shared directory
sudo mkdir -p /var/lib/stemedb/corpus
sudo chown aphoria:stemedb /var/lib/stemedb/corpus
sudo chmod 775 /var/lib/stemedb/corpus
# .aphoria/config.toml
[episteme]
corpus_data_dir = "/var/lib/stemedb/corpus"
# StemeDB API
export STEMEDB_CORPUS_DB_DIR="/var/lib/stemedb/corpus"

Pros:

  • Single database, no sync
  • Both systems have write access
  • Production-ready pattern

Cons:

  • Requires deployment coordination
  • Permissions management needed

Pattern 3: Sync Mechanism (Future)

# Planned (not yet implemented)
aphoria corpus sync --to-api --api-db-dir data/db

Use case: When databases must remain separate.


URI Scheme Inference

The Problem

Corpus items need URI-schemed subjects for API prefix scanning:

# Without URI scheme (won't work):
subject: "tls/certificate_verification"

# API queries:
curl '/v1/aphoria/corpus?sources[]=rfc'
# Scans for "subject:rfc://" → doesn't match plain subjects

The Solution

Automatic URI inference based on authority and tier:

// In aphoria corpus create
Authority: "RFC 5246 Section 7.4.2"
Tier: 0

// Auto-inferred:
subject_uri: "rfc://tls/certificate_verification"

Inference Rules

Condition Scheme Example
Already has :// Preserved rfc://testrfc://test
Authority contains "rfc" (case-insensitive) rfc:// "RFC 5280" → rfc://...
Authority contains "owasp" owasp:// "OWASP Top 10" → owasp://...
Authority contains "cwe" cwe:// "CWE-120" → cwe://...
Tier 2 vendor:// GitHub docs → vendor://...
Tier 3 community:// Team wiki → community://...
Tier 0/1 unrecognized corpus:// Unknown → corpus://...

Priority: Authority matching > Tier-based > Fallback

Examples

# RFC claim (tier 0)
aphoria corpus create \
  --subject "tls/validation" \
  --authority "RFC 5280 Section 6.1" \
  --tier 0
# Stored as: subject:rfc://tls/validation

# OWASP claim (tier 1)
aphoria corpus create \
  --subject "password/storage" \
  --authority "OWASP Password Storage Cheat Sheet" \
  --tier 1
# Stored as: subject:owasp://password/storage

# Vendor docs (tier 2)
aphoria corpus create \
  --subject "postgresql/connection_pool" \
  --authority "PostgreSQL Documentation" \
  --tier 2
# Stored as: subject:vendor://postgresql/connection_pool

# Community (tier 3)
aphoria corpus create \
  --subject "api/rest/pagination" \
  --authority "Team wiki: API standards" \
  --tier 3
# Stored as: subject:community://api/rest/pagination

# Already schemed (preserved)
aphoria corpus create \
  --subject "custom://myapp/feature" \
  --authority "Internal spec" \
  --tier 2
# Stored as: subject:custom://myapp/feature

CLI-Created Corpus Source

The Problem

Items created with aphoria corpus create weren't visible in:

aphoria corpus list
# Showed: RFC, OWASP, VendorDocs
# Missing: CLI-created items

aphoria corpus build
# Total assertions: 86
# Missing: CLI-created items

The Solution

CLI-created items are now a first-class corpus source:

// Tagged at creation time
metadata: {
    "source": "cli_create",
    "description": "...",
    "authority_source": "...",
    "category": "..."
}

// Discovered by CliCreatedBuilder
impl AsyncCorpusBuilder for CliCreatedBuilder {
    async fn build(...) -> Vec<Assertion> {
        // Scan corpus DB
        // Filter by metadata: "source": "cli_create"
        // Return assertions
    }
}

Now They Appear

aphoria corpus list
# Available corpus sources:
#   rfc:// (Tier 0) - RFC
#   owasp:// (Tier 1) - OWASP
#   vendor:// (Tier 2) - VendorDocs
#   cli:// (Tier 3) - CLI-Created Items  ← NEW

aphoria corpus build
# Corpus build complete:
#   Total assertions: 157
#   CLI-Created Items: 3 assertions  ← NEW

Querying CLI-Created Items

# Via API
curl 'http://localhost:18180/v1/aphoria/corpus?sources[]=cli'

# Via Dashboard
# Navigate to: http://localhost:3000/corpus
# Filter by "CLI-Created" source

Git Hooks for Binary Rebuilds

The Problem

Developer workflow:

  1. git pull (gets CLI definition changes)
  2. Run aphoria corpus create
  3. Error: "unrecognized subcommand 'create'"
  4. Confusion, time wasted
  5. Realize binary is stale: cargo build --release -p aphoria

The Solution

Automatic rebuild hooks:

# .git/hooks/post-merge
if git diff-tree ... | grep -q "^applications/aphoria/src/cli"; then
    echo "🔧 CLI changed, rebuilding aphoria..."
    cargo build --release -p aphoria
fi

Installed Hooks

post-merge - After git pull or git merge post-checkout - After git checkout <branch> post-rewrite - After git rebase

What Triggers Rebuild

  • Aphoria CLI: applications/aphoria/src/cli/
  • API handlers: crates/stemedb-api/src/
  • Simulator: crates/stemedb-sim/src/
  • Core libraries: crates/stemedb-*
  • Dependencies: Cargo.toml changes

Installation

Hooks are in .git/hooks/ (not tracked by git). To install on new clone:

cd /home/jml/Workspace/stemedb
ls -la .git/hooks/post-*

# If missing, check GIT-HOOKS-IMPLEMENTATION.md for setup

Bypass Hook (Emergency)

# Temporarily disable all hooks
git pull --no-verify

# Or set env var
GIT_HOOKS_DISABLE=1 git pull

Deployment Configurations

Local Development

Aphoria:

# Default: uses ~/.aphoria/corpus-db/
aphoria corpus create ...
aphoria corpus build

StemeDB API:

# Point to Aphoria's corpus
export STEMEDB_CORPUS_DB_DIR="$HOME/.aphoria/corpus-db"
cargo run --release -p stemedb-api

Docker Compose

version: '3.8'

volumes:
  corpus-db:

services:
  stemedb-api:
    image: stemedb-api:latest
    environment:
      - STEMEDB_CORPUS_DB_DIR=/var/lib/stemedb/corpus
    volumes:
      - corpus-db:/var/lib/stemedb/corpus
    ports:
      - "18180:18180"

  aphoria-builder:
    image: aphoria:latest
    volumes:
      - corpus-db:/var/lib/stemedb/corpus
      - ./aphoria-config.toml:/etc/aphoria/config.toml
    command: corpus build

Kubernetes

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: corpus-db
spec:
  accessModes: [ReadWriteMany]
  resources:
    requests:
      storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stemedb-api
spec:
  template:
    spec:
      containers:
      - name: api
        image: stemedb-api:latest
        env:
        - name: STEMEDB_CORPUS_DB_DIR
          value: /var/lib/stemedb/corpus
        volumeMounts:
        - name: corpus-db
          mountPath: /var/lib/stemedb/corpus
      volumes:
      - name: corpus-db
        persistentVolumeClaim:
          claimName: corpus-db

Production (Bare Metal)

# 1. Create shared corpus directory
sudo mkdir -p /var/lib/stemedb/corpus
sudo chown aphoria:stemedb /var/lib/stemedb/corpus
sudo chmod 775 /var/lib/stemedb/corpus

# 2. Configure Aphoria
cat > /etc/aphoria/config.toml <<EOF
[episteme]
corpus_data_dir = "/var/lib/stemedb/corpus"
EOF

# 3. Configure StemeDB API
cat > /etc/systemd/system/stemedb-api.service <<EOF
[Service]
Environment="STEMEDB_CORPUS_DB_DIR=/var/lib/stemedb/corpus"
ExecStart=/usr/local/bin/stemedb-api
User=stemedb
Group=stemedb
EOF

# 4. Start services
systemctl start stemedb-api

Integration Patterns

Pattern A: API-First (Read-Only Corpus)

Use case: Dashboard-driven architecture, corpus rarely changes.

Workflow:
1. Ops team creates corpus items via CLI
2. API serves them to dashboard
3. Developers view in dashboard (read-only)

Database:
- Aphoria: ~/.aphoria/corpus-db/ (write)
- API: points to Aphoria DB (read)

Config:

# API
export STEMEDB_CORPUS_DB_DIR="$HOME/.aphoria/corpus-db"

Pattern B: CLI-First (Frequent Corpus Updates)

Use case: Active corpus curation, frequent CLI usage.

Workflow:
1. Developers create corpus items via CLI
2. CLI builds corpus
3. API/dashboard reflect latest corpus

Database:
- Aphoria: /var/lib/stemedb/corpus (write)
- API: /var/lib/stemedb/corpus (read)

Config:

# .aphoria/config.toml
[episteme]
corpus_data_dir = "/var/lib/stemedb/corpus"
# API
export STEMEDB_CORPUS_DB_DIR="/var/lib/stemedb/corpus"

Pattern C: Hybrid (Separate Stores + Sync)

Use case: Different corpus items in different stores.

Workflow:
1. Aphoria: authoritative corpus (RFC, OWASP, CLI-created)
2. API: ephemeral assertions from scans
3. Periodic sync or query union

Database:
- Aphoria: ~/.aphoria/corpus-db/
- API: data/db/
- Sync: manual or scheduled

Sync (when implemented):

# Planned
aphoria corpus sync --to-api --api-db-dir data/db

Troubleshooting

"Items created but not visible in API"

Symptom:

aphoria corpus create --subject "test" ...
# Created corpus item: corpus://test/enabled

curl 'http://localhost:18180/v1/aphoria/corpus'
# {"items":[], "total_matching": 0}

Diagnosis:

# Check API config
env | grep STEMEDB_CORPUS_DB_DIR
# If empty, API is using data/db/

# Check Aphoria corpus DB
ls -la ~/.aphoria/corpus-db/
# Should see fjall/, redb/, wal/

Fix:

export STEMEDB_CORPUS_DB_DIR="$HOME/.aphoria/corpus-db"
# Restart API
pkill -f stemedb-api
stemedb-api &

"Command not found after git pull"

Symptom:

git pull
aphoria corpus create ...
# error: unrecognized subcommand 'create'

Diagnosis:

# Check binary date
ls -lh target/release/aphoria
# -rwxr-xr-x ... Jan 15 10:00 aphoria

# Check CLI code date
ls -lh applications/aphoria/src/cli/mod.rs
# -rw-r--r-- ... Feb 09 14:30 mod.rs  ← Newer!

Fix:

# Rebuild
cargo build --release -p aphoria

# Or check if hooks are installed
ls -la .git/hooks/post-merge
# Should be executable and contain rebuild logic

"Corpus items have wrong URI scheme"

Symptom:

aphoria corpus create \
  --subject "tls/validation" \
  --authority "RFC 5280" \
  --tier 0

# API query fails
curl '/v1/aphoria/corpus?sources[]=rfc'
# {"items":[]}

Diagnosis:

# Check stored subject (via debug scan)
aphoria scan --show-observations | grep tls
# If shows: subject:tls/validation (no rfc://)
# Then URI inference didn't work

Fix: Rebuild aphoria binary (URI inference added in recent version):

cargo build --release -p aphoria

"Dashboard shows duplicate corpus items"

Symptom: Dashboard displays same item multiple times.

Diagnosis:

# Check if corpus built multiple times
aphoria corpus build --verbose
# Look for same assertion appearing under multiple builders

Cause: CLI-created items might also match RFC/OWASP builders if they have matching metadata.

Fix: This is expected behavior if:

  1. Item was created via CLI with RFC authority
  2. RFC builder also fetches it from RFC source
  3. Both versions appear in corpus

To deduplicate, ensure CLI-created items use unique subjects or authorities that don't overlap with fetched sources.


Architecture Diagram

┌─────────────────────────────────────────────────────────┐
│                    Aphoria CLI                          │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  aphoria corpus create                                  │
│       │                                                 │
│       ├─► infer_subject_uri()                          │
│       │   (RFC/OWASP/CWE → scheme)                     │
│       │                                                 │
│       ├─► create_corpus_item()                         │
│       │   metadata: "source": "cli_create"             │
│       │                                                 │
│       └─► Store: ~/.aphoria/corpus-db/                 │
│            Key: "subject:rfc://tls/validation"         │
│                                                         │
│  aphoria corpus build                                   │
│       │                                                 │
│       ├─► HardcodedBuilder                             │
│       ├─► RfcBuilder (network)                         │
│       ├─► OwaspBuilder (network)                       │
│       ├─► VendorDocsBuilder                            │
│       └─► CliCreatedBuilder ← NEW                      │
│            Filter: "source": "cli_create"              │
│                                                         │
└─────────────────────────────────────────────────────────┘
                         │
                         │ Shared Database
                         ↓
┌─────────────────────────────────────────────────────────┐
│              ~/.aphoria/corpus-db/                      │
│                                                         │
│  subject:rfc://tls/validation → Assertion              │
│  subject:owasp://password/storage → Assertion          │
│  subject:community://api/rest → Assertion              │
│                                                         │
└─────────────────────────────────────────────────────────┘
                         ↑
                         │ STEMEDB_CORPUS_DB_DIR
                         │
┌─────────────────────────────────────────────────────────┐
│                  StemeDB API                            │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  GET /v1/aphoria/corpus?sources[]=rfc                  │
│       │                                                 │
│       └─► corpus_store.scan_prefix("subject:rfc://")   │
│            ↓                                            │
│            Returns: RFC assertions                      │
│                                                         │
└─────────────────────────────────────────────────────────┘
                         │
                         │ HTTP
                         ↓
┌─────────────────────────────────────────────────────────┐
│              Aphoria Dashboard                          │
│                                                         │
│  Filter: [RFC] [OWASP] [CLI-Created]                   │
│  ┌─────────────────────────────────┐                   │
│  │ rfc://tls/validation            │                   │
│  │ Tier 0 | Security                │                   │
│  │ TLS cert verification MUST...   │                   │
│  └─────────────────────────────────┘                   │
│                                                         │
└─────────────────────────────────────────────────────────┘

See Also