## Problem CLI-created community corpus items (tier 3) were stored correctly but invisible via API queries. Two issues blocked discoverability: 1. **Prefix mismatch**: API hardcoded 'community://pattern/' for aggregated patterns, but CLI creates 'community://rust/http/...' URIs 2. **Query parameter parsing**: Axum's default parser doesn't support bracket notation (?sources[]=value) used by the dashboard Result: 0/22 CLI-created items were queryable. ## Solution ### Fix 1: Broaden Community Prefix - Changed: 'community://pattern/' → 'community://' in corpus handler - Impact: Now matches both aggregated patterns AND CLI-created items - Backward compatible: Broader prefix includes narrower results ### Fix 2: Add QsQuery Extractor - Added: serde_qs dependency + custom QsQuery extractor - Supports: Bracket notation for array parameters (?sources[]=a&sources[]=b) - Compatible: Works with JavaScript URLSearchParams standard - Tested: 3 new unit tests for extractor behavior ## Verification - ✅ All 22 CLI-created community items now queryable (was 0) - ✅ Source filtering works: community (22), RFC (2), vendor (5) - ✅ Multi-source queries work: ?sources[]=community&sources[]=rfc → 24 - ✅ All 89 API tests pass + 3 new extractor tests - ✅ Clippy clean (0 warnings) - ✅ No regressions in existing functionality ## Files Changed - crates/stemedb-api/Cargo.toml: Add serde_qs dependency - crates/stemedb-api/src/extractors.rs: New QsQuery extractor (117 lines) - crates/stemedb-api/src/handlers/aphoria/corpus.rs: Use QsQuery, broaden prefix - crates/stemedb-api/src/lib.rs: Export extractors module Also includes: Scale-adaptive thresholds, wiki corpus extraction, documentation updates, and dashboard UI improvements from prior work. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
18 KiB
Corpus Database Architecture
Audience: Engineers integrating Aphoria with StemeDB API, ops teams deploying both systems.
What you'll learn:
- How Aphoria's corpus database integrates with StemeDB API
- URI scheme inference for authoritative sources
- Where CLI-created corpus items live
- Git hooks for automatic binary rebuilds
- Production deployment patterns
Quick Reference
# Aphoria CLI writes to:
~/.aphoria/corpus-db/
# StemeDB API reads from:
data/db/ # Default, or configure STEMEDB_CORPUS_DB_DIR
# Make API see Aphoria corpus:
export STEMEDB_CORPUS_DB_DIR="$HOME/.aphoria/corpus-db"
stemedb-api
Database Separation
The Problem
Aphoria and StemeDB API use separate databases:
Aphoria CLI:
└─ corpus create/build → ~/.aphoria/corpus-db/
StemeDB API:
└─ GET /v1/aphoria/corpus → data/db/
Result: Items created via CLI aren't visible in API/Dashboard
The Solution
Three integration patterns:
Pattern 1: Shared Database (Recommended for Development)
Point API to Aphoria's corpus database:
# .env
STEMEDB_CORPUS_DB_DIR=/home/user/.aphoria/corpus-db
# Start API
cargo run --release -p stemedb-api
Pros:
- Zero synchronization needed
- Single source of truth
- Changes immediately visible
Cons:
- API has read-only access (can't write to corpus)
- Not suitable if API needs to write corpus items
Pattern 2: Unified Database (Recommended for Production)
Use shared directory for both:
# Create shared directory
sudo mkdir -p /var/lib/stemedb/corpus
sudo chown aphoria:stemedb /var/lib/stemedb/corpus
sudo chmod 775 /var/lib/stemedb/corpus
# .aphoria/config.toml
[episteme]
corpus_data_dir = "/var/lib/stemedb/corpus"
# StemeDB API
export STEMEDB_CORPUS_DB_DIR="/var/lib/stemedb/corpus"
Pros:
- Single database, no sync
- Both systems have write access
- Production-ready pattern
Cons:
- Requires deployment coordination
- Permissions management needed
Pattern 3: Sync Mechanism (Future)
# Planned (not yet implemented)
aphoria corpus sync --to-api --api-db-dir data/db
Use case: When databases must remain separate.
URI Scheme Inference
The Problem
Corpus items need URI-schemed subjects for API prefix scanning:
# Without URI scheme (won't work):
subject: "tls/certificate_verification"
# API queries:
curl '/v1/aphoria/corpus?sources[]=rfc'
# Scans for "subject:rfc://" → doesn't match plain subjects
The Solution
Automatic URI inference based on authority and tier:
// In aphoria corpus create
Authority: "RFC 5246 Section 7.4.2"
Tier: 0
// Auto-inferred:
subject_uri: "rfc://tls/certificate_verification"
Inference Rules
| Condition | Scheme | Example |
|---|---|---|
Already has :// |
Preserved | rfc://test → rfc://test |
| Authority contains "rfc" (case-insensitive) | rfc:// |
"RFC 5280" → rfc://... |
| Authority contains "owasp" | owasp:// |
"OWASP Top 10" → owasp://... |
| Authority contains "cwe" | cwe:// |
"CWE-120" → cwe://... |
| Tier 2 | vendor:// |
GitHub docs → vendor://... |
| Tier 3 | community:// |
Team wiki → community://... |
| Tier 0/1 unrecognized | corpus:// |
Unknown → corpus://... |
Priority: Authority matching > Tier-based > Fallback
Examples
# RFC claim (tier 0)
aphoria corpus create \
--subject "tls/validation" \
--authority "RFC 5280 Section 6.1" \
--tier 0
# Stored as: subject:rfc://tls/validation
# OWASP claim (tier 1)
aphoria corpus create \
--subject "password/storage" \
--authority "OWASP Password Storage Cheat Sheet" \
--tier 1
# Stored as: subject:owasp://password/storage
# Vendor docs (tier 2)
aphoria corpus create \
--subject "postgresql/connection_pool" \
--authority "PostgreSQL Documentation" \
--tier 2
# Stored as: subject:vendor://postgresql/connection_pool
# Community (tier 3)
aphoria corpus create \
--subject "api/rest/pagination" \
--authority "Team wiki: API standards" \
--tier 3
# Stored as: subject:community://api/rest/pagination
# Already schemed (preserved)
aphoria corpus create \
--subject "custom://myapp/feature" \
--authority "Internal spec" \
--tier 2
# Stored as: subject:custom://myapp/feature
CLI-Created Corpus Source
The Problem
Items created with aphoria corpus create weren't visible in:
aphoria corpus list
# Showed: RFC, OWASP, VendorDocs
# Missing: CLI-created items
aphoria corpus build
# Total assertions: 86
# Missing: CLI-created items
The Solution
CLI-created items are now a first-class corpus source:
// Tagged at creation time
metadata: {
"source": "cli_create",
"description": "...",
"authority_source": "...",
"category": "..."
}
// Discovered by CliCreatedBuilder
impl AsyncCorpusBuilder for CliCreatedBuilder {
async fn build(...) -> Vec<Assertion> {
// Scan corpus DB
// Filter by metadata: "source": "cli_create"
// Return assertions
}
}
Now They Appear
aphoria corpus list
# Available corpus sources:
# rfc:// (Tier 0) - RFC
# owasp:// (Tier 1) - OWASP
# vendor:// (Tier 2) - VendorDocs
# cli:// (Tier 3) - CLI-Created Items ← NEW
aphoria corpus build
# Corpus build complete:
# Total assertions: 157
# CLI-Created Items: 3 assertions ← NEW
Querying CLI-Created Items
# Via API
curl 'http://localhost:18180/v1/aphoria/corpus?sources[]=cli'
# Via Dashboard
# Navigate to: http://localhost:3000/corpus
# Filter by "CLI-Created" source
Git Hooks for Binary Rebuilds
The Problem
Developer workflow:
git pull(gets CLI definition changes)- Run
aphoria corpus create - Error: "unrecognized subcommand 'create'"
- Confusion, time wasted
- Realize binary is stale:
cargo build --release -p aphoria
The Solution
Automatic rebuild hooks:
# .git/hooks/post-merge
if git diff-tree ... | grep -q "^applications/aphoria/src/cli"; then
echo "🔧 CLI changed, rebuilding aphoria..."
cargo build --release -p aphoria
fi
Installed Hooks
post-merge - After git pull or git merge
post-checkout - After git checkout <branch>
post-rewrite - After git rebase
What Triggers Rebuild
- Aphoria CLI:
applications/aphoria/src/cli/ - API handlers:
crates/stemedb-api/src/ - Simulator:
crates/stemedb-sim/src/ - Core libraries:
crates/stemedb-* - Dependencies:
Cargo.tomlchanges
Installation
Hooks are in .git/hooks/ (not tracked by git). To install on new clone:
cd /home/jml/Workspace/stemedb
ls -la .git/hooks/post-*
# If missing, check GIT-HOOKS-IMPLEMENTATION.md for setup
Bypass Hook (Emergency)
# Temporarily disable all hooks
git pull --no-verify
# Or set env var
GIT_HOOKS_DISABLE=1 git pull
Deployment Configurations
Local Development
Aphoria:
# Default: uses ~/.aphoria/corpus-db/
aphoria corpus create ...
aphoria corpus build
StemeDB API:
# Point to Aphoria's corpus
export STEMEDB_CORPUS_DB_DIR="$HOME/.aphoria/corpus-db"
cargo run --release -p stemedb-api
Docker Compose
version: '3.8'
volumes:
corpus-db:
services:
stemedb-api:
image: stemedb-api:latest
environment:
- STEMEDB_CORPUS_DB_DIR=/var/lib/stemedb/corpus
volumes:
- corpus-db:/var/lib/stemedb/corpus
ports:
- "18180:18180"
aphoria-builder:
image: aphoria:latest
volumes:
- corpus-db:/var/lib/stemedb/corpus
- ./aphoria-config.toml:/etc/aphoria/config.toml
command: corpus build
Kubernetes
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: corpus-db
spec:
accessModes: [ReadWriteMany]
resources:
requests:
storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: stemedb-api
spec:
template:
spec:
containers:
- name: api
image: stemedb-api:latest
env:
- name: STEMEDB_CORPUS_DB_DIR
value: /var/lib/stemedb/corpus
volumeMounts:
- name: corpus-db
mountPath: /var/lib/stemedb/corpus
volumes:
- name: corpus-db
persistentVolumeClaim:
claimName: corpus-db
Production (Bare Metal)
# 1. Create shared corpus directory
sudo mkdir -p /var/lib/stemedb/corpus
sudo chown aphoria:stemedb /var/lib/stemedb/corpus
sudo chmod 775 /var/lib/stemedb/corpus
# 2. Configure Aphoria
cat > /etc/aphoria/config.toml <<EOF
[episteme]
corpus_data_dir = "/var/lib/stemedb/corpus"
EOF
# 3. Configure StemeDB API
cat > /etc/systemd/system/stemedb-api.service <<EOF
[Service]
Environment="STEMEDB_CORPUS_DB_DIR=/var/lib/stemedb/corpus"
ExecStart=/usr/local/bin/stemedb-api
User=stemedb
Group=stemedb
EOF
# 4. Start services
systemctl start stemedb-api
Integration Patterns
Pattern A: API-First (Read-Only Corpus)
Use case: Dashboard-driven architecture, corpus rarely changes.
Workflow:
1. Ops team creates corpus items via CLI
2. API serves them to dashboard
3. Developers view in dashboard (read-only)
Database:
- Aphoria: ~/.aphoria/corpus-db/ (write)
- API: points to Aphoria DB (read)
Config:
# API
export STEMEDB_CORPUS_DB_DIR="$HOME/.aphoria/corpus-db"
Pattern B: CLI-First (Frequent Corpus Updates)
Use case: Active corpus curation, frequent CLI usage.
Workflow:
1. Developers create corpus items via CLI
2. CLI builds corpus
3. API/dashboard reflect latest corpus
Database:
- Aphoria: /var/lib/stemedb/corpus (write)
- API: /var/lib/stemedb/corpus (read)
Config:
# .aphoria/config.toml
[episteme]
corpus_data_dir = "/var/lib/stemedb/corpus"
# API
export STEMEDB_CORPUS_DB_DIR="/var/lib/stemedb/corpus"
Pattern C: Hybrid (Separate Stores + Sync)
Use case: Different corpus items in different stores.
Workflow:
1. Aphoria: authoritative corpus (RFC, OWASP, CLI-created)
2. API: ephemeral assertions from scans
3. Periodic sync or query union
Database:
- Aphoria: ~/.aphoria/corpus-db/
- API: data/db/
- Sync: manual or scheduled
Sync (when implemented):
# Planned
aphoria corpus sync --to-api --api-db-dir data/db
Troubleshooting
"Items created but not visible in API"
Symptom:
aphoria corpus create --subject "test" ...
# Created corpus item: corpus://test/enabled
curl 'http://localhost:18180/v1/aphoria/corpus'
# {"items":[], "total_matching": 0}
Diagnosis:
# Check API config
env | grep STEMEDB_CORPUS_DB_DIR
# If empty, API is using data/db/
# Check Aphoria corpus DB
ls -la ~/.aphoria/corpus-db/
# Should see fjall/, redb/, wal/
Fix:
export STEMEDB_CORPUS_DB_DIR="$HOME/.aphoria/corpus-db"
# Restart API
pkill -f stemedb-api
stemedb-api &
"Command not found after git pull"
Symptom:
git pull
aphoria corpus create ...
# error: unrecognized subcommand 'create'
Diagnosis:
# Check binary date
ls -lh target/release/aphoria
# -rwxr-xr-x ... Jan 15 10:00 aphoria
# Check CLI code date
ls -lh applications/aphoria/src/cli/mod.rs
# -rw-r--r-- ... Feb 09 14:30 mod.rs ← Newer!
Fix:
# Rebuild
cargo build --release -p aphoria
# Or check if hooks are installed
ls -la .git/hooks/post-merge
# Should be executable and contain rebuild logic
"Corpus items have wrong URI scheme"
Symptom:
aphoria corpus create \
--subject "tls/validation" \
--authority "RFC 5280" \
--tier 0
# API query fails
curl '/v1/aphoria/corpus?sources[]=rfc'
# {"items":[]}
Diagnosis:
# Check stored subject (via debug scan)
aphoria scan --show-observations | grep tls
# If shows: subject:tls/validation (no rfc://)
# Then URI inference didn't work
Fix: Rebuild aphoria binary (URI inference added in recent version):
cargo build --release -p aphoria
"Dashboard shows duplicate corpus items"
Symptom: Dashboard displays same item multiple times.
Diagnosis:
# Check if corpus built multiple times
aphoria corpus build --verbose
# Look for same assertion appearing under multiple builders
Cause: CLI-created items might also match RFC/OWASP builders if they have matching metadata.
Fix: This is expected behavior if:
- Item was created via CLI with RFC authority
- RFC builder also fetches it from RFC source
- Both versions appear in corpus
To deduplicate, ensure CLI-created items use unique subjects or authorities that don't overlap with fetched sources.
Architecture Diagram
┌─────────────────────────────────────────────────────────┐
│ Aphoria CLI │
├─────────────────────────────────────────────────────────┤
│ │
│ aphoria corpus create │
│ │ │
│ ├─► infer_subject_uri() │
│ │ (RFC/OWASP/CWE → scheme) │
│ │ │
│ ├─► create_corpus_item() │
│ │ metadata: "source": "cli_create" │
│ │ │
│ └─► Store: ~/.aphoria/corpus-db/ │
│ Key: "subject:rfc://tls/validation" │
│ │
│ aphoria corpus build │
│ │ │
│ ├─► HardcodedBuilder │
│ ├─► RfcBuilder (network) │
│ ├─► OwaspBuilder (network) │
│ ├─► VendorDocsBuilder │
│ └─► CliCreatedBuilder ← NEW │
│ Filter: "source": "cli_create" │
│ │
└─────────────────────────────────────────────────────────┘
│
│ Shared Database
↓
┌─────────────────────────────────────────────────────────┐
│ ~/.aphoria/corpus-db/ │
│ │
│ subject:rfc://tls/validation → Assertion │
│ subject:owasp://password/storage → Assertion │
│ subject:community://api/rest → Assertion │
│ │
└─────────────────────────────────────────────────────────┘
↑
│ STEMEDB_CORPUS_DB_DIR
│
┌─────────────────────────────────────────────────────────┐
│ StemeDB API │
├─────────────────────────────────────────────────────────┤
│ │
│ GET /v1/aphoria/corpus?sources[]=rfc │
│ │ │
│ └─► corpus_store.scan_prefix("subject:rfc://") │
│ ↓ │
│ Returns: RFC assertions │
│ │
└─────────────────────────────────────────────────────────┘
│
│ HTTP
↓
┌─────────────────────────────────────────────────────────┐
│ Aphoria Dashboard │
│ │
│ Filter: [RFC] [OWASP] [CLI-Created] │
│ ┌─────────────────────────────────┐ │
│ │ rfc://tls/validation │ │
│ │ Tier 0 | Security │ │
│ │ TLS cert verification MUST... │ │
│ └─────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
See Also
- CLI Reference - Complete command reference
- Configuration Reference - Configuration file reference
- README - Quickstart and key concepts
- Comparison Modes - Deep dive on verification logic
- Scale-Adaptive Thresholds - Community corpus thresholds