stemedb/applications/aphoria/roadmap.md
jordan a734be3a0d feat: Phase 7 Content Defense + code structure refactoring
Content Defense (Phase 7):
- Add SimilarityIndex with MinHash/LSH for near-duplicate detection
- Add QuarantineStore for flagged assertions awaiting admin review
- Add CircuitBreakerStore for per-agent circuit breaker state
- Add ContentDefenseLayer for ingestion pipeline integration
- Add API endpoints for quarantine and circuit breaker management
- Add research module with gap detection and documentation fetching

Code Structure Improvements:
- Extract research CLI commands to research_commands.rs
- Extract API routers to routers.rs module
- Extract key_codec extraction functions to separate module
- Extract test modules to separate files across multiple crates
- All files now under 500 line limit per pre-commit hook

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 12:44:05 -07:00

17 KiB

Aphoria Roadmap


Phase 0: StemeDB Foundation

Tracked in: roadmap.md § 5D. Concept Hierarchy

Changes to the core database that Aphoria depends on. Shipped as Phase 5D of the main StemeDB roadmap.

Aphoria Phase 0 StemeDB Phase 5D Status
0.1 ConceptPath Type 5D.1 ConceptPath Type
0.2 ConceptPath in Assertion (implicit in 5D.1)
0.3 Hierarchical Index 5D.4 Hierarchical Query
0.4 Alias Store 5D.3 Alias Store + 5D.5 Alias Resolution
0.5 Source Class Inference 5D.6 Source Class Inference
0.6 Concept API Endpoints 5D.7 Concept API Endpoints

Spec: docs/specs/concept-hierarchy.md


Phase 2: CLI Core

Phase 2 was built before Phase 1 (authoritative corpus expansion). The CLI pipeline works end-to-end with a bootstrapped corpus of 11 hardcoded assertions covering TLS, JWT, CORS, secrets, and rate limiting.

Task Status
2.1 Project Walker walker/mod.rs, walker/path_mapper.rs, walker/language.rs
2.2 Extractors (7) tls_verify, jwt_config, hardcoded_secrets, timeout_config, dep_versions, cors_config, rate_limit
2.3 Ingestion Bridge bridge.rs — BLAKE3 hashing, Ed25519 signing, claim→assertion conversion
2.4 Conflict Query episteme.rs — LocalEpisteme with check_conflicts()
2.5 Report Output report/ — table (comfy-table), JSON, SARIF 2.1.0, markdown
2.6 Acknowledge Command lib.rs acknowledge()
Baseline & Diff lib.rs set_baseline(), show_diff()
Status Command lib.rs show_status()

118 tests pass. Clippy and fmt clean.


Phase 2A: Concept Matching

Status: Complete. Tail-path matching (2A.1), alias-aware queries (2A.2), and auto-alias creation (2A.3) all implemented.

2A.1 Leaf-Based Concept Matching (Aphoria-side fix)

Implemented in episteme.rs via ConceptIndex:

  • make_key(subject, predicate) extracts tail 2 path segments + predicate
  • build(assertions) creates in-memory index keyed by tail path
  • lookup(subject, predicate) finds matching authoritative assertions
  • check_conflicts() uses ConceptIndex instead of QueryEngine for cross-scheme matching

Integration tests prove TLS and JWT conflicts are detected correctly.

2A.2 Alias Resolution in QueryEngine (StemeDB-side fix)

Wired AliasStore into QueryEngine.execute():

  • Added resolve_aliases: bool field to Query (defaults to false)
  • Added alias_store: Option<Arc<dyn AliasStore>> to QueryEngine
  • Added .with_alias_store() builder method
  • When resolve_aliases: true, expands subject via AliasStore.resolve_all() before index lookup
  • Added fetch_by_subjects() and fetch_by_subjects_predicate() for multi-subject deduplication
  • Modified Query.matches() to skip subject filtering when aliases are resolved
  • Skips fast path (MV lookup) when resolve_aliases: true
  • Gracefully degrades when no alias store is configured

7 unit tests in engine/tests/alias_resolution.rs. This is the architecturally correct long-term fix that complements leaf matching.

2A.3 Auto-Alias Creation

When Aphoria ingests authoritative assertions and code claims that share leaf names, automatically create aliases:

  • code://rust/myapp/tls/cert_verificationrfc://5246/tls/cert_verification
  • code://rust/myapp/auth/jwt/audience_validationrfc://7519/jwt/audience_validation

This bridges 2A.1 (leaf matching) with 2A.2 (alias resolution) — leaf matching identifies candidates, aliases persist the relationship.

Implementation:

  • Added auto_create_aliases: bool config option to AliasConfig (defaults to true)
  • Added AliasOrigin::AutoDetected variant to stemedb-core for tracking auto-created aliases
  • Wired GenericAliasStore into LocalEpisteme for alias persistence
  • In check_conflicts(), when a code claim matches an authoritative claim by leaf, calls AliasStore.set_alias() to persist the relationship with AliasOrigin::AutoDetected
  • Alias creation is idempotent (skips if alias already exists)
  • 4 unit tests verify: alias creation on conflict, no creation when disabled, correct origin, idempotency

Phase 1: Authoritative Corpus Expansion

Expanded from 11 hardcoded assertions to a pluggable corpus system with RFC, OWASP, and Vendor sources.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     aphoria corpus build                         │
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────────────────┐  │
│  │ RFC Ingester │  │ OWASP        │  │ Vendor Bootstrapper   │  │
│  │ (Tier 0)     │  │ Ingester     │  │ (Tier 2)              │  │
│  │              │  │ (Tier 1)     │  │                       │  │
│  └──────┬───────┘  └──────┬───────┘  └───────────┬───────────┘  │
│         │                 │                      │              │
│         └─────────────────┼──────────────────────┘              │
│                           ▼                                     │
│                  ┌─────────────────┐                            │
│                  │ CorpusRegistry  │                            │
│                  └────────┬────────┘                            │
│                           ▼                                     │
│                  ┌─────────────────┐                            │
│                  │ LocalEpisteme   │                            │
│                  │ ingest_         │                            │
│                  │ authoritative() │                            │
│                  └─────────────────┘                            │
└─────────────────────────────────────────────────────────────────┘

1.1 CorpusBuilder Trait

Task Status
CorpusBuilder trait corpus/mod.rs — name, scheme, default_tier, build, requires_network
CorpusRegistry Manages multiple builders, build_all(), list_builders()
CorpusBuildResult Stats per builder, total assertions, success/fail/skip counts

1.2 RFC Ingester

Task Status
RfcCorpusBuilder corpus/rfc.rs
HTTP fetching Via ureq, cached to ~/.cache/aphoria/rfc-cache/
RFC 2119 keyword parsing MUST, MUST NOT, SHOULD, SHALL extraction
RFC-specific parsers JWT (7519), OAuth (6749), Bearer (6750), TLS 1.3 (8446), TLS BCP (7525), TOTP (6238), Basic Auth (7617), HTTP (9110)
Concept mapping rfc://{number}/{topic} at Tier 0 (Regulatory)

1.3 OWASP Ingester

Task Status
OwaspCorpusBuilder corpus/owasp.rs
HTTP fetching From GitHub raw content, cached to ~/.cache/aphoria/owasp-cache/
Markdown parsing MUST/SHOULD statements, section context
Cheat sheet parsers Authentication, JWT, TLS, Secrets, Input Validation, Session, CSRF, Password Storage, HTTP Headers
Concept mapping owasp://cheatsheet/{topic}/{claim} at Tier 1 (Clinical)

1.4 Vendor Docs

Task Status
VendorCorpusBuilder corpus/vendor.rs
PostgreSQL claims pool_size, idle_timeout, ssl_mode
Redis claims timeout, max_retries, tls
reqwest claims cert_verification, connect_timeout, request_timeout
hyper claims keep_alive_timeout, max_concurrent_streams
Go net/http claims read_timeout, write_timeout, idle_timeout, min_tls_version
tokio-postgres claims pool_size, ssl_mode
SQLx claims max_connections, idle_timeout
Concept mapping vendor://{product}/{topic}/{claim} at Tier 2 (Observational)

1.5 Hardcoded Refactor

Task Status
HardcodedCorpusBuilder corpus/hardcoded.rs — original 11 assertions
create_authoritative_assertion() Made public in episteme.rs for corpus builders

1.6 CLI Integration

Task Status
aphoria corpus build Fetches and ingests from all sources
--only rfc,owasp,vendor Filter to specific sources
--offline Skip network-requiring sources
--clear-cache Clear cache before building
aphoria corpus list List available corpus sources
CorpusConfig cache_dir, include_*, rfc_list options

1.7 Error Handling

Task Status
RfcFetch error Per-RFC fetch failures with context
OwaspFetch error Per-cheat-sheet fetch failures with context
CorpusBuild error General corpus build failures
Graceful degradation Continue with other sources if one fails

Files: corpus/mod.rs, corpus/hardcoded.rs, corpus/rfc.rs, corpus/owasp.rs, corpus/vendor.rs

Tests: 118 tests pass. Clippy and fmt clean.


Phase 3: Skill Integration

Complete. Aphoria is now usable in Claude Code agent workflows.

3.1 Claude Code Skill

Task Status
skill/SKILL.md Comprehensive skill definition with all commands
/aphoria scan Scan project, show conflicts grouped by verdict
/aphoria scan --fix Interactive fix workflow
/aphoria ack Acknowledge conflicts as intentional
/aphoria status Show status and baseline
/aphoria diff Show changes since baseline
/aphoria init Initialize Aphoria
/aphoria baseline Set baseline
skill/install.sh Install script for ~/.claude/skills/aphoria/

Files: skill/SKILL.md, skill/install.sh, skill/hooks.json

3.2 Agent Pre-Flight Hook

Task Status
--exit-code flag Returns 2 for BLOCK, 1 for FLAG only, 0 for clean
--strict flag Lower thresholds (FLAG at 0.3, BLOCK at 0.5)
Hook template skill/hooks.json with PreCommit and PrePush examples

Usage:

{
  "hooks": {
    "PreCommit": [{"command": "aphoria scan --format sarif --exit-code"}],
    "PrePush": [{"command": "aphoria scan --strict --exit-code"}]
  }
}

3.3 Alias Suggestion Workflow

Auto-alias creation is now automatic (Phase 2A.3). When Aphoria scans:

  1. Tail-path matching finds authoritative assertions
  2. Aliases are auto-created with AliasOrigin::AutoDetected
  3. Future queries use the alias automatically

The skill documents the suggestion flow for manual alias management:

  • y (Accept): Creates alias
  • n (Reject): Records intentional difference
  • defer: Flags for later review

Phase 4: Pre-Commit Integration

Depends on Phase 3 (skill validates the UX before hook automation).

4.1 Git Pre-Commit Hook

A git pre-commit hook that runs Aphoria before every commit:

#!/bin/sh
# .git/hooks/pre-commit

aphoria scan --exit-code --format table

if [ $? -eq 2 ]; then
    echo "BLOCKED: Fix conflicts before committing"
    exit 1
fi

Or using pre-commit framework (.pre-commit-config.yaml):

repos:
  - repo: local
    hooks:
      - id: aphoria
        name: Aphoria Security Lint
        entry: aphoria scan --exit-code
        language: system
        pass_filenames: false

4.2 Baseline Mode

Already implemented in Phase 2. For existing projects with many conflicts:

$ aphoria baseline
Baseline recorded: 12 existing conflicts frozen.
Future scans will only report new conflicts.

4.3 Diff-Only Scanning

Scan only changed files instead of the whole project:

# Scan only staged files
aphoria scan --staged

# Scan only files changed since baseline
aphoria scan --since-baseline

This makes pre-commit hooks fast even in large projects.


Phase 5: Research Agent Loop

Research agent fills gaps in authoritative coverage by researching official documentation.

5.1 Gap Detection

Task Status
Gap struct research/gap_detector.rs — concept_path, topic, predicate, source info
detect_gaps() Compares claims against ConceptIndex, identifies missing coverage
Topic normalization Extracts last 2 path segments for cross-scheme matching
Deduplication Deduplicates gaps by topic+predicate key

5.2 Gap Storage

Task Status
GapRecord research/gap_store.rs — tracking metadata, project count, research status
GapStore JSON-backed persistent storage with atomic saves
Project tracking Records which projects reported each gap
Research eligibility is_eligible_for_research() with threshold and cooldown
Gap pruning prune_old_gaps() removes stale entries

5.3 Quality Validation

Task Status
QualityValidator research/quality.rs — validates researched claims
Source attribution Checks for authoritative domains (rfc-editor, owasp, vendor docs)
Normative language Verifies MUST/SHOULD/SHALL keywords present
Vague content detection Rejects "it depends", "typically", etc.
Consistency scoring Detects conflicting claims on same subject
QualityReport Detailed per-claim validation results
filter_passed() Returns only claims meeting quality threshold

5.4 Research Execution

Task Status
Researcher research/researcher.rs — orchestrates research pipeline
DocumentationSource Configurable sources with URL patterns and topics
Default sources Redis, PostgreSQL, Go, Rust, OWASP, Kafka, MongoDB
Content fetching HTTP with timeout and size limits
Normative extraction Regex-based MUST/SHOULD/SHALL extraction
Section tracking Extracts heading context for attribution
Confidence scoring Based on keyword strength, statement length, content size

5.5 CLI Integration

Task Status
aphoria research run Run research agent with configurable threshold
aphoria research status Show gap statistics and research progress
aphoria research gaps List gaps by project count
--threshold Minimum projects before researching (default: 3)
--strict Use strict quality validation
--prune Remove stale gaps before researching
--ready Show only gaps ready for research

Files: research/mod.rs, research/gap_detector.rs, research/gap_store.rs, research/quality.rs, research/researcher.rs, research/tests.rs

5.6 Community Corpus Contributions

Future: Users can opt in to contribute patterns anonymously.

  • "Every Rust project has this JWT pattern" → pre-built alias set
  • "This Redis config is always acknowledged" → adjust default threshold
  • "This TLS pattern is always a real bug" → elevate threshold

Milestone Summary

Phase Deliverable Depends On Status
0 ConceptPath in StemeDB concept-hierarchy spec
2 Aphoria CLI (scan, report, ack) Phase 0
2A.1 Leaf-based concept matching Phase 2
2A.2 Alias resolution in QueryEngine Phase 2
2A.3 Auto-alias creation Phase 2A.2
1 Authoritative corpus expansion Phase 0
3 Claude Code skill + hooks Phase 2A
5 Research agent loop Phase 3
4 Pre-commit integration (git hooks, diff scanning) Phase 3 NEXT

Current state:

  • Phase 1 is complete: RFC, OWASP, and Vendor corpus builders with aphoria corpus build CLI
  • Phase 2A is complete: conflict detection via tail-path matching, alias-aware QueryEngine, and auto-alias creation
  • Phase 3 is complete: /aphoria skill installed to ~/.claude/skills/aphoria/, hook templates ready
  • Phase 5 is complete: Research agent with gap detection, quality validation, and official doc research

Next: Phase 4 — Pre-commit integration (git hooks, diff-only scanning).