stemedb/applications/aphoria/roadmap-archive.md

# Aphoria Roadmap Archive

> Completed phases moved from `roadmap.md`. Full implementation details preserved in git history.

---

## Phase 0: StemeDB Foundation ✅

ConceptPath type, hierarchical index, alias store, source class inference, concept API endpoints.
All shipped as Phase 5D of the main StemeDB roadmap.

**Spec:** [docs/specs/concept-hierarchy.md](../../docs/specs/concept-hierarchy.md)

---

## Phase 2: CLI Core ✅

End-to-end CLI pipeline with 10 extractors and bootstrapped corpus of 11 hardcoded assertions.

| Task | Status |
|------|--------|
| 2.1 Project Walker | ✅ `walker/mod.rs`, `walker/path_mapper.rs`, `walker/language.rs` |
| 2.2 Extractors (10) | ✅ `tls_verify`, `jwt_config`, `hardcoded_secrets`, `timeout_config`, `dep_versions`, `cors_config`, `rate_limit`, `weak_crypto`, `command_injection`, `sql_injection` |
| 2.3 Ingestion Bridge | ✅ `bridge.rs` — BLAKE3 hashing, Ed25519 signing, claim→assertion conversion |
| 2.4 Conflict Query | ✅ `episteme.rs` — LocalEpisteme with check_conflicts() |
| 2.5 Report Output | ✅ `report/` — table (comfy-table), JSON, SARIF 2.1.0, markdown |
| 2.6 Acknowledge Command | ✅ `lib.rs` acknowledge() |
| Baseline & Diff | ✅ `lib.rs` set_baseline(), show_diff() |
| Status Command | ✅ `lib.rs` show_status() |

### Phase 2 Code Quality Fixes ✅

- DES/RC4 concept path misclassification: Split into `check_hash_pattern()` and `check_encryption_pattern()`
- SHA1 edge case: Documented as intentionally broad
- JS exec() regex: Tightened to require `child_process.` prefix

---

## Phase 2A: Concept Matching ✅

- **2A.1 Leaf-Based Matching**: `ConceptIndex` with tail-path matching (last 2 segments + predicate)
- **2A.2 Alias Resolution**: Wired `AliasStore` into `QueryEngine.execute()` with `resolve_aliases: bool`
- **2A.3 Auto-Alias Creation**: Auto-creates aliases when code and authority share leaf names

---

## Phase 1: Authoritative Corpus Expansion ✅

Expanded from 11 hardcoded assertions to pluggable corpus system.

- **1.1 CorpusBuilder Trait** ✅ — name, scheme, default_tier, build, requires_network
- **1.2 RFC Ingester** ✅ — HTTP fetching, RFC 2119 keyword parsing, 8 RFC-specific parsers
- **1.3 OWASP Ingester** ✅ — GitHub raw content, 9 cheat sheet parsers
- **1.4 Vendor Docs** ✅ — PostgreSQL, Redis, reqwest, hyper, Go net/http, tokio-postgres, SQLx
- **1.5 Hardcoded Refactor** ✅ — Original 11 assertions as `HardcodedCorpusBuilder`
- **1.6 CLI Integration** ✅ — `aphoria corpus build/list`, `--only`, `--offline`, `--clear-cache`
- **1.7 Error Handling** ✅ — Per-source graceful degradation

**Files:** `corpus/mod.rs`, `corpus/hardcoded.rs`, `corpus/rfc.rs`, `corpus/owasp.rs`, `corpus/vendor.rs`

---

## Phase 3: Skill Integration ✅

- **3.1 Claude Code Skill** ✅ — `/aphoria scan`, `scan --fix`, `ack`, `status`, `diff`, `init`, `baseline`
- **3.2 Agent Pre-Flight Hook** ✅ — `--exit-code` (2=BLOCK, 1=FLAG, 0=clean), `--strict`
- **3.3 Alias Suggestion** ✅ — Auto-alias creation from Phase 2A.3

---

## Phase 4: Full-Cycle Pre-Commit (Scan + Sync) ✅

Bidirectional knowledge sync: extract → check → classify → update → gate.

- **4A Observational Claims** ✅ — `--sync` records novel claims as Tier 4 observations
- **4B Self-Conflict Detection** ✅ — Drift detection with `Verdict::Drift`
- **4C Diff-Only Scanning** ✅ — `--staged` for fast pre-commit hooks
- **4D Enhanced Ack** ✅ — `--reason`, `aphoria update` for policy changes
- **4E Hosted Mode** ✅ — Team aggregation via central StemeDB server, `HostedClient`

---

## Phase 4.5: Ephemeral Scan Mode ✅

40x faster scans by skipping Episteme storage. Default mode ~0.25s, persistent ~1-2s.

- `ScanMode` enum (Ephemeral default, Persistent opt-in with `--persist`)
- `EphemeralDetector` — in-memory corpus + ConceptIndex
- `check_conflicts_pure()` extracted as standalone function

---

## Phase 5: Research Agent Loop ✅

- **5.1 Gap Detection** ✅ — `detect_gaps()` compares claims against ConceptIndex
- **5.2 Gap Storage** ✅ — JSON-backed persistent storage with eligibility tracking
- **5.3 Quality Validation** ✅ — Source attribution, normative language, vague content detection
- **5.4 Research Execution** ✅ — HTTP fetching, normative extraction, confidence scoring
- **5.5 CLI Integration** ✅ — `aphoria research run/status/gaps`
- **5.6 Community Corpus** ✅ — Opt-in anonymous pattern sharing with privacy-preserving anonymization
- **5.7 Security Extractors** ✅ — weak_crypto, command_injection, sql_injection

---

## Phase 6: Federated Policy & Trust Packs ✅

- **6.1 Trust Pack Format** ✅ — rkyv serialization, Ed25519 signing
- **6.2 Policy Management** ✅ — Local and remote loading with caching
- **6.3 Core Integration** ✅ — EphemeralDetector + LocalEpisteme policy ingestion
- **6.4 CLI Commands** ✅ — `aphoria policy export`, auto-loading

---

## Phase 6.5: Trust Pack Extensions ✅

- **6.5.1 Predicate Aliases** ✅ — `enabled` ↔ `required` ↔ `mandatory` ↔ `enforced`
- **6.5.2 Pack Signing Key Rotation** ✅ — `aphoria policy resign`, signature chain audit trail

---

## Phase 7: Declarative Extractors ✅

TOML-defined custom extractors without Rust code.

- **7.1 Core Types** ✅ — `DeclarativeExtractorDef`, `DeclarativeExtractor`
- **7.2 Configuration** ✅ — `[[extractors.declarative]]` in aphoria.toml
- **7.3 Validation** ✅ — ReDoS protection, confidence validation
- **7.4 Registry Integration** ✅ — Enable/disable, Trust Pack integration
- **7.5 Error Handling** ✅
- **7.6 Tests** ✅ — 22 unit + 7 integration tests

---

## Phase 7.5: LLM-in-the-Loop Extraction ✅

Gemini-powered semantic extraction for high-value files.

- **7.5.1 LLM Extractor** ✅ — `GeminiClient`, structured JSON output
- **7.5.2 Selective Triggering** ✅ — `is_high_value_file()`, token budget
- **7.5.3 Cost Controls** ✅ — BLAKE3 caching, budget enforcement
- **7.5.4 Configuration** ✅ — `[llm]` section in aphoria.toml

---

## Phase 7.6: Pattern Learning Store ✅

Remember patterns LLM finds for promotion to declarative extractors.

- **7.6.1 Schema** ✅ — `LearnedPattern`, `ClaimTemplate`, `ValueType`
- **7.6.2 PatternStore** ✅ — JSON-backed, RwLock thread safety, Levenshtein dedup
- **7.6.3 Normalization** ✅ — Version/boolean/number/string placeholder replacement
- **7.6.4 Configuration** ✅ — `[learning]` section
- **7.6.5 Scan Integration** ✅ — Project hash, record/update patterns

---

## Phase 7.7: Pattern → Extractor Promotion ✅

Learned patterns become declarative extractors via LLM regex generation.

- **7.7.1 Pipeline** ✅ — `PromotionPipeline`, `RegexGenerator`, `ExtractorValidator`, `YamlWriter`
- **7.7.2 Regex Generation** ✅ — Multi-example prompt, ReDoS safety
- **7.7.3 Validation** ✅ — Positive tests, timing validation
- **7.7.4 Human Review** ✅ — `aphoria extractors review/stats/candidates/promote`
- **7.7.5 Extractor Output** ✅ — YAML files in `.aphoria/extractors/learned/`

---

## Phase 7.8: LLM Prompt Evaluation ✅

Golden fixtures with precision/recall metrics and regression detection.

- **7.8.1 Fixture Format** ✅ — TOML-based with `must_contain`/`must_not_contain`
- **7.8.2 Claim Matching** ✅ — Tail-path matching, type coercion
- **7.8.3 Metrics** ✅ — Precision/Recall/F1, per-category breakdown
- **7.8.4 Harness** ✅ — Live/Cached/Mock modes, regression detection
- **7.8.5 Reports** ✅ — Table, JSON, Markdown
- **7.8.6 CLI** ✅ — `aphoria eval run/baseline/update-baseline/list-fixtures/validate-fixtures`
- **7.8.7 Seed Fixtures** ✅ — 10 fixtures across tls, jwt, secrets, auth, negative, edge

---

## Phase 8: Enterprise Extractor Improvements ✅

42 extractors total. Enterprise-grade detection for production codebases.

- **8.1 High-Entropy Secrets** ✅ — Shannon entropy, known prefixes (AWS/Stripe/GitHub/GitLab/Slack)
- **8.2 Framework Extractors (10)** ✅ — Spring, Django, Express, Rails, ASP.NET, Laravel, FastAPI, Next.js, Flask, NestJS
- **8.3 Config Deep Parsing** ✅ — YAML/JSON/TOML tree walking, 11 security rules
- **8.4 Semantic TLS Version** ✅ — Cross-language const detection, Terraform, Kubernetes
- **8.5 ORM SQL Injection** ✅ — Django/SQLAlchemy/GORM/ActiveRecord/Prisma/Sequelize
- **8.6 Path Traversal** ✅
- **8.7 Unvalidated Redirects** ✅
- **8.8 Weak Password** ✅
- **8.9 Security Headers** ✅
- **8.10 Insecure Deserialization** ✅
- **8.11 SSRF** ✅

---

## Phase 9: Autonomous Extractor Generation ✅

Fully self-improving extraction system.

- **9.1 Autonomous Promotion** ✅ — >0.95 confidence, >10 projects, full audit trail
- **9.2 Shadow Mode Testing** ✅ — Isolated metrics, graduation gate, FP tracking
- **9.3 Auto-Rollback** ✅ — FP rate >15% triggers automatic rollback
- **9.4 Cross-Project Learning** ✅ — Privacy-preserving pattern sync, community extractors
- **9.5 Extractor Versioning** ✅ — Changelogs, rollback, A/B comparison

---

## Phase 10.1: Acknowledgment Expiry ✅

Time-limited exceptions with `--expires` flag.

- `--expires 90d` or `--expires 2026-12-31` (ISO 8601)
- Expired acks resurface as BLOCK
- Preserved for audit trail per patent claim 25
- All report formatters show expiry info

**Files:** `src/expiry.rs`, `cli.rs`, `report/*.rs`

---

## Phase 11: Evidence-Based Authority ✅

Evidence levels (ProductSpec > Standard > Research > Commit-only) with evidence-aware graduation.

- **11.1 Types** ✅ — `EvidenceLevel`, `PatternEvidence` with ADR/spec/RFC references
- **11.2 Detection** ✅ — Commit message parsing, ADR/spec file detection
- **11.3 Graduation** ✅ — Thresholds vary by evidence (ProductSpec: 1 usage, Commit-only: 10)
- **11.4 Display** ✅ — Evidence chain in output, `--evidence` filter

**Files:** `src/evidence/mod.rs`, `evidence/types.rs`, `evidence/detection.rs`

---

## Phase 12: Knowledge Scope Hierarchy ✅

Organization → Team → Project scope levels with inheritance.

- **12.1 Scope Types** ✅ — `ScopeLevel` enum, `ScopeConfig`
- **12.2 Inheritance** ✅ — Security: no opt-out, Conventions: override with justification
- **12.3 Override Workflow** ✅ — Justification + evidence required
- **12.4 Cross-Scope Queries** ✅ — `--scope org/team/project`, `--exclude-inherited`

**Files:** `src/scope/mod.rs`, `scope/config.rs`, `scope/resolver.rs`, `scope/override_record.rs`, `scope/store.rs`

---

## Phase 13: Knowledge Lifecycle Management ✅

Active → Deprecated → Superseded → Archived lifecycle for patterns.

- **13.1 Status Types** ✅ — `KnowledgeStatus` enum with history tracking
- **13.2 Deprecation** ✅ — `aphoria deprecate` with `--reason`, `--superseded-by`, `--sunset-date`
- **13.3 Migration Guidance** ✅ — Warnings in scan output, links to replacements
- **13.4 Migration Dashboard** ✅ — `aphoria migrations status`, progress tracking, export

**Files:** `src/lifecycle/mod.rs`, `lifecycle/store.rs`, `lifecycle/migration.rs`

---

## Phase 16: Ignore & Exclusion System ✅

Clean scans by excluding test fixtures and intentional patterns.

- **16.1 Glob Patterns** ✅ — `globset` with `**`, `*`, `?` support
- **16.2 `.aphoriaignore`** ✅ — Gitignore-style patterns, merged with aphoria.toml
- **16.3 Inline Comments** ✅ — `// aphoria:ignore`, `ignore-next-line`, `ignore-block`
- **16.4 Ack Export/Import** ✅ — `.aphoria/acks.toml`, version-controllable

---

## The Self-Learning Vision (Complete)

```
Phase 7: Declarative Extractors                          ✅
    ↓
Phase 7.5: LLM-in-the-Loop (Gemini semantic extraction) ✅
    ↓
Phase 7.6: Pattern Learning (remember what LLM finds)    ✅
    ↓
Phase 7.7: Pattern Promotion (patterns → extractors)     ✅
    ↓
Phase 7.8: LLM Prompt Evaluation (measure & improve)     ✅
    ↓
Phase 8: Enterprise Extractors (42 total)                ✅
    ↓
Phase 9: Autonomous Generation (fully self-improving)     ✅
```

## Milestone Summary (Completed)

| Phase | Deliverable | Status |
|-------|-------------|--------|
| 0 | ConceptPath in StemeDB | ✅ |
| 2 | Aphoria CLI (scan, report, ack) | ✅ |
| 2A | Concept matching (leaf, alias, auto-alias) | ✅ |
| 1 | Authoritative corpus expansion | ✅ |
| 3 | Claude Code skill + hooks | ✅ |
| 4 | Full-cycle pre-commit (sync, drift, staged, hosted) | ✅ |
| 4.5 | Ephemeral scan mode (40x faster) | ✅ |
| 5 | Research agent loop + community corpus | ✅ |
| 6 | Federated Policy & Trust Packs | ✅ |
| 6.5 | Trust Pack Extensions | ✅ |
| 7 | Declarative Extractors | ✅ |
| 7.5 | LLM-in-the-Loop Extraction | ✅ |
| 7.6 | Pattern Learning Store | ✅ |
| 7.7 | Pattern → Extractor Promotion | ✅ |
| 7.8 | LLM Prompt Evaluation | ✅ |
| 8 | Enterprise Extractors (42 total) | ✅ |
| 9 | Autonomous Extractor Generation | ✅ |
| 10.1 | Acknowledgment Expiry | ✅ |
| 11 | Evidence-Based Authority | ✅ |
| 12 | Knowledge Scope Hierarchy | ✅ |
| 13 | Knowledge Lifecycle Management | ✅ |
| 16 | Ignore & Exclusion System | ✅ |