stemedb/applications/aphoria/roadmap.md
jordan 42d4e09508 feat: Index persistence (Phase 5C) - vector hot/cold, visual checkpoint
Phase 5C (Index Persistence) implementation:
- PersistentVectorIndex with hot/cold architecture
  - Hot: in-memory HNSW for recent vectors
  - Cold: memory-mapped HNSW loaded from disk
  - Background builder for WAL replay and atomic swap
  - BLAKE3 integrity verification
- PersistentVisualIndex with checkpoint persistence
  - BkTreeSnapshot with rkyv serialization
  - CRC32C corruption detection
  - Atomic write pattern (temp → fsync → rename)
- Key codec additions for vector index metadata
- Split large files into modules (<500 lines each)
  - battery_pre_sentinel.rs → battery/ directory
  - visual_index.rs → visual_index/ directory
  - persistent.rs → persistent/ directory
- Refactored ingest worker tests for clarity
- Updated roadmap to mark Phase 5 complete

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:43:18 -07:00

360 lines
13 KiB
Markdown

# Aphoria Roadmap
---
## Phase 0: StemeDB Foundation
> **Tracked in:** [roadmap.md § 5D. Concept Hierarchy](../../roadmap.md)
Changes to the core database that Aphoria depends on. These ship before the CLI and are tracked in the main StemeDB roadmap as **Phase 5D**.
| Aphoria Phase 0 | StemeDB Phase 5D | Status |
|-----------------|------------------|--------|
| 0.1 ConceptPath Type | 5D.1 ConceptPath Type | ⬜ |
| 0.2 ConceptPath in Assertion | (implicit in 5D.1) | ⬜ |
| 0.3 Hierarchical Index | 5D.4 Hierarchical Query | ⬜ |
| 0.4 Alias Store | 5D.3 Alias Store + 5D.5 Alias Resolution | ⬜ |
| 0.5 Source Class Inference | 5D.6 Source Class Inference | ⬜ |
| 0.6 Concept API Endpoints | 5D.7 Concept API Endpoints | ⬜ |
**Spec:** [docs/specs/concept-hierarchy.md](../../docs/specs/concept-hierarchy.md)
---
## Phase 1: Authoritative Corpus
Before Aphoria can find conflicts, Episteme needs the authoritative sources to conflict against.
### 1.1 RFC Ingester
A CLI tool (or ingestion module) that:
- Fetches RFC text from `rfc-editor.org` (text format, no PDF parsing needed)
- Extracts normative statements (MUST, MUST NOT, SHOULD, SHALL per RFC 2119)
- Maps each statement to a ConceptPath: `rfc://{number}/{topic}/{claim}`
- Ingests as Tier 0 assertions
Start with a curated list of security-relevant RFCs:
| RFC | Topic |
|-----|-------|
| 7519 | JWT |
| 6749 | OAuth 2.0 |
| 6750 | Bearer tokens |
| 8446 | TLS 1.3 |
| 7525 | TLS best practices |
| 6238 | TOTP |
| 7617 | HTTP Basic Auth |
| 9110 | HTTP Semantics |
### 1.2 OWASP Ingester
Parse OWASP Cheat Sheets (markdown source on GitHub):
- Extract each recommendation as a claim
- Map to `owasp://cheatsheet/{topic}/{claim}`
- Ingest as Tier 1 assertions
Priority cheat sheets: Authentication, JWT, TLS, Secrets Management, Input Validation, Session Management.
### 1.3 Vendor Docs (Manual Bootstrap)
For v1, manually curate a small set of vendor doc claims:
- Postgres connection pool recommendations
- Redis timeout defaults
- Common HTTP client library defaults (reqwest, hyper, net/http)
These are `vendor://{product}/{topic}/{claim}` at Tier 2.
This doesn't need to be exhaustive. It needs to cover the claims that Aphoria's extractors will actually find in code.
---
## Phase 2: CLI Core
The Aphoria binary itself.
### 2.1 Project Walker
Input: a project root path.
Output: a list of files to scan, each tagged with:
- Language (rust, go, python, typescript, yaml, toml, json)
- ConceptPath prefix derived from directory structure
```
crates/citadeldb/src/auth/jwt.rs
→ language: rust
→ prefix: code://rust/citadeldb/auth/jwt
```
Normalization rules:
- Strip `src/`, `lib/`, `pkg/`, `internal/` (language boilerplate)
- Strip `crates/`, `packages/`, `apps/` (monorepo wrappers)
- Map `config/`, `deploy/`, `infra/` to `code://config/{project}/...`
- File extension determines language, not directory
### 2.2 Extractors
Each extractor is a module that:
- Takes a file path + content + language
- Returns a `Vec<ExtractedClaim>`
Ship these extractors in v1:
| Extractor | What it finds | Languages |
|-----------|--------------|-----------|
| `tls_verify` | TLS certificate verification disabled | rust, go, python, js/ts |
| `jwt_config` | JWT validation settings (aud, exp, alg) | rust, go, python, js/ts |
| `hardcoded_secrets` | Credentials in source (not .env) | all |
| `timeout_config` | HTTP/DB/Redis timeout values | all (config files) |
| `dep_versions` | Known-vulnerable dependency versions | Cargo.toml, go.mod, package.json, requirements.txt |
| `cors_config` | CORS allow-origin settings | rust, go, js/ts |
| `rate_limit` | Rate limiting disabled or unreasonable | rust, go, js/ts |
Extractors use regex + AST patterns, not LLMs. Each extractor declares:
- The patterns it searches for
- The ConceptPath leaf it maps to
- The predicate (e.g., `config_value`, `enabled`, `version`)
- How to extract the ObjectValue from the match
### 2.3 Ingestion Bridge
Connect extractor output to the Episteme ingestion pipeline:
```
ExtractedClaim {
path: code://rust/citadeldb/auth/jwt/audience_validation
predicate: "enabled"
value: Boolean(false)
source_location: "src/auth/jwt.rs:47"
confidence: 1.0 // regex match, not heuristic
}
Assertion {
subject: ConceptPath::parse("code://rust/citadeldb/auth/jwt/audience_validation")
predicate: "enabled"
object: ObjectValue::Boolean(false)
source_class: SourceClass::Expert // inferred from code:// scheme
source_hash: blake3(file_content)
source_metadata: { "file": "src/auth/jwt.rs", "line": 47 }
confidence: 1.0
lifecycle: LifecycleStage::Approved // code is deployed, it's a fact about the code
}
```
The bridge handles:
- ConceptPath construction from extractor output
- Source hash computation (BLAKE3 of the file at scan time)
- Source metadata encoding (file path, line number, extraction method)
- Signing with the Aphoria agent's keypair
### 2.4 Conflict Query
After ingestion, query Episteme for each extracted concept:
```rust
for claim in extracted_claims {
let results = query_engine.query(Query {
subject: Some(claim.path.to_string()),
resolve_aliases: true,
hierarchical: false,
lens: Some("skeptic"),
..Default::default()
});
if results.conflict_score > threshold {
report.add_conflict(claim, results);
}
}
```
The Skeptic lens returns all claims for the concept across all aliased paths, with a conflict score. If the code claim (Tier 3) contradicts an RFC claim (Tier 0), the conflict score will be high because of the tier spread.
### 2.5 Report Output
```
$ aphoria scan ./citadeldb --format table
┌──────────────────────────────────────────────────────────────────────┐
│ Aphoria Report: citadeldb │
│ Scanned: 142 files │ Claims: 23 │ Conflicts: 3 │
├──────────┬───────────────────────────────────────┬──────────┬───────┤
│ Verdict │ Concept │ Score │ Tier │
├──────────┼───────────────────────────────────────┼──────────┼───────┤
│ BLOCK │ auth/jwt/audience_validation │ 0.92 │ 0↔3 │
│ BLOCK │ net/tls/cert_verification │ 0.87 │ 1↔3 │
│ FLAG │ http/timeout │ 0.54 │ 2↔3 │
└──────────┴───────────────────────────────────────┴──────────┴───────┘
Details:
BLOCK code://rust/citadeldb/auth/jwt/audience_validation
Your code: aud validation disabled (src/auth/jwt.rs:47)
RFC 7519: aud validation MUST be enabled (Tier 0)
Action: Fix or acknowledge with: aphoria ack <path> --reason "..."
BLOCK code://rust/citadeldb/net/tls/cert_verification
Your code: verify = false (src/net/client.rs:23)
OWASP: verification required (Tier 1)
Action: Fix or acknowledge with: aphoria ack <path> --reason "..."
FLAG code://rust/citadeldb/http/timeout
Your code: timeout = 0 (infinite) (config/production.yaml:8)
reqwest: default timeout 30s (Tier 2)
Action: Review recommended
```
Output formats: `table` (default), `json`, `sarif` (for CI integration), `markdown`.
### 2.6 Acknowledge Command
```
$ aphoria ack code://rust/citadeldb/auth/jwt/audience_validation \
--reason "Internal service, no external JWT consumers. Accepted risk per SEC-2024-003."
```
This creates a new Assertion:
- Subject: `internal://decision/citadeldb/auth/jwt/audience_validation`
- Predicate: `deviation_accepted`
- Object: Text with the reason
- SourceClass: Expert (Tier 3)
- Aliased to: `code://rust/citadeldb/auth/jwt/audience_validation`
The conflict still exists in Episteme, but the acknowledgment is recorded. Next scan, the conflict still shows but with context: "Acknowledged by [agent] on [date]: [reason]." The Skeptic lens sees the acknowledgment as an additional claim in the space.
---
## Phase 3: Skill Integration
### 3.1 Claude Code Skill
A `/aphoria` skill that wraps the CLI:
```
/aphoria scan Scan current project, report conflicts
/aphoria scan --fix Scan and offer to fix each conflict
/aphoria ack <path> Acknowledge a conflict with a reason
/aphoria status Show current conflict summary
/aphoria diff Show new conflicts since last scan
```
The skill runs the CLI binary, parses the JSON output, and presents results inline in the Claude Code session.
### 3.2 Agent Pre-Flight Hook
A Claude Code hook that runs Aphoria before certain operations:
```json
{
"hooks": {
"pre-commit": "aphoria scan --format sarif --exit-code",
"pre-deploy": "aphoria scan --strict --exit-code"
}
}
```
`--exit-code` returns non-zero if any BLOCK verdicts exist, preventing the commit or deploy.
### 3.3 Alias Suggestion Workflow
When Aphoria scans a new project and finds concepts that share leaf names with existing authoritative paths, it prompts:
```
New concept detected: code://rust/newproject/auth/jwt/audience_validation
Suggested alias:
→ rfc://7519/jwt/audience_validation (Tier 0, RFC 7519 Section 4.1.3)
Accept? [y/n/defer]
```
Accepting creates the alias. Deferring flags it for later review. Rejecting records that these are intentionally different concepts.
---
## Phase 4: CI Integration
### 4.1 GitHub Action
```yaml
- name: Aphoria Scan
uses: orchard9/aphoria-action@v1
with:
episteme-url: ${{ secrets.EPISTEME_URL }}
fail-on: block
format: sarif
```
Publishes SARIF results to GitHub Security tab. BLOCK verdicts fail the check. FLAG verdicts appear as warnings.
### 4.2 PR Comment Bot
On pull request, Aphoria scans the diff (not the whole project) and comments:
```
## Aphoria Report
This PR introduces 1 new conflict:
| File | Conflict | Score |
|------|----------|-------|
| src/auth/jwt.rs:47 | Disables aud validation (RFC 7519 requires it) | 0.92 |
Run `aphoria ack` to acknowledge, or fix before merge.
```
### 4.3 Baseline Mode
For existing projects with many conflicts, `aphoria baseline` records the current state. Subsequent scans only report *new* conflicts. This prevents the "500 warnings so we ignore all of them" problem.
```
$ aphoria baseline
Baseline recorded: 12 existing conflicts frozen.
Future scans will only report new conflicts.
```
---
## Phase 5: Research Agent Loop
### 5.1 Gap Detection
When Aphoria extracts a claim and no authoritative source exists for that concept, log it as a gap:
```
GAP: code://rust/citadeldb/cache/redis/max_memory_policy
No authoritative source found for redis/max_memory_policy
Seen in 3 projects
```
### 5.2 Research Agent Trigger
When a gap is seen across N projects (configurable, default 3), dispatch a research agent:
1. Agent searches for authoritative documentation on `redis max_memory_policy`
2. Finds Redis official docs
3. Extracts normative claims: "default is `noeviction`, recommended `allkeys-lru` for cache use cases"
4. Ingests as `vendor://redis/cache/max_memory_policy` at Tier 2
5. Future Aphoria scans now have something to conflict against
### 5.3 Community Corpus Contributions
Users who run Aphoria can opt in to contribute their alias mappings and acknowledgment patterns (anonymized) to a shared corpus. Common patterns propagate:
- "Every Rust project has this JWT pattern" → pre-built alias set for Rust JWT libraries
- "This Redis config is always flagged and always acknowledged" → lower the default threshold for that concept
- "This TLS pattern is always a real bug" → elevate the default threshold
---
## Milestone Summary
| Phase | Deliverable | Depends On |
|-------|-------------|------------|
| 0 | ConceptPath in StemeDB | concept-hierarchy spec |
| 1 | Authoritative corpus (RFCs, OWASP) | Phase 0 |
| 2 | Aphoria CLI (scan, report, ack) | Phase 0, Phase 1 |
| 3 | Claude Code skill + hooks | Phase 2 |
| 4 | CI integration (GitHub Action, PR bot) | Phase 2 |
| 5 | Research agent loop | Phase 2, Phase 4 (gap data) |
Phase 0 and Phase 1 can run in parallel — the corpus ingestion uses the ConceptPath types as they're built. Phase 2 is the critical path. Everything after Phase 2 is distribution and flywheel.