jordan 42d4e09508 feat: Index persistence (Phase 5C) - vector hot/cold, visual checkpoint

Phase 5C (Index Persistence) implementation:
- PersistentVectorIndex with hot/cold architecture
  - Hot: in-memory HNSW for recent vectors
  - Cold: memory-mapped HNSW loaded from disk
  - Background builder for WAL replay and atomic swap
  - BLAKE3 integrity verification
- PersistentVisualIndex with checkpoint persistence
  - BkTreeSnapshot with rkyv serialization
  - CRC32C corruption detection
  - Atomic write pattern (temp → fsync → rename)
- Key codec additions for vector index metadata
- Split large files into modules (<500 lines each)
  - battery_pre_sentinel.rs → battery/ directory
  - visual_index.rs → visual_index/ directory
  - persistent.rs → persistent/ directory
- Refactored ingest worker tests for clarity
- Updated roadmap to mark Phase 5 complete

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-02 15:43:18 -07:00

13 KiB

Raw Blame History

Aphoria Roadmap

Phase 0: StemeDB Foundation

Tracked in: roadmap.md § 5D. Concept Hierarchy

Changes to the core database that Aphoria depends on. These ship before the CLI and are tracked in the main StemeDB roadmap as Phase 5D.

Aphoria Phase 0	StemeDB Phase 5D	Status
0.1 ConceptPath Type	5D.1 ConceptPath Type	⬜
0.2 ConceptPath in Assertion	(implicit in 5D.1)	⬜
0.3 Hierarchical Index	5D.4 Hierarchical Query	⬜
0.4 Alias Store	5D.3 Alias Store + 5D.5 Alias Resolution	⬜
0.5 Source Class Inference	5D.6 Source Class Inference	⬜
0.6 Concept API Endpoints	5D.7 Concept API Endpoints	⬜

Spec: docs/specs/concept-hierarchy.md

Phase 1: Authoritative Corpus

Before Aphoria can find conflicts, Episteme needs the authoritative sources to conflict against.

1.1 RFC Ingester

A CLI tool (or ingestion module) that:

Fetches RFC text from rfc-editor.org (text format, no PDF parsing needed)
Extracts normative statements (MUST, MUST NOT, SHOULD, SHALL per RFC 2119)
Maps each statement to a ConceptPath: rfc://{number}/{topic}/{claim}
Ingests as Tier 0 assertions

Start with a curated list of security-relevant RFCs:

RFC	Topic
7519	JWT
6749	OAuth 2.0
6750	Bearer tokens
8446	TLS 1.3
7525	TLS best practices
6238	TOTP
7617	HTTP Basic Auth
9110	HTTP Semantics

1.2 OWASP Ingester

Parse OWASP Cheat Sheets (markdown source on GitHub):

Extract each recommendation as a claim
Map to owasp://cheatsheet/{topic}/{claim}
Ingest as Tier 1 assertions

Priority cheat sheets: Authentication, JWT, TLS, Secrets Management, Input Validation, Session Management.

1.3 Vendor Docs (Manual Bootstrap)

For v1, manually curate a small set of vendor doc claims:

Postgres connection pool recommendations
Redis timeout defaults
Common HTTP client library defaults (reqwest, hyper, net/http)

These are vendor://{product}/{topic}/{claim} at Tier 2.

This doesn't need to be exhaustive. It needs to cover the claims that Aphoria's extractors will actually find in code.

Phase 2: CLI Core

The Aphoria binary itself.

2.1 Project Walker

Input: a project root path. Output: a list of files to scan, each tagged with:

Language (rust, go, python, typescript, yaml, toml, json)
ConceptPath prefix derived from directory structure

crates/citadeldb/src/auth/jwt.rs
  → language: rust
  → prefix: code://rust/citadeldb/auth/jwt

Normalization rules:

Strip src/, lib/, pkg/, internal/ (language boilerplate)
Strip crates/, packages/, apps/ (monorepo wrappers)
Map config/, deploy/, infra/ to code://config/{project}/...
File extension determines language, not directory

2.2 Extractors

Each extractor is a module that:

Takes a file path + content + language
Returns a Vec<ExtractedClaim>

Ship these extractors in v1:

Extractor	What it finds	Languages
`tls_verify`	TLS certificate verification disabled	rust, go, python, js/ts
`jwt_config`	JWT validation settings (aud, exp, alg)	rust, go, python, js/ts
`hardcoded_secrets`	Credentials in source (not .env)	all
`timeout_config`	HTTP/DB/Redis timeout values	all (config files)
`dep_versions`	Known-vulnerable dependency versions	Cargo.toml, go.mod, package.json, requirements.txt
`cors_config`	CORS allow-origin settings	rust, go, js/ts
`rate_limit`	Rate limiting disabled or unreasonable	rust, go, js/ts

Extractors use regex + AST patterns, not LLMs. Each extractor declares:

The patterns it searches for
The ConceptPath leaf it maps to
The predicate (e.g., config_value, enabled, version)
How to extract the ObjectValue from the match

2.3 Ingestion Bridge

Connect extractor output to the Episteme ingestion pipeline:

ExtractedClaim {
    path: code://rust/citadeldb/auth/jwt/audience_validation
    predicate: "enabled"
    value: Boolean(false)
    source_location: "src/auth/jwt.rs:47"
    confidence: 1.0  // regex match, not heuristic
}
        ↓
Assertion {
    subject: ConceptPath::parse("code://rust/citadeldb/auth/jwt/audience_validation")
    predicate: "enabled"
    object: ObjectValue::Boolean(false)
    source_class: SourceClass::Expert  // inferred from code:// scheme
    source_hash: blake3(file_content)
    source_metadata: { "file": "src/auth/jwt.rs", "line": 47 }
    confidence: 1.0
    lifecycle: LifecycleStage::Approved  // code is deployed, it's a fact about the code
}

The bridge handles:

ConceptPath construction from extractor output
Source hash computation (BLAKE3 of the file at scan time)
Source metadata encoding (file path, line number, extraction method)
Signing with the Aphoria agent's keypair

2.4 Conflict Query

After ingestion, query Episteme for each extracted concept:

for claim in extracted_claims {
    let results = query_engine.query(Query {
        subject: Some(claim.path.to_string()),
        resolve_aliases: true,
        hierarchical: false,
        lens: Some("skeptic"),
        ..Default::default()
    });

    if results.conflict_score > threshold {
        report.add_conflict(claim, results);
    }
}

The Skeptic lens returns all claims for the concept across all aliased paths, with a conflict score. If the code claim (Tier 3) contradicts an RFC claim (Tier 0), the conflict score will be high because of the tier spread.

2.5 Report Output

$ aphoria scan ./citadeldb --format table

┌──────────────────────────────────────────────────────────────────────┐
│ Aphoria Report: citadeldb                                          │
│ Scanned: 142 files │ Claims: 23 │ Conflicts: 3                     │
├──────────┬───────────────────────────────────────┬──────────┬───────┤
│ Verdict  │ Concept                               │ Score    │ Tier  │
├──────────┼───────────────────────────────────────┼──────────┼───────┤
│ BLOCK    │ auth/jwt/audience_validation           │ 0.92     │ 0↔3  │
│ BLOCK    │ net/tls/cert_verification              │ 0.87     │ 1↔3  │
│ FLAG     │ http/timeout                           │ 0.54     │ 2↔3  │
└──────────┴───────────────────────────────────────┴──────────┴───────┘

Details:

  BLOCK  code://rust/citadeldb/auth/jwt/audience_validation
         Your code:  aud validation disabled        (src/auth/jwt.rs:47)
         RFC 7519:   aud validation MUST be enabled  (Tier 0)
         Action:     Fix or acknowledge with: aphoria ack <path> --reason "..."

  BLOCK  code://rust/citadeldb/net/tls/cert_verification
         Your code:  verify = false                  (src/net/client.rs:23)
         OWASP:      verification required           (Tier 1)
         Action:     Fix or acknowledge with: aphoria ack <path> --reason "..."

  FLAG   code://rust/citadeldb/http/timeout
         Your code:  timeout = 0 (infinite)          (config/production.yaml:8)
         reqwest:    default timeout 30s             (Tier 2)
         Action:     Review recommended

Output formats: table (default), json, sarif (for CI integration), markdown.

2.6 Acknowledge Command

$ aphoria ack code://rust/citadeldb/auth/jwt/audience_validation \
    --reason "Internal service, no external JWT consumers. Accepted risk per SEC-2024-003."

This creates a new Assertion:

Subject: internal://decision/citadeldb/auth/jwt/audience_validation
Predicate: deviation_accepted
Object: Text with the reason
SourceClass: Expert (Tier 3)
Aliased to: code://rust/citadeldb/auth/jwt/audience_validation

The conflict still exists in Episteme, but the acknowledgment is recorded. Next scan, the conflict still shows but with context: "Acknowledged by [agent] on [date]: [reason]." The Skeptic lens sees the acknowledgment as an additional claim in the space.

Phase 3: Skill Integration

3.1 Claude Code Skill

A /aphoria skill that wraps the CLI:

/aphoria scan          Scan current project, report conflicts
/aphoria scan --fix    Scan and offer to fix each conflict
/aphoria ack <path>    Acknowledge a conflict with a reason
/aphoria status        Show current conflict summary
/aphoria diff          Show new conflicts since last scan

The skill runs the CLI binary, parses the JSON output, and presents results inline in the Claude Code session.

3.2 Agent Pre-Flight Hook

A Claude Code hook that runs Aphoria before certain operations:

{
  "hooks": {
    "pre-commit": "aphoria scan --format sarif --exit-code",
    "pre-deploy": "aphoria scan --strict --exit-code"
  }
}

--exit-code returns non-zero if any BLOCK verdicts exist, preventing the commit or deploy.

3.3 Alias Suggestion Workflow

When Aphoria scans a new project and finds concepts that share leaf names with existing authoritative paths, it prompts:

New concept detected: code://rust/newproject/auth/jwt/audience_validation

Suggested alias:
  → rfc://7519/jwt/audience_validation (Tier 0, RFC 7519 Section 4.1.3)

Accept? [y/n/defer]

Accepting creates the alias. Deferring flags it for later review. Rejecting records that these are intentionally different concepts.

Phase 4: CI Integration

4.1 GitHub Action

- name: Aphoria Scan
  uses: orchard9/aphoria-action@v1
  with:
    episteme-url: ${{ secrets.EPISTEME_URL }}
    fail-on: block
    format: sarif

Publishes SARIF results to GitHub Security tab. BLOCK verdicts fail the check. FLAG verdicts appear as warnings.

4.2 PR Comment Bot

On pull request, Aphoria scans the diff (not the whole project) and comments:

## Aphoria Report

This PR introduces 1 new conflict:

| File | Conflict | Score |
|------|----------|-------|
| src/auth/jwt.rs:47 | Disables aud validation (RFC 7519 requires it) | 0.92 |

Run `aphoria ack` to acknowledge, or fix before merge.

4.3 Baseline Mode

For existing projects with many conflicts, aphoria baseline records the current state. Subsequent scans only report new conflicts. This prevents the "500 warnings so we ignore all of them" problem.

$ aphoria baseline
Baseline recorded: 12 existing conflicts frozen.
Future scans will only report new conflicts.

Phase 5: Research Agent Loop

5.1 Gap Detection

When Aphoria extracts a claim and no authoritative source exists for that concept, log it as a gap:

GAP: code://rust/citadeldb/cache/redis/max_memory_policy
     No authoritative source found for redis/max_memory_policy
     Seen in 3 projects

5.2 Research Agent Trigger

When a gap is seen across N projects (configurable, default 3), dispatch a research agent:

Agent searches for authoritative documentation on redis max_memory_policy
Finds Redis official docs
Extracts normative claims: "default is noeviction, recommended allkeys-lru for cache use cases"
Ingests as vendor://redis/cache/max_memory_policy at Tier 2
Future Aphoria scans now have something to conflict against

5.3 Community Corpus Contributions

Users who run Aphoria can opt in to contribute their alias mappings and acknowledgment patterns (anonymized) to a shared corpus. Common patterns propagate:

"Every Rust project has this JWT pattern" → pre-built alias set for Rust JWT libraries
"This Redis config is always flagged and always acknowledged" → lower the default threshold for that concept
"This TLS pattern is always a real bug" → elevate the default threshold

Milestone Summary

Phase	Deliverable	Depends On
0	ConceptPath in StemeDB	concept-hierarchy spec
1	Authoritative corpus (RFCs, OWASP)	Phase 0
2	Aphoria CLI (scan, report, ack)	Phase 0, Phase 1
3	Claude Code skill + hooks	Phase 2
4	CI integration (GitHub Action, PR bot)	Phase 2
5	Research agent loop	Phase 2, Phase 4 (gap data)

Phase 0 and Phase 1 can run in parallel — the corpus ingestion uses the ConceptPath types as they're built. Phase 2 is the critical path. Everything after Phase 2 is distribution and flywheel.

13 KiB Raw Blame History