Add CRC32C checksums to WAL record format (v2), implement crash recovery with automatic truncation of corrupt records, add feature-gated group commit buffer for batched fsync under concurrent load, and implement log rotation via segment files with global offset addressing. Key changes: - Record format v2: [len:u32][crc32c:u32][blake3:32][payload:N] - recover_file() scans and truncates corrupt tail records - GroupCommitBuffer batches fsync via MPSC channel (tokio feature gate) - SegmentManager with binary search resolution and cursor-based cleanup - Journal::read() auto-refreshes segments on miss for writer/reader split - Split recovery.rs and key_codec.rs into directory modules for 500-line max Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
390 lines
14 KiB
Markdown
390 lines
14 KiB
Markdown
# Aphoria Roadmap
|
|
|
|
---
|
|
|
|
## Phase 0: StemeDB Foundation
|
|
|
|
Changes to the core database that Aphoria depends on. These ship before the CLI.
|
|
|
|
### 0.1 ConceptPath Type
|
|
|
|
Add the `ConceptPath` struct to `stemedb-core`. Parsing, validation, wire format (`scheme://segments/leaf`), prefix matching, parent traversal. Backward-compatible: bare strings parse as `custom://{string}`.
|
|
|
|
**Depends on:** [concept-hierarchy spec](../../docs/specs/concept-hierarchy.md)
|
|
**Crate:** `stemedb-core`
|
|
|
|
### 0.2 ConceptPath in Assertion
|
|
|
|
Replace `Assertion.subject: EntityId` with `Assertion.subject: ConceptPath`. Update rkyv serialization. Update all downstream consumers (ingestion, query, lenses, API, tests).
|
|
|
|
**Crate:** `stemedb-core`, `stemedb-ingest`, `stemedb-query`, `stemedb-lens`, `stemedb-api`
|
|
|
|
### 0.3 Hierarchical Index
|
|
|
|
Update `IndexStore` key construction to use ConceptPath wire format. Verify that `scan_prefix` on `S:{concept_path}/` returns all descendants. No new index structure needed — the `/` in the path maps to byte-level prefix scanning.
|
|
|
|
**Crate:** `stemedb-storage`
|
|
|
|
### 0.4 Alias Store
|
|
|
|
Add `CA:` (alias → canonical) and `CAR:` (canonical → all aliases) key prefixes. Implement alias resolution in the query path: lookup aliases before index scan, merge results, deduplicate. Transitive alias resolution.
|
|
|
|
**Crate:** `stemedb-storage`, `stemedb-query`
|
|
|
|
### 0.5 Source Class Inference
|
|
|
|
Wire scheme-based tier inference into ingestion. If no explicit `source_class` is set, infer from ConceptPath scheme. `rfc://` → Tier 0, `code://` → Tier 3, etc.
|
|
|
|
**Crate:** `stemedb-ingest`
|
|
|
|
### 0.6 Concept API Endpoints
|
|
|
|
```
|
|
POST /v1/concepts/alias Create alias
|
|
GET /v1/concepts/aliases/{path} List aliases for a path
|
|
DELETE /v1/concepts/alias Remove alias
|
|
GET /v1/concepts/tree/{prefix} Browse hierarchy under prefix
|
|
GET /v1/concepts/suggest Suggested aliases (shared leaf detection)
|
|
```
|
|
|
|
**Crate:** `stemedb-api`
|
|
|
|
---
|
|
|
|
## Phase 1: Authoritative Corpus
|
|
|
|
Before Aphoria can find conflicts, Episteme needs the authoritative sources to conflict against.
|
|
|
|
### 1.1 RFC Ingester
|
|
|
|
A CLI tool (or ingestion module) that:
|
|
- Fetches RFC text from `rfc-editor.org` (text format, no PDF parsing needed)
|
|
- Extracts normative statements (MUST, MUST NOT, SHOULD, SHALL per RFC 2119)
|
|
- Maps each statement to a ConceptPath: `rfc://{number}/{topic}/{claim}`
|
|
- Ingests as Tier 0 assertions
|
|
|
|
Start with a curated list of security-relevant RFCs:
|
|
|
|
| RFC | Topic |
|
|
|-----|-------|
|
|
| 7519 | JWT |
|
|
| 6749 | OAuth 2.0 |
|
|
| 6750 | Bearer tokens |
|
|
| 8446 | TLS 1.3 |
|
|
| 7525 | TLS best practices |
|
|
| 6238 | TOTP |
|
|
| 7617 | HTTP Basic Auth |
|
|
| 9110 | HTTP Semantics |
|
|
|
|
### 1.2 OWASP Ingester
|
|
|
|
Parse OWASP Cheat Sheets (markdown source on GitHub):
|
|
- Extract each recommendation as a claim
|
|
- Map to `owasp://cheatsheet/{topic}/{claim}`
|
|
- Ingest as Tier 1 assertions
|
|
|
|
Priority cheat sheets: Authentication, JWT, TLS, Secrets Management, Input Validation, Session Management.
|
|
|
|
### 1.3 Vendor Docs (Manual Bootstrap)
|
|
|
|
For v1, manually curate a small set of vendor doc claims:
|
|
- Postgres connection pool recommendations
|
|
- Redis timeout defaults
|
|
- Common HTTP client library defaults (reqwest, hyper, net/http)
|
|
|
|
These are `vendor://{product}/{topic}/{claim}` at Tier 2.
|
|
|
|
This doesn't need to be exhaustive. It needs to cover the claims that Aphoria's extractors will actually find in code.
|
|
|
|
---
|
|
|
|
## Phase 2: CLI Core
|
|
|
|
The Aphoria binary itself.
|
|
|
|
### 2.1 Project Walker
|
|
|
|
Input: a project root path.
|
|
Output: a list of files to scan, each tagged with:
|
|
- Language (rust, go, python, typescript, yaml, toml, json)
|
|
- ConceptPath prefix derived from directory structure
|
|
|
|
```
|
|
crates/citadeldb/src/auth/jwt.rs
|
|
→ language: rust
|
|
→ prefix: code://rust/citadeldb/auth/jwt
|
|
```
|
|
|
|
Normalization rules:
|
|
- Strip `src/`, `lib/`, `pkg/`, `internal/` (language boilerplate)
|
|
- Strip `crates/`, `packages/`, `apps/` (monorepo wrappers)
|
|
- Map `config/`, `deploy/`, `infra/` to `code://config/{project}/...`
|
|
- File extension determines language, not directory
|
|
|
|
### 2.2 Extractors
|
|
|
|
Each extractor is a module that:
|
|
- Takes a file path + content + language
|
|
- Returns a `Vec<ExtractedClaim>`
|
|
|
|
Ship these extractors in v1:
|
|
|
|
| Extractor | What it finds | Languages |
|
|
|-----------|--------------|-----------|
|
|
| `tls_verify` | TLS certificate verification disabled | rust, go, python, js/ts |
|
|
| `jwt_config` | JWT validation settings (aud, exp, alg) | rust, go, python, js/ts |
|
|
| `hardcoded_secrets` | Credentials in source (not .env) | all |
|
|
| `timeout_config` | HTTP/DB/Redis timeout values | all (config files) |
|
|
| `dep_versions` | Known-vulnerable dependency versions | Cargo.toml, go.mod, package.json, requirements.txt |
|
|
| `cors_config` | CORS allow-origin settings | rust, go, js/ts |
|
|
| `rate_limit` | Rate limiting disabled or unreasonable | rust, go, js/ts |
|
|
|
|
Extractors use regex + AST patterns, not LLMs. Each extractor declares:
|
|
- The patterns it searches for
|
|
- The ConceptPath leaf it maps to
|
|
- The predicate (e.g., `config_value`, `enabled`, `version`)
|
|
- How to extract the ObjectValue from the match
|
|
|
|
### 2.3 Ingestion Bridge
|
|
|
|
Connect extractor output to the Episteme ingestion pipeline:
|
|
|
|
```
|
|
ExtractedClaim {
|
|
path: code://rust/citadeldb/auth/jwt/audience_validation
|
|
predicate: "enabled"
|
|
value: Boolean(false)
|
|
source_location: "src/auth/jwt.rs:47"
|
|
confidence: 1.0 // regex match, not heuristic
|
|
}
|
|
↓
|
|
Assertion {
|
|
subject: ConceptPath::parse("code://rust/citadeldb/auth/jwt/audience_validation")
|
|
predicate: "enabled"
|
|
object: ObjectValue::Boolean(false)
|
|
source_class: SourceClass::Expert // inferred from code:// scheme
|
|
source_hash: blake3(file_content)
|
|
source_metadata: { "file": "src/auth/jwt.rs", "line": 47 }
|
|
confidence: 1.0
|
|
lifecycle: LifecycleStage::Approved // code is deployed, it's a fact about the code
|
|
}
|
|
```
|
|
|
|
The bridge handles:
|
|
- ConceptPath construction from extractor output
|
|
- Source hash computation (BLAKE3 of the file at scan time)
|
|
- Source metadata encoding (file path, line number, extraction method)
|
|
- Signing with the Aphoria agent's keypair
|
|
|
|
### 2.4 Conflict Query
|
|
|
|
After ingestion, query Episteme for each extracted concept:
|
|
|
|
```rust
|
|
for claim in extracted_claims {
|
|
let results = query_engine.query(Query {
|
|
subject: Some(claim.path.to_string()),
|
|
resolve_aliases: true,
|
|
hierarchical: false,
|
|
lens: Some("skeptic"),
|
|
..Default::default()
|
|
});
|
|
|
|
if results.conflict_score > threshold {
|
|
report.add_conflict(claim, results);
|
|
}
|
|
}
|
|
```
|
|
|
|
The Skeptic lens returns all claims for the concept across all aliased paths, with a conflict score. If the code claim (Tier 3) contradicts an RFC claim (Tier 0), the conflict score will be high because of the tier spread.
|
|
|
|
### 2.5 Report Output
|
|
|
|
```
|
|
$ aphoria scan ./citadeldb --format table
|
|
|
|
┌──────────────────────────────────────────────────────────────────────┐
|
|
│ Aphoria Report: citadeldb │
|
|
│ Scanned: 142 files │ Claims: 23 │ Conflicts: 3 │
|
|
├──────────┬───────────────────────────────────────┬──────────┬───────┤
|
|
│ Verdict │ Concept │ Score │ Tier │
|
|
├──────────┼───────────────────────────────────────┼──────────┼───────┤
|
|
│ BLOCK │ auth/jwt/audience_validation │ 0.92 │ 0↔3 │
|
|
│ BLOCK │ net/tls/cert_verification │ 0.87 │ 1↔3 │
|
|
│ FLAG │ http/timeout │ 0.54 │ 2↔3 │
|
|
└──────────┴───────────────────────────────────────┴──────────┴───────┘
|
|
|
|
Details:
|
|
|
|
BLOCK code://rust/citadeldb/auth/jwt/audience_validation
|
|
Your code: aud validation disabled (src/auth/jwt.rs:47)
|
|
RFC 7519: aud validation MUST be enabled (Tier 0)
|
|
Action: Fix or acknowledge with: aphoria ack <path> --reason "..."
|
|
|
|
BLOCK code://rust/citadeldb/net/tls/cert_verification
|
|
Your code: verify = false (src/net/client.rs:23)
|
|
OWASP: verification required (Tier 1)
|
|
Action: Fix or acknowledge with: aphoria ack <path> --reason "..."
|
|
|
|
FLAG code://rust/citadeldb/http/timeout
|
|
Your code: timeout = 0 (infinite) (config/production.yaml:8)
|
|
reqwest: default timeout 30s (Tier 2)
|
|
Action: Review recommended
|
|
```
|
|
|
|
Output formats: `table` (default), `json`, `sarif` (for CI integration), `markdown`.
|
|
|
|
### 2.6 Acknowledge Command
|
|
|
|
```
|
|
$ aphoria ack code://rust/citadeldb/auth/jwt/audience_validation \
|
|
--reason "Internal service, no external JWT consumers. Accepted risk per SEC-2024-003."
|
|
```
|
|
|
|
This creates a new Assertion:
|
|
- Subject: `internal://decision/citadeldb/auth/jwt/audience_validation`
|
|
- Predicate: `deviation_accepted`
|
|
- Object: Text with the reason
|
|
- SourceClass: Expert (Tier 3)
|
|
- Aliased to: `code://rust/citadeldb/auth/jwt/audience_validation`
|
|
|
|
The conflict still exists in Episteme, but the acknowledgment is recorded. Next scan, the conflict still shows but with context: "Acknowledged by [agent] on [date]: [reason]." The Skeptic lens sees the acknowledgment as an additional claim in the space.
|
|
|
|
---
|
|
|
|
## Phase 3: Skill Integration
|
|
|
|
### 3.1 Claude Code Skill
|
|
|
|
A `/aphoria` skill that wraps the CLI:
|
|
|
|
```
|
|
/aphoria scan Scan current project, report conflicts
|
|
/aphoria scan --fix Scan and offer to fix each conflict
|
|
/aphoria ack <path> Acknowledge a conflict with a reason
|
|
/aphoria status Show current conflict summary
|
|
/aphoria diff Show new conflicts since last scan
|
|
```
|
|
|
|
The skill runs the CLI binary, parses the JSON output, and presents results inline in the Claude Code session.
|
|
|
|
### 3.2 Agent Pre-Flight Hook
|
|
|
|
A Claude Code hook that runs Aphoria before certain operations:
|
|
|
|
```json
|
|
{
|
|
"hooks": {
|
|
"pre-commit": "aphoria scan --format sarif --exit-code",
|
|
"pre-deploy": "aphoria scan --strict --exit-code"
|
|
}
|
|
}
|
|
```
|
|
|
|
`--exit-code` returns non-zero if any BLOCK verdicts exist, preventing the commit or deploy.
|
|
|
|
### 3.3 Alias Suggestion Workflow
|
|
|
|
When Aphoria scans a new project and finds concepts that share leaf names with existing authoritative paths, it prompts:
|
|
|
|
```
|
|
New concept detected: code://rust/newproject/auth/jwt/audience_validation
|
|
|
|
Suggested alias:
|
|
→ rfc://7519/jwt/audience_validation (Tier 0, RFC 7519 Section 4.1.3)
|
|
|
|
Accept? [y/n/defer]
|
|
```
|
|
|
|
Accepting creates the alias. Deferring flags it for later review. Rejecting records that these are intentionally different concepts.
|
|
|
|
---
|
|
|
|
## Phase 4: CI Integration
|
|
|
|
### 4.1 GitHub Action
|
|
|
|
```yaml
|
|
- name: Aphoria Scan
|
|
uses: orchard9/aphoria-action@v1
|
|
with:
|
|
episteme-url: ${{ secrets.EPISTEME_URL }}
|
|
fail-on: block
|
|
format: sarif
|
|
```
|
|
|
|
Publishes SARIF results to GitHub Security tab. BLOCK verdicts fail the check. FLAG verdicts appear as warnings.
|
|
|
|
### 4.2 PR Comment Bot
|
|
|
|
On pull request, Aphoria scans the diff (not the whole project) and comments:
|
|
|
|
```
|
|
## Aphoria Report
|
|
|
|
This PR introduces 1 new conflict:
|
|
|
|
| File | Conflict | Score |
|
|
|------|----------|-------|
|
|
| src/auth/jwt.rs:47 | Disables aud validation (RFC 7519 requires it) | 0.92 |
|
|
|
|
Run `aphoria ack` to acknowledge, or fix before merge.
|
|
```
|
|
|
|
### 4.3 Baseline Mode
|
|
|
|
For existing projects with many conflicts, `aphoria baseline` records the current state. Subsequent scans only report *new* conflicts. This prevents the "500 warnings so we ignore all of them" problem.
|
|
|
|
```
|
|
$ aphoria baseline
|
|
Baseline recorded: 12 existing conflicts frozen.
|
|
Future scans will only report new conflicts.
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 5: Research Agent Loop
|
|
|
|
### 5.1 Gap Detection
|
|
|
|
When Aphoria extracts a claim and no authoritative source exists for that concept, log it as a gap:
|
|
|
|
```
|
|
GAP: code://rust/citadeldb/cache/redis/max_memory_policy
|
|
No authoritative source found for redis/max_memory_policy
|
|
Seen in 3 projects
|
|
```
|
|
|
|
### 5.2 Research Agent Trigger
|
|
|
|
When a gap is seen across N projects (configurable, default 3), dispatch a research agent:
|
|
|
|
1. Agent searches for authoritative documentation on `redis max_memory_policy`
|
|
2. Finds Redis official docs
|
|
3. Extracts normative claims: "default is `noeviction`, recommended `allkeys-lru` for cache use cases"
|
|
4. Ingests as `vendor://redis/cache/max_memory_policy` at Tier 2
|
|
5. Future Aphoria scans now have something to conflict against
|
|
|
|
### 5.3 Community Corpus Contributions
|
|
|
|
Users who run Aphoria can opt in to contribute their alias mappings and acknowledgment patterns (anonymized) to a shared corpus. Common patterns propagate:
|
|
|
|
- "Every Rust project has this JWT pattern" → pre-built alias set for Rust JWT libraries
|
|
- "This Redis config is always flagged and always acknowledged" → lower the default threshold for that concept
|
|
- "This TLS pattern is always a real bug" → elevate the default threshold
|
|
|
|
---
|
|
|
|
## Milestone Summary
|
|
|
|
| Phase | Deliverable | Depends On |
|
|
|-------|-------------|------------|
|
|
| 0 | ConceptPath in StemeDB | concept-hierarchy spec |
|
|
| 1 | Authoritative corpus (RFCs, OWASP) | Phase 0 |
|
|
| 2 | Aphoria CLI (scan, report, ack) | Phase 0, Phase 1 |
|
|
| 3 | Claude Code skill + hooks | Phase 2 |
|
|
| 4 | CI integration (GitHub Action, PR bot) | Phase 2 |
|
|
| 5 | Research agent loop | Phase 2, Phase 4 (gap data) |
|
|
|
|
Phase 0 and Phase 1 can run in parallel — the corpus ingestion uses the ConceptPath types as they're built. Phase 2 is the critical path. Everything after Phase 2 is distribution and flywheel.
|