Add CRC32C checksums to WAL record format (v2), implement crash recovery with automatic truncation of corrupt records, add feature-gated group commit buffer for batched fsync under concurrent load, and implement log rotation via segment files with global offset addressing. Key changes: - Record format v2: [len:u32][crc32c:u32][blake3:32][payload:N] - recover_file() scans and truncates corrupt tail records - GroupCommitBuffer batches fsync via MPSC channel (tokio feature gate) - SegmentManager with binary search resolution and cursor-based cleanup - Journal::read() auto-refreshes segments on miss for writer/reader split - Split recovery.rs and key_codec.rs into directory modules for 500-line max Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
879 lines
26 KiB
Markdown
879 lines
26 KiB
Markdown
# Aphoria Technical Spec
|
|
|
|
**Status:** Draft
|
|
**Date:** 2026-02-02
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Aphoria is a CLI binary that scans codebases, extracts implicit claims from config and code, ingests them into a local Episteme instance, and reports conflicts against authoritative sources.
|
|
|
|
```
|
|
aphoria scan <project-root> [--config aphoria.toml] [--format table|json|sarif|markdown]
|
|
aphoria ack <concept-path> --reason "..."
|
|
aphoria baseline
|
|
aphoria diff
|
|
aphoria status
|
|
```
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ aphoria CLI │
|
|
│ │
|
|
│ ┌──────────┐ ┌────────────┐ ┌──────────┐ ┌────────┐ │
|
|
│ │ Walker │──▶│ Extractors │──▶│ Ingester │──▶│ Report │ │
|
|
│ │ │ │ │ │ │ │ │ │
|
|
│ │ fs walk │ │ tls_verify │ │ bridge │ │ table │ │
|
|
│ │ lang det │ │ jwt_config │ │ to │ │ json │ │
|
|
│ │ path map │ │ secrets │ │ episteme │ │ sarif │ │
|
|
│ │ │ │ timeouts │ │ │ │ md │ │
|
|
│ │ │ │ deps │ │ │ │ │ │
|
|
│ │ │ │ cors │ │ │ │ │ │
|
|
│ │ │ │ rate_limit │ │ │ │ │ │
|
|
│ └──────────┘ └────────────┘ └──────────┘ └────────┘ │
|
|
│ │ ▲ │
|
|
│ ▼ │ │
|
|
│ ┌──────────────┐ │ │
|
|
│ │ Episteme │────────┘ │
|
|
│ │ (local) │ query + │
|
|
│ │ │ conflict │
|
|
│ └──────────────┘ scores │
|
|
└──────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
Aphoria depends on:
|
|
- `stemedb-core` (types: ConceptPath, Assertion, SourceClass)
|
|
- `stemedb-storage` (KVStore, IndexStore, AliasStore)
|
|
- `stemedb-ingest` (ingestion pipeline)
|
|
- `stemedb-query` (query engine, lenses)
|
|
|
|
It does **not** depend on `stemedb-api`. Aphoria talks to Episteme directly through the Rust crate APIs, not over HTTP. This makes it fast (no network round-trip) and self-contained (no server process needed).
|
|
|
|
---
|
|
|
|
## Crate Structure
|
|
|
|
```
|
|
crates/
|
|
aphoria/
|
|
Cargo.toml
|
|
src/
|
|
main.rs CLI entrypoint (clap)
|
|
config.rs aphoria.toml parsing
|
|
walker/
|
|
mod.rs Project walker orchestration
|
|
language.rs Language detection
|
|
path_mapper.rs Directory → ConceptPath mapping
|
|
normalizer.rs Path normalization rules per language
|
|
extractors/
|
|
mod.rs Extractor trait + registry
|
|
tls_verify.rs TLS certificate verification
|
|
jwt_config.rs JWT validation settings
|
|
hardcoded_secrets.rs Credentials in source
|
|
timeout_config.rs HTTP/DB/Redis timeouts
|
|
dep_versions.rs Vulnerable dependency versions
|
|
cors_config.rs CORS allow-origin
|
|
rate_limit.rs Rate limiting config
|
|
bridge.rs ExtractedClaim → Assertion conversion
|
|
conflict.rs Conflict query + scoring
|
|
report/
|
|
mod.rs Report generation orchestration
|
|
table.rs Terminal table output
|
|
json.rs JSON output
|
|
sarif.rs SARIF for CI integration
|
|
markdown.rs Markdown output
|
|
ack.rs Acknowledge command
|
|
baseline.rs Baseline snapshot
|
|
diff.rs Delta since last scan
|
|
```
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
`aphoria.toml` at project root (optional, sensible defaults):
|
|
|
|
```toml
|
|
[project]
|
|
name = "citadeldb"
|
|
language = "rust" # auto-detected if omitted
|
|
|
|
[episteme]
|
|
data_dir = "~/.aphoria/db" # local Episteme instance
|
|
# url = "http://localhost:3000" # future: remote instance
|
|
|
|
[thresholds]
|
|
block = 0.7 # conflict score >= this → BLOCK
|
|
flag = 0.4 # conflict score >= this → FLAG
|
|
# below flag threshold → PASS (not reported)
|
|
|
|
[extractors]
|
|
enabled = ["tls_verify", "jwt_config", "hardcoded_secrets", "timeout_config", "dep_versions", "cors_config", "rate_limit"]
|
|
# disabled = ["rate_limit"] # alternative: disable specific ones
|
|
|
|
[extractors.timeout_config]
|
|
min_reasonable_ms = 1000 # flag timeouts below this
|
|
max_reasonable_ms = 300000 # flag timeouts above this
|
|
|
|
[extractors.dep_versions]
|
|
advisory_db = "~/.aphoria/advisory-db" # rustsec/advisory-db clone
|
|
|
|
[scan]
|
|
exclude = ["target/", "node_modules/", ".git/", "vendor/"]
|
|
max_file_size = 1048576 # skip files > 1MB
|
|
|
|
[aliases]
|
|
auto_suggest = true # suggest aliases when shared leaves detected
|
|
auto_accept_tier0 = true # auto-accept alias suggestions to Tier 0 sources
|
|
```
|
|
|
|
---
|
|
|
|
## Walker
|
|
|
|
### Language Detection
|
|
|
|
Priority order:
|
|
1. Explicit `language` in `aphoria.toml`
|
|
2. Dominant language heuristic (count files by extension)
|
|
3. Per-file extension mapping
|
|
|
|
| Extension | Language |
|
|
|-----------|----------|
|
|
| `.rs` | rust |
|
|
| `.go` | go |
|
|
| `.py` | python |
|
|
| `.ts`, `.tsx` | typescript |
|
|
| `.js`, `.jsx` | javascript |
|
|
| `.yaml`, `.yml` | yaml |
|
|
| `.toml` | toml |
|
|
| `.json` | json |
|
|
| `.env`, `.env.*` | dotenv |
|
|
| `Dockerfile`, `docker-compose.*` | docker |
|
|
| `Cargo.toml` | cargo-manifest |
|
|
| `go.mod` | go-mod |
|
|
| `package.json` | npm-manifest |
|
|
| `requirements.txt`, `pyproject.toml` | python-manifest |
|
|
|
|
### Path Mapping
|
|
|
|
Directory structure maps to ConceptPath segments. Language-specific stripping rules remove boilerplate directories:
|
|
|
|
**Rust:**
|
|
```
|
|
Strip: src/, crates/
|
|
Keep: everything else
|
|
|
|
crates/citadeldb/src/auth/jwt.rs
|
|
→ ["rust", "citadeldb", "auth", "jwt"]
|
|
→ code://rust/citadeldb/auth/jwt/{leaf from extractor}
|
|
```
|
|
|
|
**Go:**
|
|
```
|
|
Strip: cmd/, internal/, pkg/
|
|
Keep: everything else
|
|
|
|
internal/auth/jwt/validator.go
|
|
→ ["go", "{module_name}", "auth", "jwt"]
|
|
→ code://go/{module_name}/auth/jwt/{leaf}
|
|
```
|
|
|
|
**Python:**
|
|
```
|
|
Strip: src/, lib/
|
|
Keep: everything else
|
|
|
|
src/auth/jwt_handler.py
|
|
→ ["python", "{package_name}", "auth", "jwt_handler"]
|
|
→ code://python/{package_name}/auth/jwt_handler/{leaf}
|
|
```
|
|
|
|
**Config files:**
|
|
```
|
|
config/production.yaml
|
|
→ code://config/{project_name}/production/{leaf}
|
|
|
|
.env.production
|
|
→ code://config/{project_name}/env_production/{leaf}
|
|
|
|
docker-compose.yml
|
|
→ code://docker/{project_name}/{leaf}
|
|
```
|
|
|
|
The project name comes from:
|
|
1. `aphoria.toml` `project.name`
|
|
2. `Cargo.toml` `[package] name`
|
|
3. `go.mod` module name (last segment)
|
|
4. `package.json` `name`
|
|
5. Directory name of project root
|
|
|
|
### File Filtering
|
|
|
|
Skip:
|
|
- Directories in `scan.exclude` list
|
|
- Files larger than `scan.max_file_size`
|
|
- Binary files (detected by null byte in first 8KB)
|
|
- Generated files (`*.generated.*`, `*.pb.go`, `*_generated.rs`)
|
|
- Test files (configurable: include or exclude)
|
|
|
|
---
|
|
|
|
## Extractors
|
|
|
|
### Trait Definition
|
|
|
|
```rust
|
|
/// A claim extractor that finds implicit decisions in source code.
|
|
pub trait Extractor: Send + Sync {
|
|
/// Unique identifier for this extractor.
|
|
fn name(&self) -> &str;
|
|
|
|
/// File types this extractor operates on.
|
|
fn languages(&self) -> &[Language];
|
|
|
|
/// Extract claims from a file's content.
|
|
///
|
|
/// - `path_segments`: The ConceptPath segments derived from the file's location.
|
|
/// - `content`: The file content as a string.
|
|
/// - `language`: The detected language of the file.
|
|
///
|
|
/// Returns zero or more extracted claims.
|
|
fn extract(
|
|
&self,
|
|
path_segments: &[String],
|
|
content: &str,
|
|
language: Language,
|
|
) -> Vec<ExtractedClaim>;
|
|
}
|
|
```
|
|
|
|
### ExtractedClaim
|
|
|
|
```rust
|
|
/// A claim extracted from source code by an Extractor.
|
|
pub struct ExtractedClaim {
|
|
/// The full ConceptPath for this claim.
|
|
/// Scheme is always "code" for code-extracted claims.
|
|
pub concept_path: ConceptPath,
|
|
|
|
/// The predicate describing what aspect of the concept this claims.
|
|
/// Examples: "enabled", "config_value", "version", "allow_origin"
|
|
pub predicate: String,
|
|
|
|
/// The extracted value.
|
|
pub value: ObjectValue,
|
|
|
|
/// Source file path relative to project root.
|
|
pub file: String,
|
|
|
|
/// Line number in the source file (1-indexed).
|
|
pub line: usize,
|
|
|
|
/// The matched source text (the actual code/config that was matched).
|
|
pub matched_text: String,
|
|
|
|
/// Confidence of extraction.
|
|
/// 1.0 for exact regex matches.
|
|
/// Lower for heuristic matches.
|
|
pub confidence: f32,
|
|
|
|
/// Human-readable description of what was found.
|
|
/// Example: "JWT audience validation is disabled"
|
|
pub description: String,
|
|
}
|
|
```
|
|
|
|
### Extractor: tls_verify
|
|
|
|
**What it finds:** TLS/SSL certificate verification disabled.
|
|
|
|
**Patterns:**
|
|
|
|
Rust (reqwest):
|
|
```
|
|
Pattern: danger_accept_invalid_certs\s*\(\s*true\s*\)
|
|
Leaf: cert_verification
|
|
Predicate: enabled
|
|
Value: Boolean(false)
|
|
```
|
|
|
|
Rust (native-tls):
|
|
```
|
|
Pattern: accept_invalid_certs\s*\(\s*true\s*\)
|
|
```
|
|
|
|
Go (net/http):
|
|
```
|
|
Pattern: InsecureSkipVerify\s*:\s*true
|
|
```
|
|
|
|
Python (requests):
|
|
```
|
|
Pattern: verify\s*=\s*False
|
|
```
|
|
|
|
Node.js:
|
|
```
|
|
Pattern: rejectUnauthorized\s*:\s*false
|
|
Pattern: NODE_TLS_REJECT_UNAUTHORIZED.*['"]0['"]
|
|
```
|
|
|
|
YAML/TOML/JSON config:
|
|
```
|
|
Pattern: (tls_verify|ssl_verify|verify_ssl|verify_tls)\s*[:=]\s*(false|no|0|off)
|
|
```
|
|
|
|
### Extractor: jwt_config
|
|
|
|
**What it finds:** JWT validation settings.
|
|
|
|
**Claims extracted per finding:**
|
|
|
|
| Leaf | Predicate | What it means |
|
|
|------|-----------|---------------|
|
|
| `audience_validation` | `enabled` | Whether `aud` claim is validated |
|
|
| `expiry_validation` | `enabled` | Whether `exp` claim is validated |
|
|
| `algorithm_restriction` | `config_value` | Allowed algorithms (or "none" if unrestricted) |
|
|
| `signature_verification` | `enabled` | Whether signatures are verified |
|
|
|
|
**Patterns (Rust, jsonwebtoken crate):**
|
|
|
|
```
|
|
// aud validation
|
|
Pattern: set_audience.*\[\]|validate_aud.*false|aud.*None
|
|
Leaf: audience_validation
|
|
Value: Boolean(false)
|
|
|
|
// Dangerous: algorithm none
|
|
Pattern: Algorithm::None|alg.*none|allow_none.*true
|
|
Leaf: algorithm_restriction
|
|
Value: Text("none_allowed")
|
|
|
|
// Signature skip
|
|
Pattern: dangerous_insecure|skip_signature|verify.*false
|
|
Leaf: signature_verification
|
|
Value: Boolean(false)
|
|
```
|
|
|
|
**Patterns (Go, golang-jwt):**
|
|
|
|
```
|
|
Pattern: jwt\.Parse\(.*func\(.*\*jwt\.Token\).*\{[^}]*return.*signingKey
|
|
(without any algorithm check in the callback)
|
|
```
|
|
|
|
This is a heuristic match (confidence < 1.0) — detecting missing validation is harder than detecting explicit disabling.
|
|
|
|
### Extractor: hardcoded_secrets
|
|
|
|
**What it finds:** Credentials, API keys, tokens in source (not in .env or .gitignore'd files).
|
|
|
|
**Patterns:**
|
|
|
|
```
|
|
// API keys
|
|
Pattern: (api[_-]?key|apikey)\s*[:=]\s*["'][A-Za-z0-9_\-]{20,}["']
|
|
Leaf: api_key_storage
|
|
Predicate: storage_method
|
|
Value: Text("hardcoded")
|
|
|
|
// Passwords
|
|
Pattern: (password|passwd|pwd)\s*[:=]\s*["'][^"']+["']
|
|
(excluding: "password", "changeme", "placeholder", "CHANGE_ME", "xxx", test patterns)
|
|
Leaf: password_storage
|
|
Predicate: storage_method
|
|
Value: Text("hardcoded")
|
|
|
|
// AWS keys
|
|
Pattern: (AKIA[0-9A-Z]{16})
|
|
Leaf: aws_credentials
|
|
Predicate: storage_method
|
|
Value: Text("hardcoded")
|
|
|
|
// Private keys
|
|
Pattern: -----BEGIN (RSA |EC |DSA )?PRIVATE KEY-----
|
|
Leaf: private_key_storage
|
|
Predicate: storage_method
|
|
Value: Text("hardcoded_in_source")
|
|
```
|
|
|
|
**Exclusions:** Files matching `*test*`, `*example*`, `*fixture*`, `*mock*` are scanned but findings are marked with lower confidence (0.5).
|
|
|
|
### Extractor: timeout_config
|
|
|
|
**What it finds:** HTTP client, database, and cache timeout values.
|
|
|
|
**Patterns:**
|
|
|
|
```
|
|
// Zero/infinite timeout
|
|
Pattern: timeout\s*[:=]\s*(0|None|null|nil|infinity|Inf)
|
|
Leaf: {context}/timeout (context from surrounding code: http, db, redis, etc.)
|
|
Predicate: config_value
|
|
Value: Number(0.0)
|
|
Description: "Timeout disabled (infinite wait)"
|
|
|
|
// Unreasonably low timeout
|
|
Pattern: timeout\s*[:=]\s*(\d+)
|
|
where value_ms < config.min_reasonable_ms
|
|
Leaf: {context}/timeout
|
|
Value: Number(extracted_value)
|
|
Description: "Timeout {value}ms below minimum reasonable {min}ms"
|
|
|
|
// Unreasonably high timeout
|
|
Pattern: timeout\s*[:=]\s*(\d+)
|
|
where value_ms > config.max_reasonable_ms
|
|
```
|
|
|
|
**Unit detection:** Heuristic based on magnitude and surrounding context:
|
|
- Value > 1000000 → likely nanoseconds
|
|
- Value > 1000 and < 1000000 → likely milliseconds
|
|
- Value < 100 → likely seconds
|
|
- Presence of "ms", "sec", "Duration::from_secs" → explicit unit
|
|
|
|
### Extractor: dep_versions
|
|
|
|
**What it finds:** Dependencies with known vulnerabilities.
|
|
|
|
**Sources:**
|
|
- `Cargo.toml` → check against RustSec Advisory DB
|
|
- `go.mod` → check against Go Vulnerability Database
|
|
- `package.json` → check against npm audit advisories
|
|
- `requirements.txt` / `pyproject.toml` → check against PyPI advisory data
|
|
|
|
**Output:**
|
|
|
|
```
|
|
Leaf: dep/{package_name}/version
|
|
Predicate: installed_version
|
|
Value: Text("1.0.2")
|
|
Description: "openssl 1.0.2 has known vulnerability CVE-2024-XXXX"
|
|
```
|
|
|
|
The advisory databases are downloaded locally and refreshed periodically. Aphoria doesn't call external APIs during scan.
|
|
|
|
### Extractor: cors_config
|
|
|
|
**What it finds:** Overly permissive CORS configuration.
|
|
|
|
**Patterns:**
|
|
|
|
```
|
|
// Allow all origins
|
|
Pattern: allow_origin\s*\(\s*["']\*["']\s*\)|Access-Control-Allow-Origin.*\*|AllowAllOrigins.*true
|
|
Leaf: cors/allow_origin
|
|
Predicate: config_value
|
|
Value: Text("*")
|
|
Description: "CORS allows all origins"
|
|
|
|
// Allow credentials with wildcard
|
|
Pattern: (allow_credentials|AllowCredentials).*true
|
|
(in proximity to allow_origin *)
|
|
Leaf: cors/credentials_with_wildcard
|
|
Predicate: enabled
|
|
Value: Boolean(true)
|
|
Description: "CORS allows credentials with wildcard origin"
|
|
```
|
|
|
|
### Extractor: rate_limit
|
|
|
|
**What it finds:** Rate limiting disabled or set unreasonably high.
|
|
|
|
**Patterns:**
|
|
|
|
```
|
|
// Rate limiting disabled
|
|
Pattern: (rate_limit|ratelimit).*disabled|rate_limit\s*[:=]\s*(0|false|off|none)
|
|
Leaf: rate_limit/enabled
|
|
Predicate: enabled
|
|
Value: Boolean(false)
|
|
|
|
// Unreasonably high limit
|
|
Pattern: (rate_limit|ratelimit|max_requests)\s*[:=]\s*(\d+)
|
|
where value > 10000 per minute (configurable)
|
|
Leaf: rate_limit/max_requests
|
|
Predicate: config_value
|
|
Value: Number(extracted_value)
|
|
```
|
|
|
|
---
|
|
|
|
## Ingestion Bridge
|
|
|
|
### Claim → Assertion Mapping
|
|
|
|
```rust
|
|
fn to_assertion(
|
|
claim: &ExtractedClaim,
|
|
agent_keypair: &Ed25519Keypair,
|
|
scan_timestamp: u64,
|
|
) -> Assertion {
|
|
let source_metadata = serde_json::to_vec(&json!({
|
|
"file": claim.file,
|
|
"line": claim.line,
|
|
"matched_text": claim.matched_text,
|
|
"extractor": claim.concept_path.leaf(),
|
|
"scan_tool": "aphoria",
|
|
"scan_version": env!("CARGO_PKG_VERSION"),
|
|
}));
|
|
|
|
let source_hash = blake3::hash(
|
|
format!("{}:{}:{}", claim.file, claim.line, claim.matched_text).as_bytes()
|
|
);
|
|
|
|
Assertion {
|
|
subject: claim.concept_path.to_string(), // EntityId = String
|
|
predicate: claim.predicate.clone(),
|
|
object: claim.value.clone(),
|
|
parent_hash: None,
|
|
source_hash: *source_hash.as_bytes(),
|
|
source_class: SourceClass::Expert, // code:// scheme default
|
|
visual_hash: None,
|
|
epoch: None,
|
|
source_metadata: source_metadata.ok(),
|
|
lifecycle: LifecycleStage::Approved,
|
|
signatures: vec![sign(agent_keypair, &claim)],
|
|
confidence: claim.confidence,
|
|
timestamp: scan_timestamp,
|
|
vector: None,
|
|
}
|
|
}
|
|
```
|
|
|
|
### Idempotency
|
|
|
|
Same code produces the same claims. Same claims produce the same assertion hashes (content-addressed). Re-scanning a project that hasn't changed ingests nothing new. This is guaranteed by BLAKE3 content addressing in the existing Episteme pipeline.
|
|
|
|
When code changes between scans, new assertions are created. Old assertions remain (append-only). The `diff` command compares the current scan's assertions against the last scan's to show what changed.
|
|
|
|
### Scan Metadata
|
|
|
|
Each scan is recorded as an assertion about itself:
|
|
|
|
```
|
|
Subject: aphoria://scan/{project_name}/{scan_id}
|
|
Predicate: completed
|
|
Object: Text(json!({
|
|
"project": "citadeldb",
|
|
"files_scanned": 142,
|
|
"claims_extracted": 23,
|
|
"conflicts_found": 3,
|
|
"blocks": 2,
|
|
"flags": 1,
|
|
"timestamp": 1706832000
|
|
}))
|
|
```
|
|
|
|
This enables `aphoria diff` — compare two scan records and their associated assertions.
|
|
|
|
---
|
|
|
|
## Conflict Detection
|
|
|
|
### Query Strategy
|
|
|
|
After ingestion, for each extracted claim:
|
|
|
|
```rust
|
|
async fn check_conflict(
|
|
claim: &ExtractedClaim,
|
|
query_engine: &QueryEngine,
|
|
) -> Option<ConflictResult> {
|
|
// 1. Query with Skeptic lens, resolving aliases
|
|
let results = query_engine.query(Query {
|
|
subject: Some(claim.concept_path.to_string()),
|
|
predicate: Some(claim.predicate.clone()),
|
|
lens: Some("skeptic".to_string()),
|
|
resolve_aliases: true,
|
|
source_class_decay: true,
|
|
..Default::default()
|
|
}).await;
|
|
|
|
// 2. Check if any authoritative source disagrees
|
|
let code_value = &claim.value;
|
|
let mut conflicts = Vec::new();
|
|
|
|
for assertion in &results.assertions {
|
|
if assertion.source_class.tier() < 3 // Tier 0, 1, or 2
|
|
&& assertion.object != *code_value // Different value
|
|
{
|
|
conflicts.push(ConflictingSource {
|
|
path: assertion.subject.clone(),
|
|
source_class: assertion.source_class,
|
|
value: assertion.object.clone(),
|
|
confidence: assertion.confidence,
|
|
});
|
|
}
|
|
}
|
|
|
|
if conflicts.is_empty() {
|
|
return None;
|
|
}
|
|
|
|
// 3. Compute conflict score
|
|
// Higher when tier spread is larger and authoritative sources are confident
|
|
let max_tier_weight = conflicts.iter()
|
|
.map(|c| c.source_class.authority_weight())
|
|
.max_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal))
|
|
.unwrap_or(0.0);
|
|
|
|
let code_weight = SourceClass::Expert.authority_weight(); // 0.5
|
|
|
|
let conflict_score = max_tier_weight * (1.0 - code_weight);
|
|
// Tier 0 vs Tier 3: 1.0 * 0.5 = 0.50 (minimum, boosted below)
|
|
// Boosted by confidence of the authoritative source
|
|
|
|
let boosted_score = conflict_score
|
|
* conflicts.iter().map(|c| c.confidence).max_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal)).unwrap_or(1.0);
|
|
|
|
// Normalize: tier spread 0→3 maps to 0.4→0.95
|
|
let tier_spread = conflicts.iter()
|
|
.map(|c| c.source_class.tier())
|
|
.min()
|
|
.unwrap_or(3) as f32;
|
|
let normalized = 0.4 + (3.0 - tier_spread) / 3.0 * 0.55;
|
|
let final_score = normalized.max(boosted_score);
|
|
|
|
Some(ConflictResult {
|
|
claim: claim.clone(),
|
|
conflicts,
|
|
conflict_score: final_score,
|
|
verdict: if final_score >= threshold_block { Verdict::Block }
|
|
else if final_score >= threshold_flag { Verdict::Flag }
|
|
else { Verdict::Pass },
|
|
})
|
|
}
|
|
```
|
|
|
|
### Verdict Levels
|
|
|
|
| Verdict | Condition | Meaning |
|
|
|---------|-----------|---------|
|
|
| BLOCK | `conflict_score >= 0.7` | Authoritative source strongly contradicts. Fix or explicitly acknowledge. |
|
|
| FLAG | `conflict_score >= 0.4` | Potential disagreement. Review recommended. |
|
|
| PASS | `conflict_score < 0.4` | No significant conflict (or no authoritative data). |
|
|
| ACK | Any score, acknowledged | Conflict exists but has been explicitly accepted. |
|
|
|
|
### Acknowledged Conflicts
|
|
|
|
When a conflict has been acknowledged (via `aphoria ack`), the acknowledgment assertion exists in Episteme. The conflict still has a score, but the report marks it as ACK instead of BLOCK/FLAG:
|
|
|
|
```
|
|
ACK code://rust/citadeldb/auth/jwt/audience_validation
|
|
Your code: aud validation disabled (src/auth/jwt.rs:47)
|
|
RFC 7519: aud validation MUST be enabled (Tier 0)
|
|
Acknowledged: 2026-01-15 by jordan
|
|
Reason: "Internal service, no external JWT consumers. SEC-2024-003."
|
|
```
|
|
|
|
The acknowledgment doesn't suppress the conflict. It adds context. A future `--strict` mode can treat acknowledged conflicts as blocks again (for audits).
|
|
|
|
---
|
|
|
|
## Report Formats
|
|
|
|
### SARIF (for CI)
|
|
|
|
SARIF (Static Analysis Results Interchange Format) is the standard for CI security tools. GitHub, GitLab, and Azure DevOps all consume it.
|
|
|
|
```json
|
|
{
|
|
"$schema": "https://raw.githubusercontent.com/oasis-tcs/sarif-spec/main/sarif-2.1/schema/sarif-schema-2.1.0.json",
|
|
"version": "2.1.0",
|
|
"runs": [{
|
|
"tool": {
|
|
"driver": {
|
|
"name": "aphoria",
|
|
"version": "0.1.0",
|
|
"informationUri": "https://github.com/orchard9/aphoria"
|
|
}
|
|
},
|
|
"results": [{
|
|
"ruleId": "epistemic-drift/tls-verify",
|
|
"level": "error",
|
|
"message": {
|
|
"text": "TLS certificate verification disabled. OWASP requires verification (Tier 1, conflict score 0.87)."
|
|
},
|
|
"locations": [{
|
|
"physicalLocation": {
|
|
"artifactLocation": { "uri": "src/net/client.rs" },
|
|
"region": { "startLine": 23 }
|
|
}
|
|
}]
|
|
}]
|
|
}]
|
|
}
|
|
```
|
|
|
|
### JSON (for programmatic consumption)
|
|
|
|
```json
|
|
{
|
|
"project": "citadeldb",
|
|
"scan_id": "abc123",
|
|
"timestamp": 1706832000,
|
|
"summary": {
|
|
"files_scanned": 142,
|
|
"claims_extracted": 23,
|
|
"conflicts": 3,
|
|
"blocks": 2,
|
|
"flags": 1
|
|
},
|
|
"conflicts": [
|
|
{
|
|
"concept_path": "code://rust/citadeldb/auth/jwt/audience_validation",
|
|
"predicate": "enabled",
|
|
"code_value": false,
|
|
"file": "src/auth/jwt.rs",
|
|
"line": 47,
|
|
"conflict_score": 0.92,
|
|
"verdict": "BLOCK",
|
|
"conflicting_sources": [
|
|
{
|
|
"path": "rfc://7519/jwt/audience_validation",
|
|
"source_class": "Regulatory",
|
|
"value": true,
|
|
"confidence": 1.0
|
|
}
|
|
],
|
|
"acknowledged": null
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Baseline and Diff
|
|
|
|
### Baseline
|
|
|
|
`aphoria baseline` records the current scan as the baseline. Subsequent scans only report *new* conflicts.
|
|
|
|
Implementation: store the baseline scan ID in `.aphoria/baseline` in the project root. The `diff` logic compares the current scan's conflict set against the baseline's.
|
|
|
|
```
|
|
.aphoria/
|
|
baseline # scan ID of the baseline
|
|
config.toml # symlink or copy of aphoria.toml
|
|
agent.key # Ed25519 keypair for this project's Aphoria agent
|
|
```
|
|
|
|
### Diff
|
|
|
|
`aphoria diff` shows:
|
|
- New conflicts (in current scan but not baseline)
|
|
- Resolved conflicts (in baseline but not current scan)
|
|
- Changed conflicts (same concept, different score or verdict)
|
|
|
|
```
|
|
$ aphoria diff
|
|
|
|
NEW code://rust/citadeldb/cache/redis/max_connections
|
|
Your code: max_connections = 10000 (config/redis.yaml:5)
|
|
Vendor: recommended max 128 per instance (Tier 2)
|
|
Conflict: 0.48 — FLAG
|
|
|
|
RESOLVED code://rust/citadeldb/net/tls/cert_verification
|
|
Previously: verify = false → BLOCK
|
|
Current: verify = true → PASS
|
|
|
|
1 new conflict, 1 resolved, 0 changed.
|
|
```
|
|
|
|
---
|
|
|
|
## Agent Keypair
|
|
|
|
Aphoria signs assertions with a per-project Ed25519 keypair stored in `.aphoria/agent.key`. Generated on first `aphoria scan` if it doesn't exist.
|
|
|
|
The keypair identifies "Aphoria scanning project X" as a distinct agent in Episteme's trust system. Multiple projects have different keypairs. This enables:
|
|
- Per-project audit trails ("which Aphoria agent found this?")
|
|
- TrustRank per Aphoria instance (a well-calibrated Aphoria gains reputation)
|
|
- Distinguishing human-authored assertions from Aphoria-extracted ones
|
|
|
|
---
|
|
|
|
## Episteme Instance
|
|
|
|
### Local Mode (Default)
|
|
|
|
Aphoria ships with an embedded Episteme instance. No server needed. The database lives at `~/.aphoria/db/` (configurable). Multiple projects share the same local instance — their assertions are namespaced by ConceptPath (`code://rust/citadeldb/...` vs `code://go/other-project/...`).
|
|
|
|
The authoritative corpus (RFCs, OWASP) is also in the local instance. `aphoria init` bootstraps it.
|
|
|
|
```
|
|
$ aphoria init
|
|
Downloading RFC corpus (auth, crypto, TLS) ... 127 assertions ingested.
|
|
Downloading OWASP cheat sheets ... 89 assertions ingested.
|
|
Ready. Run `aphoria scan <project>` to begin.
|
|
```
|
|
|
|
### Remote Mode (Future)
|
|
|
|
```toml
|
|
[episteme]
|
|
url = "https://episteme.example.com"
|
|
api_key = "${APHORIA_API_KEY}"
|
|
```
|
|
|
|
In remote mode, Aphoria ingests into and queries from a shared Episteme instance. This enables:
|
|
- Cross-project conflict detection ("same JWT misconfiguration in 12 repos")
|
|
- Shared authoritative corpus (ingested once, used by all Aphoria agents)
|
|
- Centralized acknowledgment management
|
|
|
|
---
|
|
|
|
## Exit Codes
|
|
|
|
| Code | Meaning |
|
|
|------|---------|
|
|
| 0 | No conflicts above threshold |
|
|
| 1 | FLAG-level conflicts found (with `--exit-code`) |
|
|
| 2 | BLOCK-level conflicts found (with `--exit-code`) |
|
|
| 3 | Scan error (file access, Episteme connection, etc.) |
|
|
|
|
`--exit-code` enables non-zero exits. Without it, Aphoria always exits 0 (for interactive use where the report is the output, not the exit code).
|
|
|
|
---
|
|
|
|
## Performance Targets
|
|
|
|
| Metric | Target |
|
|
|--------|--------|
|
|
| Scan time, 1000-file Rust project | < 5 seconds |
|
|
| Scan time, 10000-file monorepo | < 30 seconds |
|
|
| Per-file extraction (all extractors) | < 5ms |
|
|
| Conflict query per claim | < 10ms |
|
|
| Local Episteme startup | < 100ms |
|
|
| Memory usage during scan | < 256MB |
|
|
|
|
The performance bottleneck is I/O (reading files), not extraction (regex matching). The conflict query is a local KV lookup, not a network call.
|
|
|
|
---
|
|
|
|
## Dependencies
|
|
|
|
| Dependency | Purpose |
|
|
|------------|---------|
|
|
| `clap` | CLI argument parsing |
|
|
| `ignore` | File walking (respects .gitignore, fast) |
|
|
| `regex` | Pattern matching in extractors |
|
|
| `serde` + `serde_json` | Config parsing, JSON output |
|
|
| `toml` | aphoria.toml parsing |
|
|
| `comfy-table` | Terminal table output |
|
|
| `stemedb-core` | Types |
|
|
| `stemedb-storage` | Local KV store |
|
|
| `stemedb-ingest` | Assertion ingestion |
|
|
| `stemedb-query` | Conflict queries |
|
|
| `ed25519-dalek` | Agent keypair + signing |
|
|
| `blake3` | Content hashing |
|
|
|
|
No LLM dependency. No network dependency (in local mode). No runtime other than tokio (for async KV store operations).
|