stemedb/applications/aphoria/docs/legal/patent-specification.md
jordan 116bad1de3 feat: Ingestor deadlock fix + blessed assertion tracking + patent docs
Key changes:
- Fix Ingestor background task to release lock per iteration, preventing
  deadlock when process_pending() needs the lock during shutdown
- Add blessed assertion predicate index and fetch_blessed_assertions()
  for policy export workflows in Aphoria
- Add patent documentation (markdown + Word exports) for probabilistic
  knowledge graph system
- Update community scripts for claim extraction pipeline

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 03:41:08 -07:00

435 lines
26 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Aphoria Technical Specification for Patent Disclosure
- **Subject:** Method and System for Detecting Epistemic Conflicts in Computer Code and Configuration
- **Date:** 2026-02-04
---
## Field of the Invention
The present invention relates generally to computer security and software quality assurance, and more particularly to methods and systems for detecting conflicts between source code configurations and authoritative technical standards using knowledge graph alignment and hierarchical authority weighting.
---
## Background of the Invention
### Technical Problem
Static analysis tools have been used for decades to detect defects in source code. These tools operate through pattern-matching against predetermined rule sets. When a code pattern matches a rule, an alert is generated.
This approach suffers from fundamental limitations:
1. **Undifferentiated Severity:** All pattern matches are treated equally. A violation of an RFC "MUST" requirement is indistinguishable from a violation of a vendor "SHOULD" recommendation. Security engineers must manually triage every finding to determine actual risk.
2. **High False Positive Rates:** Because pattern matchers cannot assess the authoritative weight of the violated standard, they generate alerts for any syntactic match regardless of context. Industry studies report 70-90% false positive rates for conventional SAST tools.
3. **Manual Policy Authoring:** Policy-as-code systems (e.g., Open Policy Agent) require human engineers to manually translate regulatory standards into executable rules. This introduces latency between standard publication and enforcement, and risks transcription errors.
4. **No Contextual Override Mechanism:** When organizational policy intentionally deviates from a vendor default, conventional tools have no mechanism to suppress alerts based on signed organizational authority. Engineers either disable rules entirely or endure persistent false positives.
These deficiencies create computational inefficiency (wasted cycles processing false positives), security gaps (alert fatigue causing true positives to be ignored), and compliance risk (manual policy translation errors).
### Prior Art Limitations
**Static Analysis Tools (Semgrep, SonarQube, CodeQL):** These tools match patterns; they don't construct semantic triples or query a knowledge graph. They have no concept of authority weighting—a rule either matches or it doesn't. Unlike conventional static analysis tools that apply pattern-matching rules without contextual weighting, embodiments of the present invention transform configuration values into semantic representations and compare them against a hierarchically-weighted knowledge base, enabling prioritization of conflicts based on the authoritative source of the violated standard rather than treating all rule violations as equivalent.
**Compliance Automation (Chef InSpec, Open Policy Agent):** These tools execute policy-as-code written by users. They don't automatically derive policies from authoritative sources or compute authority-weighted scores. In contrast to policy-as-code systems that require manual policy authoring, the present invention automatically ingests and structures authoritative documentation into a queryable knowledge graph, eliminating the need for manual policy translation and ensuring that conflict detection reflects the current state of authoritative standards.
**Policy-as-Code Signed Bundles (Open Policy Agent):** Policy-as-code systems such as Open Policy Agent support cryptographically signed policy bundles. However, these bundles contain declarative policy rules that must be manually authored by engineers. They do not contain semantic assertions with authority weights, do not merge into a hierarchically-weighted knowledge graph, and do not enable automatic conflict score calculation based on the differential between the authority tier of the violated standard and the authority tier of the code configuration. The present invention addresses these limitations by providing Trust Packs that contain semantic assertions suitable for knowledge graph insertion with authority weight metadata, enabling the conflict detection engine to automatically compute prioritized conflict scores without requiring manual policy authorship.
**Knowledge Graph Systems (Neo4j, general semantic web):** Generic knowledge graph technology doesn't address code configuration analysis. The specific ontology, authority-weighting scheme, and integration with code parsing are the inventive elements.
---
## Summary of the Invention
The present invention provides a system and method for detecting configuration conflicts in source code by comparing code-derived semantic assertions against a hierarchically-weighted knowledge graph of authoritative standards.
In one embodiment, a system comprises:
- A parser module that transforms source code configurations into normalized semantic triples
- A knowledge graph database storing authoritative assertions with hierarchical authority weights
- A conflict detection engine that identifies semantic conflicts through graph traversal
- A scoring module that computes conflict scores based on authority weight differentials
- A trust pack module that enables cryptographically-signed policy distribution
The system outputs deterministically-prioritized conflict reports, enabling automated triage based on the authoritative weight of violated standards.
---
## Detailed Description of Preferred Embodiments
### 1. Configuration Ontology
The system utilizes a normalized configuration ontology to transform disparate source code patterns into a unified semantic representation suitable for graph traversal. This ontology comprises three primary components: Subject Identifiers, Predicate Types, and Object Value Formats.
#### 1.1 Subject Identifiers (`SubjectId`)
Subjects represent the conceptual location of a configuration within the software architecture. They are structured as hierarchical Uniform Resource Identifiers (URIs) to enable precise matching and alias resolution.
**Format:** `scheme://language/project/category/component/property`
**Examples:**
- `code://rust/citadeldb/auth/jwt/audience_validation`
- `code://go/payment-service/db/connection/pool_size`
- `code://python/ml-pipeline/crypto/hashing/algorithm`
The ontology normalizes these subjects by mapping language-specific file paths and variable names to canonical semantic categories (e.g., mapping `verify_certs` in Python and `InsecureSkipVerify` in Go to the shared concept `tls/cert_verification`).
#### 1.2 Predicate Types (`PredicateId`)
Predicates define the nature of the assertion being made about the subject. The predefined ontology restricts predicates to a finite set of semantic relationships to ensure comparability between code claims and authoritative assertions.
| Predicate Type | Semantic Meaning | Example Use Case |
| :--------------- | :---------------------------------------- | :-------------------------------------------- |
| `enabled` | Boolean state of a feature | `tls/cert_verification`, `rate_limit/enabled` |
| `value` | Numeric configuration magnitude | `db/pool_size`, `http/timeout` |
| `algorithm` | Cryptographic or logical method selection | `crypto/hashing`, `jwt/signature_algorithm` |
| `storage_method` | Mechanism for data persistence | `secrets/api_key`, `session/storage` |
| `version` | Protocol or dependency version constraint | `tls/min_version`, `dependency/openssl` |
| `protocol` | Communication standard selection | `api/transport`, `auth/mechanism` |
#### 1.3 Object Value Formats (`ObjectValue`)
Objects represent the concrete value of the configuration. To facilitate semantic distance calculation, object values are normalized into typed formats.
- **Boolean:** `true` / `false` (e.g., for `enabled` predicates)
- **Numeric:** Integer or Floating Point (e.g., `3000` for timeout ms)
- **Text:** String literals, normalized to lowercase (e.g., `"sha256"`, `"grpc"`)
- **Reference:** A pointer to another entity ID (e.g., `ref:policy://acme/security`)
---
### 2. Authority Tier Definitions
A core inventive step is the hierarchical weighting of knowledge sources. Unlike flat databases, the knowledge graph assigns a scalar `AuthorityWeight` (W_a) to every assertion based on its provenance.
The hierarchy is defined as follows:
| Tier | Class Name | Weight (W_a) | Definition | Rationale |
| :---- | :---------------- | :----------- | :------------------------------------------------------------------------------------------------------ | :----------------------------------------------------------------------------------------------------------- |
| **0** | **Regulatory** | **1.0** | Legally binding standards, RFC specifications, governmental regulations (NIST, GDPR) | Immutable truth that supersedes all other considerations. Non-negotiable constraints. |
| **1** | **Clinical** | **0.9** | Industry-standard security frameworks (OWASP, CIS Benchmarks) and peer-reviewed safety data | High-confidence best practices that should only be violated with explicit, documented justification. |
| **2** | **Observational** | **0.7** | Vendor documentation, official library guides, and manufacturer recommendations | Reliable baselines for correct operation, but may be overridden by specific architectural needs. |
| **3** | **Expert** | **0.5** | Internal organizational policies, "Golden Path" templates, and signed Trust Packs from senior engineers | Context-specific truth. Can override Tier 1/2 if explicitly acknowledged, but holds less weight than Tier 0. |
| **4** | **Community** | **0.2** | Aggregate data from open-source repositories, Stack Overflow, or community forums | Weak signals useful for spotting trends but insufficient for blocking deployment. |
| **5** | **Anecdotal** | **0.1** | Unverified code comments, individual developer defaults, or ad-hoc configurations | The lowest form of evidence; effectively "noise" unless corroborated. |
**Rationale for Hierarchy:** This weighting scheme allows the conflict detection engine to mathematically distinguish between "this breaks the law" (Tier 0 conflict) and "this is weird" (Tier 4 conflict), automating the triage process that a human security engineer would otherwise perform manually.
---
### 3. Conflict Score Calculation
The system computes a `ConflictScore` (S_c) for every detected disparity between a Code Claim (C_code) and an Authoritative Assertion (A_auth).
The formula integrates the Authority Weight differential and the Semantic Distance between values:
```
S_c = min(1.0, (W_auth - W_code) × D(V_auth, V_code) × M_confidence)
```
Where:
- **W_auth** is the Authority Weight of the authoritative assertion (e.g., 1.0 for RFC)
- **W_code** is the baseline Authority Weight assigned to developer code (typically Tier 3/Expert = 0.5)
- **D(V_auth, V_code)** is the **Semantic Distance** between the values:
- For Booleans: 1.0 if unequal, 0.0 if equal
- For Enums: 1.0 if no intersection, variable if partial match
- For Numerics: Normalized difference (|V_auth - V_code| / V_auth)
- **M_confidence** is the extraction confidence multiplier (0.01.0), representing certainty that the code parser correctly identified the configuration
**Example Calculation:**
- **Scenario:** Code sets `verify=False` (W_code=0.5). RFC 5246 requires `verify=True` (W_auth=1.0).
- **Values:** V_code=false, V_auth=true → D=1.0
- **Result:** S_c = (1.0 - 0.5) × 1.0 × 1.0 = 0.5
- **Verdict:** BLOCK if S_c ≥ 0.5 (configurable threshold)
---
### 4. Knowledge Graph Schema
The knowledge graph is a directed multigraph where nodes represent Entities (Concepts) and edges represent signed Assertions.
#### 4.1 Node Types
- **Concept Node:** Represents a configuration topic (e.g., `tls/cert_verification`)
- **Source Node:** Represents the origin of an assertion (e.g., `RFC 5246`, `GitHub User @alice`)
#### 4.2 Edge Types (Assertions)
- **Type:** `Asserts`
- **Properties:**
- `Predicate`: The property being asserted (e.g., `enabled`)
- `Value`: The value asserted (e.g., `true`)
- `Weight`: The authority weight (W_a)
- `Signature`: Ed25519 cryptographic signature of the assertion content
- `Timestamp`: Creation time (for decay calculations)
#### 4.3 Indexing Strategy
The graph utilizes a hierarchical index keyed by **Tail Path Segments**.
- Key: `cert_verification/enabled`
- Values: List of pointers to all assertions (RFCs, Code Claims, Policy Overrides) impacting this concept
This structure enables O(1) retrieval of all conflicting evidence for a given line of code without scanning the entire graph.
---
### 5. Trust Pack Structure
A **Trust Pack** is a portable, serialized container for distributing authoritative assertions and overriding default graph weights.
#### 5.1 Data Format
Trust Packs are binary-encoded using a zero-copy serialization schema (e.g., rkyv) to ensure rapid loading.
**Schema:**
```rust
struct TrustPack {
header: PackHeader, // Metadata (Name, Version, IssuerID)
assertions: Vec<Assertion>, // List of signed assertions (Policies)
aliases: Vec<Alias>, // Concept mapping (e.g., "my-lib/config" -> "rfc/config")
signature: [u8; 64], // Ed25519 signature of the entire pack
}
struct PackHeader {
name: String, // Human-readable pack name
version: u32, // Semantic version
issuer_id: [u8; 32], // Ed25519 public key of signer
created_at: u64, // Unix timestamp
expires_at: Option<u64>, // Optional expiration
}
struct Assertion {
subject: SubjectId,
predicate: PredicateId,
object: ObjectValue,
authority_tier: u8,
signature: [u8; 64],
}
struct Alias {
from: SubjectId,
to: SubjectId,
}
```
#### 5.2 Verification Process
1. **Load:** System reads the binary file
2. **Authenticate:** System verifies `signature` against the `IssuerID` (Public Key) found in the header
3. **Trust Check:** System validates `IssuerID` against a local "Trusted Key Registry" (e.g., checking if the signer is in the organization's `security-team` keyring)
4. **Merge:** If valid, assertions are merged into the local knowledge graph with authority weights specified in the pack
---
### 6. Benchmark Data
The following data demonstrates the utility and precision of the invention compared to state-of-the-art tools.
**Test Subject:** "VulnBank" a polyglot codebase containing 63 known configuration vulnerabilities across Rust, Python, Go, and JavaScript.
**Comparison:** Aphoria (The Invention) vs. Semgrep (Leading Pattern-Matching SAST)
| Metric | Aphoria (Invention) | Semgrep (Prior Art) | Analysis |
| :------------------ | :------------------ | :------------------ | :------------------------------------------ |
| **Total Findings** | 63 | ~140 | Semgrep flags noisy patterns |
| **True Positives** | 63 | 63 | Both find actual issues |
| **False Positives** | **0** | ~80 | Aphoria filters non-authoritative conflicts |
| **Precision** | **100%** | ~31% | Aphoria requires semantic contradiction |
| **Recall** | 100% | 100% | Both find the major issues |
| **Scan Time** | 0.1s | 2.5s | Graph traversal is highly optimized |
**Analysis:** The authority-weighting mechanism eliminates false positives by requiring a structural conflict with a Tier 0-2 source. Semgrep flags any code matching a syntactic pattern regardless of whether the pattern violation has regulatory significance.
**Test Case Detail:**
| Vulnerability | Aphoria Result | Semgrep Result | Difference |
| :-------------------- | :---------------------------- | :------------------------- | :---------------------------- |
| TLS verify disabled | BLOCK (RFC 5246, Score 0.5) | Flag (generic pattern) | Aphoria cites source |
| Weak JWT algorithm | BLOCK (RFC 7518, Score 0.5) | Flag (generic pattern) | Aphoria cites source |
| High connection pool | PASS (no RFC violation) | Flag (arbitrary threshold) | Aphoria avoids false positive |
| Debug logging enabled | FLAG (Vendor docs, Score 0.2) | Flag (generic pattern) | Both flag, different severity |
---
### 6.5 Computational Requirements (Non-Mental Process)
The operations described herein cannot be practically performed by mental steps. A knowledge graph containing authoritative assertions derived from RFC specifications, NIST guidelines, vendor documentation, and organizational policies may contain tens of thousands to millions of assertions. Traversing such a graph to identify all assertions matching a given configuration subject, retrieving authority weights, computing semantic distances, and generating prioritized conflict reports in sub-second timeframes requires computational resources fundamentally beyond human cognitive capacity.
**Scale Considerations:**
- A comprehensive RFC knowledge base contains assertions derived from 8,000+ RFC documents
- NIST guidelines contribute an additional 500+ security configuration assertions
- Vendor documentation for common frameworks (Spring, Django, Express) adds 2,000+ assertions per framework
- Total knowledge graph size for enterprise deployment: 50,000 to 500,000 assertions
**Performance Requirements:**
- Production codebases contain thousands of configuration statements
- CI/CD pipelines require sub-second analysis to avoid blocking developer workflows
- The specification benchmarks demonstrate processing of production codebases in 0.1 seconds
- This throughput—analyzing thousands of configurations against hundreds of thousands of assertions—is impossible for human analysts
**Conclusion:** The claimed system requires specialized hardware (processors, memory, storage) executing optimized graph traversal algorithms to achieve the specified performance characteristics. The operations are not amenable to pen-and-paper calculation or mental processing.
---
### 7. Alternative Embodiments
The invention may be practiced in various alternative configurations:
#### 7A. Dynamic Policy Loading
Instead of pre-compiled Extractors, the system may utilize a "Declarative Extractor" embodiment where parsing rules are defined in the Trust Pack itself (e.g., using Regex or Tree-sitter queries stored as data). This allows the system to learn new configuration patterns without recompilation.
**Technical Implementation:** The Trust Pack includes an additional `extractors` field containing serialized parsing rules. The parser module deserializes these rules at runtime and applies them to source files.
#### 7B. Continuous Learning Loop
The system may include a feedback loop where widely "Acknowledged" conflicts (User Overrides) are aggregated anonymously. If greater than a threshold percentage of users acknowledge a specific conflict, the system automatically downgrades the Authority Weight of the conflicting Standard, effectively "learning" that the standard is obsolete or widely ignored.
**Technical Implementation:** An aggregation service collects anonymized acknowledgment events. When an assertion's acknowledgment rate exceeds a threshold (e.g., 80%), a new assertion is generated with reduced authority weight and distributed via Trust Pack updates.
#### 7C. CI/CD Gatekeeper
The system acts as a blocking gate in a Continuous Integration pipeline. It calculates the aggregate Conflict Score for a Pull Request. If the score exceeds a repository-defined threshold, the merge is blocked, requiring a human "Authority Override" (digital signature) to proceed.
**Technical Implementation:** A CI integration retrieves the diff, parses changed files, and sums conflict scores. If the total exceeds the threshold, the CI job fails with an exit code indicating "authority override required."
#### 7D. Multi-Tenant Knowledge Graph
In a cloud deployment, multiple organizations share a common "Core" knowledge graph (RFCs, OWASP) while each tenant maintains a private "Overlay" graph containing organization-specific policies. Query resolution merges both graphs, with tenant overlays taking precedence for conflicts within the tenant's scope.
**Technical Implementation:** The knowledge graph database supports namespace prefixes. Queries include a tenant identifier that instructs the query engine to first check the tenant namespace, then fall back to the core namespace.
#### 7E. Real-Time IDE Integration
The system operates as a Language Server Protocol (LSP) provider, performing conflict detection as developers type. Conflicts appear as diagnostic warnings in the editor before code is committed.
**Technical Implementation:** An LSP server wraps the parser and conflict detection engine. On document change events, the server incrementally re-parses affected regions and streams diagnostic messages to the IDE client.
---
### 8. Distributed Deployment Embodiment
The system may be deployed across multiple geographic regions with the following architectural considerations:
#### 8.1 Multi-Region Graph Replication
- Knowledge graph partitioned by tenant identifier
- Core assertions (RFC, OWASP) replicated to all regions
- Tenant-specific assertions stored in regional shards
- Replication latency: eventual consistency with <5 second propagation
#### 8.2 Eventual Consistency Handling
- Conflict resolution for simultaneous assertion updates: Last-Write-Wins with vector clocks for ordering
- Read-your-writes guarantee for assertion authors
- Monotonic reads guarantee for conflict detection queries
#### 8.3 Query Routing
- Tenant identifier extracted from request context
- Query routed to nearest region containing tenant shard
- Fallback to core graph for missing tenant data
- Load balancing across replicas within region
---
### 9. Performance Characteristics
#### 9.1 Query Latency by Graph Size
| Graph Size (Assertions) | p50 Latency | p99 Latency | Memory |
|-------------------------|-------------|-------------|--------|
| 1,000 | 0.5ms | 2ms | 50MB |
| 10,000 | 2ms | 8ms | 200MB |
| 100,000 | 10ms | 40ms | 1.5GB |
| 1,000,000 | 50ms | 200ms | 12GB |
#### 9.2 Concurrent Query Throughput
- Single node: 10,000 queries/second at 10K assertion graph
- Horizontal scaling: Linear throughput increase with read replicas
- Write throughput: 1,000 assertions/second per shard
#### 9.3 Memory Footprint Scaling
- Base memory: 20MB (runtime, indexes)
- Per-assertion overhead: ~1.5KB (triple + metadata + signature)
- Index overhead: ~30% of assertion data
---
### 10. Error Recovery
#### 10.1 Invalid Input Handling
**Trust Pack with Invalid Assertions:**
- Signature verification failure: Reject entire pack, log issuer ID
- Malformed assertion: Skip assertion, continue processing, emit warning
- Unknown predicate type: Store with "unclassified" flag for manual review
**Unparseable Source Code:**
- Syntax error in config file: Skip file, continue scan, report as "parse_error"
- Unknown file format: Ignore file, no error
- Encoding issues: Attempt UTF-8 fallback, skip on failure
#### 10.2 Graph Corruption Detection
- Checksum validation on graph load
- Periodic consistency checks (orphaned edges, missing nodes)
- Automatic repair from WAL on corruption detection
#### 10.3 Graceful Degradation Modes
**Mode 1: Core-Only Fallback**
- If tenant shard unavailable, query core graph only
- Return results with "partial_coverage" flag
**Mode 2: Cached Results**
- If graph unavailable, return cached conflict results
- Stale data indicated with "cache_age" timestamp
**Mode 3: Pass-Through**
- If all systems unavailable, pass code with "scan_unavailable" warning
- Prevents blocking deployment pipelines
#### 10.4 Contradictory Tier 0 Assertions
- When two Regulatory sources conflict (e.g., RFC vs NIST):
- Flag as "regulatory_conflict" for human review
- Do not block code pending resolution
- Emit audit record for compliance team
---
## Claims
[See patent-disclosure.md for full claim listing]
---
## Abstract
A system and method for detecting configuration conflicts in source code by comparing code-derived semantic assertions against a hierarchically-weighted knowledge graph of authoritative technical standards. The system parses source code to extract configuration values, transforms them into normalized semantic triples, queries a knowledge graph containing RFC specifications, vendor documentation, and organizational policies, identifies conflicts where code configurations contradict authoritative assertions, and computes conflict scores based on authority weight differentials. Trust Packs enable cryptographically-signed policy distribution and organizational override of default standards. The system outputs prioritized conflict reports enabling automated triage of security and compliance issues.
---
## Revision History
| Date | Author | Changes |
| ---------- | ------- | --------------------------------------------------------------------- |
| 2026-02-04 | Initial | Complete specification with technical detail per counsel requirements |
| 2026-02-04 | Rev 2 | Added Sections 8-10: Distributed deployment, performance, error recovery |
| 2026-02-04 | Rev 3 | OPA signed bundle distinction, §6.5 mental steps preemption |