stemedb/docs/specs/visual-hash-query.md
jordan 1ce4004807 feat: Complete Phase 2 (The Cortex) - query, lens, and API layers
This commit adds the read path (Cortex) to complement the write path (Spine):

## Crates
- stemedb-api: HTTP API with axum + utoipa OpenAPI
  - /v1/assert, /v1/query, /v1/epoch, /v1/skeptic, /v1/trace, /v1/audit
  - Metered endpoints with quota enforcement
  - Ed25519 signature verification
- stemedb-lens: Truth resolution lenses
  - RecencyLens, ConsensusLens, ConfidenceLens
  - VoteAwareConsensusLens (Ballot Box pattern)
  - TrustAwareAuthorityLens (The Hive pattern)
  - SkepticLens (conflict analysis)
  - EpochAwareLens (paradigm-safe queries)
- stemedb-query: Query engine with materialized views

## Storage Extensions
- VoteStore: Vote aggregation with cached counts
- TrustRankStore: Agent reputation with decay
- AuditStore: Query audit trail
- IndexStore: SP/P/S index structures
- SupersessionStore: Epoch supersession chains

## SDKs
- sdk/go/steme: Go HTTP client with Ed25519 signing
- sdk/go/adk: ADK-Go tools for AI agents

## Documentation
- Updated CLAUDE.md, architecture.md, roadmap.md
- New ai-lookup entries for all services
- Use case docs for consumer health intelligence
- Arena roadmap for simulation advancement

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 13:22:44 -07:00

454 lines
13 KiB
Markdown

# Visual Hash Query Support
> **Status:** Draft
> **Phase:** 2.4 (MVP)
> **Pillars:** First-Class Contradiction (Visual Domain), Visual Anchoring
> **Primary Use Cases:** Financial Due Diligence, GLP-1 Living Review, Consumer Health Intelligence
---
## Problem Statement
The `visual_hash: Option<PHash>` field exists on `Assertion` and is stored/returned by the API, but there is no way to query by visual similarity. The field is write-only from a query perspective.
This creates a gap between what we promise and what we deliver:
| Promise (from use-cases) | Reality |
|--------------------------|---------|
| "Query by visual similarity" | No query parameter exists |
| "Find all assertions anchored to visually similar images" | Must scan entire dataset manually |
| "Catches duplicate evidence, fake variations, source reuse" | Cannot detect duplicates |
### The Pain Points
**Research Agent:** "I extract the same clinical trial table 10 times. Each extraction drifts slightly. I stored visual_hash hoping to detect duplicates. I can't query by it."
**Human Supervisor (3am incident):** "Agent made a bad decision based on an assertion with visual_hash. I need to find all assertions from that same screenshot. I can't."
**M&A Analyst:** "I have two screenshots of the same org chart from different sources. I want to prove they corroborate each other. I can't query by visual similarity."
---
## MVP Scope (Phase 2.4)
### What We're Building
A brute-force visual similarity query capability using hamming distance on perceptual hashes.
### API Changes
#### Query Struct (`stemedb-query`)
```rust
pub struct Query {
// ... existing fields ...
/// Filter by visual similarity to a reference pHash.
/// Hex-encoded 8-byte perceptual hash (16 hex chars).
pub visual_near: Option<String>,
/// Maximum hamming distance for visual_near matching.
/// Default: 8 bits (sweet spot for pHash similarity).
/// Range: 0-64 (0 = exact match, 64 = match anything).
pub visual_threshold: Option<u32>,
}
```
#### QueryBuilder (`stemedb-query`)
```rust
impl QueryBuilder {
/// Filter by visual similarity to a reference pHash.
///
/// # Arguments
/// * `hash` - Hex-encoded pHash (16 chars)
/// * `threshold` - Max hamming distance (default: 8)
///
/// # Example
/// ```
/// let query = QueryBuilder::new()
/// .subject("BioStart_CEO")
/// .predicate("reports_to")
/// .visual_near("a3f2b1c4d5e6f708", 8)
/// .build();
/// ```
pub fn visual_near(mut self, hash: &str, threshold: u32) -> Self {
self.visual_near = Some(hash.to_string());
self.visual_threshold = Some(threshold);
self
}
}
```
#### HTTP API (`stemedb-api`)
```
GET /v1/query?visual_near=a3f2b1c4d5e6f708&visual_threshold=8
Query Parameters:
visual_near - Hex-encoded pHash (16 characters, optional)
visual_threshold - Max hamming distance (0-64, default: 8, optional)
```
### Core Algorithm
```rust
/// Compute hamming distance between two perceptual hashes.
///
/// Hamming distance = number of differing bits.
/// For 8-byte pHash, range is 0-64.
pub fn hamming_distance(a: &[u8; 8], b: &[u8; 8]) -> u32 {
a.iter()
.zip(b.iter())
.map(|(x, y)| (x ^ y).count_ones())
.sum()
}
```
### Matching Logic
In `Query::matches()`:
```rust
// If visual_near is specified, check visual similarity
if let Some(ref target_hash) = self.visual_near {
match &assertion.visual_hash {
Some(assertion_hash) => {
let target = parse_hex_phash(target_hash)?;
let distance = hamming_distance(&target, assertion_hash);
let threshold = self.visual_threshold.unwrap_or(8);
if distance > threshold {
return false; // Too dissimilar
}
}
None => return false, // No visual_hash, can't match
}
}
```
### Go SDK Changes
```go
// QueryBuilder addition
func (b *QueryBuilder) VisualNear(hash string, threshold uint32) *QueryBuilder {
b.visualNear = hash
b.visualThreshold = threshold
return b
}
// Usage
query := steme.NewQuery().
Subject("BioStart_CEO").
Predicate("reports_to").
VisualNear("a3f2b1c4d5e6f708", 8).
Build()
```
---
## Threshold Guidance
Hamming distance thresholds for 64-bit pHash:
| Threshold | Meaning | Use Case |
|-----------|---------|----------|
| 0 | Exact match | Deduplication, identical images |
| 1-4 | Very similar | Safety-critical (medications, financial) |
| 5-8 | Similar | General visual matching (default) |
| 9-12 | Loosely similar | Catch resized/cropped variants |
| 13+ | Weak similarity | Exploratory search |
### Domain-Specific Recommendations
| Domain | Recommended Threshold | Rationale |
|--------|----------------------|-----------|
| Consumer Health | 4-5 | False positives for medications are dangerous |
| Financial Due Diligence | 8-10 | Screenshots get cropped/resized |
| Pharma/Clinical | 6-8 | Balance accuracy vs OCR variations |
| General | 8 | pHash literature default |
---
## Test Cases
### Unit Tests (`stemedb-query`)
```rust
#[test]
fn test_visual_near_exact_match() {
// Same hash, threshold 0 -> matches
let hash = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08];
let assertion = make_assertion_with_visual_hash(hash);
let query = QueryBuilder::new()
.visual_near("a3f2b1c4d5e6f708", 0)
.build();
assert!(query.matches(&assertion));
}
#[test]
fn test_visual_near_within_threshold() {
// Hashes differ by 3 bits, threshold 5 -> matches
let hash_a = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08];
let hash_b = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x0F]; // 3 bits differ
let assertion = make_assertion_with_visual_hash(hash_b);
let query = QueryBuilder::new()
.visual_near("a3f2b1c4d5e6f708", 5)
.build();
assert!(query.matches(&assertion));
}
#[test]
fn test_visual_near_exceeds_threshold() {
// Hashes differ by 10 bits, threshold 5 -> no match
let hash_a = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08];
let hash_b = [0x00, 0x00, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08]; // 10+ bits differ
let assertion = make_assertion_with_visual_hash(hash_b);
let query = QueryBuilder::new()
.visual_near("a3f2b1c4d5e6f708", 5)
.build();
assert!(!query.matches(&assertion));
}
#[test]
fn test_visual_near_skips_assertions_without_hash() {
// Assertion has no visual_hash -> not matched
let assertion = make_assertion_without_visual_hash();
let query = QueryBuilder::new()
.visual_near("a3f2b1c4d5e6f708", 8)
.build();
assert!(!query.matches(&assertion));
}
#[test]
fn test_hamming_distance_zero() {
let hash = [0xFF; 8];
assert_eq!(hamming_distance(&hash, &hash), 0);
}
#[test]
fn test_hamming_distance_max() {
let a = [0x00; 8];
let b = [0xFF; 8];
assert_eq!(hamming_distance(&a, &b), 64);
}
```
### Integration Tests (`stemedb-api`)
```rust
#[tokio::test]
async fn test_visual_query_api() {
let app = create_test_app().await;
// Insert assertion with visual_hash
let resp = app.post("/v1/assertions")
.json(&json!({
"subject": "STEP-1_Trial",
"predicate": "primary_endpoint",
"object": {"Number": 14.9},
"visual_hash": "a3f2b1c4d5e6f708"
}))
.send().await;
assert_eq!(resp.status(), 201);
// Query by visual similarity
let resp = app.get("/v1/query")
.query(&[
("visual_near", "a3f2b1c4d5e6f700"), // 3 bits different
("visual_threshold", "8")
])
.send().await;
assert_eq!(resp.status(), 200);
let results: Vec<Assertion> = resp.json().await;
assert_eq!(results.len(), 1);
}
```
---
## Performance Characteristics
### MVP (Brute Force)
| Operation | Complexity | Notes |
|-----------|------------|-------|
| Visual query | O(N) | Scans all assertions |
| Hamming distance | O(1) | 8 XORs + 8 popcounts |
| Per-assertion overhead | ~10 ns | Negligible |
**Expected latency:**
- 10K assertions: ~100μs
- 100K assertions: ~1ms
- 1M assertions: ~10ms
- 10M assertions: ~100ms (needs index)
### When to Worry
If visual queries consistently exceed 100ms, it's time for Phase 3 indexing.
---
## Future Directions (Based on Performance)
### Phase 3a: BK-Tree Index
**Trigger:** Visual queries exceed 50ms at p99.
**Approach:** Build a BK-tree (Burkhard-Keller tree) optimized for hamming distance.
```rust
struct BKTree {
root: Option<Box<BKNode>>,
}
struct BKNode {
hash: PHash,
assertion_id: Hash,
children: HashMap<u32, Box<BKNode>>, // distance -> child
}
```
**Benefits:**
- O(log N) average case for bounded threshold queries
- Well-suited for discrete metric spaces (hamming distance)
**Storage:** `VIX:{hash_prefix}` keys in sled for persistence.
### Phase 3b: VP-Tree Index
**Trigger:** Need range queries with variable thresholds.
**Approach:** Vantage-point tree for metric space search.
**Benefits:**
- Better for continuous thresholds
- Can support "find K nearest" queries
### Phase 3c: Locality-Sensitive Hashing (LSH)
**Trigger:** Need sub-linear time for very large datasets (100M+ assertions).
**Approach:** Hash pHashes into buckets where similar hashes collide.
**Benefits:**
- O(1) average case
- Scales to billions of assertions
**Tradeoffs:**
- Probabilistic (may miss some matches)
- Requires tuning hash functions
### Phase 3d: Visual Clustering Dashboard
**Trigger:** Users want "show me distinct visual fingerprints."
**Approach:** Background job that clusters visual_hashes and maintains cluster centroids.
```
GET /v1/visual-clusters?subject=TechCorp
-> Returns:
{
"clusters": [
{ "centroid": "a3f2b1c4d5e6f708", "count": 47, "representative_hash": "..." },
{ "centroid": "b4e3c2d5a6f70819", "count": 12, "representative_hash": "..." }
]
}
```
### Phase 3e: Pre-Assertion Dedup Check
**Trigger:** Ingestion workers want to avoid duplicates.
**Approach:** Endpoint to check visual similarity before creating assertion.
```
GET /v1/assertions/visual-check?visual_hash=a3f2b1c4d5e6f708&threshold=8
-> Returns existing assertion hashes if any match, empty otherwise
```
---
## Non-Goals (Explicit Scope Limits)
1. **Not making visual_hash mandatory.** It remains optional. Assertions without visual_hash simply don't match visual queries.
2. **Not building image processing.** Episteme doesn't compute pHashes. Clients compute them before asserting.
3. **Not building a visual search UI.** This is query infrastructure. Apps build UIs on top.
4. **Not optimizing for exact match.** Use `source_hash` for exact binary matching. `visual_hash` is for perceptual similarity.
---
## Implementation Checklist
### Core (`stemedb-query`)
- [ ] Add `visual_near: Option<String>` to `Query` struct
- [ ] Add `visual_threshold: Option<u32>` to `Query` struct (default: 8)
- [ ] Implement `hamming_distance(a: &[u8; 8], b: &[u8; 8]) -> u32`
- [ ] Implement hex parsing for pHash strings
- [ ] Add visual matching to `Query::matches()`
- [ ] Add `.visual_near(hash, threshold)` to `QueryBuilder`
### API (`stemedb-api`)
- [ ] Add `visual_near` query parameter to `QueryParams` DTO
- [ ] Add `visual_threshold` query parameter to `QueryParams` DTO
- [ ] Update OpenAPI spec with new parameters
### SDK (`sdk/go/steme`)
- [ ] Add `VisualNear(hash, threshold)` to `QueryBuilder`
- [ ] Add integration test for visual queries
### Tests
- [ ] `test_visual_near_exact_match`
- [ ] `test_visual_near_within_threshold`
- [ ] `test_visual_near_exceeds_threshold`
- [ ] `test_visual_near_skips_assertions_without_hash`
- [ ] `test_hamming_distance_zero`
- [ ] `test_hamming_distance_max`
### Documentation
- [ ] Add threshold guidance to SDK docs
- [ ] Add domain-specific recommendations to use-case docs
- [ ] Update `ai-lookup/services/query.md` with visual query info
---
## Stakeholder Sign-Off
Based on empathy interviews:
| Persona | Concern | How MVP Addresses |
|---------|---------|-------------------|
| Research Agent | "Let me query by visual similarity" | Core feature |
| Human Supervisor | "O(N) is fine if Phase 3 is planned" | Future directions documented |
| Lead Orchestrator | "Don't mandate visual_hash" | Remains optional |
| On-Call SRE | "Keep API simple" | Single query param, sensible default |
| M&A Analyst | "Threshold 8-10 for screenshots" | Default 8, guidance for 10 |
| Consumer Health | "False positives are dangerous" | Domain guidance: threshold 4-5 |
---
## Open Questions
1. **Should we add `visual_hash` to `QueryAudit`?** When auditing "what did the agent query?", do we capture visual_near parameters?
2. **Should visual queries combine with other filters?** Current spec: yes (AND with subject, predicate, etc.). Is this correct?
3. **What's the actual distribution of visual_hash population?** If <1% of assertions have visual_hash, should we warn users?
---
## References
- [pHash paper](http://www.phash.org/docs/pubs/thesis_zauner.pdf) - Perceptual hashing algorithms
- [BK-tree](https://en.wikipedia.org/wiki/BK-tree) - Metric tree for discrete distances
- [VP-tree](https://en.wikipedia.org/wiki/Vantage-point_tree) - Metric tree for general metrics
- [LSH](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) - Probabilistic nearest-neighbor