This commit adds the read path (Cortex) to complement the write path (Spine): ## Crates - stemedb-api: HTTP API with axum + utoipa OpenAPI - /v1/assert, /v1/query, /v1/epoch, /v1/skeptic, /v1/trace, /v1/audit - Metered endpoints with quota enforcement - Ed25519 signature verification - stemedb-lens: Truth resolution lenses - RecencyLens, ConsensusLens, ConfidenceLens - VoteAwareConsensusLens (Ballot Box pattern) - TrustAwareAuthorityLens (The Hive pattern) - SkepticLens (conflict analysis) - EpochAwareLens (paradigm-safe queries) - stemedb-query: Query engine with materialized views ## Storage Extensions - VoteStore: Vote aggregation with cached counts - TrustRankStore: Agent reputation with decay - AuditStore: Query audit trail - IndexStore: SP/P/S index structures - SupersessionStore: Epoch supersession chains ## SDKs - sdk/go/steme: Go HTTP client with Ed25519 signing - sdk/go/adk: ADK-Go tools for AI agents ## Documentation - Updated CLAUDE.md, architecture.md, roadmap.md - New ai-lookup entries for all services - Use case docs for consumer health intelligence - Arena roadmap for simulation advancement Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
454 lines
13 KiB
Markdown
454 lines
13 KiB
Markdown
# Visual Hash Query Support
|
|
|
|
> **Status:** Draft
|
|
> **Phase:** 2.4 (MVP)
|
|
> **Pillars:** First-Class Contradiction (Visual Domain), Visual Anchoring
|
|
> **Primary Use Cases:** Financial Due Diligence, GLP-1 Living Review, Consumer Health Intelligence
|
|
|
|
---
|
|
|
|
## Problem Statement
|
|
|
|
The `visual_hash: Option<PHash>` field exists on `Assertion` and is stored/returned by the API, but there is no way to query by visual similarity. The field is write-only from a query perspective.
|
|
|
|
This creates a gap between what we promise and what we deliver:
|
|
|
|
| Promise (from use-cases) | Reality |
|
|
|--------------------------|---------|
|
|
| "Query by visual similarity" | No query parameter exists |
|
|
| "Find all assertions anchored to visually similar images" | Must scan entire dataset manually |
|
|
| "Catches duplicate evidence, fake variations, source reuse" | Cannot detect duplicates |
|
|
|
|
### The Pain Points
|
|
|
|
**Research Agent:** "I extract the same clinical trial table 10 times. Each extraction drifts slightly. I stored visual_hash hoping to detect duplicates. I can't query by it."
|
|
|
|
**Human Supervisor (3am incident):** "Agent made a bad decision based on an assertion with visual_hash. I need to find all assertions from that same screenshot. I can't."
|
|
|
|
**M&A Analyst:** "I have two screenshots of the same org chart from different sources. I want to prove they corroborate each other. I can't query by visual similarity."
|
|
|
|
---
|
|
|
|
## MVP Scope (Phase 2.4)
|
|
|
|
### What We're Building
|
|
|
|
A brute-force visual similarity query capability using hamming distance on perceptual hashes.
|
|
|
|
### API Changes
|
|
|
|
#### Query Struct (`stemedb-query`)
|
|
|
|
```rust
|
|
pub struct Query {
|
|
// ... existing fields ...
|
|
|
|
/// Filter by visual similarity to a reference pHash.
|
|
/// Hex-encoded 8-byte perceptual hash (16 hex chars).
|
|
pub visual_near: Option<String>,
|
|
|
|
/// Maximum hamming distance for visual_near matching.
|
|
/// Default: 8 bits (sweet spot for pHash similarity).
|
|
/// Range: 0-64 (0 = exact match, 64 = match anything).
|
|
pub visual_threshold: Option<u32>,
|
|
}
|
|
```
|
|
|
|
#### QueryBuilder (`stemedb-query`)
|
|
|
|
```rust
|
|
impl QueryBuilder {
|
|
/// Filter by visual similarity to a reference pHash.
|
|
///
|
|
/// # Arguments
|
|
/// * `hash` - Hex-encoded pHash (16 chars)
|
|
/// * `threshold` - Max hamming distance (default: 8)
|
|
///
|
|
/// # Example
|
|
/// ```
|
|
/// let query = QueryBuilder::new()
|
|
/// .subject("BioStart_CEO")
|
|
/// .predicate("reports_to")
|
|
/// .visual_near("a3f2b1c4d5e6f708", 8)
|
|
/// .build();
|
|
/// ```
|
|
pub fn visual_near(mut self, hash: &str, threshold: u32) -> Self {
|
|
self.visual_near = Some(hash.to_string());
|
|
self.visual_threshold = Some(threshold);
|
|
self
|
|
}
|
|
}
|
|
```
|
|
|
|
#### HTTP API (`stemedb-api`)
|
|
|
|
```
|
|
GET /v1/query?visual_near=a3f2b1c4d5e6f708&visual_threshold=8
|
|
|
|
Query Parameters:
|
|
visual_near - Hex-encoded pHash (16 characters, optional)
|
|
visual_threshold - Max hamming distance (0-64, default: 8, optional)
|
|
```
|
|
|
|
### Core Algorithm
|
|
|
|
```rust
|
|
/// Compute hamming distance between two perceptual hashes.
|
|
///
|
|
/// Hamming distance = number of differing bits.
|
|
/// For 8-byte pHash, range is 0-64.
|
|
pub fn hamming_distance(a: &[u8; 8], b: &[u8; 8]) -> u32 {
|
|
a.iter()
|
|
.zip(b.iter())
|
|
.map(|(x, y)| (x ^ y).count_ones())
|
|
.sum()
|
|
}
|
|
```
|
|
|
|
### Matching Logic
|
|
|
|
In `Query::matches()`:
|
|
|
|
```rust
|
|
// If visual_near is specified, check visual similarity
|
|
if let Some(ref target_hash) = self.visual_near {
|
|
match &assertion.visual_hash {
|
|
Some(assertion_hash) => {
|
|
let target = parse_hex_phash(target_hash)?;
|
|
let distance = hamming_distance(&target, assertion_hash);
|
|
let threshold = self.visual_threshold.unwrap_or(8);
|
|
if distance > threshold {
|
|
return false; // Too dissimilar
|
|
}
|
|
}
|
|
None => return false, // No visual_hash, can't match
|
|
}
|
|
}
|
|
```
|
|
|
|
### Go SDK Changes
|
|
|
|
```go
|
|
// QueryBuilder addition
|
|
func (b *QueryBuilder) VisualNear(hash string, threshold uint32) *QueryBuilder {
|
|
b.visualNear = hash
|
|
b.visualThreshold = threshold
|
|
return b
|
|
}
|
|
|
|
// Usage
|
|
query := steme.NewQuery().
|
|
Subject("BioStart_CEO").
|
|
Predicate("reports_to").
|
|
VisualNear("a3f2b1c4d5e6f708", 8).
|
|
Build()
|
|
```
|
|
|
|
---
|
|
|
|
## Threshold Guidance
|
|
|
|
Hamming distance thresholds for 64-bit pHash:
|
|
|
|
| Threshold | Meaning | Use Case |
|
|
|-----------|---------|----------|
|
|
| 0 | Exact match | Deduplication, identical images |
|
|
| 1-4 | Very similar | Safety-critical (medications, financial) |
|
|
| 5-8 | Similar | General visual matching (default) |
|
|
| 9-12 | Loosely similar | Catch resized/cropped variants |
|
|
| 13+ | Weak similarity | Exploratory search |
|
|
|
|
### Domain-Specific Recommendations
|
|
|
|
| Domain | Recommended Threshold | Rationale |
|
|
|--------|----------------------|-----------|
|
|
| Consumer Health | 4-5 | False positives for medications are dangerous |
|
|
| Financial Due Diligence | 8-10 | Screenshots get cropped/resized |
|
|
| Pharma/Clinical | 6-8 | Balance accuracy vs OCR variations |
|
|
| General | 8 | pHash literature default |
|
|
|
|
---
|
|
|
|
## Test Cases
|
|
|
|
### Unit Tests (`stemedb-query`)
|
|
|
|
```rust
|
|
#[test]
|
|
fn test_visual_near_exact_match() {
|
|
// Same hash, threshold 0 -> matches
|
|
let hash = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08];
|
|
let assertion = make_assertion_with_visual_hash(hash);
|
|
let query = QueryBuilder::new()
|
|
.visual_near("a3f2b1c4d5e6f708", 0)
|
|
.build();
|
|
assert!(query.matches(&assertion));
|
|
}
|
|
|
|
#[test]
|
|
fn test_visual_near_within_threshold() {
|
|
// Hashes differ by 3 bits, threshold 5 -> matches
|
|
let hash_a = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08];
|
|
let hash_b = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x0F]; // 3 bits differ
|
|
let assertion = make_assertion_with_visual_hash(hash_b);
|
|
let query = QueryBuilder::new()
|
|
.visual_near("a3f2b1c4d5e6f708", 5)
|
|
.build();
|
|
assert!(query.matches(&assertion));
|
|
}
|
|
|
|
#[test]
|
|
fn test_visual_near_exceeds_threshold() {
|
|
// Hashes differ by 10 bits, threshold 5 -> no match
|
|
let hash_a = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08];
|
|
let hash_b = [0x00, 0x00, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08]; // 10+ bits differ
|
|
let assertion = make_assertion_with_visual_hash(hash_b);
|
|
let query = QueryBuilder::new()
|
|
.visual_near("a3f2b1c4d5e6f708", 5)
|
|
.build();
|
|
assert!(!query.matches(&assertion));
|
|
}
|
|
|
|
#[test]
|
|
fn test_visual_near_skips_assertions_without_hash() {
|
|
// Assertion has no visual_hash -> not matched
|
|
let assertion = make_assertion_without_visual_hash();
|
|
let query = QueryBuilder::new()
|
|
.visual_near("a3f2b1c4d5e6f708", 8)
|
|
.build();
|
|
assert!(!query.matches(&assertion));
|
|
}
|
|
|
|
#[test]
|
|
fn test_hamming_distance_zero() {
|
|
let hash = [0xFF; 8];
|
|
assert_eq!(hamming_distance(&hash, &hash), 0);
|
|
}
|
|
|
|
#[test]
|
|
fn test_hamming_distance_max() {
|
|
let a = [0x00; 8];
|
|
let b = [0xFF; 8];
|
|
assert_eq!(hamming_distance(&a, &b), 64);
|
|
}
|
|
```
|
|
|
|
### Integration Tests (`stemedb-api`)
|
|
|
|
```rust
|
|
#[tokio::test]
|
|
async fn test_visual_query_api() {
|
|
let app = create_test_app().await;
|
|
|
|
// Insert assertion with visual_hash
|
|
let resp = app.post("/v1/assertions")
|
|
.json(&json!({
|
|
"subject": "STEP-1_Trial",
|
|
"predicate": "primary_endpoint",
|
|
"object": {"Number": 14.9},
|
|
"visual_hash": "a3f2b1c4d5e6f708"
|
|
}))
|
|
.send().await;
|
|
assert_eq!(resp.status(), 201);
|
|
|
|
// Query by visual similarity
|
|
let resp = app.get("/v1/query")
|
|
.query(&[
|
|
("visual_near", "a3f2b1c4d5e6f700"), // 3 bits different
|
|
("visual_threshold", "8")
|
|
])
|
|
.send().await;
|
|
assert_eq!(resp.status(), 200);
|
|
|
|
let results: Vec<Assertion> = resp.json().await;
|
|
assert_eq!(results.len(), 1);
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Performance Characteristics
|
|
|
|
### MVP (Brute Force)
|
|
|
|
| Operation | Complexity | Notes |
|
|
|-----------|------------|-------|
|
|
| Visual query | O(N) | Scans all assertions |
|
|
| Hamming distance | O(1) | 8 XORs + 8 popcounts |
|
|
| Per-assertion overhead | ~10 ns | Negligible |
|
|
|
|
**Expected latency:**
|
|
- 10K assertions: ~100μs
|
|
- 100K assertions: ~1ms
|
|
- 1M assertions: ~10ms
|
|
- 10M assertions: ~100ms (needs index)
|
|
|
|
### When to Worry
|
|
|
|
If visual queries consistently exceed 100ms, it's time for Phase 3 indexing.
|
|
|
|
---
|
|
|
|
## Future Directions (Based on Performance)
|
|
|
|
### Phase 3a: BK-Tree Index
|
|
|
|
**Trigger:** Visual queries exceed 50ms at p99.
|
|
|
|
**Approach:** Build a BK-tree (Burkhard-Keller tree) optimized for hamming distance.
|
|
|
|
```rust
|
|
struct BKTree {
|
|
root: Option<Box<BKNode>>,
|
|
}
|
|
|
|
struct BKNode {
|
|
hash: PHash,
|
|
assertion_id: Hash,
|
|
children: HashMap<u32, Box<BKNode>>, // distance -> child
|
|
}
|
|
```
|
|
|
|
**Benefits:**
|
|
- O(log N) average case for bounded threshold queries
|
|
- Well-suited for discrete metric spaces (hamming distance)
|
|
|
|
**Storage:** `VIX:{hash_prefix}` keys in sled for persistence.
|
|
|
|
### Phase 3b: VP-Tree Index
|
|
|
|
**Trigger:** Need range queries with variable thresholds.
|
|
|
|
**Approach:** Vantage-point tree for metric space search.
|
|
|
|
**Benefits:**
|
|
- Better for continuous thresholds
|
|
- Can support "find K nearest" queries
|
|
|
|
### Phase 3c: Locality-Sensitive Hashing (LSH)
|
|
|
|
**Trigger:** Need sub-linear time for very large datasets (100M+ assertions).
|
|
|
|
**Approach:** Hash pHashes into buckets where similar hashes collide.
|
|
|
|
**Benefits:**
|
|
- O(1) average case
|
|
- Scales to billions of assertions
|
|
|
|
**Tradeoffs:**
|
|
- Probabilistic (may miss some matches)
|
|
- Requires tuning hash functions
|
|
|
|
### Phase 3d: Visual Clustering Dashboard
|
|
|
|
**Trigger:** Users want "show me distinct visual fingerprints."
|
|
|
|
**Approach:** Background job that clusters visual_hashes and maintains cluster centroids.
|
|
|
|
```
|
|
GET /v1/visual-clusters?subject=TechCorp
|
|
-> Returns:
|
|
{
|
|
"clusters": [
|
|
{ "centroid": "a3f2b1c4d5e6f708", "count": 47, "representative_hash": "..." },
|
|
{ "centroid": "b4e3c2d5a6f70819", "count": 12, "representative_hash": "..." }
|
|
]
|
|
}
|
|
```
|
|
|
|
### Phase 3e: Pre-Assertion Dedup Check
|
|
|
|
**Trigger:** Ingestion workers want to avoid duplicates.
|
|
|
|
**Approach:** Endpoint to check visual similarity before creating assertion.
|
|
|
|
```
|
|
GET /v1/assertions/visual-check?visual_hash=a3f2b1c4d5e6f708&threshold=8
|
|
-> Returns existing assertion hashes if any match, empty otherwise
|
|
```
|
|
|
|
---
|
|
|
|
## Non-Goals (Explicit Scope Limits)
|
|
|
|
1. **Not making visual_hash mandatory.** It remains optional. Assertions without visual_hash simply don't match visual queries.
|
|
|
|
2. **Not building image processing.** Episteme doesn't compute pHashes. Clients compute them before asserting.
|
|
|
|
3. **Not building a visual search UI.** This is query infrastructure. Apps build UIs on top.
|
|
|
|
4. **Not optimizing for exact match.** Use `source_hash` for exact binary matching. `visual_hash` is for perceptual similarity.
|
|
|
|
---
|
|
|
|
## Implementation Checklist
|
|
|
|
### Core (`stemedb-query`)
|
|
|
|
- [ ] Add `visual_near: Option<String>` to `Query` struct
|
|
- [ ] Add `visual_threshold: Option<u32>` to `Query` struct (default: 8)
|
|
- [ ] Implement `hamming_distance(a: &[u8; 8], b: &[u8; 8]) -> u32`
|
|
- [ ] Implement hex parsing for pHash strings
|
|
- [ ] Add visual matching to `Query::matches()`
|
|
- [ ] Add `.visual_near(hash, threshold)` to `QueryBuilder`
|
|
|
|
### API (`stemedb-api`)
|
|
|
|
- [ ] Add `visual_near` query parameter to `QueryParams` DTO
|
|
- [ ] Add `visual_threshold` query parameter to `QueryParams` DTO
|
|
- [ ] Update OpenAPI spec with new parameters
|
|
|
|
### SDK (`sdk/go/steme`)
|
|
|
|
- [ ] Add `VisualNear(hash, threshold)` to `QueryBuilder`
|
|
- [ ] Add integration test for visual queries
|
|
|
|
### Tests
|
|
|
|
- [ ] `test_visual_near_exact_match`
|
|
- [ ] `test_visual_near_within_threshold`
|
|
- [ ] `test_visual_near_exceeds_threshold`
|
|
- [ ] `test_visual_near_skips_assertions_without_hash`
|
|
- [ ] `test_hamming_distance_zero`
|
|
- [ ] `test_hamming_distance_max`
|
|
|
|
### Documentation
|
|
|
|
- [ ] Add threshold guidance to SDK docs
|
|
- [ ] Add domain-specific recommendations to use-case docs
|
|
- [ ] Update `ai-lookup/services/query.md` with visual query info
|
|
|
|
---
|
|
|
|
## Stakeholder Sign-Off
|
|
|
|
Based on empathy interviews:
|
|
|
|
| Persona | Concern | How MVP Addresses |
|
|
|---------|---------|-------------------|
|
|
| Research Agent | "Let me query by visual similarity" | Core feature |
|
|
| Human Supervisor | "O(N) is fine if Phase 3 is planned" | Future directions documented |
|
|
| Lead Orchestrator | "Don't mandate visual_hash" | Remains optional |
|
|
| On-Call SRE | "Keep API simple" | Single query param, sensible default |
|
|
| M&A Analyst | "Threshold 8-10 for screenshots" | Default 8, guidance for 10 |
|
|
| Consumer Health | "False positives are dangerous" | Domain guidance: threshold 4-5 |
|
|
|
|
---
|
|
|
|
## Open Questions
|
|
|
|
1. **Should we add `visual_hash` to `QueryAudit`?** When auditing "what did the agent query?", do we capture visual_near parameters?
|
|
|
|
2. **Should visual queries combine with other filters?** Current spec: yes (AND with subject, predicate, etc.). Is this correct?
|
|
|
|
3. **What's the actual distribution of visual_hash population?** If <1% of assertions have visual_hash, should we warn users?
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- [pHash paper](http://www.phash.org/docs/pubs/thesis_zauner.pdf) - Perceptual hashing algorithms
|
|
- [BK-tree](https://en.wikipedia.org/wiki/BK-tree) - Metric tree for discrete distances
|
|
- [VP-tree](https://en.wikipedia.org/wiki/Vantage-point_tree) - Metric tree for general metrics
|
|
- [LSH](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) - Probabilistic nearest-neighbor
|