stemedb/docs/specs/visual-hash-query.md

# Visual Hash Query Support

> **Status:** Draft
> **Phase:** 2.4 (MVP)
> **Pillars:** First-Class Contradiction (Visual Domain), Visual Anchoring
> **Primary Use Cases:** Financial Due Diligence, GLP-1 Living Review, Consumer Health Intelligence

---

## Problem Statement

The `visual_hash: Option<PHash>` field exists on `Assertion` and is stored/returned by the API, but there is no way to query by visual similarity. The field is write-only from a query perspective.

This creates a gap between what we promise and what we deliver:

| Promise (from use-cases) | Reality |
|--------------------------|---------|
| "Query by visual similarity" | No query parameter exists |
| "Find all assertions anchored to visually similar images" | Must scan entire dataset manually |
| "Catches duplicate evidence, fake variations, source reuse" | Cannot detect duplicates |

### The Pain Points

**Research Agent:** "I extract the same clinical trial table 10 times. Each extraction drifts slightly. I stored visual_hash hoping to detect duplicates. I can't query by it."

**Human Supervisor (3am incident):** "Agent made a bad decision based on an assertion with visual_hash. I need to find all assertions from that same screenshot. I can't."

**M&A Analyst:** "I have two screenshots of the same org chart from different sources. I want to prove they corroborate each other. I can't query by visual similarity."

---

## MVP Scope (Phase 2.4)

### What We're Building

A brute-force visual similarity query capability using hamming distance on perceptual hashes.

### API Changes

#### Query Struct (`stemedb-query`)

```rust
pub struct Query {
    // ... existing fields ...

    /// Filter by visual similarity to a reference pHash.
    /// Hex-encoded 8-byte perceptual hash (16 hex chars).
    pub visual_near: Option<String>,

    /// Maximum hamming distance for visual_near matching.
    /// Default: 8 bits (sweet spot for pHash similarity).
    /// Range: 0-64 (0 = exact match, 64 = match anything).
    pub visual_threshold: Option<u32>,
}
```

#### QueryBuilder (`stemedb-query`)

```rust
impl QueryBuilder {
    /// Filter by visual similarity to a reference pHash.
    ///
    /// # Arguments
    /// * `hash` - Hex-encoded pHash (16 chars)
    /// * `threshold` - Max hamming distance (default: 8)
    ///
    /// # Example
    /// ```
    /// let query = QueryBuilder::new()
    ///     .subject("BioStart_CEO")
    ///     .predicate("reports_to")
    ///     .visual_near("a3f2b1c4d5e6f708", 8)
    ///     .build();
    /// ```
    pub fn visual_near(mut self, hash: &str, threshold: u32) -> Self {
        self.visual_near = Some(hash.to_string());
        self.visual_threshold = Some(threshold);
        self
    }
}
```

#### HTTP API (`stemedb-api`)

```
GET /v1/query?visual_near=a3f2b1c4d5e6f708&visual_threshold=8

Query Parameters:
  visual_near      - Hex-encoded pHash (16 characters, optional)
  visual_threshold - Max hamming distance (0-64, default: 8, optional)
```

### Core Algorithm

```rust
/// Compute hamming distance between two perceptual hashes.
///
/// Hamming distance = number of differing bits.
/// For 8-byte pHash, range is 0-64.
pub fn hamming_distance(a: &[u8; 8], b: &[u8; 8]) -> u32 {
    a.iter()
        .zip(b.iter())
        .map(|(x, y)| (x ^ y).count_ones())
        .sum()
}
```

### Matching Logic

In `Query::matches()`:

```rust
// If visual_near is specified, check visual similarity
if let Some(ref target_hash) = self.visual_near {
    match &assertion.visual_hash {
        Some(assertion_hash) => {
            let target = parse_hex_phash(target_hash)?;
            let distance = hamming_distance(&target, assertion_hash);
            let threshold = self.visual_threshold.unwrap_or(8);
            if distance > threshold {
                return false; // Too dissimilar
            }
        }
        None => return false, // No visual_hash, can't match
    }
}
```

### Go SDK Changes

```go
// QueryBuilder addition
func (b *QueryBuilder) VisualNear(hash string, threshold uint32) *QueryBuilder {
    b.visualNear = hash
    b.visualThreshold = threshold
    return b
}

// Usage
query := steme.NewQuery().
    Subject("BioStart_CEO").
    Predicate("reports_to").
    VisualNear("a3f2b1c4d5e6f708", 8).
    Build()
```

---

## Threshold Guidance

Hamming distance thresholds for 64-bit pHash:

| Threshold | Meaning | Use Case |
|-----------|---------|----------|
| 0 | Exact match | Deduplication, identical images |
| 1-4 | Very similar | Safety-critical (medications, financial) |
| 5-8 | Similar | General visual matching (default) |
| 9-12 | Loosely similar | Catch resized/cropped variants |
| 13+ | Weak similarity | Exploratory search |

### Domain-Specific Recommendations

| Domain | Recommended Threshold | Rationale |
|--------|----------------------|-----------|
| Consumer Health | 4-5 | False positives for medications are dangerous |
| Financial Due Diligence | 8-10 | Screenshots get cropped/resized |
| Pharma/Clinical | 6-8 | Balance accuracy vs OCR variations |
| General | 8 | pHash literature default |

---

## Test Cases

### Unit Tests (`stemedb-query`)

```rust
#[test]
fn test_visual_near_exact_match() {
    // Same hash, threshold 0 -> matches
    let hash = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08];
    let assertion = make_assertion_with_visual_hash(hash);
    let query = QueryBuilder::new()
        .visual_near("a3f2b1c4d5e6f708", 0)
        .build();
    assert!(query.matches(&assertion));
}

#[test]
fn test_visual_near_within_threshold() {
    // Hashes differ by 3 bits, threshold 5 -> matches
    let hash_a = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08];
    let hash_b = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x0F]; // 3 bits differ
    let assertion = make_assertion_with_visual_hash(hash_b);
    let query = QueryBuilder::new()
        .visual_near("a3f2b1c4d5e6f708", 5)
        .build();
    assert!(query.matches(&assertion));
}

#[test]
fn test_visual_near_exceeds_threshold() {
    // Hashes differ by 10 bits, threshold 5 -> no match
    let hash_a = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08];
    let hash_b = [0x00, 0x00, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08]; // 10+ bits differ
    let assertion = make_assertion_with_visual_hash(hash_b);
    let query = QueryBuilder::new()
        .visual_near("a3f2b1c4d5e6f708", 5)
        .build();
    assert!(!query.matches(&assertion));
}

#[test]
fn test_visual_near_skips_assertions_without_hash() {
    // Assertion has no visual_hash -> not matched
    let assertion = make_assertion_without_visual_hash();
    let query = QueryBuilder::new()
        .visual_near("a3f2b1c4d5e6f708", 8)
        .build();
    assert!(!query.matches(&assertion));
}

#[test]
fn test_hamming_distance_zero() {
    let hash = [0xFF; 8];
    assert_eq!(hamming_distance(&hash, &hash), 0);
}

#[test]
fn test_hamming_distance_max() {
    let a = [0x00; 8];
    let b = [0xFF; 8];
    assert_eq!(hamming_distance(&a, &b), 64);
}
```

### Integration Tests (`stemedb-api`)

```rust
#[tokio::test]
async fn test_visual_query_api() {
    let app = create_test_app().await;

    // Insert assertion with visual_hash
    let resp = app.post("/v1/assertions")
        .json(&json!({
            "subject": "STEP-1_Trial",
            "predicate": "primary_endpoint",
            "object": {"Number": 14.9},
            "visual_hash": "a3f2b1c4d5e6f708"
        }))
        .send().await;
    assert_eq!(resp.status(), 201);

    // Query by visual similarity
    let resp = app.get("/v1/query")
        .query(&[
            ("visual_near", "a3f2b1c4d5e6f700"), // 3 bits different
            ("visual_threshold", "8")
        ])
        .send().await;
    assert_eq!(resp.status(), 200);

    let results: Vec<Assertion> = resp.json().await;
    assert_eq!(results.len(), 1);
}
```

---

## Performance Characteristics

### MVP (Brute Force)

| Operation | Complexity | Notes |
|-----------|------------|-------|
| Visual query | O(N) | Scans all assertions |
| Hamming distance | O(1) | 8 XORs + 8 popcounts |
| Per-assertion overhead | ~10 ns | Negligible |

**Expected latency:**
- 10K assertions: ~100μs
- 100K assertions: ~1ms
- 1M assertions: ~10ms
- 10M assertions: ~100ms (needs index)

### When to Worry

If visual queries consistently exceed 100ms, it's time for Phase 3 indexing.

---

## Future Directions (Based on Performance)

### Phase 3a: BK-Tree Index

**Trigger:** Visual queries exceed 50ms at p99.

**Approach:** Build a BK-tree (Burkhard-Keller tree) optimized for hamming distance.

```rust
struct BKTree {
    root: Option<Box<BKNode>>,
}

struct BKNode {
    hash: PHash,
    assertion_id: Hash,
    children: HashMap<u32, Box<BKNode>>, // distance -> child
}
```

**Benefits:**
- O(log N) average case for bounded threshold queries
- Well-suited for discrete metric spaces (hamming distance)

**Storage:** `VIX:{hash_prefix}` keys in sled for persistence.

### Phase 3b: VP-Tree Index

**Trigger:** Need range queries with variable thresholds.

**Approach:** Vantage-point tree for metric space search.

**Benefits:**
- Better for continuous thresholds
- Can support "find K nearest" queries

### Phase 3c: Locality-Sensitive Hashing (LSH)

**Trigger:** Need sub-linear time for very large datasets (100M+ assertions).

**Approach:** Hash pHashes into buckets where similar hashes collide.

**Benefits:**
- O(1) average case
- Scales to billions of assertions

**Tradeoffs:**
- Probabilistic (may miss some matches)
- Requires tuning hash functions

### Phase 3d: Visual Clustering Dashboard

**Trigger:** Users want "show me distinct visual fingerprints."

**Approach:** Background job that clusters visual_hashes and maintains cluster centroids.

```
GET /v1/visual-clusters?subject=TechCorp
-> Returns:
{
  "clusters": [
    { "centroid": "a3f2b1c4d5e6f708", "count": 47, "representative_hash": "..." },
    { "centroid": "b4e3c2d5a6f70819", "count": 12, "representative_hash": "..." }
  ]
}
```

### Phase 3e: Pre-Assertion Dedup Check

**Trigger:** Ingestion workers want to avoid duplicates.

**Approach:** Endpoint to check visual similarity before creating assertion.

```
GET /v1/assertions/visual-check?visual_hash=a3f2b1c4d5e6f708&threshold=8
-> Returns existing assertion hashes if any match, empty otherwise
```

---

## Non-Goals (Explicit Scope Limits)

1. **Not making visual_hash mandatory.** It remains optional. Assertions without visual_hash simply don't match visual queries.

2. **Not building image processing.** Episteme doesn't compute pHashes. Clients compute them before asserting.

3. **Not building a visual search UI.** This is query infrastructure. Apps build UIs on top.

4. **Not optimizing for exact match.** Use `source_hash` for exact binary matching. `visual_hash` is for perceptual similarity.

---

## Implementation Checklist

### Core (`stemedb-query`)

- [ ] Add `visual_near: Option<String>` to `Query` struct
- [ ] Add `visual_threshold: Option<u32>` to `Query` struct (default: 8)
- [ ] Implement `hamming_distance(a: &[u8; 8], b: &[u8; 8]) -> u32`
- [ ] Implement hex parsing for pHash strings
- [ ] Add visual matching to `Query::matches()`
- [ ] Add `.visual_near(hash, threshold)` to `QueryBuilder`

### API (`stemedb-api`)

- [ ] Add `visual_near` query parameter to `QueryParams` DTO
- [ ] Add `visual_threshold` query parameter to `QueryParams` DTO
- [ ] Update OpenAPI spec with new parameters

### SDK (`sdk/go/steme`)

- [ ] Add `VisualNear(hash, threshold)` to `QueryBuilder`
- [ ] Add integration test for visual queries

### Tests

- [ ] `test_visual_near_exact_match`
- [ ] `test_visual_near_within_threshold`
- [ ] `test_visual_near_exceeds_threshold`
- [ ] `test_visual_near_skips_assertions_without_hash`
- [ ] `test_hamming_distance_zero`
- [ ] `test_hamming_distance_max`

### Documentation

- [ ] Add threshold guidance to SDK docs
- [ ] Add domain-specific recommendations to use-case docs
- [ ] Update `ai-lookup/services/query.md` with visual query info

---

## Stakeholder Sign-Off

Based on empathy interviews:

| Persona | Concern | How MVP Addresses |
|---------|---------|-------------------|
| Research Agent | "Let me query by visual similarity" | Core feature |
| Human Supervisor | "O(N) is fine if Phase 3 is planned" | Future directions documented |
| Lead Orchestrator | "Don't mandate visual_hash" | Remains optional |
| On-Call SRE | "Keep API simple" | Single query param, sensible default |
| M&A Analyst | "Threshold 8-10 for screenshots" | Default 8, guidance for 10 |
| Consumer Health | "False positives are dangerous" | Domain guidance: threshold 4-5 |

---

## Open Questions

1. **Should we add `visual_hash` to `QueryAudit`?** When auditing "what did the agent query?", do we capture visual_near parameters?

2. **Should visual queries combine with other filters?** Current spec: yes (AND with subject, predicate, etc.). Is this correct?

3. **What's the actual distribution of visual_hash population?** If <1% of assertions have visual_hash, should we warn users?

---

## References

- [pHash paper](http://www.phash.org/docs/pubs/thesis_zauner.pdf) - Perceptual hashing algorithms
- [BK-tree](https://en.wikipedia.org/wiki/BK-tree) - Metric tree for discrete distances
- [VP-tree](https://en.wikipedia.org/wiki/Vantage-point_tree) - Metric tree for general metrics
- [LSH](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) - Probabilistic nearest-neighbor