# Visual Hash Query Support > **Status:** Draft > **Phase:** 2.4 (MVP) > **Pillars:** First-Class Contradiction (Visual Domain), Visual Anchoring > **Primary Use Cases:** Financial Due Diligence, GLP-1 Living Review, Consumer Health Intelligence --- ## Problem Statement The `visual_hash: Option` field exists on `Assertion` and is stored/returned by the API, but there is no way to query by visual similarity. The field is write-only from a query perspective. This creates a gap between what we promise and what we deliver: | Promise (from use-cases) | Reality | |--------------------------|---------| | "Query by visual similarity" | No query parameter exists | | "Find all assertions anchored to visually similar images" | Must scan entire dataset manually | | "Catches duplicate evidence, fake variations, source reuse" | Cannot detect duplicates | ### The Pain Points **Research Agent:** "I extract the same clinical trial table 10 times. Each extraction drifts slightly. I stored visual_hash hoping to detect duplicates. I can't query by it." **Human Supervisor (3am incident):** "Agent made a bad decision based on an assertion with visual_hash. I need to find all assertions from that same screenshot. I can't." **M&A Analyst:** "I have two screenshots of the same org chart from different sources. I want to prove they corroborate each other. I can't query by visual similarity." --- ## MVP Scope (Phase 2.4) ### What We're Building A brute-force visual similarity query capability using hamming distance on perceptual hashes. ### API Changes #### Query Struct (`stemedb-query`) ```rust pub struct Query { // ... existing fields ... /// Filter by visual similarity to a reference pHash. /// Hex-encoded 8-byte perceptual hash (16 hex chars). pub visual_near: Option, /// Maximum hamming distance for visual_near matching. /// Default: 8 bits (sweet spot for pHash similarity). /// Range: 0-64 (0 = exact match, 64 = match anything). pub visual_threshold: Option, } ``` #### QueryBuilder (`stemedb-query`) ```rust impl QueryBuilder { /// Filter by visual similarity to a reference pHash. /// /// # Arguments /// * `hash` - Hex-encoded pHash (16 chars) /// * `threshold` - Max hamming distance (default: 8) /// /// # Example /// ``` /// let query = QueryBuilder::new() /// .subject("BioStart_CEO") /// .predicate("reports_to") /// .visual_near("a3f2b1c4d5e6f708", 8) /// .build(); /// ``` pub fn visual_near(mut self, hash: &str, threshold: u32) -> Self { self.visual_near = Some(hash.to_string()); self.visual_threshold = Some(threshold); self } } ``` #### HTTP API (`stemedb-api`) ``` GET /v1/query?visual_near=a3f2b1c4d5e6f708&visual_threshold=8 Query Parameters: visual_near - Hex-encoded pHash (16 characters, optional) visual_threshold - Max hamming distance (0-64, default: 8, optional) ``` ### Core Algorithm ```rust /// Compute hamming distance between two perceptual hashes. /// /// Hamming distance = number of differing bits. /// For 8-byte pHash, range is 0-64. pub fn hamming_distance(a: &[u8; 8], b: &[u8; 8]) -> u32 { a.iter() .zip(b.iter()) .map(|(x, y)| (x ^ y).count_ones()) .sum() } ``` ### Matching Logic In `Query::matches()`: ```rust // If visual_near is specified, check visual similarity if let Some(ref target_hash) = self.visual_near { match &assertion.visual_hash { Some(assertion_hash) => { let target = parse_hex_phash(target_hash)?; let distance = hamming_distance(&target, assertion_hash); let threshold = self.visual_threshold.unwrap_or(8); if distance > threshold { return false; // Too dissimilar } } None => return false, // No visual_hash, can't match } } ``` ### Go SDK Changes ```go // QueryBuilder addition func (b *QueryBuilder) VisualNear(hash string, threshold uint32) *QueryBuilder { b.visualNear = hash b.visualThreshold = threshold return b } // Usage query := steme.NewQuery(). Subject("BioStart_CEO"). Predicate("reports_to"). VisualNear("a3f2b1c4d5e6f708", 8). Build() ``` --- ## Threshold Guidance Hamming distance thresholds for 64-bit pHash: | Threshold | Meaning | Use Case | |-----------|---------|----------| | 0 | Exact match | Deduplication, identical images | | 1-4 | Very similar | Safety-critical (medications, financial) | | 5-8 | Similar | General visual matching (default) | | 9-12 | Loosely similar | Catch resized/cropped variants | | 13+ | Weak similarity | Exploratory search | ### Domain-Specific Recommendations | Domain | Recommended Threshold | Rationale | |--------|----------------------|-----------| | Consumer Health | 4-5 | False positives for medications are dangerous | | Financial Due Diligence | 8-10 | Screenshots get cropped/resized | | Pharma/Clinical | 6-8 | Balance accuracy vs OCR variations | | General | 8 | pHash literature default | --- ## Test Cases ### Unit Tests (`stemedb-query`) ```rust #[test] fn test_visual_near_exact_match() { // Same hash, threshold 0 -> matches let hash = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08]; let assertion = make_assertion_with_visual_hash(hash); let query = QueryBuilder::new() .visual_near("a3f2b1c4d5e6f708", 0) .build(); assert!(query.matches(&assertion)); } #[test] fn test_visual_near_within_threshold() { // Hashes differ by 3 bits, threshold 5 -> matches let hash_a = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08]; let hash_b = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x0F]; // 3 bits differ let assertion = make_assertion_with_visual_hash(hash_b); let query = QueryBuilder::new() .visual_near("a3f2b1c4d5e6f708", 5) .build(); assert!(query.matches(&assertion)); } #[test] fn test_visual_near_exceeds_threshold() { // Hashes differ by 10 bits, threshold 5 -> no match let hash_a = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08]; let hash_b = [0x00, 0x00, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08]; // 10+ bits differ let assertion = make_assertion_with_visual_hash(hash_b); let query = QueryBuilder::new() .visual_near("a3f2b1c4d5e6f708", 5) .build(); assert!(!query.matches(&assertion)); } #[test] fn test_visual_near_skips_assertions_without_hash() { // Assertion has no visual_hash -> not matched let assertion = make_assertion_without_visual_hash(); let query = QueryBuilder::new() .visual_near("a3f2b1c4d5e6f708", 8) .build(); assert!(!query.matches(&assertion)); } #[test] fn test_hamming_distance_zero() { let hash = [0xFF; 8]; assert_eq!(hamming_distance(&hash, &hash), 0); } #[test] fn test_hamming_distance_max() { let a = [0x00; 8]; let b = [0xFF; 8]; assert_eq!(hamming_distance(&a, &b), 64); } ``` ### Integration Tests (`stemedb-api`) ```rust #[tokio::test] async fn test_visual_query_api() { let app = create_test_app().await; // Insert assertion with visual_hash let resp = app.post("/v1/assertions") .json(&json!({ "subject": "STEP-1_Trial", "predicate": "primary_endpoint", "object": {"Number": 14.9}, "visual_hash": "a3f2b1c4d5e6f708" })) .send().await; assert_eq!(resp.status(), 201); // Query by visual similarity let resp = app.get("/v1/query") .query(&[ ("visual_near", "a3f2b1c4d5e6f700"), // 3 bits different ("visual_threshold", "8") ]) .send().await; assert_eq!(resp.status(), 200); let results: Vec = resp.json().await; assert_eq!(results.len(), 1); } ``` --- ## Performance Characteristics ### MVP (Brute Force) | Operation | Complexity | Notes | |-----------|------------|-------| | Visual query | O(N) | Scans all assertions | | Hamming distance | O(1) | 8 XORs + 8 popcounts | | Per-assertion overhead | ~10 ns | Negligible | **Expected latency:** - 10K assertions: ~100μs - 100K assertions: ~1ms - 1M assertions: ~10ms - 10M assertions: ~100ms (needs index) ### When to Worry If visual queries consistently exceed 100ms, it's time for Phase 3 indexing. --- ## Future Directions (Based on Performance) ### Phase 3a: BK-Tree Index **Trigger:** Visual queries exceed 50ms at p99. **Approach:** Build a BK-tree (Burkhard-Keller tree) optimized for hamming distance. ```rust struct BKTree { root: Option>, } struct BKNode { hash: PHash, assertion_id: Hash, children: HashMap>, // distance -> child } ``` **Benefits:** - O(log N) average case for bounded threshold queries - Well-suited for discrete metric spaces (hamming distance) **Storage:** `VIX:{hash_prefix}` keys in sled for persistence. ### Phase 3b: VP-Tree Index **Trigger:** Need range queries with variable thresholds. **Approach:** Vantage-point tree for metric space search. **Benefits:** - Better for continuous thresholds - Can support "find K nearest" queries ### Phase 3c: Locality-Sensitive Hashing (LSH) **Trigger:** Need sub-linear time for very large datasets (100M+ assertions). **Approach:** Hash pHashes into buckets where similar hashes collide. **Benefits:** - O(1) average case - Scales to billions of assertions **Tradeoffs:** - Probabilistic (may miss some matches) - Requires tuning hash functions ### Phase 3d: Visual Clustering Dashboard **Trigger:** Users want "show me distinct visual fingerprints." **Approach:** Background job that clusters visual_hashes and maintains cluster centroids. ``` GET /v1/visual-clusters?subject=TechCorp -> Returns: { "clusters": [ { "centroid": "a3f2b1c4d5e6f708", "count": 47, "representative_hash": "..." }, { "centroid": "b4e3c2d5a6f70819", "count": 12, "representative_hash": "..." } ] } ``` ### Phase 3e: Pre-Assertion Dedup Check **Trigger:** Ingestion workers want to avoid duplicates. **Approach:** Endpoint to check visual similarity before creating assertion. ``` GET /v1/assertions/visual-check?visual_hash=a3f2b1c4d5e6f708&threshold=8 -> Returns existing assertion hashes if any match, empty otherwise ``` --- ## Non-Goals (Explicit Scope Limits) 1. **Not making visual_hash mandatory.** It remains optional. Assertions without visual_hash simply don't match visual queries. 2. **Not building image processing.** Episteme doesn't compute pHashes. Clients compute them before asserting. 3. **Not building a visual search UI.** This is query infrastructure. Apps build UIs on top. 4. **Not optimizing for exact match.** Use `source_hash` for exact binary matching. `visual_hash` is for perceptual similarity. --- ## Implementation Checklist ### Core (`stemedb-query`) - [ ] Add `visual_near: Option` to `Query` struct - [ ] Add `visual_threshold: Option` to `Query` struct (default: 8) - [ ] Implement `hamming_distance(a: &[u8; 8], b: &[u8; 8]) -> u32` - [ ] Implement hex parsing for pHash strings - [ ] Add visual matching to `Query::matches()` - [ ] Add `.visual_near(hash, threshold)` to `QueryBuilder` ### API (`stemedb-api`) - [ ] Add `visual_near` query parameter to `QueryParams` DTO - [ ] Add `visual_threshold` query parameter to `QueryParams` DTO - [ ] Update OpenAPI spec with new parameters ### SDK (`sdk/go/steme`) - [ ] Add `VisualNear(hash, threshold)` to `QueryBuilder` - [ ] Add integration test for visual queries ### Tests - [ ] `test_visual_near_exact_match` - [ ] `test_visual_near_within_threshold` - [ ] `test_visual_near_exceeds_threshold` - [ ] `test_visual_near_skips_assertions_without_hash` - [ ] `test_hamming_distance_zero` - [ ] `test_hamming_distance_max` ### Documentation - [ ] Add threshold guidance to SDK docs - [ ] Add domain-specific recommendations to use-case docs - [ ] Update `ai-lookup/services/query.md` with visual query info --- ## Stakeholder Sign-Off Based on empathy interviews: | Persona | Concern | How MVP Addresses | |---------|---------|-------------------| | Research Agent | "Let me query by visual similarity" | Core feature | | Human Supervisor | "O(N) is fine if Phase 3 is planned" | Future directions documented | | Lead Orchestrator | "Don't mandate visual_hash" | Remains optional | | On-Call SRE | "Keep API simple" | Single query param, sensible default | | M&A Analyst | "Threshold 8-10 for screenshots" | Default 8, guidance for 10 | | Consumer Health | "False positives are dangerous" | Domain guidance: threshold 4-5 | --- ## Open Questions 1. **Should we add `visual_hash` to `QueryAudit`?** When auditing "what did the agent query?", do we capture visual_near parameters? 2. **Should visual queries combine with other filters?** Current spec: yes (AND with subject, predicate, etc.). Is this correct? 3. **What's the actual distribution of visual_hash population?** If <1% of assertions have visual_hash, should we warn users? --- ## References - [pHash paper](http://www.phash.org/docs/pubs/thesis_zauner.pdf) - Perceptual hashing algorithms - [BK-tree](https://en.wikipedia.org/wiki/BK-tree) - Metric tree for discrete distances - [VP-tree](https://en.wikipedia.org/wiki/Vantage-point_tree) - Metric tree for general metrics - [LSH](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) - Probabilistic nearest-neighbor