stemedb/docs/specs/visual-hash-query.md
jordan 1ce4004807 feat: Complete Phase 2 (The Cortex) - query, lens, and API layers
This commit adds the read path (Cortex) to complement the write path (Spine):

## Crates
- stemedb-api: HTTP API with axum + utoipa OpenAPI
  - /v1/assert, /v1/query, /v1/epoch, /v1/skeptic, /v1/trace, /v1/audit
  - Metered endpoints with quota enforcement
  - Ed25519 signature verification
- stemedb-lens: Truth resolution lenses
  - RecencyLens, ConsensusLens, ConfidenceLens
  - VoteAwareConsensusLens (Ballot Box pattern)
  - TrustAwareAuthorityLens (The Hive pattern)
  - SkepticLens (conflict analysis)
  - EpochAwareLens (paradigm-safe queries)
- stemedb-query: Query engine with materialized views

## Storage Extensions
- VoteStore: Vote aggregation with cached counts
- TrustRankStore: Agent reputation with decay
- AuditStore: Query audit trail
- IndexStore: SP/P/S index structures
- SupersessionStore: Epoch supersession chains

## SDKs
- sdk/go/steme: Go HTTP client with Ed25519 signing
- sdk/go/adk: ADK-Go tools for AI agents

## Documentation
- Updated CLAUDE.md, architecture.md, roadmap.md
- New ai-lookup entries for all services
- Use case docs for consumer health intelligence
- Arena roadmap for simulation advancement

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 13:22:44 -07:00

13 KiB

Visual Hash Query Support

Status: Draft Phase: 2.4 (MVP) Pillars: First-Class Contradiction (Visual Domain), Visual Anchoring Primary Use Cases: Financial Due Diligence, GLP-1 Living Review, Consumer Health Intelligence


Problem Statement

The visual_hash: Option<PHash> field exists on Assertion and is stored/returned by the API, but there is no way to query by visual similarity. The field is write-only from a query perspective.

This creates a gap between what we promise and what we deliver:

Promise (from use-cases) Reality
"Query by visual similarity" No query parameter exists
"Find all assertions anchored to visually similar images" Must scan entire dataset manually
"Catches duplicate evidence, fake variations, source reuse" Cannot detect duplicates

The Pain Points

Research Agent: "I extract the same clinical trial table 10 times. Each extraction drifts slightly. I stored visual_hash hoping to detect duplicates. I can't query by it."

Human Supervisor (3am incident): "Agent made a bad decision based on an assertion with visual_hash. I need to find all assertions from that same screenshot. I can't."

M&A Analyst: "I have two screenshots of the same org chart from different sources. I want to prove they corroborate each other. I can't query by visual similarity."


MVP Scope (Phase 2.4)

What We're Building

A brute-force visual similarity query capability using hamming distance on perceptual hashes.

API Changes

Query Struct (stemedb-query)

pub struct Query {
    // ... existing fields ...

    /// Filter by visual similarity to a reference pHash.
    /// Hex-encoded 8-byte perceptual hash (16 hex chars).
    pub visual_near: Option<String>,

    /// Maximum hamming distance for visual_near matching.
    /// Default: 8 bits (sweet spot for pHash similarity).
    /// Range: 0-64 (0 = exact match, 64 = match anything).
    pub visual_threshold: Option<u32>,
}

QueryBuilder (stemedb-query)

impl QueryBuilder {
    /// Filter by visual similarity to a reference pHash.
    ///
    /// # Arguments
    /// * `hash` - Hex-encoded pHash (16 chars)
    /// * `threshold` - Max hamming distance (default: 8)
    ///
    /// # Example
    /// ```
    /// let query = QueryBuilder::new()
    ///     .subject("BioStart_CEO")
    ///     .predicate("reports_to")
    ///     .visual_near("a3f2b1c4d5e6f708", 8)
    ///     .build();
    /// ```
    pub fn visual_near(mut self, hash: &str, threshold: u32) -> Self {
        self.visual_near = Some(hash.to_string());
        self.visual_threshold = Some(threshold);
        self
    }
}

HTTP API (stemedb-api)

GET /v1/query?visual_near=a3f2b1c4d5e6f708&visual_threshold=8

Query Parameters:
  visual_near      - Hex-encoded pHash (16 characters, optional)
  visual_threshold - Max hamming distance (0-64, default: 8, optional)

Core Algorithm

/// Compute hamming distance between two perceptual hashes.
///
/// Hamming distance = number of differing bits.
/// For 8-byte pHash, range is 0-64.
pub fn hamming_distance(a: &[u8; 8], b: &[u8; 8]) -> u32 {
    a.iter()
        .zip(b.iter())
        .map(|(x, y)| (x ^ y).count_ones())
        .sum()
}

Matching Logic

In Query::matches():

// If visual_near is specified, check visual similarity
if let Some(ref target_hash) = self.visual_near {
    match &assertion.visual_hash {
        Some(assertion_hash) => {
            let target = parse_hex_phash(target_hash)?;
            let distance = hamming_distance(&target, assertion_hash);
            let threshold = self.visual_threshold.unwrap_or(8);
            if distance > threshold {
                return false; // Too dissimilar
            }
        }
        None => return false, // No visual_hash, can't match
    }
}

Go SDK Changes

// QueryBuilder addition
func (b *QueryBuilder) VisualNear(hash string, threshold uint32) *QueryBuilder {
    b.visualNear = hash
    b.visualThreshold = threshold
    return b
}

// Usage
query := steme.NewQuery().
    Subject("BioStart_CEO").
    Predicate("reports_to").
    VisualNear("a3f2b1c4d5e6f708", 8).
    Build()

Threshold Guidance

Hamming distance thresholds for 64-bit pHash:

Threshold Meaning Use Case
0 Exact match Deduplication, identical images
1-4 Very similar Safety-critical (medications, financial)
5-8 Similar General visual matching (default)
9-12 Loosely similar Catch resized/cropped variants
13+ Weak similarity Exploratory search

Domain-Specific Recommendations

Domain Recommended Threshold Rationale
Consumer Health 4-5 False positives for medications are dangerous
Financial Due Diligence 8-10 Screenshots get cropped/resized
Pharma/Clinical 6-8 Balance accuracy vs OCR variations
General 8 pHash literature default

Test Cases

Unit Tests (stemedb-query)

#[test]
fn test_visual_near_exact_match() {
    // Same hash, threshold 0 -> matches
    let hash = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08];
    let assertion = make_assertion_with_visual_hash(hash);
    let query = QueryBuilder::new()
        .visual_near("a3f2b1c4d5e6f708", 0)
        .build();
    assert!(query.matches(&assertion));
}

#[test]
fn test_visual_near_within_threshold() {
    // Hashes differ by 3 bits, threshold 5 -> matches
    let hash_a = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08];
    let hash_b = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x0F]; // 3 bits differ
    let assertion = make_assertion_with_visual_hash(hash_b);
    let query = QueryBuilder::new()
        .visual_near("a3f2b1c4d5e6f708", 5)
        .build();
    assert!(query.matches(&assertion));
}

#[test]
fn test_visual_near_exceeds_threshold() {
    // Hashes differ by 10 bits, threshold 5 -> no match
    let hash_a = [0xA3, 0xF2, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08];
    let hash_b = [0x00, 0x00, 0xB1, 0xC4, 0xD5, 0xE6, 0xF7, 0x08]; // 10+ bits differ
    let assertion = make_assertion_with_visual_hash(hash_b);
    let query = QueryBuilder::new()
        .visual_near("a3f2b1c4d5e6f708", 5)
        .build();
    assert!(!query.matches(&assertion));
}

#[test]
fn test_visual_near_skips_assertions_without_hash() {
    // Assertion has no visual_hash -> not matched
    let assertion = make_assertion_without_visual_hash();
    let query = QueryBuilder::new()
        .visual_near("a3f2b1c4d5e6f708", 8)
        .build();
    assert!(!query.matches(&assertion));
}

#[test]
fn test_hamming_distance_zero() {
    let hash = [0xFF; 8];
    assert_eq!(hamming_distance(&hash, &hash), 0);
}

#[test]
fn test_hamming_distance_max() {
    let a = [0x00; 8];
    let b = [0xFF; 8];
    assert_eq!(hamming_distance(&a, &b), 64);
}

Integration Tests (stemedb-api)

#[tokio::test]
async fn test_visual_query_api() {
    let app = create_test_app().await;

    // Insert assertion with visual_hash
    let resp = app.post("/v1/assertions")
        .json(&json!({
            "subject": "STEP-1_Trial",
            "predicate": "primary_endpoint",
            "object": {"Number": 14.9},
            "visual_hash": "a3f2b1c4d5e6f708"
        }))
        .send().await;
    assert_eq!(resp.status(), 201);

    // Query by visual similarity
    let resp = app.get("/v1/query")
        .query(&[
            ("visual_near", "a3f2b1c4d5e6f700"), // 3 bits different
            ("visual_threshold", "8")
        ])
        .send().await;
    assert_eq!(resp.status(), 200);

    let results: Vec<Assertion> = resp.json().await;
    assert_eq!(results.len(), 1);
}

Performance Characteristics

MVP (Brute Force)

Operation Complexity Notes
Visual query O(N) Scans all assertions
Hamming distance O(1) 8 XORs + 8 popcounts
Per-assertion overhead ~10 ns Negligible

Expected latency:

  • 10K assertions: ~100μs
  • 100K assertions: ~1ms
  • 1M assertions: ~10ms
  • 10M assertions: ~100ms (needs index)

When to Worry

If visual queries consistently exceed 100ms, it's time for Phase 3 indexing.


Future Directions (Based on Performance)

Phase 3a: BK-Tree Index

Trigger: Visual queries exceed 50ms at p99.

Approach: Build a BK-tree (Burkhard-Keller tree) optimized for hamming distance.

struct BKTree {
    root: Option<Box<BKNode>>,
}

struct BKNode {
    hash: PHash,
    assertion_id: Hash,
    children: HashMap<u32, Box<BKNode>>, // distance -> child
}

Benefits:

  • O(log N) average case for bounded threshold queries
  • Well-suited for discrete metric spaces (hamming distance)

Storage: VIX:{hash_prefix} keys in sled for persistence.

Phase 3b: VP-Tree Index

Trigger: Need range queries with variable thresholds.

Approach: Vantage-point tree for metric space search.

Benefits:

  • Better for continuous thresholds
  • Can support "find K nearest" queries

Phase 3c: Locality-Sensitive Hashing (LSH)

Trigger: Need sub-linear time for very large datasets (100M+ assertions).

Approach: Hash pHashes into buckets where similar hashes collide.

Benefits:

  • O(1) average case
  • Scales to billions of assertions

Tradeoffs:

  • Probabilistic (may miss some matches)
  • Requires tuning hash functions

Phase 3d: Visual Clustering Dashboard

Trigger: Users want "show me distinct visual fingerprints."

Approach: Background job that clusters visual_hashes and maintains cluster centroids.

GET /v1/visual-clusters?subject=TechCorp
-> Returns:
{
  "clusters": [
    { "centroid": "a3f2b1c4d5e6f708", "count": 47, "representative_hash": "..." },
    { "centroid": "b4e3c2d5a6f70819", "count": 12, "representative_hash": "..." }
  ]
}

Phase 3e: Pre-Assertion Dedup Check

Trigger: Ingestion workers want to avoid duplicates.

Approach: Endpoint to check visual similarity before creating assertion.

GET /v1/assertions/visual-check?visual_hash=a3f2b1c4d5e6f708&threshold=8
-> Returns existing assertion hashes if any match, empty otherwise

Non-Goals (Explicit Scope Limits)

  1. Not making visual_hash mandatory. It remains optional. Assertions without visual_hash simply don't match visual queries.

  2. Not building image processing. Episteme doesn't compute pHashes. Clients compute them before asserting.

  3. Not building a visual search UI. This is query infrastructure. Apps build UIs on top.

  4. Not optimizing for exact match. Use source_hash for exact binary matching. visual_hash is for perceptual similarity.


Implementation Checklist

Core (stemedb-query)

  • Add visual_near: Option<String> to Query struct
  • Add visual_threshold: Option<u32> to Query struct (default: 8)
  • Implement hamming_distance(a: &[u8; 8], b: &[u8; 8]) -> u32
  • Implement hex parsing for pHash strings
  • Add visual matching to Query::matches()
  • Add .visual_near(hash, threshold) to QueryBuilder

API (stemedb-api)

  • Add visual_near query parameter to QueryParams DTO
  • Add visual_threshold query parameter to QueryParams DTO
  • Update OpenAPI spec with new parameters

SDK (sdk/go/steme)

  • Add VisualNear(hash, threshold) to QueryBuilder
  • Add integration test for visual queries

Tests

  • test_visual_near_exact_match
  • test_visual_near_within_threshold
  • test_visual_near_exceeds_threshold
  • test_visual_near_skips_assertions_without_hash
  • test_hamming_distance_zero
  • test_hamming_distance_max

Documentation

  • Add threshold guidance to SDK docs
  • Add domain-specific recommendations to use-case docs
  • Update ai-lookup/services/query.md with visual query info

Stakeholder Sign-Off

Based on empathy interviews:

Persona Concern How MVP Addresses
Research Agent "Let me query by visual similarity" Core feature
Human Supervisor "O(N) is fine if Phase 3 is planned" Future directions documented
Lead Orchestrator "Don't mandate visual_hash" Remains optional
On-Call SRE "Keep API simple" Single query param, sensible default
M&A Analyst "Threshold 8-10 for screenshots" Default 8, guidance for 10
Consumer Health "False positives are dangerous" Domain guidance: threshold 4-5

Open Questions

  1. Should we add visual_hash to QueryAudit? When auditing "what did the agent query?", do we capture visual_near parameters?

  2. Should visual queries combine with other filters? Current spec: yes (AND with subject, predicate, etc.). Is this correct?

  3. What's the actual distribution of visual_hash population? If <1% of assertions have visual_hash, should we warn users?


References

  • pHash paper - Perceptual hashing algorithms
  • BK-tree - Metric tree for discrete distances
  • VP-tree - Metric tree for general metrics
  • LSH - Probabilistic nearest-neighbor