tidaldb/docs/research/ann_for_tidaldb.md
jordan 413b712c0a chore: initialize tidalDB repository with schema foundation and standards
- Schema phase 1 (tasks 01-02): EntityId, EntityKind, Timestamp, Score, SignalTypeDef, DecayModel, Window, WindowSet — all with property tests and benchmarks scaffolding
- Stub modules for storage, signals, query, ranking
- Full documentation suite: VISION, USE_CASES, SEQUENCE, API, CODING_GUIDELINES, ai-lookup, research docs, specs, roadmap, planning docs
- Marketing site (Next.js) with blog infrastructure
- .claude/ agents and skills for the tidalDB development workflow
- Foundation standards enforced: thiserror + tracing declared as dependencies, clippy::unwrap_used = deny added to lint config
- .gitignore hardened: .next/, node_modules/, .env, secrets, logs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 12:52:20 -07:00

19 KiB
Raw Blame History

ANN for tidalDB: USearch with adaptive filtered search

Use USearch (Unum Cloud) as tidalDB's vector index engine, with an adaptive query planner layered on top. USearch is the only actively maintained Rust-available ANN library that supports predicate-based filtering during HNSW graph traversal — the exact algorithmic primitive tidalDB needs. ScyllaDB, ClickHouse, and DuckDB already embed USearch in production at scale, validating the approach. The filtered search callback architecture lets tidalDB evaluate arbitrary metadata predicates (date ranges, followed-creator sets, seen-item exclusions) inline during graph traversal, avoiding both post-filter recall collapse and pre-filter brute-force degradation. At 10M vectors of dimension 1536, expect ~60 GB RAM at float32 or ~15 GB with int8 quantization, with mmap persistence for instant restart via USearch's view() mode.


Rust ANN library landscape: most options are inadequate

The Rust ecosystem for HNSW is fragmented. Of eight libraries evaluated, only two support filtered search with active maintenance — and one dramatically outperforms the other.

Library Type Stars Active Filtered Search mmap Incremental Add/Delete Quantization QPS (high-dim)
usearch C++ FFI ~3,500 (Jan 2026) predicate callback view() / (lazy) f16, i8, binary ~127K167K
hnsw_rs Pure Rust 221 Filterable trait (vectors) / ~10K30K est.
hannoy Pure Rust (LMDB) New RoaringBitmap (LMDB) / Binary only Competitive w/ FAISS
hnswlib-rs (CoreNN) Pure Rust New (bincode) / (tombstone) i8 Unknown
hora Pure Rust ~2,600 (2021) N/A — abandoned
instant-distance Pure Rust 343 ⚠️ Low N/A
arroy Pure Rust (LMDB) ~500 (superseded) RoaringBitmap Binary Degrades at high-dim
hnsw (rust-cv) Pure Rust ~200 (2021) N/A — abandoned

USearch dominates on every axis that matters for tidalDB. On a 92-core Intel Xeon with 10M vectors, USearch achieves 126,582 QPS at float32 and 166,667 QPS at int8 — roughly 150x faster than Lucene at equivalent recall. ScyllaDB reports 12,000 QPS at >97% recall on 100M vectors of dimension 768 in production, with p99 latency under 40ms. No pure-Rust library publishes competitive numbers at this scale.

The pure-Rust alternative hnsw_rs (jean-pierreBoth) deserves mention as a fallback. It supports filtering via a Filterable trait, offers mmap for vector data, has SIMD acceleration via simdeez, and supports an unusually broad set of distance metrics including Jaccard, Hamming, and Hellinger divergence. It lacks quantization and deletion support, and has no published high-dimensional benchmarks, but at 145K crates.io downloads it has real production usage in genomics and bioinformatics.

hannoy (the new Meilisearch HNSW backend) is architecturally interesting — an LMDB-backed, disk-native HNSW with DiskANN-inspired graph patching for incremental updates. It replaced arroy in Meilisearch v1.29 (December 2025), delivering 10x search speedup and 2x index size reduction. However, it is tightly coupled to Meilisearch internals and far too new (v0.0.3) for production embedding in another database.

Qdrant's internal HNSW implementation (pure Rust, with ACORN-1 support) cannot be extracted as a standalone library. The code is deeply coupled to Qdrant's segment/storage/API layers, and the only third-party extraction attempt (qdrant-lib, 63 stars) carries massive dependency bloat.


The filtered ANN problem: why USearch's callback architecture works

The core challenge is clear: "find 100 nearest items, but only from items created in the last 7 days by followed creators, excluding seen items." Naive post-filtering fails catastrophically when the filter retains less than ~10% of the corpus — recall drops to near zero because the top-k ANN candidates contain almost no filter-matching items. Pure pre-filtering (brute-force over the filtered set) works perfectly at 1% selectivity but becomes prohibitively slow when the filtered set exceeds a few hundred thousand vectors.

Production systems converge on the same solution: evaluate filters during graph traversal, with an adaptive query planner that switches strategies based on estimated filter selectivity. Here is how the major systems handle it:

Qdrant builds a "filterable HNSW" with extra graph edges per payload value, ensuring subgraph connectivity under filters. Its query planner estimates filter cardinality, then selects: payload-index-only scan for very selective filters (<1-2%), filterable HNSW traversal for moderate selectivity, and ACORN two-hop expansion for compound low-selectivity filters. Starting with v1.16, Qdrant integrates ACORN as a configurable per-query option — 2-10x slower than regular HNSW but dramatically better recall for restrictive multi-attribute filters.

Weaviate takes a simpler approach: build both an inverted index and an HNSW index per shard. Pre-filtering produces a RoaringBitmap allow-list; HNSW traversal follows all edges normally but only adds allow-listed nodes to results. A flatSearchCutoff parameter triggers brute-force when the filtered set is small enough. Their documentation shows recall@10 remaining near 0.99 from 100% down to 1% selectivity with this hybrid approach. Since v1.27, Weaviate adds ACORN-style two-hop expansion for low-correlation filter scenarios.

Pinecone uses a single-stage approach with adaptive IVF: metadata bitmaps per field are intersected with IVF cluster assignments, excluding irrelevant clusters entirely. Their published benchmarks show recall@10 of 0.989 on YFCC, stable across all selectivities — and filters actually speed up search by 35% at 1% selectivity because they reduce the effective search space.

The critical insight across all systems: at extreme selectivity (<1-2%), everyone falls back to pre-filter + brute-force over the small matched set, which gives exact results quickly. The differentiating engineering happens in the 5-30% selectivity "danger zone" where the filtered set is too large for brute-force but too sparse for standard HNSW to maintain recall.

USearch's filtered_search(query, k, |key| predicate(key)) implements the correct in-graph filtering primitive. The predicate receives each candidate node's Key (u64) during traversal. Nodes failing the predicate are skipped for results but still used for graph navigation — preserving search quality. tidalDB's architecture would be:

User query → parse filter conditions → estimate selectivity via metadata indexes
  → if selectivity < 2%: pre-filter (roaring bitmap) → brute-force top-k
  → if selectivity 2-100%: index.filtered_search(vector, k, |key| check_metadata(key))

This mirrors exactly how ScyllaDB uses USearch in production.


The ACORN paper (Patel et al., Stanford, SIGMOD 2024) introduces the most theoretically elegant solution to filtered ANN. Its core insight: expand each HNSW node's neighbor list from M to γ·M candidates during construction. When a selective filter eliminates most nodes, the surviving ~γ·M × selectivity edges still approximate the standard M edges needed for navigability. With γ=12 and 8% selectivity, each node retains roughly 12 × 16 × 0.08 ≈ 15 usable edges — close to the standard M=16.

ACORN achieves 21,000x higher QPS at 0.9 recall compared to pre-filtering and post-filtering across multiple datasets (SIFT1M, LAION, TripClick). The lightweight ACORN-1 variant uses two-hop neighbor scanning instead of graph densification — at most 5x lower QPS than full ACORN-γ but with 9-53x lower construction time. ACORN-1 is what Qdrant (v1.16) and Weaviate (v1.27) have adopted.

For 1536-dimensional embeddings specifically, ACORN is the strongest academic approach. The ETH Zurich FANNS benchmark (2025) tested all major filtered ANN methods on 2.7M documents with 1024-dim transformer embeddings. ACORN was the only method supporting all filter types (exact match, range, set membership, and combinations), and it maintained performance while Filtered-DiskANN, CAPS, and UNG all failed to reach >25% recall on this high-dimensional dataset.

Other notable academic approaches and their applicability:

  • Filtered-DiskANN (Microsoft, WWW 2023): builds label-aware Vamana graphs. Excellent for categorical label filters — deployed in Microsoft Ads with 30-80% revenue gains. Limited to equality predicates on ~1,000 labels; struggles with high-dimensional transformer embeddings.
  • SeRF (SIGMOD 2024): segment graph specifically for range filters on ordered attributes (timestamps, prices). Excellent for tidalDB's "last 7 days" filter component but static — no incremental updates.
  • NHQ (TKDE 2022 / NeurIPS 2024): fuses embedding distance with attribute dissimilarity into a combined metric. Fast (10-315x over baselines) but returns approximate filter matches — not guaranteed to satisfy predicates. Unsuitable when hard filter compliance is required.
  • CAPS (2023): partition-based approach with 10% the index size of graph methods. Impressive on low-dimensional data but fails on transformer embeddings at scale.

Practical recommendation: tidalDB does not need to implement ACORN directly. USearch's predicate callback during traversal achieves the same effect (skipping non-matching nodes while preserving graph navigation). If recall degrades under very selective compound filters, tidalDB can implement ACORN-1 style two-hop expansion by having the predicate maintain state and exploring neighbors-of-neighbors — or simply fall back to pre-filter + brute-force for the most selective cases. The adaptive query planner handles this automatically.


Memory, persistence, and quantization at scale

At 1536 dimensions with HNSW (M=16), memory is dominated by raw vectors — the graph adds only ~300 bytes per node (~5% overhead):

Scale Float32 Vectors HNSW Graph Total With f16 With int8 With Binary + Rescore
1M 5.72 GB 0.29 GB 6.0 GB 3.2 GB 1.7 GB 0.5 GB
10M 57.2 GB 2.86 GB 60 GB 31.5 GB 17.2 GB 4.7 GB
100M 572 GB 28.6 GB 601 GB 314 GB 172 GB 47 GB

USearch's f16 quantization is the optimal default for tidalDB. It halves memory with negligible recall loss (<1%) — bringing 10M vectors from 60 GB to ~32 GB, comfortably fitting on a single 64 GB node. Int8 quantization reduces to 17 GB with 1-3% recall loss. Binary quantization achieves 32x compression but requires full-precision rescoring from disk for acceptable recall.

Persistence strategy: USearch provides three modes — save() for full serialization, load() for deserialization into RAM, and view() for zero-copy mmap serving. The recommended pattern for tidalDB:

  1. Active index in RAM for reads and writes during operation
  2. Periodic save() to persist to disk (coordinated with tidalDB's WAL/checkpointing)
  3. On restart: view() for immediate read-only serving while a writable copy loads in background
  4. For datasets exceeding RAM: mmap vectors to NVMe SSD while keeping the HNSW graph (~29 GB for 100M vectors) in memory. Expect 2-10x latency increase for mmap'd vectors depending on OS page cache hit rates. Milvus specifically recommends HNSW over IVF for mmap workloads because graph traversal locality is better than IVF's random cluster access.

Incremental updates: USearch supports add(key, vector) and remove(key) natively. Deletion is lazy (tombstoning). One constraint: reserve(capacity) must be called before first write, requiring capacity planning. tidalDB should either over-provision (2x expected count) or implement segment-based index management — build new segments for incoming data, periodically merge segments into a rebuilt index that reclaims tombstoned space. This mirrors Qdrant's and Tantivy/Lucene's proven segment architecture.

The DiskANN alternative (Microsoft) uses a fundamentally different approach for datasets that don't fit in RAM: a single-layer Vamana graph with Product Quantization in memory for coarse search, full vectors on SSD for rescoring. DiskANN achieves <5ms mean latency at 95%+ recall on 1 billion 128D vectors using SSD + 64 GB RAM. A pure-Rust DiskANN implementation exists (infinilabs/diskann) but is early-stage. For tidalDB's single-node scale (≤100M vectors), HNSW with mmap is simpler and sufficient.


Multi-vector retrieval needs no special indexing

For "For You" feeds driven by a user's history of interactions, the question is how to query with a preference derived from multiple embeddings. The answer is PinnerSage-style multi-query with result merging — no special index modifications required.

Pinterest's PinnerSage system (KDD 2020, 400M+ MAU in production) proved that averaging multiple interest embeddings catastrophically loses information. Averaging embeddings for "hiking," "cooking," and "cars" produces a centroid best represented by "energy boosting breakfast" — semantically unrelated to any actual interest. Instead, PinnerSage clusters user interactions via Ward hierarchical clustering into 3-100 coherent interest groups per user, represents each with a medoid (actual item, not centroid), and issues separate ANN queries per interest cluster.

For tidalDB, this means:

  1. Pre-compute user interest clusters offline (3-10 clusters per user)
  2. Store cluster medoids/centroids per user
  3. At query time: issue 3-10 standard filtered_search calls (one per top cluster), merge and deduplicate results by score
  4. For users with <5 interactions: simple weighted average is acceptable

This requires only standard single-vector ANN queries — USearch's filtered_search works directly. The total query cost scales linearly with cluster count, but since each query is independent, they can execute in parallel.

For cosine vs. inner product: OpenAI 1536D embeddings are designed for cosine similarity. Normalize vectors at insertion time and use L2 distance (equivalent to cosine for unit vectors, and more SIMD-friendly). If tidalDB later adds collaborative-filtering-style embeddings where magnitude carries meaning, implement the XBOX transformation (append one extra dimension) to convert MIPS to L2.


Implementation recommendation: wrap USearch, build the planner

Recommended architecture: Embed USearch as tidalDB's vector index via its Rust crate (Apache-2.0, single dependency on cxx), and build three layers on top.

Layer 1 — Metadata indexes (tidalDB builds): Maintain roaring bitmaps per high-cardinality filter value (creator_id → bitmap of their item keys), B-tree indexes for range attributes (created_at), and a bloom filter or hash set for per-user seen-item exclusion. These enable both fast cardinality estimation and efficient predicate evaluation inside USearch's callback.

Layer 2 — Adaptive query planner (tidalDB builds): Before each search, estimate filter selectivity from metadata index statistics:

  • Selectivity <2%: Pre-filter via bitmap intersection → brute-force L2 scan over matched vectors (exact, fast on small sets)
  • Selectivity 2-100%: Call index.filtered_search(query, k, |key| check_all_filters(key)) — USearch handles in-graph filtered traversal
  • Fallback: If filtered_search returns fewer than k results with high ef_search, widen the search or fall back to pre-filter + brute-force

Layer 3 — Persistence and lifecycle (tidalDB builds): WAL-based durability wrapping USearch's save/load/view. Segment-based index management for growing datasets. Periodic compaction to reclaim tombstoned vectors. On restart, view() for immediate read serving.

Why not build HNSW from scratch: Implementing a correct, high-performance, concurrent HNSW with SIMD-optimized distance computation is 6-12 months of dedicated systems work. USearch's C++ core has been battle-tested across ScyllaDB (1B vectors), ClickHouse, and DuckDB. The FFI boundary via CXX is thin and well-maintained. The engineering effort is better spent on tidalDB's metadata filtering, query planning, and persistence layers — the parts that differentiate a database from a bare index.

Why not use hnsw_rs instead: It's pure Rust (avoiding FFI), but lacks quantization, deletion support, and published high-dimensional benchmarks. For a performance-critical vector database, USearch's 10-100x performance advantage (via SimSIMD and architectural optimizations) outweighs FFI purity concerns. If tidalDB later needs to eliminate C++ dependencies, hnsw_rs is a viable migration target — its Filterable trait provides the same predicate-during-traversal capability.


Open questions requiring benchmarking in tidalDB's conditions

Must benchmark before committing:

  • USearch filtered_search latency as a function of predicate evaluation cost. If tidalDB's check_all_filters(key) requires random access to a metadata store, the overhead per HNSW hop could dominate. Benchmark with realistic filter complexity (bitmap lookup + range check + set membership) to establish the latency budget per predicate call.
  • Recall@10 and QPS at 1536D for USearch at 1M and 10M vectors with tidalDB's actual filter selectivity distribution. No published benchmark tests USearch's filtered search at this dimensionality.
  • Memory overhead of USearch's graph structure at 1536D with M=16 vs M=32 — higher M improves recall under selective filters but increases memory.
  • reserve() capacity planning: what happens when the index fills up? Is there a graceful resize path or does it require a full rebuild?

Should investigate:

  • USearch's behavior under concurrent writes + filtered reads — ScyllaDB validates concurrent operation but tidalDB's access patterns may differ.
  • The crossover point where pre-filter brute-force beats filtered HNSW for tidalDB's data distribution. This determines the query planner's switching threshold.
  • Whether USearch's view() mmap mode supports concurrent search adequately, or if a writable load() is always needed for production serving.
  • f16 vs. int8 quantization recall impact specifically for OpenAI text-embedding-3-large vectors — quantization tolerance varies by embedding model.
  • Incremental index degradation: after 100K inserts + 50K deletes without compaction, how much does recall degrade?
  • ACORN-1 style two-hop expansion: can this be implemented within USearch's predicate callback (by maintaining traversal state), or would it require patching USearch's C++ core?