jordan 413b712c0a chore: initialize tidalDB repository with schema foundation and standards

- Schema phase 1 (tasks 01-02): EntityId, EntityKind, Timestamp, Score, SignalTypeDef, DecayModel, Window, WindowSet — all with property tests and benchmarks scaffolding
- Stub modules for storage, signals, query, ranking
- Full documentation suite: VISION, USE_CASES, SEQUENCE, API, CODING_GUIDELINES, ai-lookup, research docs, specs, roadmap, planning docs
- Marketing site (Next.js) with blog infrastructure
- .claude/ agents and skills for the tidalDB development workflow
- Foundation standards enforced: thiserror + tracing declared as dependencies, clippy::unwrap_used = deny added to lint config
- .gitignore hardened: .next/, node_modules/, .env, secrets, logs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-20 12:52:20 -07:00

20 KiB

Raw Blame History

Tantivy is the right engine for tidalDB, with one critical pattern to get right

Tantivy is a strong fit for tidalDB's embedded full-text search needs, and the feared integration blocker — extracting raw BM25 scores without Tantivy's own top-K selection — is not a blocker at all. The Collector trait, Weight/Scorer pipeline, and DocSet::seek API provide exactly the hooks tidalDB needs to treat Tantivy as a scoring primitive rather than a complete search engine. The real engineering risk lies elsewhere: keeping Tantivy's segment storage consistent with tidalDB's entity store under failure conditions, and managing segment merge latency at scale. This report covers the exact API patterns, consistency strategies, performance expectations, and hybrid fusion approach for the integration.

Tantivy is currently at version 0.25.0, is MIT-licensed, maintained by the Quickwit team (acquired by Datadog in January 2025), and represents roughly 40,000 lines of Rust — substantial but well-structured. The Collector/Scorer API has been stable since the 0.20 rewrite. Multiple production systems embed it successfully, including Quickwit (distributed log search), ParadeDB (Postgres extension), and Milvus (vector database scalar filtering). One notable rejection: SurrealDB built their own BM25 engine because Tantivy's non-ACID commit model conflicted with their transactional requirements — a cautionary signal relevant to tidalDB's dual-write problem.

Per-document scoring works cleanly through three distinct APIs

The key risk identified in the brief — that extracting raw BM25 scores per document might require internal API hacking — is unfounded. Tantivy's scoring pipeline is explicitly designed as a composable chain: Query → Weight → Scorer → Collector, where the Collector is the user's code. tidalDB has three well-supported approaches, listed from most to least recommended.

Approach 1: Custom Collector (best for "give me all BM25 scores"). The Collector trait lets you capture every (DocAddress, Score) pair without any top-K filtering. The critical detail: requires_scoring() must return true or Tantivy skips BM25 computation entirely.

use tantivy::collector::{Collector, SegmentCollector};
use tantivy::{DocId, Score, SegmentOrdinal, SegmentReader, DocAddress};

struct AllScoresCollector;
struct AllScoresSegmentCollector {
    segment_ord: SegmentOrdinal,
    scores: Vec<(DocAddress, Score)>,
}

impl Collector for AllScoresCollector {
    type Fruit = Vec<(DocAddress, Score)>;
    type Child = AllScoresSegmentCollector;

    fn for_segment(&self, segment_local_id: SegmentOrdinal, _segment: &SegmentReader)
        -> tantivy::Result<Self::Child> {
        Ok(AllScoresSegmentCollector {
            segment_ord: segment_local_id,
            scores: Vec::new(),
        })
    }

    fn requires_scoring(&self) -> bool { true }

    fn merge_fruits(&self, segment_fruits: Vec<Vec<(DocAddress, Score)>>)
        -> tantivy::Result<Self::Fruit> {
        Ok(segment_fruits.into_iter().flatten().collect())
    }
}

impl SegmentCollector for AllScoresSegmentCollector {
    type Fruit = Vec<(DocAddress, Score)>;
    fn collect(&mut self, doc: DocId, score: Score) {
        self.scores.push((DocAddress::new(self.segment_ord, doc), score));
    }
    fn harvest(self) -> Self::Fruit { self.scores }
}

// Usage: returns ALL matching docs with BM25 scores, no top-K
let all_scores = searcher.search(&query, &AllScoresCollector)?;

Approach 2: Weight::scorer + DocSet::seek (best for "score these specific doc IDs"). This is the pattern for tidalDB's re-ranking use case — when you already have a candidate set from ANN or signal filtering and want BM25 scores for just those documents. The Scorer trait extends DocSet, which provides seek(target) -> DocId. Seek advances to the first doc ≥ target; if it returns exactly the target, the document matches the query, and scorer.score() gives its BM25 score.

let weight = query.weight(EnableScoring::enabled_from_searcher(&searcher))?;
for (seg_ord, segment_reader) in searcher.segment_readers().iter().enumerate() {
    let mut scorer = weight.scorer(segment_reader, 1.0)?;
    for &target_doc_id in &sorted_candidate_ids {  // MUST be sorted ascending
        let reached = scorer.seek(target_doc_id);
        if reached == target_doc_id {
            let bm25_score = scorer.score();
            // Feed bm25_score into tidalDB's ranking profile
        }
    }
}

The caveat: seek() only moves forward. Candidate doc IDs must be pre-sorted in ascending order. This is a Lucene-inherited design — the posting list cursor is forward-only. For tidalDB's use case of scoring ANN candidates, sort the segment-local doc IDs first.

Approach 3: Weight::for_each (middle ground). Calls a closure for every matching (DocId, Score) pair within a segment. Less flexible than a full Collector but useful for simple score extraction without the trait boilerplate. Note also that Query::explain() returns a structured Explanation tree for any single document — useful for debugging but too expensive for bulk scoring.

Keeping Tantivy and tidalDB's entity store in sync

This is where the real integration complexity lives. Tantivy is crash-safe within itself — meta.json updates atomically, uncommitted documents vanish on crash, and the index always recovers to its last successful commit. But Tantivy has no concept of external transactions. Writing to both Tantivy and an external database is the classic dual-write problem, with four failure modes that must be explicitly handled.

Tantivy's commit model in brief: Documents are queued in memory across internal indexing threads (up to 8). Nothing is visible until commit(), which flushes all in-memory segments to disk and atomically updates meta.json. A crash before commit completion rolls back to the previous state. Each operation gets a monotonically increasing opstamp (u64), and commits can carry an arbitrary string payload via set_payload() — this is the coordination primitive.

The single-writer lock is enforced via a filesystem lock file (.tantivy-writer.lock). Only one IndexWriter can exist per index at a time. The writer is internally multi-threaded, so add_document() and delete_term() are thread-safe, but commit() requires exclusive access. For tidalDB, this means serializing write access through a single writer instance, likely behind an Arc<RwLock<IndexWriter>>.

The recommended consistency pattern is DB-primary with Tantivy as a derived index:

Write to tidalDB's entity store first, within a transaction that also writes to an outbox table (or use CDC/change data capture).
A background indexer reads the outbox and feeds documents into Tantivy's IndexWriter.
On each Tantivy commit, call set_payload() with the last processed outbox sequence number.
On crash recovery, read Tantivy's last commit payload to determine the resume point and replay from there.

This treats the entity store as the source of truth and Tantivy as a materialized view that can be rebuilt. The lag between entity store write and search visibility equals outbox_poll_interval + tantivy_commit_time.

For tighter consistency, use prepare_commit() for pseudo-two-phase commit: Call prepare_commit() to flush segments to disk without making them visible, then write to the DB, then call commit() or abort(). If the process crashes between prepare_commit() and commit(), Tantivy rolls back, and the gap is healed by replaying from the DB using the opstamp watermark. This is not true 2PC — a crash between DB commit and Tantivy commit leaves the DB ahead — but the recovery path is deterministic.

Document updates require delete-then-add — there is no atomic update API. Use a designated ID field, call delete_term(Term::from_field_text(id_field, "doc-123")), then add_document(new_doc), then commit(). Both operations within the same commit batch are safe; the delete applies to prior commits and earlier operations in the batch.

Performance at 10M documents is feasible but not heavily benchmarked

The most authoritative benchmarks come from the search-benchmark-game (maintained by the Tantivy team) running on English Wikipedia (~6M documents) on an AWS c7i.2xlarge, and from Tantivy author Paul Masurel's 2017 blog posts. Concrete numbers at 10M documents specifically are scarce, but extrapolation is reasonable given the architecture.

Indexing throughput on the Wikipedia corpus (5M documents, title + body fields, positions indexed): ~53,000 docs/sec with 4 threads on 2017 hardware, or about 94 seconds for the full corpus. With merging enabled, this drops to ~21,000–28,000 docs/sec due to background merge overhead. For simpler structured documents (HTTP logs), throughput reached ~135,000 docs/sec. tidalDB's 4–5 text field documents would likely land at 30,000–50,000 docs/sec, putting a full 10M document index build at 3–6 minutes on modern hardware.

Query latency on warm cache is consistently in the microseconds to low milliseconds range for single-threaded queries across term, phrase, and boolean query types on the 6M-document Wikipedia corpus. The Tantivy README historically claimed "approximately 2x faster than Lucene," though Lucene 10.3 (late 2025) has closed much of that gap. Tantivy's advantages are strongest on COUNT queries (popcnt optimization) and phrase queries (sorted array intersection).

Memory model is mmap-based for search and budget-controlled for indexing. The IndexWriter takes a configurable heap budget (default 1GB in the CLI, minimum ~15MB per thread). Search requires minimal anonymous memory — index files are memory-mapped, and performance depends on OS page cache residency. For a 10M document index with 4–5 text fields, expect an index size of roughly 5–8 GB on disk (based on the ~38% compression ratio observed for Wikipedia: ~3.1 GB index from ~8 GB raw JSON). Keeping this in page cache requires equivalent RAM.

Scaling to 10M is architecturally sound. Tantivy uses u32 doc IDs per segment (4B limit) and searches segments in parallel when configured with a thread pool. Segment count matters: half a dozen segments has negligible impact versus a single segment, but hundreds of tiny segments degrade query performance measurably. The LogMergePolicy handles this automatically in steady state.

Start with Reciprocal Rank Fusion, graduate to tuned linear combination

For combining BM25 text scores with ANN vector similarity scores, Reciprocal Rank Fusion (RRF) with k=60 is the recommended starting point, and a tuned linear combination with min-max normalization is the upgrade path when relevance labels become available.

RRF (Cormack, Clarke, Büttcher, SIGIR 2009) fuses ranked lists using only rank positions, eliminating the score normalization problem entirely:

RRFscore(d) = 1/(60 + rank_bm25(d)) + 1/(60 + rank_ann(d))

Documents appearing in only one list contribute only that term. The original paper showed RRF outperforming Condorcet fusion all 7 times tested (p ≈ 0.008) and CombMNZ 6/7 times (p ≈ 0.04), with typical MAP improvements of 4–5% over competing methods. The k=60 constant is not sensitive — values from 30–100 yield nearly identical results. A Rust implementation exists on crates.io as the rrf crate, supporting weighted fusion in a one-liner: fuse_weighted(&[bm25_list, vector_list], &[1.0, 1.0], 60).

Production systems are split on approach. Qdrant and Elasticsearch default to RRF. Weaviate switched from RRF (rankedFusion) to min-max normalized linear combination (relativeScoreFusion) as their default in v1.24, arguing it preserves score distribution information. Vespa benchmarks on NFCorpus showed atan-normalized linear combination (NDCG@10 = 0.341) beating RRF (0.320), though margins are dataset-dependent. OpenSearch supports both and recommends RRF when score distributions are heterogeneous.

The score scale mismatch is real but solvable. BM25 scores are unbounded (typically 0–25+) while cosine similarity is bounded [0, 1]. For linear combination, min-max normalization (norm(s) = (s - min) / (max - min)) is the most validated approach (Lee 1997, Wu et al. 2006). An alternative is atan normalization (norm(s) = 2·atan(s/C)/π), which Vespa uses and which avoids the need to know the global min/max at query time.

Bruch, Gai, and Ingber (ACM TOIS, 2024) challenged the "RRF needs no tuning" narrative, finding that convex combination (linear with learned α) outperforms RRF in both in-domain and out-of-domain settings when even a small training set is available. Their key insight: RRF discards score magnitude information, which is wasteful when both scoring functions produce meaningful distances. For tidalDB, this suggests starting with RRF for zero-configuration robustness, then implementing score(d) = α·norm(bm25(d)) + (1-α)·cosine_sim(d) once relevance labels exist to tune α.

Operational gotchas that will bite in production

Segment merging is the primary latency risk. Merging runs in background threads managed by the IndexWriter, governed by the LogMergePolicy (default). After each commit, the policy evaluates whether small segments should be merged into larger ones. Merging does not block readers — a Searcher captures an immutable snapshot at acquisition time — but it consumes CPU and disk I/O that can cause latency spikes on I/O-constrained systems. For bulk loading, set NoMergePolicy during ingest and trigger merging afterward. For steady-state operation, the LogMergePolicy parameters (min_num_segments, max_docs_before_merge, del_docs_ratio_before_merge) should be tuned to tidalDB's write pattern. Call wait_merging_threads() before dropping the IndexWriter.

Schema evolution is additive-only. New fields can be added to an existing index — old segments simply lack data for those fields, which is treated as absent. Removing fields or changing field types requires a full re-index. Changing tokenizers for existing fields also requires re-indexing, since old segments were tokenized with the old analyzer. Tantivy's JSON field type (added in 0.17) provides schema flexibility for semi-structured data without knowing nested field names in advance. A full re-index of 10M documents at ~30K docs/sec takes approximately 5–6 minutes — operationally feasible if the entity store is the source of truth and the index can be rebuilt into a new directory and swapped atomically.

The single-writer lock is non-negotiable. Tantivy enforces one IndexWriter per index via a filesystem lock file. The writer is internally multi-threaded (up to 8 threads), so single-writer does not mean single-threaded, but it does mean tidalDB's write path must serialize through a single writer instance. If the process crashes, the lock file may remain as a stale lock that must be manually deleted. This matches the DB-primary architecture: a single background indexer process owns the Tantivy writer.

Commit frequency is a throughput/latency tradeoff. Each commit flushes one segment per active indexing thread, creating potentially 4–8 new segments per commit. Committing too frequently creates many small segments, increasing merge pressure and degrading query performance until merging catches up. Committing too rarely increases the lag between entity store write and search visibility. For tidalDB's use case, committing every 1–5 seconds (or every N thousand documents) is a reasonable starting point, with the LogMergePolicy handling segment consolidation automatically.

Why not build a minimal BM25 engine instead?

A minimal BM25-only inverted index in Rust — no phrase queries, no fuzzy matching, no segment merging — would require roughly 2,000–4,000 lines of code: tokenization (~300 lines using rust-stemmers), term dictionary with an FST or HashMap (~300 lines), posting lists with basic compression (~300 lines), field norms for length normalization (~150 lines), BM25 scoring (~200 lines), disk serialization (~400 lines), and boolean query processing (~300 lines). An in-memory-only version using the bm25 crate on crates.io gets even simpler.

This is a trap. The first 80% is easy; the remaining 20% is where Tantivy's 40,000 lines live: concurrent indexing with configurable thread pools, crash-safe atomic commits, segment merging with configurable policies, mmap-based I/O for low-memory search, LZ4/Zstd compression for doc stores, delete handling via alive bitsets, multi-segment query execution, and the full Weight/Scorer pipeline with block-max WAND pruning. tidalDB would eventually need incremental updates, concurrent read/write, and crash safety — all of which Tantivy provides and a minimal engine does not. The bm25 crate is useful for prototyping but offers no persistence, no concurrent access, and no incremental updates.

The correct comparison is not lines of code but time-to-production. ParadeDB, Quickwit, and Milvus all embed Tantivy rather than building their own inverted index, despite having the engineering resources to do so. SurrealDB is the notable exception, and they cite ACID requirements as the primary reason — a constraint that tidalDB's DB-primary architecture already handles by treating Tantivy as a derived index rather than a source of truth. Notably, Meilisearch built their own engine (milli, ~17K lines) on top of LMDB, but they needed Algolia-style bucket ranking, not BM25 — a fundamentally different scoring model that would have required fighting Tantivy's BM25 assumptions.

Open questions that need prototyping before committing

DocAddress mapping. Tantivy's DocAddress is a (SegmentOrdinal, DocId) pair that changes when segments merge. tidalDB needs a stable external ID → DocAddress mapping. The standard pattern is to store an external ID as a fast field and maintain a lookup, but the performance cost of this mapping at 10M documents needs measurement. ParadeDB solved this by integrating with Postgres's ctid system — tidalDB will need its own equivalent.

Score stability across commits. BM25 scores depend on corpus statistics (document frequency, average field length). As documents are added or removed, scores for the same query-document pair shift. If tidalDB's ranking profiles use BM25 as a feature with learned weights, score drift could degrade ranking quality. This needs characterization: how much do BM25 scores drift as a 10M-document corpus grows by 1%? By 10%?

Seek performance on candidate sets. The DocSet::seek() pattern for scoring ANN candidates needs benchmarking. If tidalDB retrieves 1,000 ANN candidates and seeks through Tantivy's posting lists for each, the forward-only constraint means worst-case traversal of the entire posting list. For high-frequency terms, this could be expensive. A prototype should measure seek latency for candidate sets of 100, 1,000, and 10,000 documents against queries of varying selectivity.

Merge latency under concurrent search load. The LogMergePolicy runs merges in background threads that compete for I/O bandwidth with mmap-based search. On a system serving p99 latency SLAs while continuously ingesting documents, the interaction between merge I/O and query I/O needs measurement on tidalDB's target hardware, particularly if the index exceeds available RAM and the page cache cannot hold everything.

Two-phase commit reliability. The prepare_commit() → external DB write → commit() pattern needs fault injection testing. Specifically: what happens if prepare_commit() succeeds, the DB write commits, and then the process crashes before commit()? Tantivy will roll back on restart, but the entity store will be ahead. The recovery path (replaying from the DB using the opstamp watermark) needs to be proven correct under concurrent operations.

20 KiB Raw Blame History Unescape Escape