tidaldb/site/content/blog/tantivy-derived-index.mdx
2026-02-23 22:41:16 -07:00

262 lines
17 KiB
Plaintext

---
title: "Tantivy is a derived index, and that changes everything about crash recovery"
date: "2026-02-22"
author: "Jordan Washburn"
description: "The entity store is the source of truth. Tantivy is a materialized view. This architectural decision eliminated distributed transactions, simplified crash recovery to a single sequence number, and gave us a synchronous flush in 12 lines of code."
tags: ["search", "architecture", "rust", "internals"]
---
The hardest problem in embedding a full-text search engine is not tokenization, BM25 scoring, or segment merging. It is keeping two storage systems in sync.
You have your primary store -- the place entities live, the thing your WAL protects. And you have Tantivy, which maintains its own segment files, its own commit log, its own crash recovery. If you treat both as sources of truth, you need to atomically update both on every write. That is a distributed transaction problem. In a single-process embedded database. The absurdity of the situation tells you the mental model is wrong.
The right mental model: Tantivy is a derived index. The entity store is canonical. Tantivy is a materialized view that can be rebuilt from storage at any time, for any reason. This decision, made before we wrote a line of integration code, eliminated an entire class of consistency problems and gave us a crash recovery story that fits in a single sentence: read the last committed sequence number from Tantivy, replay from there.
This post describes what we built, how the syncer architecture works with Tantivy's grain instead of against it, and why "derived index" is the unifying model for every secondary index in tidalDB.
## The research said this would work
Before writing integration code, we produced a [research document](/docs/research/tantivy.md) that surveyed Tantivy's internals, scored the API surface for our use case, and identified the real engineering risks. The research confirmed three things.
First, raw BM25 score extraction -- the thing we needed most -- was well-supported through the Collector trait. Second, `DocSet::seek()` enabled targeted scoring of pre-existing candidate sets without full index scans. Third, the consistency problem was solvable if we chose the right ownership model.
The research also flagged a cautionary signal: SurrealDB rejected Tantivy because its non-ACID commit model conflicted with their transactional requirements. They built their own BM25 engine. We took the opposite lesson. SurrealDB needed Tantivy to be a co-equal participant in transactions. We needed it to be a follower. The DB-primary architecture we had already chosen -- WAL-first durability, entity store as source of truth, everything else derived -- meant Tantivy's commit model was not a liability. It was irrelevant. If Tantivy loses data, the entity store still has it.
## What "derived" means in practice
Every write to tidalDB's entity store produces a side effect: a `PendingWrite` struct is sent over a crossbeam channel to a background thread called the text syncer.
```rust
// From tidal/src/db/items.rs -- entity write path
if let Ok(guard) = self.text_tx.lock()
&& let Some(tx) = guard.as_ref()
{
let _ = tx.send(crate::text::PendingWrite {
entity_id: id,
metadata: metadata.clone(),
seq: 0,
deleted: false,
});
}
```
The entity store write completes immediately. The application does not wait for Tantivy. The `PendingWrite` carries the entity ID, the metadata to index, and a sequence number. The sequence number is the coordination primitive -- the one piece of state that connects the entity store's timeline to Tantivy's timeline.
The channel is unbounded -- the send is best-effort. If the syncer thread has shut down, the write is silently dropped. But the entity store recorded it. This asymmetry is the entire point: the entity store is always ahead of or equal to the text index. Never behind.
## The syncer thread
The syncer is a loop on a dedicated thread. It receives `PendingWrite` events, batches them, and commits to Tantivy when either of two thresholds is reached: 1,000 documents or 2 seconds, whichever comes first.
```rust
// Simplified from tidal/src/text/syncer.rs -- flush handling omitted for clarity
loop {
match self.rx.recv_timeout(Duration::from_millis(100)) {
Ok(update) => {
if update.deleted {
writer.delete_item(update.entity_id);
} else {
writer.index_item(update.entity_id, &update.metadata)?;
}
last_seq = update.seq;
pending_count += 1;
if pending_count >= self.commit_every_n {
writer.commit(last_seq)?;
pending_count = 0;
last_commit = Instant::now();
}
}
Err(RecvTimeoutError::Timeout) => {
if pending_count > 0 && last_commit.elapsed() >= self.commit_every {
writer.commit(last_seq)?;
pending_count = 0;
last_commit = Instant::now();
}
}
Err(RecvTimeoutError::Disconnected) => {
if pending_count > 0 {
writer.commit(last_seq)?;
}
break;
}
}
}
```
Batch commits amortize Tantivy's per-commit cost. Each commit flushes one segment per indexing thread, updates `meta.json` atomically, and makes new documents visible to readers. Committing too frequently creates many small segments and increases merge pressure. Committing too rarely increases the lag between an entity write and its visibility in search. The 1,000-doc / 2-second thresholds are a starting point that bounds worst-case search lag to 2 seconds while keeping segment count manageable.
The syncer holds Tantivy's writer lock for its entire lifetime. This is correct because Tantivy enforces single-writer semantics -- only one `IndexWriter` can exist per index at a time. There is no benefit to releasing and re-acquiring the lock between commits.
On shutdown (channel disconnect), the syncer drains remaining events and commits before exiting. No pending write is silently dropped.
## The commit payload: one number does all the work
Every Tantivy commit carries a payload -- an arbitrary string stored in `meta.json`. We store the last processed sequence number:
```rust
// From tidal/src/text/writer.rs
pub fn commit(&mut self, last_seq: u64) -> crate::Result<()> {
let mut prepared = self.writer
.prepare_commit()
.map_err(|e| TidalError::Internal(format!("tantivy prepare_commit: {e}")))?;
prepared.set_payload(&last_seq.to_string());
prepared
.commit()
.map_err(|e| TidalError::Internal(format!("tantivy commit: {e}")))?;
Ok(())
}
```
This is two-phase commit within Tantivy -- `prepare_commit()` flushes segments to disk, `set_payload()` attaches the sequence number, `commit()` makes everything visible atomically. But it is not two-phase commit between Tantivy and the entity store. The entity store does not participate. It does not need to. It is already durable. Tantivy is catching up.
On restart, crash recovery reads the payload:
```rust
// From tidal/src/text/writer.rs
pub fn last_committed_seq(index: &tantivy::Index) -> u64 {
index
.load_metas()
.ok()
.and_then(|meta| meta.payload)
.and_then(|p| p.parse::<u64>().ok())
.unwrap_or(0)
}
```
If the sequence number is 0 (fresh index or corrupted metadata), rebuild the entire text index from the entity store. If it is N, replay entity writes from N+1 forward. The entity store is the source of truth. The sequence number bounds the replay window. That is the entire crash recovery protocol.
Compare this to the alternative: if Tantivy were a co-equal source of truth, a crash between the entity store commit and the Tantivy commit would leave them inconsistent. You would need to detect the inconsistency, determine which system is ahead, and reconcile. That is a distributed systems problem. In an embedded database. We chose not to have that problem.
## The reader/writer split
Tantivy separates reads from writes at the architecture level. The `IndexWriter` owns mutation. The `IndexReader` owns search. They do not share locks. A reader captures a snapshot of committed segments at creation time and does not see subsequent commits until it is explicitly reloaded.
We work with this design, not against it:
```rust
// From tidal/src/text/index.rs
pub struct TextIndex {
pub(crate) index: Index,
pub(crate) writer: Mutex<IndexWriter>,
pub(crate) reader: IndexReader,
pub(crate) fields: Arc<TantivyFields>,
pub(crate) config: TextIndexConfig,
pub(crate) entity_map: Arc<DashMap<u64, (u32, u32)>>,
}
```
The writer lives behind a `Mutex` and is exclusively owned by the syncer thread. The reader reloads automatically after each commit (with a short delay) in production, or manually in tests. This means search queries never block on writes, and writes never block on search queries. The only contention point is the reader reload, which is a lightweight operation -- swap a pointer to the new set of segment readers.
The `entity_map` is a `DashMap` that translates entity IDs to Tantivy `(segment_ord, doc_id)` pairs. It is rebuilt after every commit. This map enables the targeted scoring path: given a set of candidate entities from vector search, look up their Tantivy doc addresses and seek directly to them in the posting list. No full index scan required.
## The synchronous flush
Tests need determinism. "Sleep 2.5 seconds and hope the syncer has committed" is not determinism. So we added a flush channel:
```rust
// From tidal/src/db/items.rs
pub fn flush_text_index(&self) -> crate::Result<()> {
if let Some(ref flush_tx) = self.text_flush_tx {
let (ack_tx, ack_rx) = crossbeam::channel::bounded(1);
let _ = flush_tx.send(ack_tx);
let _ = ack_rx.recv_timeout(std::time::Duration::from_secs(10));
}
self.reload_text_index()
}
```
The caller sends a one-shot `Sender<()>` to the syncer. The syncer drains all pending writes from the channel, commits, and sends acknowledgment back. The caller blocks until it receives the ack, then reloads the reader. After `flush_text_index()` returns, every entity written before the flush is visible in search. Deterministic. No sleep.
This pattern is simple because the syncer already owns the writer lock and the commit logic. The flush is not a new code path -- it is a trigger for the existing commit path with a synchronization barrier.
## Two collectors, two use cases
The search pipeline needs BM25 scores in two modes. The `AllScoresCollector` returns every matching document with its score -- no top-K truncation, no re-ranking by Tantivy. This is the building block for hybrid search, where BM25 scores are one signal among many and the final ranking is done by the profile executor, not the text index.
```rust
// From tidal/src/text/collectors.rs
impl SegmentCollector for AllScoresSegmentCollector {
type Fruit = Vec<(EntityId, f32)>;
fn collect(&mut self, doc: DocId, score: Score) {
let eid_val = self.entity_id_col.get_val(doc);
self.results.push((EntityId::new(eid_val), score));
}
fn harvest(self) -> Self::Fruit {
self.results
}
}
```
The key detail: `requires_scoring()` returns `true`. Without this, Tantivy skips BM25 computation entirely and every document receives a score of 0.0. The research document identified this as a gotcha. The implementation handles it.
The `score_candidates` function takes the other path. Given a pre-sorted list of `(segment_ord, doc_id, entity_id)` triples -- entities already selected by vector search or signal ranking -- it seeks through Tantivy's posting lists and scores only those documents. This is the targeted scoring path the research document recommended:
```rust
// From tidal/src/text/collectors.rs -- seek-based scoring
for &(_, doc_id, entity_id) in seg_candidates {
if scorer.doc() > doc_id {
continue;
}
let reached = scorer.seek(doc_id);
if reached == doc_id {
results.push((entity_id, scorer.score()));
}
if reached == TERMINATED {
break;
}
}
```
`DocSet::seek()` is forward-only -- a Lucene-inherited design where the posting list cursor never moves backward. Candidates must be sorted ascending by `(segment_ord, doc_id)`. The implementation groups candidates by segment to reuse scorers, then seeks within each segment in order.
## The pattern underneath
Tantivy is not the only derived index in tidalDB.
The HNSW vector index is a derived index. Embeddings are stored durably in the entity store. The in-memory HNSW graph is built from those stored embeddings. If the graph is lost, rebuild it.
The bitmap indexes (category, format, creator, tags) are derived indexes. They are built from entity metadata on startup and updated inline on writes. If they are lost, scan the entity store and rebuild.
The signal ledger's decay scores and windowed counts are derived state. The WAL holds the raw signal events. The ledger is a materialized view. If it is lost, replay from the last checkpoint.
Every secondary data structure in tidalDB follows the same pattern:
1. One canonical store holds the truth.
2. Derived indexes are materialized views of that truth.
3. A sequence number (or checkpoint) bounds the rebuild window.
4. Crash recovery is replay, not reconciliation.
This is not a novel insight. It is the log-structured view of databases that Jay Kreps described in 2013 -- the log is the source of truth, and everything else is a consumer of that log. But applying it consistently, to every index in the system, eliminates an entire category of bugs. You never ask "are these two systems in sync?" You ask "how far behind is the derived index?" And the answer is always a number.
## What the research planned vs what shipped
The research document identified four integration patterns for Tantivy. Here is how they mapped to implementation:
| Research recommendation | What shipped |
|------------------------|--------------|
| Custom `Collector` for bulk BM25 extraction | `AllScoresCollector` -- returns all `(EntityId, f32)` pairs |
| `Weight::scorer` + `DocSet::seek` for candidate scoring | `score_candidates()` -- seek-based scoring of pre-sorted candidate sets |
| DB-primary with Tantivy as derived index | Entity store writes first, syncer feeds Tantivy asynchronously |
| Commit payload with sequence number for crash recovery | `prepare_commit()` + `set_payload()` + `last_committed_seq()` |
| Commit every 1-5 seconds | 1,000 documents or 2 seconds, whichever first |
| Single background indexer owns the writer | `TextIndexSyncer` on dedicated thread, holds writer lock for entire run |
The research flagged two open questions that we deferred: BM25 score drift as the corpus grows, and seek performance on large candidate sets. Both are characterization work that matters at scale. At the current target of tens of thousands of documents, BM25 at 10K docs returns in 0.26ms. The research's concern about segment merge latency under concurrent search load has not materialized -- the `LogMergePolicy` handles steady-state ingest without measurable impact on query latency.
## Why it matters that this is boring
The Tantivy integration has no clever tricks. No lock-free algorithms. No custom allocators. A background thread reads from a channel, batches writes, and commits periodically. A sequence number in the commit payload enables crash recovery. A reader snapshot serves search queries without blocking writes.
This is boring on purpose. The interesting parts of tidalDB are the signal engine, the ranking profiles, the diversity enforcement, the hybrid fusion pipeline. Full-text search is a building block -- important, but not the place where we want architectural novelty. Tantivy is 40,000 lines of well-tested Rust that handles tokenization, BM25 scoring, segment merging, and crash-safe commits. We wrote 600 lines of integration code to treat it as a derived index.
The alternative -- building a minimal BM25 engine from scratch -- would have been roughly 3,000 lines for the basics, and then the gradual accumulation of the remaining 37,000 lines as we discovered we needed concurrent indexing, crash safety, segment merging, and block-max WAND pruning. Every database team that embeds full-text search faces this choice. ParadeDB, Quickwit, and Milvus all chose Tantivy. We chose it for the same reason: the integration complexity is bounded and predictable. The from-scratch complexity is not.
The derived index pattern made the integration simple. And it will make the next integration simple too, whatever it is -- because the principle does not change. One store holds the truth. Everything else follows.
---
*The text index integration is at [tidal/src/text/](https://github.com/orchard9/tidalDB/tree/main/tidal/src/text). The search executor is at [tidal/src/query/search/executor.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/query/search/executor.rs). The Tantivy research document is at [docs/research/tantivy.md](https://github.com/orchard9/tidalDB/blob/main/docs/research/tantivy.md). Follow the build on [GitHub](https://github.com/orchard9/tidalDB).*