tidaldb/docs/specs/06-text-retrieval.md

# Text Retrieval Specification

**Status:** Draft
**Authors:** tidalDB Engineering
**Date:** 2026-02-20
**Depends on:** Storage Engine (01), Entity Model (02), Signal System (03)
**Research:** `docs/research/tantivy.md`, `docs/research/ann_for_tidaldb.md`

---

## Table of Contents

1. [Design Principles](#1-design-principles)
2. [Inverted Index Design](#2-inverted-index-design)
3. [BM25 Scoring](#3-bm25-scoring)
4. [Query Parsing](#4-query-parsing)
5. [Phrase Matching](#5-phrase-matching)
6. [Boolean Operators](#6-boolean-operators)
7. [Field-Scoped Search](#7-field-scoped-search)
8. [Autocomplete and Suggest](#8-autocomplete-and-suggest)
9. [Typo Tolerance](#9-typo-tolerance)
10. [Segment Management](#10-segment-management)
11. [Hybrid Fusion with Vector Retrieval](#11-hybrid-fusion-with-vector-retrieval)
12. [Integration with Storage Engine](#12-integration-with-storage-engine)
13. [Trait Abstraction](#13-trait-abstraction)
14. [Performance Targets](#14-performance-targets)
15. [Invariants and Correctness Guarantees](#15-invariants-and-correctness-guarantees)
16. [Configuration Reference](#16-configuration-reference)

---

## 1. Design Principles

Text retrieval is one leg of tidalDB's hybrid search pipeline. The other leg is vector retrieval (USearch HNSW, spec 05). Together they answer the question: "given a user's query string and optional query embedding, which entities are most relevant?" Text retrieval produces BM25 relevance scores. Vector retrieval produces cosine similarity scores. Fusion merges these into a single ranked list that feeds the ranking pipeline.

### 1.1 Design Axioms

1. **BM25 relevance is the floor.** An irrelevant result never surfaces because the user likes the creator or the item has high engagement. Text match quality gates the entire search pipeline. If the text score is zero (no term overlap) and no vector is provided, the item is excluded.

2. **Tantivy is the engine, behind a trait boundary.** The `TextIndex` trait abstracts all full-text operations. The production implementation wraps Tantivy 0.25+. Tests use a `MockTextIndex`. If Tantivy proves insufficient for a specific workload, the implementation can be swapped without touching any module outside `storage/text/`. This follows the same pattern as fjall/redb in the storage engine (01-storage-engine.md Section 4.4) and USearch in the vector index.

3. **The text index is a secondary index, not a source of truth.** The entity store (redb) is the source of truth. The text index (Tantivy) is a derived materialized view that can be rebuilt from the entity store at any time. If Tantivy's index is corrupted or lost, the database rebuilds it. This is the same principle as StemeDB's materialized views (thoughts.md) and Tantivy research recommendation (docs/research/tantivy.md).

4. **One entity, one document.** Each entity in the entity store maps to exactly one document in the text index. The document's fields mirror the entity's `text` and `keyword` type metadata fields as defined in the entity model (02-entity-model.md). Entity creation inserts a document. Entity update replaces the document (delete + insert). Entity archive or delete removes the document.

5. **Raw BM25 scores are extractable.** tidalDB's ranking pipeline needs per-document BM25 scores as a feature -- not Tantivy's internal top-K ranking. The custom Collector and Weight/Scorer/seek() APIs provide this (docs/research/tantivy.md, Approaches 1 and 2).

---

## 2. Inverted Index Design

### 2.1 Document Model

Every active entity in the entity store is represented as a Tantivy document. The document schema is derived from the entity definition's metadata fields:

| Entity Field Type | Tantivy Field Type | Indexed | Stored | Positions |
|---|---|---|---|---|
| `text` | `TEXT` | Tokenized, BM25 | Yes | Yes (for phrase queries) |
| `keyword` | `STRING` | Exact-match, not tokenized | Yes | No |
| `keywords` (multi-value) | `STRING` (one entry per value) | Exact-match, not tokenized | Yes | No |
| `i64` | `I64` | Fast field (sorted numeric) | Yes | No |
| `f64` | `F64` | Fast field (sorted numeric) | Yes | No |
| `bool` | `BOOL` | Fast field | Yes | No |
| `timestamp` | `I64` (nanos since epoch) | Fast field (sorted numeric) | Yes | No |
| `duration` | `F64` (seconds) | Fast field (sorted numeric) | Yes | No |

Additionally, every document carries:

- **`_entity_id`**: A `BYTES` fast field containing the 8-byte big-endian entity ID. This is the stable identifier that survives segment merges. It is the bridge between Tantivy's internal `DocAddress` (which changes on merge) and tidalDB's entity model.
- **`_entity_kind`**: A `U64` fast field encoding the entity kind byte (`0x01` = Item, `0x02` = User, `0x03` = Creator). Enables per-kind queries without maintaining separate indexes.

### 2.2 Field Registry

The Tantivy schema is built dynamically from the entity definitions registered via `define_entity()`. Each entity kind contributes its text and keyword fields to a shared Tantivy schema. Field names are prefixed with the entity kind to avoid collisions:

```
item.title         -> TEXT, positions, stored
item.description   -> TEXT, positions, stored
item.category      -> STRING, stored
item.tags          -> STRING (multi-value), stored
item.hashtags      -> STRING (multi-value), stored
creator.name       -> TEXT, positions, stored
creator.handle     -> STRING, stored
```

Field names in user-facing queries omit the prefix (users write `title:jazz`, not `item.title:jazz`). The query parser resolves the prefix based on the target entity kind in the search request.

### 2.3 Analyzer Chain

Text fields pass through an analyzer chain before indexing and before query parsing. The default chain:

```
Input text
  |
  v
[Unicode Segmenter]     -- ICU-based word boundary detection
  |
  v
[Lowercase Filter]      -- ASCII + Unicode lowercasing
  |
  v
[Stop Word Filter]      -- Language-specific stop words (optional, off by default)
  |
  v
[Stemmer]               -- Snowball stemmer, language-configurable
  |
  v
Indexed terms
```

**Default tokenizer:** `tantivy::tokenizer::TextAnalyzer` composed with:
- `SimpleTokenizer` (Unicode word boundaries) as the base tokenizer
- `LowerCaser` filter
- `Stemmer` filter with `Language::English` default

**Language-aware analysis.** The entity definition can specify a language per text field. Different languages get different stemmers and stop word lists:

```rust
Field::text("title").language(Language::English),
Field::text("description").language(Language::Japanese),
```

Japanese, Chinese, and Korean require segmentation tokenizers (lindera or jieba). When a CJK language is specified, the analyzer chain substitutes a CJK-specific tokenizer. This is a per-field configuration, not per-index -- a single entity can have English title and Japanese description.

**Keyword fields are NOT analyzed.** They are indexed as exact byte sequences. `tag:tutorial` matches the exact string "tutorial", not stemmed variants.

### 2.4 Position Indexes

All `text` type fields are indexed with positions enabled. This is required for:

- **Exact phrase matching**: `"jazz piano"` requires knowing that "jazz" appears at position N and "piano" at position N+1 in the same field.
- **Proximity queries**: terms within N positions of each other (future extension).
- **Phrase boosting**: exact phrase matches score higher than individual term matches.

Position indexes increase index size by approximately 30-40% compared to term-only indexing. At 10M documents with 4-5 text fields, this adds roughly 1.5-2 GB to the index. This is acceptable given the phrase matching requirement.

### 2.5 Term Frequency and Document Frequency

Tantivy stores per-segment term frequency (TF) and document frequency (DF) natively. These power BM25 scoring:

- **Term frequency (TF)**: number of times term t appears in document d in field f. Stored in posting lists.
- **Document frequency (DF)**: number of documents containing term t in field f. Stored in the term dictionary per segment.
- **Field norms**: encoded document field lengths for BM25 length normalization. Stored as fast fields per document.

No additional storage beyond Tantivy's default is required for BM25 computation.

---

## 3. BM25 Scoring

### 3.1 Formula

tidalDB uses the standard Okapi BM25 formula as implemented by Tantivy:

```
BM25(q, d) = SUM over t in q:
    IDF(t) * (tf(t,d) * (k1 + 1)) / (tf(t,d) + k1 * (1 - b + b * (|d| / avgdl)))
```

Where:
- `q` = query (set of terms)
- `d` = document
- `t` = a term in the query
- `tf(t, d)` = frequency of term t in document d (within a specific field)
- `|d|` = length of document d (in the specific field, measured in tokens)
- `avgdl` = average document length across the corpus (in the specific field)
- `IDF(t) = ln(1 + (N - n(t) + 0.5) / (n(t) + 0.5))` where N = total documents, n(t) = documents containing term t
- `k1` = term saturation parameter (default: 1.2)
- `b` = length normalization parameter (default: 0.75)

### 3.2 Parameter Defaults

| Parameter | Default | Range | Effect |
|-----------|---------|-------|--------|
| `k1` | 1.2 | 0.0 - 3.0 | Higher k1 increases the impact of term frequency. At k1=0, TF has no effect (binary matching). At k1=3.0, high-TF documents are strongly preferred. 1.2 is the TREC-validated default. |
| `b` | 0.75 | 0.0 - 1.0 | Higher b penalizes long documents more. At b=0, no length normalization. At b=1.0, full normalization. 0.75 balances well for mixed-length content (short titles, long descriptions). |

These parameters are configurable per ranking profile. The `search` profile uses the defaults. A profile tuned for short-form content (tweets, titles) might use `b=0.3` to reduce length normalization penalty.

### 3.3 Per-Field BM25 with Field Boosting

BM25 is computed independently per text field. The final text score for a document is a weighted sum across fields:

```
text_score(q, d) = SUM over f in fields:
    field_boost(f) * BM25_f(q, d)
```

Default field boost weights:

| Field | Default Boost | Rationale |
|-------|---------------|-----------|
| `title` | 3.0 | Title matches are strongest relevance signal. A title containing the exact query terms is almost certainly relevant. |
| `description` | 1.0 | Baseline relevance. Description is the primary text body. |
| `tags` | 2.0 | Tag matches indicate topical relevance. Tags are curated by the creator. |
| `hashtags` | 2.0 | Same as tags. Hashtag matches are strong topical signals. |
| `creator.name` | 2.5 | Creator name matches are high-intent. The user is looking for this creator. |
| `creator.handle` | 3.0 | Handle matches are exact-intent. Even stronger than name. |

Field boosts are configurable per ranking profile:

```rust
db.define_profile(ProfileDef {
    name: "search",
    text_config: TextConfig {
        field_boosts: vec![
            ("title", 3.0),
            ("description", 1.0),
            ("tags", 2.0),
        ],
        bm25_k1: 1.2,
        bm25_b: 0.75,
    },
    ..Default::default()
})?;
```

### 3.4 IDF Computation

IDF is computed per-segment by Tantivy and combined across segments at query time. This is Tantivy's default behavior and requires no special handling.

**Corpus statistics stability.** BM25 scores depend on corpus statistics (DF, avgdl). As documents are added or removed, scores for the same query-document pair shift. For tidalDB's use case, this is acceptable because:

1. Score normalization before fusion (Section 11) absorbs absolute score drift.
2. Ranking profiles use relative ordering, not absolute score thresholds.
3. At 10M documents, adding 1% (100K documents) shifts IDF values by less than 0.5% for common terms.

If score stability becomes critical (e.g., for A/B testing with absolute score comparisons), a periodic `IndexReader::reload()` cadence can be configured to control when new corpus statistics take effect.

### 3.5 Score Normalization

Raw BM25 scores are unbounded (typically 0-25+ depending on query length and corpus). For fusion with vector similarity scores (bounded [0, 1]), normalization is required. See Section 11 for normalization strategies.

---

## 4. Query Parsing

### 4.1 Grammar

The search query language supports the syntax defined in API.md. The grammar is specified here in extended BNF:

```ebnf
query           ::= clause ( clause )*

clause          ::= [ boolean_op ] term_expr
                  | [ boolean_op ] group

boolean_op      ::= 'AND' | 'OR' | 'NOT'

group           ::= '(' query ')'

term_expr       ::= negation
                  | field_scoped
                  | phrase
                  | hashtag
                  | wildcard
                  | bare_term

negation        ::= '-' bare_term
                  | '-' phrase
                  | 'NOT' term_expr

field_scoped    ::= field_name ':' ( bare_term | phrase )

phrase          ::= '"' word ( word )* '"'

hashtag         ::= '#' word

wildcard        ::= word '*'

bare_term       ::= word

field_name      ::= 'title' | 'description' | 'tag' | 'tags'
                  | 'creator' | 'category' | 'hashtag'
                  | IDENTIFIER

word            ::= [a-zA-Z0-9_]+

IDENTIFIER      ::= [a-zA-Z_] [a-zA-Z0-9_]*
```

### 4.2 Operator Precedence

From highest to lowest binding:

1. **Negation**: `-term`, `NOT term` (prefix unary, binds tightest)
2. **Grouping**: `(expr)` (explicit grouping overrides all precedence)
3. **AND**: `a AND b` (binary, left-associative)
4. **OR**: `a OR b` (binary, left-associative)
5. **Implicit OR**: `a b` (space-separated terms default to OR, ranked by relevance)

Examples:

| Input | Parsed As | Semantics |
|-------|-----------|-----------|
| `jazz piano tutorial` | `jazz OR piano OR tutorial` | Any term matches, ranked by BM25 |
| `jazz AND piano NOT beginner` | `(jazz AND piano) AND (NOT beginner)` | Must contain jazz and piano, must not contain beginner |
| `"jazz piano"` | `PHRASE("jazz", "piano")` | Adjacent terms in order |
| `-beginner` | `NOT beginner` | Exclude documents containing "beginner" |
| `jazz pian*` | `jazz OR PREFIX(pian)` | "jazz" or any term starting with "pian" |
| `title:jazz` | `FIELD(title, jazz)` | Match "jazz" only in the title field |
| `tag:tutorial` | `FIELD(tag, tutorial)` | Exact match in the tag field |
| `#jazz` | `FIELD(hashtags, jazz)` | Exact match in hashtags |
| `(jazz OR blues) AND piano` | `(jazz OR blues) AND piano` | Grouped OR within AND |

### 4.3 AST Design

The query parser produces an abstract syntax tree consumed by the query planner:

```rust
/// A parsed search query, ready for planning.
pub enum SearchQuery {
    /// A single term, optionally stemmed.
    Term {
        text: String,
        field: Option<FieldName>,
    },

    /// An exact phrase: terms must appear adjacent and in order.
    Phrase {
        terms: Vec<String>,
        field: Option<FieldName>,
    },

    /// A prefix wildcard: matches all terms starting with the prefix.
    Prefix {
        prefix: String,
        field: Option<FieldName>,
    },

    /// Boolean AND: all children must match.
    And(Vec<SearchQuery>),

    /// Boolean OR: any child may match.
    Or(Vec<SearchQuery>),

    /// Boolean NOT: exclude documents matching the child.
    Not(Box<SearchQuery>),

    /// Field-scoped query: restrict matching to a specific field.
    /// Redundant with the `field` option on Term/Phrase/Prefix but
    /// kept for clarity when the parser produces the tree.
    FieldScoped {
        field: FieldName,
        inner: Box<SearchQuery>,
    },

    /// Hashtag sugar: `#jazz` -> `FieldScoped(hashtags, Term("jazz"))`
    Hashtag(String),
}
```

### 4.4 Tantivy Query Translation

The AST is translated to Tantivy query types:

| AST Node | Tantivy Query |
|----------|---------------|
| `Term { text, field: None }` | `BooleanQuery::union` over per-field `TermQuery` with field boosts |
| `Term { text, field: Some(f) }` | `TermQuery` on field f |
| `Phrase { terms, field: None }` | `BooleanQuery::union` over per-field `PhraseQuery` with field boosts |
| `Phrase { terms, field: Some(f) }` | `PhraseQuery` on field f |
| `Prefix { prefix, field }` | `RegexQuery` or `PhrasePrefixQuery` on the field(s) |
| `And(children)` | `BooleanQuery` with all children as `Must` |
| `Or(children)` | `BooleanQuery` with all children as `Should` |
| `Not(child)` | `BooleanQuery` with child as `MustNot` |
| `FieldScoped { field, inner }` | Recursive translation with field context |
| `Hashtag(tag)` | `TermQuery` on the `hashtags` field (exact match, no analysis) |

For bare terms with no field scope, the query is expanded across all text fields with field-level boosts. Given the query `jazz`:

```rust
BooleanQuery::union(vec![
    (3.0, TermQuery::new(term("item.title", "jazz"))),    // title boost
    (1.0, TermQuery::new(term("item.description", "jazz"))),
    (2.0, TermQuery::new(term("item.tags", "jazz"))),     // exact in tags
    (2.0, TermQuery::new(term("item.hashtags", "jazz"))), // exact in hashtags
])
```

### 4.5 Error Recovery

Malformed queries must not produce errors. The parser degrades gracefully:

| Malformation | Recovery |
|-------------|----------|
| Unmatched `"` | Treat the opening `"` as literal; parse remaining as bare terms |
| Unmatched `(` | Treat `(` as ignored; parse remaining as flat clause list |
| Empty query `""` | Return zero results with no error |
| Only operators `AND OR NOT` | Treat operators as bare terms |
| Unknown field `foo:bar` | Treat `foo:bar` as bare term `foo:bar` |
| Consecutive operators `AND AND jazz` | Ignore duplicate operators, parse `AND jazz` |

The parser never returns an error to the user. It always produces a best-effort AST. The original query string is preserved in the response for display/debugging.

---

## 5. Phrase Matching

### 5.1 Exact Phrase

Quoted strings produce phrase queries. `"jazz piano"` matches only documents where "jazz" appears immediately before "piano" in the same field, after tokenization and analysis.

**Implementation:** Tantivy's `PhraseQuery` uses position indexes to verify adjacency. Each term in the phrase must appear at consecutive positions in the posting list.

**Cross-field behavior:** A phrase query without a field scope is expanded across all text fields. The phrase must match within a single field -- not across fields. `"jazz piano"` matches a title containing "jazz piano" but does not match a document with "jazz" in the title and "piano" in the description.

### 5.2 Phrase Boosting

Phrase matches receive a multiplicative boost over individual term matches. When a query contains both bare terms and a phrase, the phrase component scores higher:

```
Query: "jazz piano" tutorial

Scoring breakdown for a matching document:
  phrase_score = BM25("jazz piano" as phrase) * phrase_boost
  term_score   = BM25("tutorial" as term)
  total        = phrase_score + term_score
```

| Parameter | Default | Range | Effect |
|-----------|---------|-------|--------|
| `phrase_boost` | 2.0 | 1.0 - 10.0 | Multiplicative boost for phrase matches over individual term matches. |

### 5.3 Proximity Queries (Future Extension)

Proximity queries (terms within N positions) are not in the initial implementation. The position index infrastructure supports them. When needed, the syntax `"jazz piano"~3` (terms within 3 positions) can be added by translating to Tantivy's `PhraseQuery::with_slop(3)`.

---

## 6. Boolean Operators

### 6.1 AND

All terms connected by AND must appear in the matching document. AND is translated to a `BooleanQuery` with all clauses as `Must`:

```
jazz AND piano  ->  BooleanQuery([Must(jazz), Must(piano)])
```

BM25 scoring is still computed for AND queries. Documents matching all terms are scored by the sum of per-term BM25 scores. AND restricts the candidate set; BM25 ranks within it.

### 6.2 OR (Default)

Space-separated terms without explicit operators are treated as OR. Any matching term contributes to the document's score:

```
jazz piano tutorial  ->  BooleanQuery([Should(jazz), Should(piano), Should(tutorial)])
```

Documents matching more terms score higher (BM25 scores accumulate). A document matching all three terms outscores a document matching two, which outscores a document matching one.

### 6.3 NOT / Exclusion

NOT and the `-` prefix exclude documents containing the specified term. Excluded documents are removed from the result set entirely -- they do not receive a score of zero, they are absent.

```
jazz NOT beginner    ->  BooleanQuery([Must(jazz), MustNot(beginner)])
jazz -beginner       ->  same translation
```

A query consisting solely of NOT terms (`-jazz -piano`) is invalid -- it would match every document except those containing the excluded terms. The parser treats this as an empty result set with no error.

### 6.4 Grouping

Parentheses override operator precedence:

```
(jazz OR blues) AND piano  ->  BooleanQuery([
    Must(BooleanQuery([Should(jazz), Should(blues)])),
    Must(piano)
])
```

Grouping nests arbitrarily: `((jazz OR blues) AND piano) NOT beginner` is valid.

### 6.5 Boolean + BM25 Interaction

Boolean operators constrain the candidate set. BM25 ranks within the constrained set. The interaction:

| Clause Type | Effect on Candidates | Effect on BM25 Score |
|-------------|---------------------|---------------------|
| `Must` (AND) | Document must match | Term contributes to BM25 score |
| `Should` (OR) | Document may match | Matching terms contribute to BM25 score; non-matching terms contribute 0 |
| `MustNot` (NOT) | Document must not match | Term does not contribute to score (document excluded) |

For a pure OR query like `jazz piano tutorial`, a document matching only "jazz" is still returned -- but it scores lower than a document matching all three terms. This is the expected "ranked OR" behavior for keyword search.

---

## 7. Field-Scoped Search

### 7.1 Field Syntax

The `field:term` syntax restricts matching to a specific field:

```
title:jazz          -> TermQuery on title field only
tag:tutorial        -> TermQuery on tags field (exact keyword match)
creator:jazzacademy -> TermQuery on creator.handle field (exact keyword match)
```

### 7.2 Field Resolution

The parser maps user-facing field names to internal Tantivy field names based on the target entity kind in the search request:

| User-Facing Field | Entity Kind | Internal Field | Match Type |
|-------------------|-------------|----------------|------------|
| `title` | Item | `item.title` | Tokenized BM25 |
| `description` | Item | `item.description` | Tokenized BM25 |
| `tag` or `tags` | Item | `item.tags` | Exact keyword |
| `category` | Item | `item.category` | Exact keyword |
| `hashtag` | Item | `item.hashtags` | Exact keyword |
| `creator` | Item/Creator | `creator.handle` | Exact keyword |
| `name` | Creator | `creator.name` | Tokenized BM25 |
| `handle` | Creator | `creator.handle` | Exact keyword |
| `language` | Item/Creator | `{kind}.language` | Exact keyword |

### 7.3 Mixed Queries

A query can mix field-scoped and unscoped terms:

```
title:jazz piano tutorial
```

This parses as:
- `FieldScoped(title, Term("jazz"))` AND
- `Or(Term("piano"), Term("tutorial"))` across all fields

The field-scoped term searches only in the title. The unscoped terms search across all text fields with field boosts. The document must match the field-scoped clause AND at least one unscoped clause.

### 7.4 Keyword Field Behavior

Field-scoped searches on `keyword` type fields (tags, category, hashtags, handle) use exact matching, not BM25:

- `tag:tutorial` matches the exact tag string "tutorial"
- `tag:tutorials` does NOT match "tutorial" (no stemming on keyword fields)
- `tag:jazz piano` is parsed as `tag:jazz OR piano` -- only "jazz" is field-scoped

For multi-word exact keyword matches, use quotes: `tag:"jazz piano"` (matches if "jazz piano" is a single tag value).

### 7.5 Field Boost Configuration

Field boosts are configurable per ranking profile, enabling different search experiences:

```rust
// Profile optimized for finding specific content (title-heavy)
TextConfig {
    field_boosts: vec![("title", 5.0), ("description", 1.0), ("tags", 1.5)],
    ..Default::default()
}

// Profile optimized for topic discovery (tag/category-heavy)
TextConfig {
    field_boosts: vec![("title", 2.0), ("description", 1.0), ("tags", 4.0), ("category", 3.0)],
    ..Default::default()
}
```

---

## 8. Autocomplete and Suggest

### 8.1 Architecture

Autocomplete serves the `SUGGEST` operation from API.md. It provides fast prefix-based completions as the user types, powered by three data sources:

```
User types "jazz pia"
     |
     v
+-------------------+     +---------------------+     +-------------------+
|  Term Prefix      |     |  Popular Queries     |     |  Personal History |
|  Index            |     |  (Signal-Weighted)   |     |  (Per-User)       |
+-------------------+     +---------------------+     +-------------------+
| Tantivy term      |     | Top query strings    |     | User's recent     |
| dictionary scan   |     | by result-click      |     | searches and      |
| for "pia*"        |     | signal velocity      |     | engaged items     |
+--------+----------+     +---------+-----------+     +--------+----------+
         |                          |                          |
         v                          v                          v
     +---------------------------------------------------+
     |  Merge + Deduplicate + Rank by:                    |
     |    1. Personal history recency (if for_user)       |
     |    2. Popular query velocity                       |
     |    3. Term frequency in index                      |
     +---------------------------------------------------+
         |
         v
     ["jazz piano", "jazz piano tutorial", "jazz piano chords", ...]
```

### 8.2 Term Prefix Completions

Tantivy's term dictionary supports ordered iteration over terms. Given prefix "pia", scanning the term dictionary yields all terms starting with "pia" (piano, pianist, pianos, etc.). The scan is O(log N + k) where N is the dictionary size and k is the number of matching terms.

**Implementation:** `segment_reader.inverted_index(field).terms().range(prefix_range)` iterates over matching terms. The term's document frequency is used as a popularity proxy.

### 8.3 Popular Query Suggestions

A separate in-memory data structure tracks popular query strings:

```rust
/// Tracks query popularity for autocomplete suggestions.
struct QueryPopularity {
    /// Query string -> (total_count, velocity_1h, last_seen)
    queries: DashMap<String, QueryStats>,
}

struct QueryStats {
    total_count: AtomicU64,
    velocity_1h: AtomicF64,  // via AtomicU64 + f64::from_bits
    last_seen_ns: AtomicU64,
}
```

**Population:** When a `SEARCH` query is executed, the query string is recorded in this structure. When a search result is clicked (`search_click` signal), the query string's count is incremented. This means popular suggestions are weighted by result-click engagement, not just query frequency -- avoiding suggesting queries that produce poor results.

**Trending queries:** When the suggest `prefix` is empty, return the queries with the highest 1-hour velocity. This powers the "trending searches" feature in API.md.

### 8.4 Personalized Suggestions

When `for_user` is provided in the suggest request, the user's recent search history and engaged items contribute to suggestions:

1. **Recent searches**: the user's last 100 search queries, ordered by recency.
2. **Engaged item terms**: terms from titles/tags of items the user has positively engaged with (liked, completed, saved) in the last 7 days.

Personalized suggestions are ranked above popular suggestions when they match the prefix. This enables "jazz pia" to suggest "jazz piano tutorial" if the user recently searched for or engaged with jazz piano content.

### 8.5 "Did You Mean" (Typo Correction on Submit)

When a submitted search query returns fewer than a configurable threshold of results (`did_you_mean_threshold`, default: 5), the system attempts typo correction:

1. For each query term, compute edit-distance-1 and edit-distance-2 variants.
2. Look up each variant in the term dictionary.
3. If a variant exists with higher document frequency than the original term, suggest it.
4. Format as: `did_you_mean: "jazz piano"` in the search response.

This is a post-search operation. The original query still executes and returns whatever results it finds. The suggestion is advisory.

### 8.6 Performance Target

| Operation | Latency Target | Constraint |
|-----------|---------------|------------|
| Prefix autocomplete | < 10 ms p99 | At 10M documents, 500K unique terms |
| Trending suggestions (empty prefix) | < 5 ms p99 | In-memory lookup |
| "Did you mean" | < 15 ms p99 | Edit distance computation over term dictionary |

---

## 9. Typo Tolerance

### 9.1 Fuzzy Matching Strategy

Typo tolerance is applied selectively, not universally. Exact matching is always preferred. Fuzzy matching activates only as a fallback:

```
Query: "jaz piano"
  |
  v
[Exact search for "jaz" AND "piano"]
  |
  v
Results < fuzzy_threshold (default: 5)?
  |
  YES --> [Fuzzy expand "jaz" to edit distance 1]
  |         -> finds "jazz" (DF: 50,000)
  |         -> re-search with "jazz piano"
  |
  NO  --> Return exact results
```

### 9.2 Edit Distance Rules

| Term Length | Max Edit Distance | Rationale |
|-------------|------------------|-----------|
| 1-3 chars | 0 (no fuzzy) | Too many false positives. "cat" -> "car", "can", "bat" -- too noisy. |
| 4-5 chars | 1 | Short terms tolerate 1 typo. "jaz" -> "jazz", "pino" -> "piano". |
| 6+ chars | 2 | Longer terms tolerate 2 typos. "tutoral" -> "tutorial", "begginer" -> "beginner". |

### 9.3 Implementation

**Tantivy's `FuzzyTermQuery`** supports Levenshtein automaton-based fuzzy matching. For each term that produces insufficient results, a `FuzzyTermQuery` is constructed with the appropriate max edit distance. Tantivy compiles a Levenshtein DFA that scans the term dictionary in a single pass, collecting all terms within the edit distance.

```rust
// For the term "tutoral" (6 chars, max_distance=2):
let fuzzy_query = FuzzyTermQuery::new_prefix(
    Term::from_field_text(field, "tutoral"),
    2,          // max_distance
    true,       // transpositions count as distance 1
);
```

**Transpositions** (swapping adjacent characters, e.g., "paino" -> "piano") count as edit distance 1, not 2. This is the Damerau-Levenshtein model, which better matches human typing errors.

### 9.4 Performance Considerations

Levenshtein automaton construction is O(|alphabet|^d) where d is the max edit distance. For d=2, this is manageable. For d=3+, the automaton becomes prohibitively large. The max edit distance of 2 is a hard cap.

Fuzzy matching is NOT applied to:
- Phrase queries (phrase must match exactly after stemming)
- Field-scoped keyword queries (exact match semantics)
- Prefix/wildcard queries (already flexible)
- Terms inside boolean NOT clauses

---

## 10. Segment Management

### 10.1 Tantivy Segment Model

Tantivy organizes the index into immutable segments. Each segment contains a self-contained inverted index, stored columns (fast fields), position data, and a document store. New documents are buffered in memory and flushed as new segments on commit.

```
Tantivy Index Lifecycle

  write(doc)  write(doc)  write(doc)
       |           |           |
       v           v           v
  +---------------------------------------+
  |  IndexWriter (in-memory buffer)       |
  |  - up to 8 concurrent indexing threads|
  |  - configurable heap budget           |
  +---------------------------------------+
                    |
            commit() triggers
                    |
                    v
  +--------+  +--------+  +--------+
  | Seg 0  |  | Seg 1  |  | Seg 2  |  <- on-disk, immutable
  +--------+  +--------+  +--------+
                    |
        merge policy evaluates
                    |
                    v
  +---------------------------+
  | Merged Segment            |  <- replaces Seg 0 + Seg 1
  +---------------------------+
```

### 10.2 Commit Strategy

Tantivy commits control when new documents become searchable. Each commit:
1. Flushes all in-memory documents as new segment(s) on disk.
2. Atomically updates `meta.json` to include new segments.
3. Optionally runs the merge policy to schedule background merges.

**tidalDB's commit cadence:**

| Parameter | Default | Range | Rationale |
|-----------|---------|-------|-----------|
| `text_index.commit_interval` | 1 second | 100ms - 10s | Time between automatic commits. 1s balances search freshness against segment proliferation. |
| `text_index.commit_batch_size` | 5,000 | 100 - 50,000 | Force commit when this many documents are buffered, even if the interval has not elapsed. |

At 1-second commit intervals, new entities are searchable within 1 second of entity store write. Under burst writes (e.g., 10K entities imported), the batch size trigger keeps commit frequency bounded.

**Each commit creates 1 segment per active indexing thread.** With 4 threads and 1-second commits, 4 new segments are created per second. The merge policy consolidates these.

### 10.3 Merge Policy

tidalDB uses Tantivy's `LogMergePolicy` (default) with tuned parameters:

| Parameter | Default | Rationale |
|-----------|---------|-----------|
| `min_merge_size` | 8 | Minimum number of segments before merging is considered. Prevents merging when segment count is already low. |
| `max_docs_before_merge` | 10,000,000 | Segments larger than this are never merged into. Prevents rewriting very large segments. |
| `min_num_segments` | 8 | Merge is triggered when segment count exceeds this. |
| `max_merge_factor` | 10 | Maximum segments merged in a single operation. Bounds merge I/O. |

**Target: fewer than 20 segments at steady state.** At 10M documents, this means segments of 500K-2M documents each. Tantivy searches segments in parallel (when configured with a thread pool), so segment count has diminishing impact on query latency up to approximately 30 segments.

### 10.4 Real-Time Indexing Visibility

The timeline from entity write to searchability:

```
Entity write acknowledged
  |
  | WAL durably logged (0 ms)
  |
  v
Outbox entry created (0 ms)
  |
  | Background indexer polls outbox
  | (poll_interval, default 100ms)
  |
  v
Document added to IndexWriter buffer (<1 ms)
  |
  | Next commit fires
  | (commit_interval, default 1s)
  |
  v
Segment flushed to disk, meta.json updated
  |
  | IndexReader reloaded
  | (reader_reload_interval, default 500ms)
  |
  v
Document visible to search queries
```

**Worst-case visibility latency:** `outbox_poll_interval + commit_interval + reader_reload_interval` = 100ms + 1000ms + 500ms = **1.6 seconds**.

**Typical visibility latency:** Approximately **500-800ms** (poll + commit overlap, reader may already be reloading).

### 10.5 Delete Handling

When an entity is archived or deleted, its document must be removed from the text index. Tantivy's delete mechanism:

1. Call `index_writer.delete_term(Term::from_field_bytes(entity_id_field, &entity_id_bytes))`.
2. The delete is recorded as a tombstone (bitset marking the document as deleted).
3. Deleted documents are excluded from search results immediately after the next commit.
4. Physical removal occurs during segment merging -- the merge process skips deleted documents, reclaiming space.

**Delete-then-add for updates.** Entity metadata updates (e.g., title change) require removing the old document and inserting a new one. Within a single commit batch, the delete applies to prior segments and earlier operations in the batch. The add creates a new document in the new segment.

```rust
// Entity update: title changed
writer.delete_term(Term::from_field_bytes(entity_id_field, &id_bytes));
writer.add_document(new_tantivy_doc)?;
writer.commit()?;
```

### 10.6 Merge Latency Mitigation

Segment merging consumes CPU and I/O in background threads. Under concurrent search load, merges can cause latency spikes. Mitigations:

1. **Readers are never blocked by merges.** A `Searcher` captures an immutable snapshot of the index at acquisition time. Ongoing merges do not affect active searches.
2. **I/O priority.** Merge threads should be configured with lower I/O scheduling priority than search threads (via `ionice` or equivalent on Linux).
3. **Merge rate limiting.** Tantivy's `MergePolicy` can be configured to limit concurrent merges. Default: 1 concurrent merge operation.
4. **Bulk load mode.** During initial data import, set `NoMergePolicy` to skip background merging entirely. After import completes, switch to `LogMergePolicy` and trigger a one-time merge sweep.

---

## 11. Hybrid Fusion with Vector Retrieval

### 11.1 Two-Phase Retrieval Architecture

tidalDB's search pipeline retrieves candidates from two independent indexes, then fuses results:

```
User query: "jazz piano tutorial" + query_embedding
     |
     +------> [Text Index (Tantivy)]           [Vector Index (USearch)]  <---+
     |         BM25 search                      ANN search                   |
     |         query: "jazz piano tutorial"     vector: query_embedding      |
     |         top_k_text candidates            top_k_vector candidates      |
     |              |                                  |                     |
     |              v                                  v                     |
     |         +------------------------------------------+                  |
     |         |  Fusion (RRF or Linear Combination)      |                  |
     |         |  Merge two ranked lists into one          |                  |
     |         +------------------------------------------+                  |
     |              |                                                        |
     |              v                                                        |
     |         Fused candidate set (up to top_k_text + top_k_vector unique)  |
     |              |                                                        |
     +------------- | ----> [Ranking Pipeline] ---> [Diversity] ---> Results |
                    |
              Signal scoring, profile boosts,
              personalization, quality gates
```

### 11.2 Candidate Retrieval Sizes

Each index returns an independent top-k candidate set:

| Parameter | Default | Range | Rationale |
|-----------|---------|-------|-----------|
| `top_k_text` | 200 | 50 - 1,000 | Number of BM25 candidates. 200 is sufficient for most queries. Increase for very broad queries. |
| `top_k_vector` | 200 | 50 - 1,000 | Number of ANN candidates. Matches text retrieval for balanced fusion. |

The union of both candidate sets (up to 400 unique entities) feeds the ranking pipeline. Documents appearing in both lists receive fused scores.

### 11.3 Reciprocal Rank Fusion (RRF)

**Default fusion strategy.** RRF uses rank positions only, eliminating the score normalization problem:

```
RRF_score(d) = sum over each ranked list L:
    1 / (k + rank_L(d))
```

Where:
- `k` = smoothing constant (default: 60)
- `rank_L(d)` = 1-based rank of document d in list L
- If document d does not appear in list L, it contributes 0 from that list

**Pseudocode:**

```rust
fn reciprocal_rank_fusion(
    text_results: &[(EntityId, f32)],   // sorted by BM25 desc
    vector_results: &[(EntityId, f32)], // sorted by similarity desc
    k: u32,                              // default: 60
) -> Vec<(EntityId, f64)> {
    let mut scores: HashMap<EntityId, f64> = HashMap::new();

    // Score from text results
    for (rank_0, (entity_id, _bm25_score)) in text_results.iter().enumerate() {
        let rank = (rank_0 + 1) as f64; // 1-based rank
        *scores.entry(*entity_id).or_default() += 1.0 / (k as f64 + rank);
    }

    // Score from vector results
    for (rank_0, (entity_id, _similarity)) in vector_results.iter().enumerate() {
        let rank = (rank_0 + 1) as f64;
        *scores.entry(*entity_id).or_default() += 1.0 / (k as f64 + rank);
    }

    // Sort by fused score descending
    let mut fused: Vec<_> = scores.into_iter().collect();
    fused.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(Ordering::Equal));
    fused
}
```

**Evidence for RRF as default:** Cormack, Clarke, Buttcher (SIGIR 2009) showed RRF outperforming Condorcet fusion all 7 times tested and CombMNZ 6/7 times (p ~ 0.04). The k=60 constant is robust -- values from 30 to 100 produce nearly identical results. Qdrant and Elasticsearch default to RRF. RRF requires no score normalization, no training data, and no tuning -- ideal for tidalDB's zero-configuration starting point.

### 11.4 Linear Combination (Tuned Fusion)

**Upgrade path when relevance labels exist.** A convex combination uses normalized scores:

```
fused_score(d) = alpha * norm(text_score(d)) + (1 - alpha) * vector_score(d)
```

Where:
- `alpha` = text weight (configurable per ranking profile, default: 0.6)
- `norm()` = score normalization function (see 11.5)
- `vector_score(d)` is already bounded [0, 1] for cosine similarity on normalized vectors

**Evidence for linear combination as upgrade:** Bruch, Gai, Ingber (ACM TOIS, 2024) showed convex combination outperforms RRF in both in-domain and out-of-domain settings when even a small training set is available. The key insight: RRF discards score magnitude information.

### 11.5 Score Normalization

BM25 scores must be normalized before linear combination. Two strategies:

**Min-Max Normalization (default for linear combination):**

```
norm(s) = (s - min_score) / (max_score - min_score)
```

Where `min_score` and `max_score` are from the current query's result set. This maps BM25 scores to [0, 1] for the current query. Different queries produce different normalizations.

**Atan Normalization (alternative):**

```
norm(s) = (2 / pi) * atan(s / C)
```

Where C is a corpus-dependent constant (default: 10.0 for typical BM25 score ranges). This avoids needing min/max from the current result set. Vespa uses this approach.

### 11.6 Configurable Fusion per Profile

Fusion strategy is set per ranking profile:

```rust
db.define_profile(ProfileDef {
    name: "search",
    candidate: Candidate::Hybrid {
        text_weight: 0.6,      // alpha for linear combination
        vector_weight: 0.4,    // 1 - alpha
        fusion: Fusion::Rrf { k: 60 },  // or Fusion::Linear { normalize: MinMax }
    },
    ..Default::default()
})?;
```

| Fusion Strategy | When to Use | Configuration |
|----------------|-------------|---------------|
| `Fusion::Rrf { k: 60 }` | Default. No training data. Heterogeneous score distributions. | k: smoothing constant (30-100). |
| `Fusion::Linear { normalize: MinMax }` | Training data available. Known score distributions. | alpha via text_weight/vector_weight. |
| `Fusion::Linear { normalize: Atan { c: 10.0 } }` | Need query-independent normalization. | C: corpus-dependent constant. |
| `Fusion::TextOnly` | No query embedding provided (text-only search). | N/A |
| `Fusion::VectorOnly` | No query text provided (semantic-only search). | N/A |

### 11.7 Text-Only and Vector-Only Fallback

When only a text query is provided (no `vector` in the search request), the pipeline skips vector retrieval entirely. BM25 scores pass directly to the ranking pipeline without normalization. This is `Fusion::TextOnly`.

When only a vector is provided (empty `query` string), the pipeline skips text retrieval. Vector similarity scores pass directly. This is `Fusion::VectorOnly`.

When both are provided but one index returns zero results (e.g., the text query matches nothing), the other index's results are used alone. Documents do not receive a fusion penalty for being absent from an empty result list.

### 11.8 Cross-Scoring Optimization

For the ranking pipeline's signal scoring phase, the raw BM25 score and raw vector similarity score are preserved as features on each candidate, even after fusion:

```rust
pub struct FusedCandidate {
    pub entity_id: EntityId,
    pub fused_score: f64,
    pub text_score: Option<f32>,    // raw BM25, pre-normalization
    pub vector_score: Option<f32>,  // raw cosine similarity
    pub text_rank: Option<u32>,     // 1-based rank in text results
    pub vector_rank: Option<u32>,   // 1-based rank in vector results
}
```

This enables ranking profiles to apply additional boosts based on raw scores:

```rust
// Boost items that scored well on BOTH text and vector
Boost::hybrid_match_bonus(0.1),  // +10% for items appearing in both lists
```

---

## 12. Integration with Storage Engine

### 12.1 Dual-Write Outbox Pattern

The entity store is the source of truth. The text index is a derived index. Consistency between them follows the outbox pattern recommended in the Tantivy research (docs/research/tantivy.md):

```
Entity write request
     |
     v
+-----------------------------+
| WAL: write EntityWrite      |
| record with seqno N         |
+-----------------------------+
     |
     v
+-----------------------------+
| Entity Store (redb):        |
| write metadata to META key  |
+-----------------------------+
     |
     v
+-----------------------------+
| Outbox (fjall or redb):     |
| write (seqno N, entity_id,  |
|   operation: Insert/Update/ |
|   Delete, field_data)       |
+-----------------------------+
     |
     | (all above in same WAL record / atomic batch)
     |
     v
ACK returned to caller
     |
     | (asynchronous, background thread)
     v
+-----------------------------+
| Text Index Background       |
| Indexer:                    |
|   1. Poll outbox for        |
|      entries > last_seqno   |
|   2. For each entry:        |
|      - Insert: add_document |
|      - Update: delete + add |
|      - Delete: delete_term  |
|   3. Commit Tantivy         |
|   4. Store last_seqno in    |
|      commit payload         |
+-----------------------------+
```

### 12.2 Background Indexer

The background indexer is a dedicated thread that drains the outbox and feeds Tantivy:

```rust
/// Background thread that keeps the text index synchronized
/// with the entity store.
struct TextIndexer {
    /// Tantivy IndexWriter -- single-writer lock.
    writer: IndexWriter,
    /// Last outbox sequence number successfully committed to Tantivy.
    last_committed_seqno: u64,
    /// Polling interval for outbox reads.
    poll_interval: Duration,
    /// Maximum documents per commit batch.
    commit_batch_size: usize,
}
```

**Indexer loop:**

1. Read outbox entries with `seqno > last_committed_seqno`, up to `commit_batch_size`.
2. For each entry, translate to Tantivy operations (add, delete, update).
3. Call `writer.commit()`. On success, store the highest processed `seqno` in the commit's payload via `writer.set_payload()`.
4. Update `last_committed_seqno`.
5. Sleep for `poll_interval` if no entries were found.

### 12.3 Crash Recovery

On startup, the text indexer:

1. Opens the Tantivy index.
2. Reads the last commit's payload to recover `last_committed_seqno`.
3. Replays all outbox entries with `seqno > last_committed_seqno`.
4. Resumes normal polling.

**Failure modes and recovery:**

| Failure | State After Crash | Recovery |
|---------|-------------------|----------|
| Crash before Tantivy commit | Entity store ahead of text index. Outbox entries exist for uncommitted docs. | Replay from `last_committed_seqno`. Documents appear in search after recovery. |
| Crash during Tantivy commit | Tantivy rolls back to last successful commit. | Same as above -- replay from last committed seqno. |
| Crash after Tantivy commit but before outbox cleanup | Outbox may re-deliver entries. | Tantivy silently handles duplicate deletes. Duplicate adds create duplicate documents briefly until the next merge consolidates them. The `_entity_id` field provides deduplication at query time. |
| Tantivy index corruption | Text index is unusable. | Full rebuild from entity store (Section 12.5). |

### 12.4 Outbox Key Encoding

Outbox entries are stored in the LSM-tree (fjall) for write performance:

```
Key:  OUTBOX{seqno:8BE}
Value: {operation:1}{entity_kind:1}{entity_id:8BE}{field_data:variable}
```

| Operation Byte | Meaning |
|---------------|---------|
| `0x01` | Insert (new entity) |
| `0x02` | Update (metadata changed) |
| `0x03` | Delete (entity archived/deleted) |

Outbox entries are cleaned up after the text indexer confirms they have been committed to Tantivy. Cleanup is a range delete: all keys with `seqno <= last_committed_seqno`.

### 12.5 Full Rebuild

The text index can be rebuilt from scratch using the entity store:

```rust
impl TextIndex for TantivyTextIndex {
    fn rebuild_from(&self, entity_store: &dyn EntityStore) -> Result<()> {
        // 1. Create a new empty Tantivy index in a temporary directory
        // 2. Set NoMergePolicy for bulk load
        // 3. Scan all active entities from the entity store
        // 4. For each entity, extract text/keyword fields, add_document()
        // 5. Commit with batch_size chunks
        // 6. Switch merge policy to LogMergePolicy
        // 7. Trigger one-time merge sweep
        // 8. Atomically swap the old index directory for the new one
        // 9. Reload the IndexReader
    }
}
```

**Rebuild performance:** At ~30,000 docs/sec (measured for structured documents with 4-5 text fields on the Tantivy benchmark), a full 10M document rebuild completes in approximately **5-6 minutes**. The old index continues serving queries during the rebuild. The swap is atomic (directory rename).

### 12.6 Consistency Guarantees

The text index is **eventually consistent** with the entity store. The maximum lag is bounded by:

```
max_lag = outbox_poll_interval + commit_interval + reader_reload_interval
        = 100ms + 1000ms + 500ms
        = 1.6 seconds (worst case)
```

This means:
- A newly written entity is searchable within 1.6 seconds.
- A deleted entity may still appear in search results for up to 1.6 seconds after deletion.
- An updated entity may return stale text matches for up to 1.6 seconds.

For tidalDB's use case, this is acceptable. Content platforms routinely tolerate 1-5 second indexing lag. If sub-second freshness is critical, reduce `commit_interval` to 200ms (at the cost of more frequent segment creation and higher merge pressure).

---

## 13. Trait Abstraction

### 13.1 TextIndex Trait

All text retrieval operations are accessed through this trait. No module outside `storage/text/` interacts with Tantivy types directly.

```rust
/// Trait for the full-text search index.
///
/// The text index is a secondary index over entity metadata.
/// It is not a source of truth -- it can be rebuilt from the entity store.
pub trait TextIndex: Send + Sync {
    /// Index a document for a newly created or updated entity.
    ///
    /// For updates, the caller must call `delete_document` first.
    /// Fields are extracted from the entity's metadata according to the
    /// entity definition's field types (text and keyword fields only).
    fn index_document(
        &self,
        entity_kind: EntityKind,
        entity_id: EntityId,
        fields: &[(FieldName, FieldValue)],
    ) -> Result<(), TextIndexError>;

    /// Execute a text search query and return matching entities with BM25 scores.
    ///
    /// Results are sorted by BM25 score descending.
    /// `filters` are metadata predicates evaluated during or after search.
    /// `limit` caps the number of results.
    fn search(
        &self,
        entity_kind: EntityKind,
        query: &SearchQuery,
        field_boosts: &[(FieldName, f32)],
        limit: usize,
    ) -> Result<Vec<TextSearchResult>, TextIndexError>;

    /// Score a specific set of entity IDs against a query.
    ///
    /// Used by the ranking pipeline to obtain BM25 scores for entities
    /// that were retrieved by vector search but need text relevance scoring.
    /// Returns scores only for entities that match the query.
    fn score_candidates(
        &self,
        entity_kind: EntityKind,
        query: &SearchQuery,
        field_boosts: &[(FieldName, f32)],
        candidate_ids: &[EntityId],
    ) -> Result<Vec<TextSearchResult>, TextIndexError>;

    /// Return autocomplete suggestions for a prefix.
    ///
    /// Combines term dictionary prefix scan, popular query suggestions,
    /// and optionally personalized suggestions.
    fn suggest(
        &self,
        entity_kind: EntityKind,
        prefix: &str,
        limit: usize,
    ) -> Result<Vec<Suggestion>, TextIndexError>;

    /// Remove a document from the text index.
    ///
    /// The document is tombstoned and excluded from future search results.
    /// Physical removal occurs during segment merging.
    fn delete_document(
        &self,
        entity_kind: EntityKind,
        entity_id: EntityId,
    ) -> Result<(), TextIndexError>;

    /// Rebuild the entire text index from the entity store.
    ///
    /// Used for crash recovery when the text index is corrupted,
    /// or when the entity schema changes in ways that require re-indexing
    /// (e.g., tokenizer change, new text field added to existing entities).
    fn rebuild_from(
        &self,
        entity_store: &dyn EntityStore,
    ) -> Result<(), TextIndexError>;

    /// Commit pending changes and make them visible to searchers.
    ///
    /// Called by the background indexer on its commit cadence.
    /// Returns the commit opstamp for outbox coordination.
    fn commit(&self) -> Result<u64, TextIndexError>;

    /// Return the number of documents currently in the index.
    fn doc_count(&self) -> Result<u64, TextIndexError>;
}
```

### 13.2 Supporting Types

```rust
/// A text search result: entity ID with BM25 score.
pub struct TextSearchResult {
    pub entity_id: EntityId,
    pub score: f32,
}

/// An autocomplete suggestion.
pub struct Suggestion {
    /// The suggested completion string.
    pub text: String,
    /// Suggestion source for UI rendering.
    pub source: SuggestionSource,
    /// Relevance/popularity score for ranking suggestions.
    pub score: f64,
}

pub enum SuggestionSource {
    /// From the term dictionary (term completion).
    TermCompletion,
    /// From popular query tracking.
    PopularQuery,
    /// From the user's personal history.
    PersonalHistory,
    /// From trending queries.
    TrendingQuery,
}

pub enum TextIndexError {
    /// Tantivy internal error.
    Engine(String),
    /// Schema mismatch: field not found in index.
    FieldNotFound(FieldName),
    /// Index is being rebuilt; queries are temporarily unavailable.
    Rebuilding,
    /// I/O error during index operations.
    Io(std::io::Error),
}
```

### 13.3 MockTextIndex

For testing, `MockTextIndex` implements the `TextIndex` trait with an in-memory inverted index:

```rust
/// In-memory text index for deterministic testing.
///
/// Uses a simple HashMap<Term, Vec<EntityId>> for term lookups
/// and a naive TF-IDF scorer. Not performant, but correct.
/// Enables unit testing of the query parser, fusion logic, and
/// ranking pipeline without Tantivy on disk.
pub struct MockTextIndex {
    documents: HashMap<EntityId, Vec<(FieldName, String)>>,
    inverted_index: HashMap<(FieldName, String), Vec<EntityId>>,
}
```

The mock implements all trait methods with simplified but functionally correct behavior. BM25 scoring uses a basic TF-IDF approximation. Phrase matching checks term adjacency in the stored document text. This is sufficient for testing query parsing, fusion, and ranking integration.

### 13.4 TantivyTextIndex

The production implementation:

```rust
/// Production text index backed by Tantivy.
pub struct TantivyTextIndex {
    /// Tantivy index handle.
    index: tantivy::Index,
    /// Single-writer lock. Protected by Arc<Mutex<>> because
    /// Tantivy's IndexWriter is !Sync but we need it accessible
    /// from the background indexer thread.
    writer: Arc<Mutex<IndexWriter>>,
    /// Reader for search operations. Internally uses a pool of Searcher
    /// instances. Reloaded on commit to see new segments.
    reader: IndexReader,
    /// Field name -> Tantivy Field mapping.
    field_map: HashMap<String, tantivy::schema::Field>,
    /// The entity_id fast field for DocAddress -> EntityId resolution.
    entity_id_field: tantivy::schema::Field,
    /// The entity_kind fast field for per-kind queries.
    entity_kind_field: tantivy::schema::Field,
}
```

---

## 14. Performance Targets

### 14.1 Search Latency

| Operation | Target | Corpus Size | Conditions |
|-----------|--------|-------------|------------|
| Single-term keyword search | < 5 ms p50, < 10 ms p99 | 10M documents | Warm cache, single thread |
| Multi-term OR search (3 terms) | < 10 ms p50, < 20 ms p99 | 10M documents | Warm cache, single thread |
| Phrase search | < 10 ms p50, < 20 ms p99 | 10M documents | Warm cache, 2-3 word phrase |
| Boolean AND + NOT | < 10 ms p50, < 20 ms p99 | 10M documents | Warm cache |
| Field-scoped search | < 5 ms p50, < 10 ms p99 | 10M documents | Single field, warm cache |
| Hybrid fusion (text + vector) | < 30 ms p50, < 50 ms p99 | 10M documents | Both indexes warm, includes fusion computation |

### 14.2 Indexing Throughput

| Operation | Target | Conditions |
|-----------|--------|------------|
| Bulk indexing (initial load) | > 30,000 docs/sec | 4 indexing threads, NoMergePolicy, 4-5 text fields per doc |
| Incremental indexing (steady state) | > 10,000 docs/sec | LogMergePolicy active, concurrent search load |
| Full rebuild (10M docs) | < 6 minutes | 4 threads, temporary index directory |

### 14.3 Autocomplete

| Operation | Target | Conditions |
|-----------|--------|------------|
| Prefix autocomplete | < 10 ms p99 | 500K unique terms, 10M documents |
| Trending suggestions | < 5 ms p99 | In-memory, no disk I/O |
| Personalized suggestions | < 10 ms p99 | User history in memory |

### 14.4 Real-Time Visibility

| Metric | Target |
|--------|--------|
| Entity write to searchable | < 1.6 seconds (worst case) |
| Entity write to searchable | < 800 ms (typical) |
| Entity delete to unsearchable | < 1.6 seconds (worst case) |

### 14.5 Resource Budget

| Resource | Budget at 10M Documents | Notes |
|----------|------------------------|-------|
| Disk space (index) | 5-8 GB | 4-5 text fields, positions indexed, ~38% compression ratio |
| RAM (page cache) | 5-8 GB recommended | mmap-based search; performance depends on page cache residency |
| RAM (IndexWriter heap) | 256 MB | Configurable. 256 MB supports 4 indexing threads at 64 MB each. |
| Background threads | 2 | 1 for the indexer loop, 1 for merge operations |

---

## 15. Invariants and Correctness Guarantees

These invariants must hold at all times. Property tests and integration tests enforce them.

| # | Invariant | Test Strategy |
|---|-----------|---------------|
| 1 | Every active entity in the entity store has exactly one corresponding document in the text index (eventually, within the consistency window). | Periodic consistency check: scan entity store, verify each entity has a text index document. |
| 2 | No archived or deleted entity appears in text search results. | Property test: archive entity, verify it disappears from search within the consistency window. |
| 3 | A phrase query `"A B"` matches only documents where token A appears immediately before token B in the same field. | Property test: generate random documents, verify phrase matches against position indexes. |
| 4 | Boolean NOT never produces false negatives: if a document does not contain the excluded term, it must not be excluded. | Property test: documents without the NOT term must appear in results. |
| 5 | Field-scoped queries never match in fields other than the specified field. | Property test: `title:X` with X only in description returns zero results. |
| 6 | The text index can be fully rebuilt from the entity store and produce identical search results. | Integration test: build index, query, rebuild, query again, compare results. |
| 7 | BM25 scores are deterministic: the same query against the same corpus always produces the same scores (within floating-point precision). | Property test: run same query twice, verify scores match. |
| 8 | The outbox never loses an entry: every entity write produces an outbox entry that is eventually consumed by the text indexer. | Crash test: inject failures during entity write, verify outbox entries survive recovery. |
| 9 | Duplicate outbox replay does not corrupt the text index. | Test: replay the same outbox range twice, verify search results are correct (no duplicate documents). |
| 10 | Autocomplete suggestions never include terms from deleted/archived entities (eventually, within the consistency window). | Integration test: delete entity with unique term, verify term disappears from suggestions after commit + merge. |

---

## 16. Configuration Reference

### 16.1 Text Index Configuration

| Parameter | Default | Range | Description |
|-----------|---------|-------|-------------|
| `text_index.enabled` | `true` | bool | Enable/disable the text index entirely. When disabled, SEARCH queries with text return an error. |
| `text_index.data_dir` | `{data_dir}/text_index/` | path | Directory for Tantivy index files. |
| `text_index.writer_heap_budget` | 256 MiB | 64 MiB - 2 GiB | Memory budget for Tantivy's IndexWriter. Divided among indexing threads. |
| `text_index.indexing_threads` | 4 | 1 - 8 | Number of concurrent indexing threads within Tantivy. |
| `text_index.commit_interval` | 1 second | 100ms - 10s | Time between automatic Tantivy commits. |
| `text_index.commit_batch_size` | 5,000 | 100 - 50,000 | Maximum documents buffered before forcing a commit. |
| `text_index.reader_reload_interval` | 500 ms | 100ms - 5s | How often the IndexReader checks for new commits. |

### 16.2 Outbox Configuration

| Parameter | Default | Range | Description |
|-----------|---------|-------|-------------|
| `text_index.outbox_poll_interval` | 100 ms | 10ms - 1s | How often the background indexer polls the outbox. |
| `text_index.outbox_batch_size` | 1,000 | 100 - 10,000 | Maximum outbox entries processed per indexer cycle. |

### 16.3 Merge Policy Configuration

| Parameter | Default | Range | Description |
|-----------|---------|-------|-------------|
| `text_index.merge_policy` | `log` | `log`, `none` | Merge strategy. `none` disables merging (for bulk load). |
| `text_index.merge_min_segments` | 8 | 2 - 50 | Minimum segment count to trigger merge. |
| `text_index.merge_max_factor` | 10 | 2 - 20 | Maximum segments merged in one operation. |

### 16.4 BM25 Configuration (per Ranking Profile)

| Parameter | Default | Range | Description |
|-----------|---------|-------|-------------|
| `bm25_k1` | 1.2 | 0.0 - 3.0 | Term frequency saturation parameter. |
| `bm25_b` | 0.75 | 0.0 - 1.0 | Document length normalization parameter. |
| `phrase_boost` | 2.0 | 1.0 - 10.0 | Multiplicative boost for phrase matches. |
| `field_boosts` | See Section 3.3 | field -> f32 | Per-field BM25 boost weights. |

### 16.5 Fusion Configuration (per Ranking Profile)

| Parameter | Default | Range | Description |
|-----------|---------|-------|-------------|
| `fusion` | `Rrf { k: 60 }` | See Section 11.6 | Fusion strategy for hybrid search. |
| `top_k_text` | 200 | 50 - 1,000 | BM25 candidate set size for fusion. |
| `top_k_vector` | 200 | 50 - 1,000 | ANN candidate set size for fusion. |
| `text_weight` | 0.6 | 0.0 - 1.0 | Text score weight in linear combination. |
| `vector_weight` | 0.4 | 0.0 - 1.0 | Vector score weight in linear combination. |

### 16.6 Autocomplete Configuration

| Parameter | Default | Range | Description |
|-----------|---------|-------|-------------|
| `suggest.max_term_completions` | 10 | 1 - 100 | Maximum term completions from the term dictionary. |
| `suggest.max_popular_queries` | 100,000 | 10,000 - 1,000,000 | Maximum popular query strings tracked in memory. |
| `suggest.popular_query_decay` | 24 hours | 1h - 7d | Half-life for popular query velocity decay. |
| `suggest.did_you_mean_threshold` | 5 | 0 - 100 | Minimum results before "did you mean" triggers. 0 disables. |

### 16.7 Typo Tolerance Configuration

| Parameter | Default | Range | Description |
|-----------|---------|-------|-------------|
| `fuzzy.enabled` | `true` | bool | Enable/disable typo tolerance. |
| `fuzzy.min_term_length` | 4 | 1 - 10 | Minimum term length for fuzzy matching. |
| `fuzzy.short_term_distance` | 1 | 0 - 2 | Max edit distance for terms with length < 6. |
| `fuzzy.long_term_distance` | 2 | 0 - 3 | Max edit distance for terms with length >= 6. |
| `fuzzy.result_threshold` | 5 | 0 - 100 | Minimum exact results before fuzzy fallback triggers. 0 = always fuzzy. |

---

## References

- **Tantivy Research:** `docs/research/tantivy.md` -- Custom Collector API, dual-write consistency, segment merge latency, RRF vs linear combination analysis
- **ANN Research:** `docs/research/ann_for_tidaldb.md` -- USearch selection, filtered search architecture, memory/persistence planning
- **Storage Engine Spec:** `docs/specs/01-storage-engine.md` -- WAL, outbox pattern, key encoding, hybrid storage backend
- **Entity Model Spec:** `docs/specs/02-entity-model.md` -- Field types (text, keyword, keywords), entity lifecycle, embedding management
- **Signal System Spec:** `docs/specs/03-signal-system.md` -- Signal write path, WAL-first durability
- **Cormack, Clarke, Buttcher.** "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods." SIGIR 2009. -- RRF algorithm, k=60 default, statistical significance results
- **Bruch, Gai, Ingber.** "An Analysis of Fusion Functions for Hybrid Retrieval." ACM TOIS 2024. -- Convex combination outperforms RRF with training data
- **Lee, J.H.** "Analyses of Multiple Evidence Combination." SIGIR 1997. -- Min-max score normalization for rank fusion
- **Robertson, Zaragoza.** "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in IR, 2009. -- BM25 formula, parameter analysis, k1/b defaults
- **Tantivy 0.25 documentation** (docs.rs/tantivy) -- Collector trait, Weight/Scorer pipeline, LogMergePolicy, schema API
- **Quickwit engineering blog** -- Tantivy segment management at scale, commit frequency tradeoffs
- **Vespa engineering blog** -- Atan normalization for hybrid search, NDCG comparison of fusion methods