tidaldb/docs/planning/milestone-5/phase-1/task-01-text-index-core.md
jordan 192c473f55 feat: complete Milestone 5 — full-text search, RRF fusion, and creator search
- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs)
- M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates)
- M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking)
- M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators)
- Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.)
- Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.)
- Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers
- Add benches: fusion, search, session, text_index
- Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index)
- Update blog posts, roadmap, content strategy, and M5 planning docs
- Add tmp/ and .claude/worktrees/ to .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 23:53:16 -07:00

11 KiB

Task 01: TextIndex Core

Delivers

TextIndex struct, Tantivy schema generation from tidalDB schema text field definitions, IndexWriter/IndexReader lifecycle, entity_id fast field, TextIndex::open() and TextIndex::close().

Also extends Schema and SchemaBuilder with TextFieldDef — the declaration of which metadata keys to index for full-text search, and whether they are tokenized text or keyword (raw) fields.

Complexity: L

Dependencies

  • None from prior m5 tasks (this is the foundation)
  • tidalDB Schema (schema/validation.rs) — will be extended
  • Cargo.toml — tantivy dependency must be added

Technical Design

1. Add tantivy to Cargo.toml

tantivy = "0.22"

Use 0.22 — stable API, widely deployed, Collector trait and DocSet::seek available.

2. Add TextFieldDef to Schema

In schema/validation.rs, add:

/// Declaration of a text field for full-text search indexing.
///
/// When a text field is declared in the schema, items written to tidalDB
/// will have the corresponding metadata key indexed by Tantivy for full-text search.
#[derive(Debug, Clone)]
pub struct TextFieldDef {
    /// The metadata key to index (e.g., "title", "description", "tags").
    pub key: String,
    /// Whether this field is tokenized (full-text) or raw (keyword/exact-match).
    pub field_type: TextFieldType,
}

/// The Tantivy indexing mode for a text field.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TextFieldType {
    /// Full tokenization with Tantivy's default tokenizer (lowercase, whitespace split).
    /// Good for: title, description, body text.
    Text,
    /// Raw storage, no tokenization. Only exact-match queries work.
    /// Good for: category, format, creator_id, language tags.
    Keyword,
}

Add text_fields: Vec<TextFieldDef> to Schema and SchemaBuilder. Add SchemaBuilder::text_field(key, TextFieldType) builder method. Expose Schema::text_fields() -> &[TextFieldDef].

3. TextIndex Module Structure

Create tidal/src/text/ module with:

tidal/src/text/
├── mod.rs          # pub re-exports
├── index.rs        # TextIndex struct and config
├── writer.rs       # TextIndexWriter
├── syncer.rs       # TextIndexSyncer (task-03)
├── collectors.rs   # Scoring collectors (task-04)
└── query.rs        # TextQueryParser (task-05)

Add pub mod text; to tidal/src/lib.rs.

4. TextIndex Struct

// tidal/src/text/index.rs

use std::path::PathBuf;
use std::sync::{Arc, Mutex};
use tantivy::{Index, IndexReader, IndexWriter, ReloadPolicy, schema as tv_schema};

use crate::schema::{EntityId, TextFieldDef, TextFieldType};
use crate::TidalError;

/// Configuration for the text index.
#[derive(Debug, Clone)]
pub struct TextIndexConfig {
    /// Directory for Tantivy index files.
    pub index_dir: PathBuf,
    /// IndexWriter heap budget in bytes. Default: 50MB.
    pub heap_budget_bytes: usize,
    /// Maximum documents before forcing a commit.
    pub commit_every_n_docs: usize,
    /// Maximum seconds between commits.
    pub commit_every_secs: u64,
}

impl Default for TextIndexConfig {
    fn default() -> Self {
        Self {
            index_dir: PathBuf::from("data/text_index"),
            heap_budget_bytes: 50 * 1024 * 1024, // 50MB
            commit_every_n_docs: 1000,
            commit_every_secs: 2,
        }
    }
}

/// Fields that every Tantivy document must have.
pub(crate) struct TantivyFields {
    /// Fast field for the tidalDB entity ID (u64). Used for EntityId->DocAddress mapping.
    pub entity_id: tv_schema::Field,
    /// Declared text fields from the tidalDB schema.
    pub text_fields: Vec<(String, tv_schema::Field, TextFieldType)>,
}

/// The text index. Wraps Tantivy's Index, IndexWriter, and IndexReader.
///
/// Thread-safe: the IndexWriter is behind a Mutex (Tantivy enforces single-writer),
/// the IndexReader provides lock-free snapshot reads.
///
/// IMPORTANT: `TextIndex` is a derived index. The entity store is the source of truth.
/// If the Tantivy index is lost, call `rebuild_from()` to reconstruct it.
pub struct TextIndex {
    pub(crate) index: Index,
    pub(crate) writer: Mutex<IndexWriter>,
    pub(crate) reader: IndexReader,
    pub(crate) fields: Arc<TantivyFields>,
    pub(crate) config: TextIndexConfig,
}

5. TextIndex::open() and ::close()

impl TextIndex {
    /// Open or create a TextIndex from the given config and field definitions.
    ///
    /// If the index directory exists, opens the existing index.
    /// If not, creates a new index.
    ///
    /// # Errors
    /// Returns `TidalError::Internal` if Tantivy initialization fails.
    pub fn open(config: TextIndexConfig, text_fields: &[TextFieldDef]) -> crate::Result<Self> {
        // 1. Build Tantivy schema
        let (tv_schema, fields) = build_tantivy_schema(text_fields)?;

        // 2. Open or create index
        let index = if config.index_dir.exists() {
            Index::open_in_dir(&config.index_dir)
                .map_err(|e| TidalError::Internal(format!("tantivy open: {e}")))?
        } else {
            std::fs::create_dir_all(&config.index_dir)
                .map_err(|e| TidalError::Internal(format!("create index dir: {e}")))?;
            Index::create_in_dir(&config.index_dir, tv_schema)
                .map_err(|e| TidalError::Internal(format!("tantivy create: {e}")))?
        };

        // 3. Create IndexWriter with heap budget
        let writer = index
            .writer(config.heap_budget_bytes)
            .map_err(|e| TidalError::Internal(format!("tantivy writer: {e}")))?;

        // 4. Create IndexReader with auto-reload on commit
        let reader = index
            .reader_builder()
            .reload_policy(ReloadPolicy::OnCommitWithDelay)
            .try_into()
            .map_err(|e| TidalError::Internal(format!("tantivy reader: {e}")))?;

        Ok(Self {
            index,
            writer: Mutex::new(writer),
            reader,
            fields: Arc::new(fields),
            config,
        })
    }

    /// Open an in-memory text index for testing.
    pub fn ephemeral(text_fields: &[TextFieldDef]) -> crate::Result<Self> {
        let (tv_schema, fields) = build_tantivy_schema(text_fields)?;
        let index = Index::create_in_ram(tv_schema);
        let writer = index
            .writer(15 * 1024 * 1024) // 15MB minimum for ephemeral
            .map_err(|e| TidalError::Internal(format!("tantivy writer: {e}")))?;
        let reader = index
            .reader_builder()
            .reload_policy(ReloadPolicy::Manual)
            .try_into()
            .map_err(|e| TidalError::Internal(format!("tantivy reader: {e}")))?;
        let config = TextIndexConfig {
            index_dir: PathBuf::from(":memory:"),
            ..Default::default()
        };
        Ok(Self {
            index,
            writer: Mutex::new(writer),
            reader,
            fields: Arc::new(fields),
            config,
        })
    }

    /// Graceful shutdown: wait for background merges to complete.
    ///
    /// # Errors
    /// Returns `TidalError::Internal` if the writer fails to commit or merge.
    pub fn close(self) -> crate::Result<()> {
        let mut writer = self
            .writer
            .into_inner()
            .map_err(|e| TidalError::Internal(format!("writer lock poisoned: {e}")))?;
        writer
            .wait_merging_threads()
            .map_err(|e| TidalError::Internal(format!("tantivy merge wait: {e}")))
    }

    /// Get a reference to the fields mapping (for writer and collector use).
    #[must_use]
    pub fn fields(&self) -> &Arc<TantivyFields> {
        &self.fields
    }
}

/// Construct a Tantivy schema from tidalDB text field definitions.
///
/// Always adds:
/// - `entity_id`: u64 fast field for EntityId -> DocAddress mapping
///
/// For each TextFieldDef:
/// - `TextFieldType::Text` → `TEXT | STORED` (tokenized, stored for highlight)
/// - `TextFieldType::Keyword` → `STRING | STORED` (raw, stored)
fn build_tantivy_schema(
    text_fields: &[TextFieldDef],
) -> crate::Result<(tv_schema::Schema, TantivyFields)> {
    let mut sb = tv_schema::Schema::builder();

    // entity_id fast field — every document must have this
    let entity_id_field = sb.add_u64_field(
        "entity_id",
        tv_schema::FAST | tv_schema::STORED,
    );

    let mut fields = Vec::with_capacity(text_fields.len());
    for def in text_fields {
        let options = match def.field_type {
            TextFieldType::Text => tv_schema::TEXT | tv_schema::STORED,
            TextFieldType::Keyword => tv_schema::STRING | tv_schema::STORED,
        };
        let field = sb.add_text_field(&def.key, options);
        fields.push((def.key.clone(), field, def.field_type.clone()));
    }

    let schema = sb.build();
    Ok((
        schema,
        TantivyFields {
            entity_id: entity_id_field,
            text_fields: fields,
        },
    ))
}

6. TextIndex must be Send + Sync

tantivy::Index is Send + Sync. tantivy::IndexWriter is Send (not Sync) — hence the Mutex<IndexWriter>. tantivy::IndexReader is Send + Sync. Mutex<IndexWriter> is Send + Sync when IndexWriter: Send. So TextIndex is Send + Sync implicitly.

Acceptance Criteria

  • TextFieldDef and TextFieldType types in schema/validation.rs
  • SchemaBuilder::text_field(key, TextFieldType) builder method
  • Schema::text_fields() -> &[TextFieldDef] accessor
  • tidal/src/text/ module created with pub mod text; in lib.rs
  • TextIndex::open(config, text_fields) creates or opens a Tantivy index
  • TextIndex::ephemeral(text_fields) creates an in-memory index for tests
  • TextIndex::close(self) calls wait_merging_threads()
  • entity_id fast field present in every Tantivy document
  • Text fields use TEXT | STORED options (tokenized)
  • Keyword fields use STRING | STORED options (raw/exact)
  • TextIndex is Send + Sync
  • Unit tests: open_and_close, ephemeral_creates_valid_index, schema_has_entity_id_field, text_fields_correct_options, keyword_fields_correct_options
  • cargo check, cargo fmt, cargo clippy -D warnings all pass

Test Strategy

#[cfg(test)]
mod tests {
    use super::*;
    use crate::schema::{TextFieldDef, TextFieldType};

    fn test_fields() -> Vec<TextFieldDef> {
        vec![
            TextFieldDef { key: "title".into(), field_type: TextFieldType::Text },
            TextFieldDef { key: "tags".into(), field_type: TextFieldType::Keyword },
        ]
    }

    #[test]
    fn ephemeral_creates_valid_index() {
        let idx = TextIndex::ephemeral(&test_fields()).unwrap();
        let fields = idx.fields();
        // entity_id field exists
        assert!(fields.text_fields.iter().any(|(k, _, _)| k == "title"));
        assert!(fields.text_fields.iter().any(|(k, _, _)| k == "tags"));
        idx.close().unwrap();
    }

    #[test]
    fn open_and_close_on_disk() {
        let dir = tempfile::tempdir().unwrap();
        let config = TextIndexConfig {
            index_dir: dir.path().to_path_buf(),
            ..Default::default()
        };
        let idx = TextIndex::open(config.clone(), &test_fields()).unwrap();
        idx.close().unwrap();
        // Reopen
        let idx2 = TextIndex::open(config, &test_fields()).unwrap();
        idx2.close().unwrap();
    }
}