# Task 01: TextIndex Core ## Delivers `TextIndex` struct, Tantivy schema generation from tidalDB schema text field definitions, `IndexWriter`/`IndexReader` lifecycle, `entity_id` fast field, `TextIndex::open()` and `TextIndex::close()`. Also extends `Schema` and `SchemaBuilder` with `TextFieldDef` — the declaration of which metadata keys to index for full-text search, and whether they are tokenized text or keyword (raw) fields. ## Complexity: L ## Dependencies - None from prior m5 tasks (this is the foundation) - tidalDB `Schema` (schema/validation.rs) — will be extended - Cargo.toml — `tantivy` dependency must be added ## Technical Design ### 1. Add `tantivy` to Cargo.toml ```toml tantivy = "0.22" ``` Use `0.22` — stable API, widely deployed, Collector trait and DocSet::seek available. ### 2. Add TextFieldDef to Schema In `schema/validation.rs`, add: ```rust /// Declaration of a text field for full-text search indexing. /// /// When a text field is declared in the schema, items written to tidalDB /// will have the corresponding metadata key indexed by Tantivy for full-text search. #[derive(Debug, Clone)] pub struct TextFieldDef { /// The metadata key to index (e.g., "title", "description", "tags"). pub key: String, /// Whether this field is tokenized (full-text) or raw (keyword/exact-match). pub field_type: TextFieldType, } /// The Tantivy indexing mode for a text field. #[derive(Debug, Clone, PartialEq, Eq)] pub enum TextFieldType { /// Full tokenization with Tantivy's default tokenizer (lowercase, whitespace split). /// Good for: title, description, body text. Text, /// Raw storage, no tokenization. Only exact-match queries work. /// Good for: category, format, creator_id, language tags. Keyword, } ``` Add `text_fields: Vec` to `Schema` and `SchemaBuilder`. Add `SchemaBuilder::text_field(key, TextFieldType)` builder method. Expose `Schema::text_fields() -> &[TextFieldDef]`. ### 3. TextIndex Module Structure Create `tidal/src/text/` module with: ``` tidal/src/text/ ├── mod.rs # pub re-exports ├── index.rs # TextIndex struct and config ├── writer.rs # TextIndexWriter ├── syncer.rs # TextIndexSyncer (task-03) ├── collectors.rs # Scoring collectors (task-04) └── query.rs # TextQueryParser (task-05) ``` Add `pub mod text;` to `tidal/src/lib.rs`. ### 4. TextIndex Struct ```rust // tidal/src/text/index.rs use std::path::PathBuf; use std::sync::{Arc, Mutex}; use tantivy::{Index, IndexReader, IndexWriter, ReloadPolicy, schema as tv_schema}; use crate::schema::{EntityId, TextFieldDef, TextFieldType}; use crate::TidalError; /// Configuration for the text index. #[derive(Debug, Clone)] pub struct TextIndexConfig { /// Directory for Tantivy index files. pub index_dir: PathBuf, /// IndexWriter heap budget in bytes. Default: 50MB. pub heap_budget_bytes: usize, /// Maximum documents before forcing a commit. pub commit_every_n_docs: usize, /// Maximum seconds between commits. pub commit_every_secs: u64, } impl Default for TextIndexConfig { fn default() -> Self { Self { index_dir: PathBuf::from("data/text_index"), heap_budget_bytes: 50 * 1024 * 1024, // 50MB commit_every_n_docs: 1000, commit_every_secs: 2, } } } /// Fields that every Tantivy document must have. pub(crate) struct TantivyFields { /// Fast field for the tidalDB entity ID (u64). Used for EntityId->DocAddress mapping. pub entity_id: tv_schema::Field, /// Declared text fields from the tidalDB schema. pub text_fields: Vec<(String, tv_schema::Field, TextFieldType)>, } /// The text index. Wraps Tantivy's Index, IndexWriter, and IndexReader. /// /// Thread-safe: the IndexWriter is behind a Mutex (Tantivy enforces single-writer), /// the IndexReader provides lock-free snapshot reads. /// /// IMPORTANT: `TextIndex` is a derived index. The entity store is the source of truth. /// If the Tantivy index is lost, call `rebuild_from()` to reconstruct it. pub struct TextIndex { pub(crate) index: Index, pub(crate) writer: Mutex, pub(crate) reader: IndexReader, pub(crate) fields: Arc, pub(crate) config: TextIndexConfig, } ``` ### 5. TextIndex::open() and ::close() ```rust impl TextIndex { /// Open or create a TextIndex from the given config and field definitions. /// /// If the index directory exists, opens the existing index. /// If not, creates a new index. /// /// # Errors /// Returns `TidalError::Internal` if Tantivy initialization fails. pub fn open(config: TextIndexConfig, text_fields: &[TextFieldDef]) -> crate::Result { // 1. Build Tantivy schema let (tv_schema, fields) = build_tantivy_schema(text_fields)?; // 2. Open or create index let index = if config.index_dir.exists() { Index::open_in_dir(&config.index_dir) .map_err(|e| TidalError::Internal(format!("tantivy open: {e}")))? } else { std::fs::create_dir_all(&config.index_dir) .map_err(|e| TidalError::Internal(format!("create index dir: {e}")))?; Index::create_in_dir(&config.index_dir, tv_schema) .map_err(|e| TidalError::Internal(format!("tantivy create: {e}")))? }; // 3. Create IndexWriter with heap budget let writer = index .writer(config.heap_budget_bytes) .map_err(|e| TidalError::Internal(format!("tantivy writer: {e}")))?; // 4. Create IndexReader with auto-reload on commit let reader = index .reader_builder() .reload_policy(ReloadPolicy::OnCommitWithDelay) .try_into() .map_err(|e| TidalError::Internal(format!("tantivy reader: {e}")))?; Ok(Self { index, writer: Mutex::new(writer), reader, fields: Arc::new(fields), config, }) } /// Open an in-memory text index for testing. pub fn ephemeral(text_fields: &[TextFieldDef]) -> crate::Result { let (tv_schema, fields) = build_tantivy_schema(text_fields)?; let index = Index::create_in_ram(tv_schema); let writer = index .writer(15 * 1024 * 1024) // 15MB minimum for ephemeral .map_err(|e| TidalError::Internal(format!("tantivy writer: {e}")))?; let reader = index .reader_builder() .reload_policy(ReloadPolicy::Manual) .try_into() .map_err(|e| TidalError::Internal(format!("tantivy reader: {e}")))?; let config = TextIndexConfig { index_dir: PathBuf::from(":memory:"), ..Default::default() }; Ok(Self { index, writer: Mutex::new(writer), reader, fields: Arc::new(fields), config, }) } /// Graceful shutdown: wait for background merges to complete. /// /// # Errors /// Returns `TidalError::Internal` if the writer fails to commit or merge. pub fn close(self) -> crate::Result<()> { let mut writer = self .writer .into_inner() .map_err(|e| TidalError::Internal(format!("writer lock poisoned: {e}")))?; writer .wait_merging_threads() .map_err(|e| TidalError::Internal(format!("tantivy merge wait: {e}"))) } /// Get a reference to the fields mapping (for writer and collector use). #[must_use] pub fn fields(&self) -> &Arc { &self.fields } } /// Construct a Tantivy schema from tidalDB text field definitions. /// /// Always adds: /// - `entity_id`: u64 fast field for EntityId -> DocAddress mapping /// /// For each TextFieldDef: /// - `TextFieldType::Text` → `TEXT | STORED` (tokenized, stored for highlight) /// - `TextFieldType::Keyword` → `STRING | STORED` (raw, stored) fn build_tantivy_schema( text_fields: &[TextFieldDef], ) -> crate::Result<(tv_schema::Schema, TantivyFields)> { let mut sb = tv_schema::Schema::builder(); // entity_id fast field — every document must have this let entity_id_field = sb.add_u64_field( "entity_id", tv_schema::FAST | tv_schema::STORED, ); let mut fields = Vec::with_capacity(text_fields.len()); for def in text_fields { let options = match def.field_type { TextFieldType::Text => tv_schema::TEXT | tv_schema::STORED, TextFieldType::Keyword => tv_schema::STRING | tv_schema::STORED, }; let field = sb.add_text_field(&def.key, options); fields.push((def.key.clone(), field, def.field_type.clone())); } let schema = sb.build(); Ok(( schema, TantivyFields { entity_id: entity_id_field, text_fields: fields, }, )) } ``` ### 6. TextIndex must be Send + Sync `tantivy::Index` is `Send + Sync`. `tantivy::IndexWriter` is `Send` (not `Sync`) — hence the `Mutex`. `tantivy::IndexReader` is `Send + Sync`. `Mutex` is `Send + Sync` when `IndexWriter: Send`. So `TextIndex` is `Send + Sync` implicitly. ## Acceptance Criteria - [ ] `TextFieldDef` and `TextFieldType` types in `schema/validation.rs` - [ ] `SchemaBuilder::text_field(key, TextFieldType)` builder method - [ ] `Schema::text_fields() -> &[TextFieldDef]` accessor - [ ] `tidal/src/text/` module created with `pub mod text;` in `lib.rs` - [ ] `TextIndex::open(config, text_fields)` creates or opens a Tantivy index - [ ] `TextIndex::ephemeral(text_fields)` creates an in-memory index for tests - [ ] `TextIndex::close(self)` calls `wait_merging_threads()` - [ ] `entity_id` fast field present in every Tantivy document - [ ] `Text` fields use `TEXT | STORED` options (tokenized) - [ ] `Keyword` fields use `STRING | STORED` options (raw/exact) - [ ] `TextIndex` is `Send + Sync` - [ ] Unit tests: `open_and_close`, `ephemeral_creates_valid_index`, `schema_has_entity_id_field`, `text_fields_correct_options`, `keyword_fields_correct_options` - [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass ## Test Strategy ```rust #[cfg(test)] mod tests { use super::*; use crate::schema::{TextFieldDef, TextFieldType}; fn test_fields() -> Vec { vec![ TextFieldDef { key: "title".into(), field_type: TextFieldType::Text }, TextFieldDef { key: "tags".into(), field_type: TextFieldType::Keyword }, ] } #[test] fn ephemeral_creates_valid_index() { let idx = TextIndex::ephemeral(&test_fields()).unwrap(); let fields = idx.fields(); // entity_id field exists assert!(fields.text_fields.iter().any(|(k, _, _)| k == "title")); assert!(fields.text_fields.iter().any(|(k, _, _)| k == "tags")); idx.close().unwrap(); } #[test] fn open_and_close_on_disk() { let dir = tempfile::tempdir().unwrap(); let config = TextIndexConfig { index_dir: dir.path().to_path_buf(), ..Default::default() }; let idx = TextIndex::open(config.clone(), &test_fields()).unwrap(); idx.close().unwrap(); // Reopen let idx2 = TextIndex::open(config, &test_fields()).unwrap(); idx2.close().unwrap(); } } ```