- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs) - M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates) - M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking) - M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators) - Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.) - Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.) - Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers - Add benches: fusion, search, session, text_index - Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index) - Update blog posts, roadmap, content strategy, and M5 planning docs - Add tmp/ and .claude/worktrees/ to .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
11 KiB
Task 01: TextIndex Core
Delivers
TextIndex struct, Tantivy schema generation from tidalDB schema text field definitions, IndexWriter/IndexReader lifecycle, entity_id fast field, TextIndex::open() and TextIndex::close().
Also extends Schema and SchemaBuilder with TextFieldDef — the declaration of which metadata keys to index for full-text search, and whether they are tokenized text or keyword (raw) fields.
Complexity: L
Dependencies
- None from prior m5 tasks (this is the foundation)
- tidalDB
Schema(schema/validation.rs) — will be extended - Cargo.toml —
tantivydependency must be added
Technical Design
1. Add tantivy to Cargo.toml
tantivy = "0.22"
Use 0.22 — stable API, widely deployed, Collector trait and DocSet::seek available.
2. Add TextFieldDef to Schema
In schema/validation.rs, add:
/// Declaration of a text field for full-text search indexing.
///
/// When a text field is declared in the schema, items written to tidalDB
/// will have the corresponding metadata key indexed by Tantivy for full-text search.
#[derive(Debug, Clone)]
pub struct TextFieldDef {
/// The metadata key to index (e.g., "title", "description", "tags").
pub key: String,
/// Whether this field is tokenized (full-text) or raw (keyword/exact-match).
pub field_type: TextFieldType,
}
/// The Tantivy indexing mode for a text field.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TextFieldType {
/// Full tokenization with Tantivy's default tokenizer (lowercase, whitespace split).
/// Good for: title, description, body text.
Text,
/// Raw storage, no tokenization. Only exact-match queries work.
/// Good for: category, format, creator_id, language tags.
Keyword,
}
Add text_fields: Vec<TextFieldDef> to Schema and SchemaBuilder.
Add SchemaBuilder::text_field(key, TextFieldType) builder method.
Expose Schema::text_fields() -> &[TextFieldDef].
3. TextIndex Module Structure
Create tidal/src/text/ module with:
tidal/src/text/
├── mod.rs # pub re-exports
├── index.rs # TextIndex struct and config
├── writer.rs # TextIndexWriter
├── syncer.rs # TextIndexSyncer (task-03)
├── collectors.rs # Scoring collectors (task-04)
└── query.rs # TextQueryParser (task-05)
Add pub mod text; to tidal/src/lib.rs.
4. TextIndex Struct
// tidal/src/text/index.rs
use std::path::PathBuf;
use std::sync::{Arc, Mutex};
use tantivy::{Index, IndexReader, IndexWriter, ReloadPolicy, schema as tv_schema};
use crate::schema::{EntityId, TextFieldDef, TextFieldType};
use crate::TidalError;
/// Configuration for the text index.
#[derive(Debug, Clone)]
pub struct TextIndexConfig {
/// Directory for Tantivy index files.
pub index_dir: PathBuf,
/// IndexWriter heap budget in bytes. Default: 50MB.
pub heap_budget_bytes: usize,
/// Maximum documents before forcing a commit.
pub commit_every_n_docs: usize,
/// Maximum seconds between commits.
pub commit_every_secs: u64,
}
impl Default for TextIndexConfig {
fn default() -> Self {
Self {
index_dir: PathBuf::from("data/text_index"),
heap_budget_bytes: 50 * 1024 * 1024, // 50MB
commit_every_n_docs: 1000,
commit_every_secs: 2,
}
}
}
/// Fields that every Tantivy document must have.
pub(crate) struct TantivyFields {
/// Fast field for the tidalDB entity ID (u64). Used for EntityId->DocAddress mapping.
pub entity_id: tv_schema::Field,
/// Declared text fields from the tidalDB schema.
pub text_fields: Vec<(String, tv_schema::Field, TextFieldType)>,
}
/// The text index. Wraps Tantivy's Index, IndexWriter, and IndexReader.
///
/// Thread-safe: the IndexWriter is behind a Mutex (Tantivy enforces single-writer),
/// the IndexReader provides lock-free snapshot reads.
///
/// IMPORTANT: `TextIndex` is a derived index. The entity store is the source of truth.
/// If the Tantivy index is lost, call `rebuild_from()` to reconstruct it.
pub struct TextIndex {
pub(crate) index: Index,
pub(crate) writer: Mutex<IndexWriter>,
pub(crate) reader: IndexReader,
pub(crate) fields: Arc<TantivyFields>,
pub(crate) config: TextIndexConfig,
}
5. TextIndex::open() and ::close()
impl TextIndex {
/// Open or create a TextIndex from the given config and field definitions.
///
/// If the index directory exists, opens the existing index.
/// If not, creates a new index.
///
/// # Errors
/// Returns `TidalError::Internal` if Tantivy initialization fails.
pub fn open(config: TextIndexConfig, text_fields: &[TextFieldDef]) -> crate::Result<Self> {
// 1. Build Tantivy schema
let (tv_schema, fields) = build_tantivy_schema(text_fields)?;
// 2. Open or create index
let index = if config.index_dir.exists() {
Index::open_in_dir(&config.index_dir)
.map_err(|e| TidalError::Internal(format!("tantivy open: {e}")))?
} else {
std::fs::create_dir_all(&config.index_dir)
.map_err(|e| TidalError::Internal(format!("create index dir: {e}")))?;
Index::create_in_dir(&config.index_dir, tv_schema)
.map_err(|e| TidalError::Internal(format!("tantivy create: {e}")))?
};
// 3. Create IndexWriter with heap budget
let writer = index
.writer(config.heap_budget_bytes)
.map_err(|e| TidalError::Internal(format!("tantivy writer: {e}")))?;
// 4. Create IndexReader with auto-reload on commit
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::OnCommitWithDelay)
.try_into()
.map_err(|e| TidalError::Internal(format!("tantivy reader: {e}")))?;
Ok(Self {
index,
writer: Mutex::new(writer),
reader,
fields: Arc::new(fields),
config,
})
}
/// Open an in-memory text index for testing.
pub fn ephemeral(text_fields: &[TextFieldDef]) -> crate::Result<Self> {
let (tv_schema, fields) = build_tantivy_schema(text_fields)?;
let index = Index::create_in_ram(tv_schema);
let writer = index
.writer(15 * 1024 * 1024) // 15MB minimum for ephemeral
.map_err(|e| TidalError::Internal(format!("tantivy writer: {e}")))?;
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::Manual)
.try_into()
.map_err(|e| TidalError::Internal(format!("tantivy reader: {e}")))?;
let config = TextIndexConfig {
index_dir: PathBuf::from(":memory:"),
..Default::default()
};
Ok(Self {
index,
writer: Mutex::new(writer),
reader,
fields: Arc::new(fields),
config,
})
}
/// Graceful shutdown: wait for background merges to complete.
///
/// # Errors
/// Returns `TidalError::Internal` if the writer fails to commit or merge.
pub fn close(self) -> crate::Result<()> {
let mut writer = self
.writer
.into_inner()
.map_err(|e| TidalError::Internal(format!("writer lock poisoned: {e}")))?;
writer
.wait_merging_threads()
.map_err(|e| TidalError::Internal(format!("tantivy merge wait: {e}")))
}
/// Get a reference to the fields mapping (for writer and collector use).
#[must_use]
pub fn fields(&self) -> &Arc<TantivyFields> {
&self.fields
}
}
/// Construct a Tantivy schema from tidalDB text field definitions.
///
/// Always adds:
/// - `entity_id`: u64 fast field for EntityId -> DocAddress mapping
///
/// For each TextFieldDef:
/// - `TextFieldType::Text` → `TEXT | STORED` (tokenized, stored for highlight)
/// - `TextFieldType::Keyword` → `STRING | STORED` (raw, stored)
fn build_tantivy_schema(
text_fields: &[TextFieldDef],
) -> crate::Result<(tv_schema::Schema, TantivyFields)> {
let mut sb = tv_schema::Schema::builder();
// entity_id fast field — every document must have this
let entity_id_field = sb.add_u64_field(
"entity_id",
tv_schema::FAST | tv_schema::STORED,
);
let mut fields = Vec::with_capacity(text_fields.len());
for def in text_fields {
let options = match def.field_type {
TextFieldType::Text => tv_schema::TEXT | tv_schema::STORED,
TextFieldType::Keyword => tv_schema::STRING | tv_schema::STORED,
};
let field = sb.add_text_field(&def.key, options);
fields.push((def.key.clone(), field, def.field_type.clone()));
}
let schema = sb.build();
Ok((
schema,
TantivyFields {
entity_id: entity_id_field,
text_fields: fields,
},
))
}
6. TextIndex must be Send + Sync
tantivy::Index is Send + Sync. tantivy::IndexWriter is Send (not Sync) — hence the Mutex<IndexWriter>. tantivy::IndexReader is Send + Sync. Mutex<IndexWriter> is Send + Sync when IndexWriter: Send. So TextIndex is Send + Sync implicitly.
Acceptance Criteria
TextFieldDefandTextFieldTypetypes inschema/validation.rsSchemaBuilder::text_field(key, TextFieldType)builder methodSchema::text_fields() -> &[TextFieldDef]accessortidal/src/text/module created withpub mod text;inlib.rsTextIndex::open(config, text_fields)creates or opens a Tantivy indexTextIndex::ephemeral(text_fields)creates an in-memory index for testsTextIndex::close(self)callswait_merging_threads()entity_idfast field present in every Tantivy documentTextfields useTEXT | STOREDoptions (tokenized)Keywordfields useSTRING | STOREDoptions (raw/exact)TextIndexisSend + Sync- Unit tests:
open_and_close,ephemeral_creates_valid_index,schema_has_entity_id_field,text_fields_correct_options,keyword_fields_correct_options cargo check,cargo fmt,cargo clippy -D warningsall pass
Test Strategy
#[cfg(test)]
mod tests {
use super::*;
use crate::schema::{TextFieldDef, TextFieldType};
fn test_fields() -> Vec<TextFieldDef> {
vec![
TextFieldDef { key: "title".into(), field_type: TextFieldType::Text },
TextFieldDef { key: "tags".into(), field_type: TextFieldType::Keyword },
]
}
#[test]
fn ephemeral_creates_valid_index() {
let idx = TextIndex::ephemeral(&test_fields()).unwrap();
let fields = idx.fields();
// entity_id field exists
assert!(fields.text_fields.iter().any(|(k, _, _)| k == "title"));
assert!(fields.text_fields.iter().any(|(k, _, _)| k == "tags"));
idx.close().unwrap();
}
#[test]
fn open_and_close_on_disk() {
let dir = tempfile::tempdir().unwrap();
let config = TextIndexConfig {
index_dir: dir.path().to_path_buf(),
..Default::default()
};
let idx = TextIndex::open(config.clone(), &test_fields()).unwrap();
idx.close().unwrap();
// Reopen
let idx2 = TextIndex::open(config, &test_fields()).unwrap();
idx2.close().unwrap();
}
}