# Task 03: Tantivy Merge Policy Tuning ## Delivers Configured `LogMergePolicy` for Tantivy's embedded text index. Benchmarked segment count evolution under steady-state write load at 1M documents. Verified that background merges do not block concurrent reads. Segment count stays below 20 at steady state. ## Complexity M ## Dependencies - task-01 complete (scale benchmark infrastructure, 1M-item TidalDb) - `docs/research/tantivy.md` (LogMergePolicy parameters, segment merge behavior) ## Technical Design ### 1. Current state The research doc identifies segment merging as the primary latency risk. Tantivy's `LogMergePolicy` governs when small segments are merged into larger ones after each commit. The relevant parameters: | Parameter | Default | Description | |-----------|---------|-------------| | `min_num_segments` | 8 | Minimum segments before merge fires | | `max_docs_before_merge` | 10_000_000 | Segments larger than this are never merged | | `del_docs_ratio_before_merge` | 1.0 | Fraction of deleted docs triggering merge | | `min_layer_size` | 10_000 | Minimum docs per segment layer in log merge | Currently, `TextIndex` uses Tantivy's default `LogMergePolicy` without tuning. At 1M documents with commits every 1000 items or 2 seconds, the syncer produces ~1000 commits during initial ingest, each creating 1-8 small segments. Without tuned merge, segment count can grow to 50+ before merges catch up. ### 2. Merge policy configuration Apply tuned parameters in `TextIndex` construction: ```rust use tantivy::merge_policy::LogMergePolicy; fn configure_merge_policy() -> LogMergePolicy { let mut policy = LogMergePolicy::default(); // Merge when 4+ segments exist at same level (more aggressive than default 8). // At 1M docs with 1000-doc commits, this keeps steady-state segments < 15. policy.set_min_num_segments(4); // Allow merges of segments up to 5M docs. Default 10M is fine for single-node // but 5M reduces the maximum merge duration (smaller max segment = faster merge). policy.set_max_docs_before_merge(5_000_000); // Trigger merge when 30% of docs in a segment are deleted (default 100% = never). // tidalDB uses delete-then-add for updates, so deleted docs accumulate. policy.set_del_docs_ratio_before_merge(0.3); policy } // In TextIndex construction: let index_writer = index.writer_with_num_threads(2, 50_000_000)?; // 50MB heap, 2 threads index_writer.set_merge_policy(Box::new(configure_merge_policy())); ``` ### 3. Segment count observation benchmark This is not a standard Criterion benchmark -- it measures segment count evolution over time under write+read load. Implement as a test that prints a report: ```rust #[test] #[ignore] // only run manually: cargo test --manifest-path tidal/Cargo.toml -- tantivy_segment_evolution --ignored --nocapture fn tantivy_segment_evolution() { let db = build_scale_db_with_incremental_writes(); // Phase 1: Bulk ingest 1M items (already done in build) // Observe segment count after initial ingest let seg_count_after_ingest = db.text_segment_count(); // expose via TextIndex println!("Segments after 1M ingest: {seg_count_after_ingest}"); // Phase 2: Steady-state writes (100 items every 2 seconds, 10 rounds) for round in 0..10 { let base_id = 1_000_000 + round * 100; for i in 0..100u64 { let mut meta = HashMap::new(); meta.insert("title".to_string(), format!("Steady state item {}", base_id + i)); db.write_item_with_metadata(EntityId::new(base_id + i), &meta).unwrap(); } std::thread::sleep(Duration::from_secs(2)); db.reload_text_index().unwrap(); let seg_count = db.text_segment_count(); println!("Round {round}: segments = {seg_count}"); assert!(seg_count < 20, "Segment count {seg_count} exceeds threshold of 20"); } // Phase 3: Read latency during merge // Fire a search while merge is in progress (if detectable) let query = Search::builder() .query("steady state") .limit(20) .build() .unwrap(); let start = std::time::Instant::now(); let results = db.search(&query).unwrap(); let latency = start.elapsed(); println!("Search latency during steady state: {latency:?}"); println!("Results: {}", results.items.len()); assert!( latency < Duration::from_millis(100), "Search latency {latency:?} exceeds 100ms during steady-state merge" ); } ``` ### 4. Concurrent read/write verification Spawn a reader thread that continuously searches while the writer thread ingests documents. Measure reader latency percentiles: ```rust #[test] #[ignore] fn tantivy_concurrent_read_write_latency() { let db = Arc::new(build_1m_db()); let db_reader = db.clone(); let reader_handle = std::thread::spawn(move || { let query = Search::builder() .query("tutorial") .limit(10) .build() .unwrap(); let mut latencies = Vec::with_capacity(100); for _ in 0..100 { let start = std::time::Instant::now(); let _ = db_reader.search(&query).unwrap(); latencies.push(start.elapsed()); std::thread::sleep(Duration::from_millis(50)); } latencies }); // Writer: add 5000 items while reader is searching for i in 0..5000u64 { let mut meta = HashMap::new(); meta.insert("title".to_string(), format!("Concurrent write item {i}")); db.write_item_with_metadata(EntityId::new(2_000_000 + i), &meta).unwrap(); } let latencies = reader_handle.join().unwrap(); latencies.sort(); let p50 = latencies[latencies.len() / 2]; let p99 = latencies[latencies.len() * 99 / 100]; println!("Read latency during concurrent writes:"); println!(" p50 = {p50:?}"); println!(" p99 = {p99:?}"); assert!( p99 < Duration::from_millis(100), "p99 read latency {p99:?} exceeds 100ms during concurrent writes" ); } ``` ### 5. TextIndex API extension To observe segment count, expose a method on `TextIndex`: ```rust impl TextIndex { /// Returns the current number of searchable segments. /// /// Useful for monitoring merge policy effectiveness. #[must_use] pub fn segment_count(&self) -> usize { self.reader.searcher().segment_readers().len() } } ``` And propagate through `TidalDb`: ```rust impl TidalDb { /// Returns the current Tantivy segment count (for diagnostics). #[must_use] pub fn text_segment_count(&self) -> usize { self.text_index.segment_count() } } ``` ## Acceptance Criteria - [ ] `LogMergePolicy` configured with `min_num_segments=4`, `max_docs_before_merge=5_000_000`, `del_docs_ratio_before_merge=0.3` - [ ] `TextIndex::segment_count()` method exposed for monitoring - [ ] Segment count < 20 after 1M document ingest and 10 rounds of 100-document steady-state writes - [ ] Concurrent read latency p99 < 100ms during active writes at 1M documents - [ ] Merge policy parameters documented in `docs/profiling/tantivy-merge-tuning.md` - [ ] No regression in existing `tidal/benches/text_index.rs` and `tidal/benches/search.rs` benchmarks ## Test Strategy 1. **Segment count test:** `tantivy_segment_evolution` (ignored test, run manually) verifies segment count stays below 20 at steady state. 2. **Concurrent latency test:** `tantivy_concurrent_read_write_latency` (ignored test) measures read p99 during active writes. 3. **Regression:** Run existing `cargo bench --manifest-path tidal/Cargo.toml --bench text_index` and `--bench search` before and after applying the merge policy change. Latency should not regress. 4. **Unit test:** Verify that `configure_merge_policy()` returns a policy with the expected parameter values (Tantivy exposes getters for some policy fields).