tidaldb/docs/planning/milestone-7/phase-3/task-03-tantivy-merge-policy.md
2026-02-23 22:41:16 -07:00

7.7 KiB

Task 03: Tantivy Merge Policy Tuning

Delivers

Configured LogMergePolicy for Tantivy's embedded text index. Benchmarked segment count evolution under steady-state write load at 1M documents. Verified that background merges do not block concurrent reads. Segment count stays below 20 at steady state.

Complexity

M

Dependencies

  • task-01 complete (scale benchmark infrastructure, 1M-item TidalDb)
  • docs/research/tantivy.md (LogMergePolicy parameters, segment merge behavior)

Technical Design

1. Current state

The research doc identifies segment merging as the primary latency risk. Tantivy's LogMergePolicy governs when small segments are merged into larger ones after each commit. The relevant parameters:

Parameter Default Description
min_num_segments 8 Minimum segments before merge fires
max_docs_before_merge 10_000_000 Segments larger than this are never merged
del_docs_ratio_before_merge 1.0 Fraction of deleted docs triggering merge
min_layer_size 10_000 Minimum docs per segment layer in log merge

Currently, TextIndex uses Tantivy's default LogMergePolicy without tuning. At 1M documents with commits every 1000 items or 2 seconds, the syncer produces ~1000 commits during initial ingest, each creating 1-8 small segments. Without tuned merge, segment count can grow to 50+ before merges catch up.

2. Merge policy configuration

Apply tuned parameters in TextIndex construction:

use tantivy::merge_policy::LogMergePolicy;

fn configure_merge_policy() -> LogMergePolicy {
    let mut policy = LogMergePolicy::default();
    // Merge when 4+ segments exist at same level (more aggressive than default 8).
    // At 1M docs with 1000-doc commits, this keeps steady-state segments < 15.
    policy.set_min_num_segments(4);
    // Allow merges of segments up to 5M docs. Default 10M is fine for single-node
    // but 5M reduces the maximum merge duration (smaller max segment = faster merge).
    policy.set_max_docs_before_merge(5_000_000);
    // Trigger merge when 30% of docs in a segment are deleted (default 100% = never).
    // tidalDB uses delete-then-add for updates, so deleted docs accumulate.
    policy.set_del_docs_ratio_before_merge(0.3);
    policy
}

// In TextIndex construction:
let index_writer = index.writer_with_num_threads(2, 50_000_000)?; // 50MB heap, 2 threads
index_writer.set_merge_policy(Box::new(configure_merge_policy()));

3. Segment count observation benchmark

This is not a standard Criterion benchmark -- it measures segment count evolution over time under write+read load. Implement as a test that prints a report:

#[test]
#[ignore] // only run manually: cargo test --manifest-path tidal/Cargo.toml -- tantivy_segment_evolution --ignored --nocapture
fn tantivy_segment_evolution() {
    let db = build_scale_db_with_incremental_writes();

    // Phase 1: Bulk ingest 1M items (already done in build)
    // Observe segment count after initial ingest
    let seg_count_after_ingest = db.text_segment_count(); // expose via TextIndex
    println!("Segments after 1M ingest: {seg_count_after_ingest}");

    // Phase 2: Steady-state writes (100 items every 2 seconds, 10 rounds)
    for round in 0..10 {
        let base_id = 1_000_000 + round * 100;
        for i in 0..100u64 {
            let mut meta = HashMap::new();
            meta.insert("title".to_string(), format!("Steady state item {}", base_id + i));
            db.write_item_with_metadata(EntityId::new(base_id + i), &meta).unwrap();
        }
        std::thread::sleep(Duration::from_secs(2));
        db.reload_text_index().unwrap();

        let seg_count = db.text_segment_count();
        println!("Round {round}: segments = {seg_count}");
        assert!(seg_count < 20, "Segment count {seg_count} exceeds threshold of 20");
    }

    // Phase 3: Read latency during merge
    // Fire a search while merge is in progress (if detectable)
    let query = Search::builder()
        .query("steady state")
        .limit(20)
        .build()
        .unwrap();

    let start = std::time::Instant::now();
    let results = db.search(&query).unwrap();
    let latency = start.elapsed();
    println!("Search latency during steady state: {latency:?}");
    println!("Results: {}", results.items.len());

    assert!(
        latency < Duration::from_millis(100),
        "Search latency {latency:?} exceeds 100ms during steady-state merge"
    );
}

4. Concurrent read/write verification

Spawn a reader thread that continuously searches while the writer thread ingests documents. Measure reader latency percentiles:

#[test]
#[ignore]
fn tantivy_concurrent_read_write_latency() {
    let db = Arc::new(build_1m_db());

    let db_reader = db.clone();
    let reader_handle = std::thread::spawn(move || {
        let query = Search::builder()
            .query("tutorial")
            .limit(10)
            .build()
            .unwrap();

        let mut latencies = Vec::with_capacity(100);
        for _ in 0..100 {
            let start = std::time::Instant::now();
            let _ = db_reader.search(&query).unwrap();
            latencies.push(start.elapsed());
            std::thread::sleep(Duration::from_millis(50));
        }
        latencies
    });

    // Writer: add 5000 items while reader is searching
    for i in 0..5000u64 {
        let mut meta = HashMap::new();
        meta.insert("title".to_string(), format!("Concurrent write item {i}"));
        db.write_item_with_metadata(EntityId::new(2_000_000 + i), &meta).unwrap();
    }

    let latencies = reader_handle.join().unwrap();
    latencies.sort();
    let p50 = latencies[latencies.len() / 2];
    let p99 = latencies[latencies.len() * 99 / 100];
    println!("Read latency during concurrent writes:");
    println!("  p50 = {p50:?}");
    println!("  p99 = {p99:?}");

    assert!(
        p99 < Duration::from_millis(100),
        "p99 read latency {p99:?} exceeds 100ms during concurrent writes"
    );
}

5. TextIndex API extension

To observe segment count, expose a method on TextIndex:

impl TextIndex {
    /// Returns the current number of searchable segments.
    ///
    /// Useful for monitoring merge policy effectiveness.
    #[must_use]
    pub fn segment_count(&self) -> usize {
        self.reader.searcher().segment_readers().len()
    }
}

And propagate through TidalDb:

impl TidalDb {
    /// Returns the current Tantivy segment count (for diagnostics).
    #[must_use]
    pub fn text_segment_count(&self) -> usize {
        self.text_index.segment_count()
    }
}

Acceptance Criteria

  • LogMergePolicy configured with min_num_segments=4, max_docs_before_merge=5_000_000, del_docs_ratio_before_merge=0.3
  • TextIndex::segment_count() method exposed for monitoring
  • Segment count < 20 after 1M document ingest and 10 rounds of 100-document steady-state writes
  • Concurrent read latency p99 < 100ms during active writes at 1M documents
  • Merge policy parameters documented in docs/profiling/tantivy-merge-tuning.md
  • No regression in existing tidal/benches/text_index.rs and tidal/benches/search.rs benchmarks

Test Strategy

  1. Segment count test: tantivy_segment_evolution (ignored test, run manually) verifies segment count stays below 20 at steady state.
  2. Concurrent latency test: tantivy_concurrent_read_write_latency (ignored test) measures read p99 during active writes.
  3. Regression: Run existing cargo bench --manifest-path tidal/Cargo.toml --bench text_index and --bench search before and after applying the merge policy change. Latency should not regress.
  4. Unit test: Verify that configure_merge_policy() returns a policy with the expected parameter values (Tantivy exposes getters for some policy fields).