tidaldb/docs/planning/milestone-7/phase-3/task-03-tantivy-merge-policy.md
2026-02-23 22:41:16 -07:00

202 lines
7.7 KiB
Markdown

# Task 03: Tantivy Merge Policy Tuning
## Delivers
Configured `LogMergePolicy` for Tantivy's embedded text index. Benchmarked segment count evolution under steady-state write load at 1M documents. Verified that background merges do not block concurrent reads. Segment count stays below 20 at steady state.
## Complexity
M
## Dependencies
- task-01 complete (scale benchmark infrastructure, 1M-item TidalDb)
- `docs/research/tantivy.md` (LogMergePolicy parameters, segment merge behavior)
## Technical Design
### 1. Current state
The research doc identifies segment merging as the primary latency risk. Tantivy's `LogMergePolicy` governs when small segments are merged into larger ones after each commit. The relevant parameters:
| Parameter | Default | Description |
|-----------|---------|-------------|
| `min_num_segments` | 8 | Minimum segments before merge fires |
| `max_docs_before_merge` | 10_000_000 | Segments larger than this are never merged |
| `del_docs_ratio_before_merge` | 1.0 | Fraction of deleted docs triggering merge |
| `min_layer_size` | 10_000 | Minimum docs per segment layer in log merge |
Currently, `TextIndex` uses Tantivy's default `LogMergePolicy` without tuning. At 1M documents with commits every 1000 items or 2 seconds, the syncer produces ~1000 commits during initial ingest, each creating 1-8 small segments. Without tuned merge, segment count can grow to 50+ before merges catch up.
### 2. Merge policy configuration
Apply tuned parameters in `TextIndex` construction:
```rust
use tantivy::merge_policy::LogMergePolicy;
fn configure_merge_policy() -> LogMergePolicy {
let mut policy = LogMergePolicy::default();
// Merge when 4+ segments exist at same level (more aggressive than default 8).
// At 1M docs with 1000-doc commits, this keeps steady-state segments < 15.
policy.set_min_num_segments(4);
// Allow merges of segments up to 5M docs. Default 10M is fine for single-node
// but 5M reduces the maximum merge duration (smaller max segment = faster merge).
policy.set_max_docs_before_merge(5_000_000);
// Trigger merge when 30% of docs in a segment are deleted (default 100% = never).
// tidalDB uses delete-then-add for updates, so deleted docs accumulate.
policy.set_del_docs_ratio_before_merge(0.3);
policy
}
// In TextIndex construction:
let index_writer = index.writer_with_num_threads(2, 50_000_000)?; // 50MB heap, 2 threads
index_writer.set_merge_policy(Box::new(configure_merge_policy()));
```
### 3. Segment count observation benchmark
This is not a standard Criterion benchmark -- it measures segment count evolution over time under write+read load. Implement as a test that prints a report:
```rust
#[test]
#[ignore] // only run manually: cargo test --manifest-path tidal/Cargo.toml -- tantivy_segment_evolution --ignored --nocapture
fn tantivy_segment_evolution() {
let db = build_scale_db_with_incremental_writes();
// Phase 1: Bulk ingest 1M items (already done in build)
// Observe segment count after initial ingest
let seg_count_after_ingest = db.text_segment_count(); // expose via TextIndex
println!("Segments after 1M ingest: {seg_count_after_ingest}");
// Phase 2: Steady-state writes (100 items every 2 seconds, 10 rounds)
for round in 0..10 {
let base_id = 1_000_000 + round * 100;
for i in 0..100u64 {
let mut meta = HashMap::new();
meta.insert("title".to_string(), format!("Steady state item {}", base_id + i));
db.write_item_with_metadata(EntityId::new(base_id + i), &meta).unwrap();
}
std::thread::sleep(Duration::from_secs(2));
db.reload_text_index().unwrap();
let seg_count = db.text_segment_count();
println!("Round {round}: segments = {seg_count}");
assert!(seg_count < 20, "Segment count {seg_count} exceeds threshold of 20");
}
// Phase 3: Read latency during merge
// Fire a search while merge is in progress (if detectable)
let query = Search::builder()
.query("steady state")
.limit(20)
.build()
.unwrap();
let start = std::time::Instant::now();
let results = db.search(&query).unwrap();
let latency = start.elapsed();
println!("Search latency during steady state: {latency:?}");
println!("Results: {}", results.items.len());
assert!(
latency < Duration::from_millis(100),
"Search latency {latency:?} exceeds 100ms during steady-state merge"
);
}
```
### 4. Concurrent read/write verification
Spawn a reader thread that continuously searches while the writer thread ingests documents. Measure reader latency percentiles:
```rust
#[test]
#[ignore]
fn tantivy_concurrent_read_write_latency() {
let db = Arc::new(build_1m_db());
let db_reader = db.clone();
let reader_handle = std::thread::spawn(move || {
let query = Search::builder()
.query("tutorial")
.limit(10)
.build()
.unwrap();
let mut latencies = Vec::with_capacity(100);
for _ in 0..100 {
let start = std::time::Instant::now();
let _ = db_reader.search(&query).unwrap();
latencies.push(start.elapsed());
std::thread::sleep(Duration::from_millis(50));
}
latencies
});
// Writer: add 5000 items while reader is searching
for i in 0..5000u64 {
let mut meta = HashMap::new();
meta.insert("title".to_string(), format!("Concurrent write item {i}"));
db.write_item_with_metadata(EntityId::new(2_000_000 + i), &meta).unwrap();
}
let latencies = reader_handle.join().unwrap();
latencies.sort();
let p50 = latencies[latencies.len() / 2];
let p99 = latencies[latencies.len() * 99 / 100];
println!("Read latency during concurrent writes:");
println!(" p50 = {p50:?}");
println!(" p99 = {p99:?}");
assert!(
p99 < Duration::from_millis(100),
"p99 read latency {p99:?} exceeds 100ms during concurrent writes"
);
}
```
### 5. TextIndex API extension
To observe segment count, expose a method on `TextIndex`:
```rust
impl TextIndex {
/// Returns the current number of searchable segments.
///
/// Useful for monitoring merge policy effectiveness.
#[must_use]
pub fn segment_count(&self) -> usize {
self.reader.searcher().segment_readers().len()
}
}
```
And propagate through `TidalDb`:
```rust
impl TidalDb {
/// Returns the current Tantivy segment count (for diagnostics).
#[must_use]
pub fn text_segment_count(&self) -> usize {
self.text_index.segment_count()
}
}
```
## Acceptance Criteria
- [ ] `LogMergePolicy` configured with `min_num_segments=4`, `max_docs_before_merge=5_000_000`, `del_docs_ratio_before_merge=0.3`
- [ ] `TextIndex::segment_count()` method exposed for monitoring
- [ ] Segment count < 20 after 1M document ingest and 10 rounds of 100-document steady-state writes
- [ ] Concurrent read latency p99 < 100ms during active writes at 1M documents
- [ ] Merge policy parameters documented in `docs/profiling/tantivy-merge-tuning.md`
- [ ] No regression in existing `tidal/benches/text_index.rs` and `tidal/benches/search.rs` benchmarks
## Test Strategy
1. **Segment count test:** `tantivy_segment_evolution` (ignored test, run manually) verifies segment count stays below 20 at steady state.
2. **Concurrent latency test:** `tantivy_concurrent_read_write_latency` (ignored test) measures read p99 during active writes.
3. **Regression:** Run existing `cargo bench --manifest-path tidal/Cargo.toml --bench text_index` and `--bench search` before and after applying the merge policy change. Latency should not regress.
4. **Unit test:** Verify that `configure_merge_policy()` returns a policy with the expected parameter values (Tantivy exposes getters for some policy fields).