202 lines
7.7 KiB
Markdown
202 lines
7.7 KiB
Markdown
# Task 03: Tantivy Merge Policy Tuning
|
|
|
|
## Delivers
|
|
|
|
Configured `LogMergePolicy` for Tantivy's embedded text index. Benchmarked segment count evolution under steady-state write load at 1M documents. Verified that background merges do not block concurrent reads. Segment count stays below 20 at steady state.
|
|
|
|
## Complexity
|
|
|
|
M
|
|
|
|
## Dependencies
|
|
|
|
- task-01 complete (scale benchmark infrastructure, 1M-item TidalDb)
|
|
- `docs/research/tantivy.md` (LogMergePolicy parameters, segment merge behavior)
|
|
|
|
## Technical Design
|
|
|
|
### 1. Current state
|
|
|
|
The research doc identifies segment merging as the primary latency risk. Tantivy's `LogMergePolicy` governs when small segments are merged into larger ones after each commit. The relevant parameters:
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `min_num_segments` | 8 | Minimum segments before merge fires |
|
|
| `max_docs_before_merge` | 10_000_000 | Segments larger than this are never merged |
|
|
| `del_docs_ratio_before_merge` | 1.0 | Fraction of deleted docs triggering merge |
|
|
| `min_layer_size` | 10_000 | Minimum docs per segment layer in log merge |
|
|
|
|
Currently, `TextIndex` uses Tantivy's default `LogMergePolicy` without tuning. At 1M documents with commits every 1000 items or 2 seconds, the syncer produces ~1000 commits during initial ingest, each creating 1-8 small segments. Without tuned merge, segment count can grow to 50+ before merges catch up.
|
|
|
|
### 2. Merge policy configuration
|
|
|
|
Apply tuned parameters in `TextIndex` construction:
|
|
|
|
```rust
|
|
use tantivy::merge_policy::LogMergePolicy;
|
|
|
|
fn configure_merge_policy() -> LogMergePolicy {
|
|
let mut policy = LogMergePolicy::default();
|
|
// Merge when 4+ segments exist at same level (more aggressive than default 8).
|
|
// At 1M docs with 1000-doc commits, this keeps steady-state segments < 15.
|
|
policy.set_min_num_segments(4);
|
|
// Allow merges of segments up to 5M docs. Default 10M is fine for single-node
|
|
// but 5M reduces the maximum merge duration (smaller max segment = faster merge).
|
|
policy.set_max_docs_before_merge(5_000_000);
|
|
// Trigger merge when 30% of docs in a segment are deleted (default 100% = never).
|
|
// tidalDB uses delete-then-add for updates, so deleted docs accumulate.
|
|
policy.set_del_docs_ratio_before_merge(0.3);
|
|
policy
|
|
}
|
|
|
|
// In TextIndex construction:
|
|
let index_writer = index.writer_with_num_threads(2, 50_000_000)?; // 50MB heap, 2 threads
|
|
index_writer.set_merge_policy(Box::new(configure_merge_policy()));
|
|
```
|
|
|
|
### 3. Segment count observation benchmark
|
|
|
|
This is not a standard Criterion benchmark -- it measures segment count evolution over time under write+read load. Implement as a test that prints a report:
|
|
|
|
```rust
|
|
#[test]
|
|
#[ignore] // only run manually: cargo test --manifest-path tidal/Cargo.toml -- tantivy_segment_evolution --ignored --nocapture
|
|
fn tantivy_segment_evolution() {
|
|
let db = build_scale_db_with_incremental_writes();
|
|
|
|
// Phase 1: Bulk ingest 1M items (already done in build)
|
|
// Observe segment count after initial ingest
|
|
let seg_count_after_ingest = db.text_segment_count(); // expose via TextIndex
|
|
println!("Segments after 1M ingest: {seg_count_after_ingest}");
|
|
|
|
// Phase 2: Steady-state writes (100 items every 2 seconds, 10 rounds)
|
|
for round in 0..10 {
|
|
let base_id = 1_000_000 + round * 100;
|
|
for i in 0..100u64 {
|
|
let mut meta = HashMap::new();
|
|
meta.insert("title".to_string(), format!("Steady state item {}", base_id + i));
|
|
db.write_item_with_metadata(EntityId::new(base_id + i), &meta).unwrap();
|
|
}
|
|
std::thread::sleep(Duration::from_secs(2));
|
|
db.reload_text_index().unwrap();
|
|
|
|
let seg_count = db.text_segment_count();
|
|
println!("Round {round}: segments = {seg_count}");
|
|
assert!(seg_count < 20, "Segment count {seg_count} exceeds threshold of 20");
|
|
}
|
|
|
|
// Phase 3: Read latency during merge
|
|
// Fire a search while merge is in progress (if detectable)
|
|
let query = Search::builder()
|
|
.query("steady state")
|
|
.limit(20)
|
|
.build()
|
|
.unwrap();
|
|
|
|
let start = std::time::Instant::now();
|
|
let results = db.search(&query).unwrap();
|
|
let latency = start.elapsed();
|
|
println!("Search latency during steady state: {latency:?}");
|
|
println!("Results: {}", results.items.len());
|
|
|
|
assert!(
|
|
latency < Duration::from_millis(100),
|
|
"Search latency {latency:?} exceeds 100ms during steady-state merge"
|
|
);
|
|
}
|
|
```
|
|
|
|
### 4. Concurrent read/write verification
|
|
|
|
Spawn a reader thread that continuously searches while the writer thread ingests documents. Measure reader latency percentiles:
|
|
|
|
```rust
|
|
#[test]
|
|
#[ignore]
|
|
fn tantivy_concurrent_read_write_latency() {
|
|
let db = Arc::new(build_1m_db());
|
|
|
|
let db_reader = db.clone();
|
|
let reader_handle = std::thread::spawn(move || {
|
|
let query = Search::builder()
|
|
.query("tutorial")
|
|
.limit(10)
|
|
.build()
|
|
.unwrap();
|
|
|
|
let mut latencies = Vec::with_capacity(100);
|
|
for _ in 0..100 {
|
|
let start = std::time::Instant::now();
|
|
let _ = db_reader.search(&query).unwrap();
|
|
latencies.push(start.elapsed());
|
|
std::thread::sleep(Duration::from_millis(50));
|
|
}
|
|
latencies
|
|
});
|
|
|
|
// Writer: add 5000 items while reader is searching
|
|
for i in 0..5000u64 {
|
|
let mut meta = HashMap::new();
|
|
meta.insert("title".to_string(), format!("Concurrent write item {i}"));
|
|
db.write_item_with_metadata(EntityId::new(2_000_000 + i), &meta).unwrap();
|
|
}
|
|
|
|
let latencies = reader_handle.join().unwrap();
|
|
latencies.sort();
|
|
let p50 = latencies[latencies.len() / 2];
|
|
let p99 = latencies[latencies.len() * 99 / 100];
|
|
println!("Read latency during concurrent writes:");
|
|
println!(" p50 = {p50:?}");
|
|
println!(" p99 = {p99:?}");
|
|
|
|
assert!(
|
|
p99 < Duration::from_millis(100),
|
|
"p99 read latency {p99:?} exceeds 100ms during concurrent writes"
|
|
);
|
|
}
|
|
```
|
|
|
|
### 5. TextIndex API extension
|
|
|
|
To observe segment count, expose a method on `TextIndex`:
|
|
|
|
```rust
|
|
impl TextIndex {
|
|
/// Returns the current number of searchable segments.
|
|
///
|
|
/// Useful for monitoring merge policy effectiveness.
|
|
#[must_use]
|
|
pub fn segment_count(&self) -> usize {
|
|
self.reader.searcher().segment_readers().len()
|
|
}
|
|
}
|
|
```
|
|
|
|
And propagate through `TidalDb`:
|
|
|
|
```rust
|
|
impl TidalDb {
|
|
/// Returns the current Tantivy segment count (for diagnostics).
|
|
#[must_use]
|
|
pub fn text_segment_count(&self) -> usize {
|
|
self.text_index.segment_count()
|
|
}
|
|
}
|
|
```
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `LogMergePolicy` configured with `min_num_segments=4`, `max_docs_before_merge=5_000_000`, `del_docs_ratio_before_merge=0.3`
|
|
- [ ] `TextIndex::segment_count()` method exposed for monitoring
|
|
- [ ] Segment count < 20 after 1M document ingest and 10 rounds of 100-document steady-state writes
|
|
- [ ] Concurrent read latency p99 < 100ms during active writes at 1M documents
|
|
- [ ] Merge policy parameters documented in `docs/profiling/tantivy-merge-tuning.md`
|
|
- [ ] No regression in existing `tidal/benches/text_index.rs` and `tidal/benches/search.rs` benchmarks
|
|
|
|
## Test Strategy
|
|
|
|
1. **Segment count test:** `tantivy_segment_evolution` (ignored test, run manually) verifies segment count stays below 20 at steady state.
|
|
2. **Concurrent latency test:** `tantivy_concurrent_read_write_latency` (ignored test) measures read p99 during active writes.
|
|
3. **Regression:** Run existing `cargo bench --manifest-path tidal/Cargo.toml --bench text_index` and `--bench search` before and after applying the merge policy change. Latency should not regress.
|
|
4. **Unit test:** Verify that `configure_merge_policy()` returns a policy with the expected parameter values (Tantivy exposes getters for some policy fields).
|