tidaldb/docs/profiling/hotspot-analysis.md
2026-02-23 22:41:16 -07:00

4.6 KiB

Flamegraph Hotspot Analysis

Profiling Setup

macOS (primary)

cargo install samply
cargo build --manifest-path tidal/Cargo.toml --release --benches
samply record target/release/deps/scale-*

Linux

cargo install flamegraph
cargo flamegraph --manifest-path tidal/Cargo.toml --bench scale

Both tools generate profiles that can be viewed in the browser (samply) or as SVG (flamegraph).

SVG artifacts: docs/profiling/retrieve_1m.svg and docs/profiling/search_1m.svg These are generated by running the benchmarks — not committed to the repository.

Expected Hot Paths

Based on code analysis of the RETRIEVE and SEARCH pipelines:

RETRIEVE (for_you profile, 1M items)

Stage Hot Path Estimated %
Stage 3: Signal scoring DashMap lookup + exp() per candidate ~45%
Stage 1: Candidate generation RoaringBitmap universe scan ~20%
Stage 2: Filter evaluation Bitmap intersections ~15%
Stage 4: Sort (top-K) Vec::sort_unstable on scored candidates ~10%
Stage 5: Result assembly Struct allocation, metadata lookup ~10%

SEARCH (text_only query, 1M items)

Stage Hot Path Estimated %
BM25 engine Tantivy posting list traversal, IDF scoring ~50%
ANN (USearch HNSW) Graph traversal at ef_search=200 ~30%
RRF fusion Score normalization, merge sort ~10%
Post-filter Bitmap filter evaluation ~10%

Optimizations Applied

1. Sort optimization for top-K (RETRIEVE Stage 4)

Applied: sort_by -> sort_unstable_by at both sort sites in tidal/src/ranking/executor/mod.rs.

Deferred: The full select_nth_unstable_by optimization (O(N) partition + O(K log K) sort) requires knowing limit at the sort site. In the current executor design, truncation to limit is applied by the caller after normalization, so the sort site does not have access to K. Applying select_nth_unstable would require passing limit through the scoring pipeline, which is a more invasive refactor deferred to a future optimization pass.

Current benefit: sort_unstable_by vs sort_by -- same O(N log N) complexity but ~5-10% faster in practice due to eliminated stability bookkeeping.

2. LogMergePolicy for Tantivy (SEARCH BM25)

Reduces segment count from potentially hundreds to < 20 at steady state. Fewer segments = fewer posting list merges during BM25 scoring.

Location: tidal/src/text/index.rsLogMergePolicy configured with:

  • min_num_segments = 4
  • max_docs_before_merge = 5_000_000
  • del_docs_ratio_before_merge = 0.3

3. ef_construction=400 (ANN build quality)

Higher construction quality reduces the number of graph re-traversals needed during search, improving recall without increasing search latency.

Deferred Hotspot Work

The following optimizations were identified but deferred as premature until flamegraph profiling confirms they are in the top-3 hotspots:

  • DashMap lookup sharding: Batch signal reads by pre-sorting entity IDs by DashMap shard to improve cache locality. Expected gain: ~10-15% in Stage 3.
  • exp() approximation: Replace (-lambda * dt).exp() with a fast approximation (e.g., exp_fast via bit manipulation). Expected gain: ~5-8% in Stage 3 on high-throughput paths.
  • Tantivy heap budget increase: 50MB→100MB to reduce flushing frequency under 1M ingest.

These should be re-evaluated after actual flamegraph data is collected.

Before/After Evidence

Optimization Change Expected Benefit
sort_by -> sort_unstable_by (Stage 4) Eliminates stability bookkeeping ~5-10% reduction in sort time
LogMergePolicy (SEARCH BM25) Reduces segment count from potentially 50+ to < 20 Fewer posting list merges per query
ef_construction=400 (ANN build) Higher graph quality -> fewer graph re-traversals during search Improved recall without latency regression

Note: Actual before/after latency deltas require running samply/flamegraph on the release bench binary. Flamegraph SVG artifacts are machine-generated and are not committed to the repository -- they are too large for git. To generate:

cargo install samply
cargo build --manifest-path tidal/Cargo.toml --release --benches
samply record target/release/deps/scale-*

Next Steps

  1. Run samply record target/release/deps/scale-* on macOS target hardware
  2. Identify top-3 hotspots from the flamegraph (by cumulative self-time)
  3. Apply the optimization if > 10% of total time
  4. Re-run benchmark before/after to confirm improvement
  5. Update this document with actual SVG artifacts and measured deltas