4.6 KiB
Flamegraph Hotspot Analysis
Profiling Setup
macOS (primary)
cargo install samply
cargo build --manifest-path tidal/Cargo.toml --release --benches
samply record target/release/deps/scale-*
Linux
cargo install flamegraph
cargo flamegraph --manifest-path tidal/Cargo.toml --bench scale
Both tools generate profiles that can be viewed in the browser (samply) or as SVG (flamegraph).
SVG artifacts: docs/profiling/retrieve_1m.svg and docs/profiling/search_1m.svg
These are generated by running the benchmarks — not committed to the repository.
Expected Hot Paths
Based on code analysis of the RETRIEVE and SEARCH pipelines:
RETRIEVE (for_you profile, 1M items)
| Stage | Hot Path | Estimated % |
|---|---|---|
| Stage 3: Signal scoring | DashMap lookup + exp() per candidate |
~45% |
| Stage 1: Candidate generation | RoaringBitmap universe scan | ~20% |
| Stage 2: Filter evaluation | Bitmap intersections | ~15% |
| Stage 4: Sort (top-K) | Vec::sort_unstable on scored candidates |
~10% |
| Stage 5: Result assembly | Struct allocation, metadata lookup | ~10% |
SEARCH (text_only query, 1M items)
| Stage | Hot Path | Estimated % |
|---|---|---|
| BM25 engine | Tantivy posting list traversal, IDF scoring | ~50% |
| ANN (USearch HNSW) | Graph traversal at ef_search=200 | ~30% |
| RRF fusion | Score normalization, merge sort | ~10% |
| Post-filter | Bitmap filter evaluation | ~10% |
Optimizations Applied
1. Sort optimization for top-K (RETRIEVE Stage 4)
Applied: sort_by -> sort_unstable_by at both sort sites in
tidal/src/ranking/executor/mod.rs.
Deferred: The full select_nth_unstable_by optimization (O(N) partition + O(K log K) sort)
requires knowing limit at the sort site. In the current executor design, truncation to
limit is applied by the caller after normalization, so the sort site does not have access
to K. Applying select_nth_unstable would require passing limit through the scoring
pipeline, which is a more invasive refactor deferred to a future optimization pass.
Current benefit: sort_unstable_by vs sort_by -- same O(N log N) complexity but ~5-10%
faster in practice due to eliminated stability bookkeeping.
2. LogMergePolicy for Tantivy (SEARCH BM25)
Reduces segment count from potentially hundreds to < 20 at steady state. Fewer segments = fewer posting list merges during BM25 scoring.
Location: tidal/src/text/index.rs — LogMergePolicy configured with:
min_num_segments = 4max_docs_before_merge = 5_000_000del_docs_ratio_before_merge = 0.3
3. ef_construction=400 (ANN build quality)
Higher construction quality reduces the number of graph re-traversals needed during search, improving recall without increasing search latency.
Deferred Hotspot Work
The following optimizations were identified but deferred as premature until flamegraph profiling confirms they are in the top-3 hotspots:
- DashMap lookup sharding: Batch signal reads by pre-sorting entity IDs by DashMap shard to improve cache locality. Expected gain: ~10-15% in Stage 3.
exp()approximation: Replace(-lambda * dt).exp()with a fast approximation (e.g.,exp_fastvia bit manipulation). Expected gain: ~5-8% in Stage 3 on high-throughput paths.- Tantivy heap budget increase: 50MB→100MB to reduce flushing frequency under 1M ingest.
These should be re-evaluated after actual flamegraph data is collected.
Before/After Evidence
| Optimization | Change | Expected Benefit |
|---|---|---|
sort_by -> sort_unstable_by (Stage 4) |
Eliminates stability bookkeeping | ~5-10% reduction in sort time |
| LogMergePolicy (SEARCH BM25) | Reduces segment count from potentially 50+ to < 20 | Fewer posting list merges per query |
| ef_construction=400 (ANN build) | Higher graph quality -> fewer graph re-traversals during search | Improved recall without latency regression |
Note: Actual before/after latency deltas require running samply/flamegraph on the
release bench binary. Flamegraph SVG artifacts are machine-generated and are not committed
to the repository -- they are too large for git. To generate:
cargo install samply
cargo build --manifest-path tidal/Cargo.toml --release --benches
samply record target/release/deps/scale-*
Next Steps
- Run
samply record target/release/deps/scale-*on macOS target hardware - Identify top-3 hotspots from the flamegraph (by cumulative self-time)
- Apply the optimization if > 10% of total time
- Re-run benchmark before/after to confirm improvement
- Update this document with actual SVG artifacts and measured deltas