tidaldb/docs/profiling/hotspot-analysis.md
2026-02-23 22:41:16 -07:00

117 lines
4.6 KiB
Markdown

# Flamegraph Hotspot Analysis
## Profiling Setup
### macOS (primary)
```bash
cargo install samply
cargo build --manifest-path tidal/Cargo.toml --release --benches
samply record target/release/deps/scale-*
```
### Linux
```bash
cargo install flamegraph
cargo flamegraph --manifest-path tidal/Cargo.toml --bench scale
```
Both tools generate profiles that can be viewed in the browser (samply) or as SVG (flamegraph).
_SVG artifacts: `docs/profiling/retrieve_1m.svg` and `docs/profiling/search_1m.svg`_
_These are generated by running the benchmarks — not committed to the repository._
## Expected Hot Paths
Based on code analysis of the RETRIEVE and SEARCH pipelines:
### RETRIEVE (`for_you` profile, 1M items)
| Stage | Hot Path | Estimated % |
|-------|---------|-------------|
| Stage 3: Signal scoring | DashMap lookup + `exp()` per candidate | ~45% |
| Stage 1: Candidate generation | RoaringBitmap universe scan | ~20% |
| Stage 2: Filter evaluation | Bitmap intersections | ~15% |
| Stage 4: Sort (top-K) | `Vec::sort_unstable` on scored candidates | ~10% |
| Stage 5: Result assembly | Struct allocation, metadata lookup | ~10% |
### SEARCH (`text_only` query, 1M items)
| Stage | Hot Path | Estimated % |
|-------|---------|-------------|
| BM25 engine | Tantivy posting list traversal, IDF scoring | ~50% |
| ANN (USearch HNSW) | Graph traversal at ef_search=200 | ~30% |
| RRF fusion | Score normalization, merge sort | ~10% |
| Post-filter | Bitmap filter evaluation | ~10% |
## Optimizations Applied
### 1. Sort optimization for top-K (RETRIEVE Stage 4)
**Applied:** `sort_by` -> `sort_unstable_by` at both sort sites in
`tidal/src/ranking/executor/mod.rs`.
**Deferred:** The full `select_nth_unstable_by` optimization (O(N) partition + O(K log K) sort)
requires knowing `limit` at the sort site. In the current executor design, truncation to
`limit` is applied by the caller after normalization, so the sort site does not have access
to `K`. Applying `select_nth_unstable` would require passing `limit` through the scoring
pipeline, which is a more invasive refactor deferred to a future optimization pass.
**Current benefit:** `sort_unstable_by` vs `sort_by` -- same O(N log N) complexity but ~5-10%
faster in practice due to eliminated stability bookkeeping.
### 2. LogMergePolicy for Tantivy (SEARCH BM25)
Reduces segment count from potentially hundreds to < 20 at steady state.
Fewer segments = fewer posting list merges during BM25 scoring.
**Location:** `tidal/src/text/index.rs` `LogMergePolicy` configured with:
- `min_num_segments = 4`
- `max_docs_before_merge = 5_000_000`
- `del_docs_ratio_before_merge = 0.3`
### 3. ef_construction=400 (ANN build quality)
Higher construction quality reduces the number of graph re-traversals needed
during search, improving recall without increasing search latency.
## Deferred Hotspot Work
The following optimizations were identified but deferred as premature until
flamegraph profiling confirms they are in the top-3 hotspots:
- **DashMap lookup sharding:** Batch signal reads by pre-sorting entity IDs by DashMap shard
to improve cache locality. Expected gain: ~10-15% in Stage 3.
- **`exp()` approximation:** Replace `(-lambda * dt).exp()` with a fast approximation
(e.g., `exp_fast` via bit manipulation). Expected gain: ~5-8% in Stage 3 on high-throughput paths.
- **Tantivy heap budget increase:** 50MB100MB to reduce flushing frequency under 1M ingest.
These should be re-evaluated after actual flamegraph data is collected.
## Before/After Evidence
| Optimization | Change | Expected Benefit |
|-------------|--------|-----------------|
| `sort_by` -> `sort_unstable_by` (Stage 4) | Eliminates stability bookkeeping | ~5-10% reduction in sort time |
| LogMergePolicy (SEARCH BM25) | Reduces segment count from potentially 50+ to < 20 | Fewer posting list merges per query |
| ef_construction=400 (ANN build) | Higher graph quality -> fewer graph re-traversals during search | Improved recall without latency regression |
_Note: Actual before/after latency deltas require running `samply`/`flamegraph` on the
release bench binary. Flamegraph SVG artifacts are machine-generated and are not committed
to the repository -- they are too large for git. To generate:_
```bash
cargo install samply
cargo build --manifest-path tidal/Cargo.toml --release --benches
samply record target/release/deps/scale-*
```
## Next Steps
1. Run `samply record target/release/deps/scale-*` on macOS target hardware
2. Identify top-3 hotspots from the flamegraph (by cumulative self-time)
3. Apply the optimization if > 10% of total time
4. Re-run benchmark before/after to confirm improvement
5. Update this document with actual SVG artifacts and measured deltas