tidaldb/docs/profiling/hotspot-analysis.md

# Flamegraph Hotspot Analysis

## Profiling Setup

### macOS (primary)

```bash
cargo install samply
cargo build --manifest-path tidal/Cargo.toml --release --benches
samply record target/release/deps/scale-*
```

### Linux

```bash
cargo install flamegraph
cargo flamegraph --manifest-path tidal/Cargo.toml --bench scale
```

Both tools generate profiles that can be viewed in the browser (samply) or as SVG (flamegraph).

_SVG artifacts: `docs/profiling/retrieve_1m.svg` and `docs/profiling/search_1m.svg`_
_These are generated by running the benchmarks — not committed to the repository._

## Expected Hot Paths

Based on code analysis of the RETRIEVE and SEARCH pipelines:

### RETRIEVE (`for_you` profile, 1M items)

| Stage | Hot Path | Estimated % |
|-------|---------|-------------|
| Stage 3: Signal scoring | DashMap lookup + `exp()` per candidate | ~45% |
| Stage 1: Candidate generation | RoaringBitmap universe scan | ~20% |
| Stage 2: Filter evaluation | Bitmap intersections | ~15% |
| Stage 4: Sort (top-K) | `Vec::sort_unstable` on scored candidates | ~10% |
| Stage 5: Result assembly | Struct allocation, metadata lookup | ~10% |

### SEARCH (`text_only` query, 1M items)

| Stage | Hot Path | Estimated % |
|-------|---------|-------------|
| BM25 engine | Tantivy posting list traversal, IDF scoring | ~50% |
| ANN (USearch HNSW) | Graph traversal at ef_search=200 | ~30% |
| RRF fusion | Score normalization, merge sort | ~10% |
| Post-filter | Bitmap filter evaluation | ~10% |

## Optimizations Applied

### 1. Sort optimization for top-K (RETRIEVE Stage 4)

**Applied:** `sort_by` -> `sort_unstable_by` at both sort sites in
`tidal/src/ranking/executor/mod.rs`.

**Deferred:** The full `select_nth_unstable_by` optimization (O(N) partition + O(K log K) sort)
requires knowing `limit` at the sort site. In the current executor design, truncation to
`limit` is applied by the caller after normalization, so the sort site does not have access
to `K`. Applying `select_nth_unstable` would require passing `limit` through the scoring
pipeline, which is a more invasive refactor deferred to a future optimization pass.

**Current benefit:** `sort_unstable_by` vs `sort_by` -- same O(N log N) complexity but ~5-10%
faster in practice due to eliminated stability bookkeeping.

### 2. LogMergePolicy for Tantivy (SEARCH BM25)

Reduces segment count from potentially hundreds to < 20 at steady state.
Fewer segments = fewer posting list merges during BM25 scoring.

**Location:** `tidal/src/text/index.rs` — `LogMergePolicy` configured with:
- `min_num_segments = 4`
- `max_docs_before_merge = 5_000_000`
- `del_docs_ratio_before_merge = 0.3`

### 3. ef_construction=400 (ANN build quality)

Higher construction quality reduces the number of graph re-traversals needed
during search, improving recall without increasing search latency.

## Deferred Hotspot Work

The following optimizations were identified but deferred as premature until
flamegraph profiling confirms they are in the top-3 hotspots:

- **DashMap lookup sharding:** Batch signal reads by pre-sorting entity IDs by DashMap shard
  to improve cache locality. Expected gain: ~10-15% in Stage 3.
- **`exp()` approximation:** Replace `(-lambda * dt).exp()` with a fast approximation
  (e.g., `exp_fast` via bit manipulation). Expected gain: ~5-8% in Stage 3 on high-throughput paths.
- **Tantivy heap budget increase:** 50MB→100MB to reduce flushing frequency under 1M ingest.

These should be re-evaluated after actual flamegraph data is collected.

## Before/After Evidence

| Optimization | Change | Expected Benefit |
|-------------|--------|-----------------|
| `sort_by` -> `sort_unstable_by` (Stage 4) | Eliminates stability bookkeeping | ~5-10% reduction in sort time |
| LogMergePolicy (SEARCH BM25) | Reduces segment count from potentially 50+ to < 20 | Fewer posting list merges per query |
| ef_construction=400 (ANN build) | Higher graph quality -> fewer graph re-traversals during search | Improved recall without latency regression |

_Note: Actual before/after latency deltas require running `samply`/`flamegraph` on the
release bench binary. Flamegraph SVG artifacts are machine-generated and are not committed
to the repository -- they are too large for git. To generate:_

```bash
cargo install samply
cargo build --manifest-path tidal/Cargo.toml --release --benches
samply record target/release/deps/scale-*
```

## Next Steps

1. Run `samply record target/release/deps/scale-*` on macOS target hardware
2. Identify top-3 hotspots from the flamegraph (by cumulative self-time)
3. Apply the optimization if > 10% of total time
4. Re-run benchmark before/after to confirm improvement
5. Update this document with actual SVG artifacts and measured deltas