117 lines
4.6 KiB
Markdown
117 lines
4.6 KiB
Markdown
# Flamegraph Hotspot Analysis
|
|
|
|
## Profiling Setup
|
|
|
|
### macOS (primary)
|
|
|
|
```bash
|
|
cargo install samply
|
|
cargo build --manifest-path tidal/Cargo.toml --release --benches
|
|
samply record target/release/deps/scale-*
|
|
```
|
|
|
|
### Linux
|
|
|
|
```bash
|
|
cargo install flamegraph
|
|
cargo flamegraph --manifest-path tidal/Cargo.toml --bench scale
|
|
```
|
|
|
|
Both tools generate profiles that can be viewed in the browser (samply) or as SVG (flamegraph).
|
|
|
|
_SVG artifacts: `docs/profiling/retrieve_1m.svg` and `docs/profiling/search_1m.svg`_
|
|
_These are generated by running the benchmarks — not committed to the repository._
|
|
|
|
## Expected Hot Paths
|
|
|
|
Based on code analysis of the RETRIEVE and SEARCH pipelines:
|
|
|
|
### RETRIEVE (`for_you` profile, 1M items)
|
|
|
|
| Stage | Hot Path | Estimated % |
|
|
|-------|---------|-------------|
|
|
| Stage 3: Signal scoring | DashMap lookup + `exp()` per candidate | ~45% |
|
|
| Stage 1: Candidate generation | RoaringBitmap universe scan | ~20% |
|
|
| Stage 2: Filter evaluation | Bitmap intersections | ~15% |
|
|
| Stage 4: Sort (top-K) | `Vec::sort_unstable` on scored candidates | ~10% |
|
|
| Stage 5: Result assembly | Struct allocation, metadata lookup | ~10% |
|
|
|
|
### SEARCH (`text_only` query, 1M items)
|
|
|
|
| Stage | Hot Path | Estimated % |
|
|
|-------|---------|-------------|
|
|
| BM25 engine | Tantivy posting list traversal, IDF scoring | ~50% |
|
|
| ANN (USearch HNSW) | Graph traversal at ef_search=200 | ~30% |
|
|
| RRF fusion | Score normalization, merge sort | ~10% |
|
|
| Post-filter | Bitmap filter evaluation | ~10% |
|
|
|
|
## Optimizations Applied
|
|
|
|
### 1. Sort optimization for top-K (RETRIEVE Stage 4)
|
|
|
|
**Applied:** `sort_by` -> `sort_unstable_by` at both sort sites in
|
|
`tidal/src/ranking/executor/mod.rs`.
|
|
|
|
**Deferred:** The full `select_nth_unstable_by` optimization (O(N) partition + O(K log K) sort)
|
|
requires knowing `limit` at the sort site. In the current executor design, truncation to
|
|
`limit` is applied by the caller after normalization, so the sort site does not have access
|
|
to `K`. Applying `select_nth_unstable` would require passing `limit` through the scoring
|
|
pipeline, which is a more invasive refactor deferred to a future optimization pass.
|
|
|
|
**Current benefit:** `sort_unstable_by` vs `sort_by` -- same O(N log N) complexity but ~5-10%
|
|
faster in practice due to eliminated stability bookkeeping.
|
|
|
|
### 2. LogMergePolicy for Tantivy (SEARCH BM25)
|
|
|
|
Reduces segment count from potentially hundreds to < 20 at steady state.
|
|
Fewer segments = fewer posting list merges during BM25 scoring.
|
|
|
|
**Location:** `tidal/src/text/index.rs` — `LogMergePolicy` configured with:
|
|
- `min_num_segments = 4`
|
|
- `max_docs_before_merge = 5_000_000`
|
|
- `del_docs_ratio_before_merge = 0.3`
|
|
|
|
### 3. ef_construction=400 (ANN build quality)
|
|
|
|
Higher construction quality reduces the number of graph re-traversals needed
|
|
during search, improving recall without increasing search latency.
|
|
|
|
## Deferred Hotspot Work
|
|
|
|
The following optimizations were identified but deferred as premature until
|
|
flamegraph profiling confirms they are in the top-3 hotspots:
|
|
|
|
- **DashMap lookup sharding:** Batch signal reads by pre-sorting entity IDs by DashMap shard
|
|
to improve cache locality. Expected gain: ~10-15% in Stage 3.
|
|
- **`exp()` approximation:** Replace `(-lambda * dt).exp()` with a fast approximation
|
|
(e.g., `exp_fast` via bit manipulation). Expected gain: ~5-8% in Stage 3 on high-throughput paths.
|
|
- **Tantivy heap budget increase:** 50MB→100MB to reduce flushing frequency under 1M ingest.
|
|
|
|
These should be re-evaluated after actual flamegraph data is collected.
|
|
|
|
## Before/After Evidence
|
|
|
|
| Optimization | Change | Expected Benefit |
|
|
|-------------|--------|-----------------|
|
|
| `sort_by` -> `sort_unstable_by` (Stage 4) | Eliminates stability bookkeeping | ~5-10% reduction in sort time |
|
|
| LogMergePolicy (SEARCH BM25) | Reduces segment count from potentially 50+ to < 20 | Fewer posting list merges per query |
|
|
| ef_construction=400 (ANN build) | Higher graph quality -> fewer graph re-traversals during search | Improved recall without latency regression |
|
|
|
|
_Note: Actual before/after latency deltas require running `samply`/`flamegraph` on the
|
|
release bench binary. Flamegraph SVG artifacts are machine-generated and are not committed
|
|
to the repository -- they are too large for git. To generate:_
|
|
|
|
```bash
|
|
cargo install samply
|
|
cargo build --manifest-path tidal/Cargo.toml --release --benches
|
|
samply record target/release/deps/scale-*
|
|
```
|
|
|
|
## Next Steps
|
|
|
|
1. Run `samply record target/release/deps/scale-*` on macOS target hardware
|
|
2. Identify top-3 hotspots from the flamegraph (by cumulative self-time)
|
|
3. Apply the optimization if > 10% of total time
|
|
4. Re-run benchmark before/after to confirm improvement
|
|
5. Update this document with actual SVG artifacts and measured deltas
|