jordan 213b8efcca feat: complete M6-M7 + Enterprise Readiness milestones; split oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-23 22:41:16 -07:00

4.6 KiB

Raw Blame History

Flamegraph Hotspot Analysis

Profiling Setup

macOS (primary)

cargo install samply
cargo build --manifest-path tidal/Cargo.toml --release --benches
samply record target/release/deps/scale-*

Linux

cargo install flamegraph
cargo flamegraph --manifest-path tidal/Cargo.toml --bench scale

Both tools generate profiles that can be viewed in the browser (samply) or as SVG (flamegraph).

SVG artifacts: docs/profiling/retrieve_1m.svg and docs/profiling/search_1m.svg These are generated by running the benchmarks — not committed to the repository.

Expected Hot Paths

Based on code analysis of the RETRIEVE and SEARCH pipelines:

RETRIEVE (`for_you` profile, 1M items)

Stage	Hot Path	Estimated %
Stage 3: Signal scoring	DashMap lookup + `exp()` per candidate	~45%
Stage 1: Candidate generation	RoaringBitmap universe scan	~20%
Stage 2: Filter evaluation	Bitmap intersections	~15%
Stage 4: Sort (top-K)	`Vec::sort_unstable` on scored candidates	~10%
Stage 5: Result assembly	Struct allocation, metadata lookup	~10%

SEARCH (`text_only` query, 1M items)

Stage	Hot Path	Estimated %
BM25 engine	Tantivy posting list traversal, IDF scoring	~50%
ANN (USearch HNSW)	Graph traversal at ef_search=200	~30%
RRF fusion	Score normalization, merge sort	~10%
Post-filter	Bitmap filter evaluation	~10%

Optimizations Applied

1. Sort optimization for top-K (RETRIEVE Stage 4)

Applied: sort_by -> sort_unstable_by at both sort sites in tidal/src/ranking/executor/mod.rs.

Deferred: The full select_nth_unstable_by optimization (O(N) partition + O(K log K) sort) requires knowing limit at the sort site. In the current executor design, truncation to limit is applied by the caller after normalization, so the sort site does not have access to K. Applying select_nth_unstable would require passing limit through the scoring pipeline, which is a more invasive refactor deferred to a future optimization pass.

Current benefit: sort_unstable_by vs sort_by -- same O(N log N) complexity but ~5-10% faster in practice due to eliminated stability bookkeeping.

2. LogMergePolicy for Tantivy (SEARCH BM25)

Reduces segment count from potentially hundreds to < 20 at steady state. Fewer segments = fewer posting list merges during BM25 scoring.

Location: tidal/src/text/index.rs — LogMergePolicy configured with:

min_num_segments = 4
max_docs_before_merge = 5_000_000
del_docs_ratio_before_merge = 0.3

3. ef_construction=400 (ANN build quality)

Higher construction quality reduces the number of graph re-traversals needed during search, improving recall without increasing search latency.

Deferred Hotspot Work

The following optimizations were identified but deferred as premature until flamegraph profiling confirms they are in the top-3 hotspots:

DashMap lookup sharding: Batch signal reads by pre-sorting entity IDs by DashMap shard to improve cache locality. Expected gain: ~10-15% in Stage 3.
exp() approximation: Replace (-lambda * dt).exp() with a fast approximation (e.g., exp_fast via bit manipulation). Expected gain: ~5-8% in Stage 3 on high-throughput paths.
Tantivy heap budget increase: 50MB→100MB to reduce flushing frequency under 1M ingest.

These should be re-evaluated after actual flamegraph data is collected.

Before/After Evidence

Optimization	Change	Expected Benefit
`sort_by` -> `sort_unstable_by` (Stage 4)	Eliminates stability bookkeeping	~5-10% reduction in sort time
LogMergePolicy (SEARCH BM25)	Reduces segment count from potentially 50+ to < 20	Fewer posting list merges per query
ef_construction=400 (ANN build)	Higher graph quality -> fewer graph re-traversals during search	Improved recall without latency regression

Note: Actual before/after latency deltas require running samply/flamegraph on the release bench binary. Flamegraph SVG artifacts are machine-generated and are not committed to the repository -- they are too large for git. To generate:

cargo install samply
cargo build --manifest-path tidal/Cargo.toml --release --benches
samply record target/release/deps/scale-*

Next Steps

Run samply record target/release/deps/scale-* on macOS target hardware
Identify top-3 hotspots from the flamegraph (by cumulative self-time)
Apply the optimization if > 10% of total time
Re-run benchmark before/after to confirm improvement
Update this document with actual SVG artifacts and measured deltas

4.6 KiB Raw Blame History