# Task 06: Flamegraph Profiling + Hotspot Optimization ## Delivers Flamegraph profiles of RETRIEVE and SEARCH hot paths at 1M items. Top-3 hotspots documented. Any function consuming > 10% of total time is optimized with before/after benchmark evidence. ## Complexity L ## Dependencies - task-01 complete (1M-item benchmark suite with baselines) ## Technical Design ### 1. Profiling tooling Use `cargo flamegraph` (wraps `perf` on Linux, `dtrace` on macOS) to generate SVG flamegraphs from the scale benchmark: ```bash # Install if needed cargo install flamegraph # Profile RETRIEVE at 1M items # Run the scale benchmark binary under perf/dtrace sampling cargo flamegraph --manifest-path tidal/Cargo.toml \ --bench scale \ -o docs/profiling/retrieve_1m.svg \ -- --bench "retrieve_1m/for_you" # Profile SEARCH at 1M items cargo flamegraph --manifest-path tidal/Cargo.toml \ --bench scale \ -o docs/profiling/search_1m.svg \ -- --bench "search_1m/text_only" ``` On macOS, if `dtrace` is restricted, use `Instruments.app` with the Time Profiler template targeting the bench binary, or use `samply`: ```bash cargo install samply cargo bench --manifest-path tidal/Cargo.toml --bench scale --no-run samply record target/release/deps/scale-* -- --bench "retrieve_1m/for_you" ``` ### 2. Profiling harness Write a standalone binary that exercises the hot path in a tight loop for profiling (avoids Criterion's measurement overhead interfering with the profile): ```rust // tidal/benches/profile_retrieve.rs (not a Criterion bench -- raw binary) use std::collections::HashMap; use std::time::Duration; use tidaldb::TidalDb; use tidaldb::query::retrieve::Retrieve; use tidaldb::ranking::diversity::DiversityConstraints; use tidaldb::schema::{ DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldType, Timestamp, Window, }; fn main() { // Build 1M-item DB (same as task-01) let db = build_scale_db(); let query = Retrieve::builder() .profile("for_you") .limit(20) .diversity(DiversityConstraints::new().max_per_creator(2)) .build() .unwrap(); // Warm up for _ in 0..10 { let _ = db.retrieve(&query).unwrap(); } // Hot loop for profiler sampling let iterations = 1000; let start = std::time::Instant::now(); for _ in 0..iterations { let _ = db.retrieve(&query).unwrap(); } let elapsed = start.elapsed(); println!( "{iterations} iterations in {elapsed:?} ({:.2} us/iter)", elapsed.as_micros() as f64 / iterations as f64 ); } ``` Register as an example or standalone binary: ```toml [[example]] name = "profile_retrieve" [[example]] name = "profile_search" ``` ### 3. Expected hotspot categories Based on the RETRIEVE pipeline architecture (5 stages), likely hotspots: | Stage | Expected cost | Why | |-------|--------------|-----| | 1. Candidate generation | Low | Bitmap scan or DashMap iteration | | 2. Filter evaluation | Medium | Bitmap intersections over 1M-bit bitmaps | | 3. Signal scoring | High | DashMap lookups + exp() for every candidate x every signal | | 4. Diversity enforcement | Low-Medium | Sorting + greedy selection | | 5. Result assembly | Low | Top-K collection | For SEARCH: | Stage | Expected cost | Why | |-------|--------------|-----| | BM25 scoring | High | Tantivy posting list traversal across multiple segments | | ANN search | High | USearch graph traversal with distance computation | | RRF fusion | Low | Rank merging is O(n log n) on small candidate sets | | Profile scoring | Medium | Same as RETRIEVE stage 3 | ### 4. Optimization patterns by hotspot type #### If DashMap lookup is > 10% The `entries.get(&(entity_id, type_id))` call on every candidate for every signal type involves hashing and shard locking. Optimization: batch reads by pre-sharding candidates by shard index: ```rust /// Pre-group entity IDs by DashMap shard to minimize lock contention /// and improve cache locality during the scoring pass. fn batch_score_by_shard( entries: &DashMap<(EntityId, SignalTypeId), EntitySignalEntry>, candidates: &[EntityId], type_id: SignalTypeId, lambda: f64, now_ns: u64, ) -> Vec<(EntityId, f64)> { // DashMap has 16 shards; group candidates by shard index let shard_count = entries.shards().len(); let mut sharded: Vec> = vec![Vec::new(); shard_count]; for &id in candidates { let hash = entries.hash_usize(&(id, type_id)); let shard_idx = hash % shard_count; sharded[shard_idx].push(id); } let mut results = Vec::with_capacity(candidates.len()); for shard_candidates in &sharded { for &id in shard_candidates { if let Some(entry) = entries.get(&(id, type_id)) { let score = entry.hot.current_score(0, now_ns, lambda); results.push((id, score)); } } } results } ``` #### If exp() is > 10% The `(-lambda * dt).exp()` call dominates when scoring hundreds of candidates. Optimization: use a fast approximation for ranking-only queries (relative ordering is sufficient): ```rust /// Fast exponential approximation for ranking (not absolute scoring). /// Uses the identity: exp(x) ~ (1 + x/1024)^1024 for small |x|. /// Accuracy: < 0.1% relative error for |x| < 10. #[inline] fn fast_exp_approx(x: f64) -> f64 { // Schraudolph's algorithm: interpret float bits as exp approximation // Only suitable when relative ordering matters, not absolute values. let y = 1.0 + x / 1024.0; y.powi(1024) } ``` Or use the Jacobs forward-decay trick from the research doc: factor out `e^(-lambda * t_now)` which is constant across all entities, and rank by `S_static` alone (zero exp() calls at query time). #### If bitmap intersection is > 10% RoaringBitmap AND operations on 1M-bit bitmaps can be optimized by checking cardinality first and short-circuiting. The current evaluator already does selectivity-ordered AND; verify this is working. #### If sort is > 10% The top-K selection in diversity enforcement sorts all scored candidates. Replace with a partial sort (selection algorithm): ```rust // Instead of: candidates.sort_by(|a, b| b.score.partial_cmp(&a.score)); // Use: select the top K without fully sorting fn top_k_partial_sort(candidates: &mut [(EntityId, f64)], k: usize) { if k < candidates.len() { candidates.select_nth_unstable_by(k, |a, b| { b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal) }); candidates.truncate(k); candidates.sort_unstable_by(|a, b| { b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal) }); } } ``` ### 5. Measurement methodology For each identified hotspot: 1. **Before optimization:** Record Criterion benchmark result from task-01 baselines. 2. **Apply optimization.** 3. **After optimization:** Re-run the exact same Criterion benchmark. 4. **Document:** Before/after numbers, percentage improvement, and whether the hotspot dropped below 10%. ### 6. Output artifacts - `docs/profiling/retrieve_1m.svg` -- RETRIEVE flamegraph - `docs/profiling/search_1m.svg` -- SEARCH flamegraph - `docs/profiling/hotspot-analysis.md` -- Written analysis of top-3 hotspots with before/after numbers ## Acceptance Criteria - [ ] Flamegraph SVGs generated for RETRIEVE and SEARCH at 1M items - [ ] Top-3 hotspots identified and documented with percentage of total time - [ ] Any hotspot > 10% optimized with before/after Criterion benchmark evidence - [ ] `docs/profiling/hotspot-analysis.md` contains: hotspot name, percentage, root cause, optimization applied, before/after latency - [ ] No correctness regression: existing test suites pass after optimizations - [ ] No performance regression in non-optimized paths (measured via existing benchmarks) ## Test Strategy 1. **Profiling (manual):** Generate flamegraphs and visually inspect. This is inherently a manual analysis task. 2. **Optimization correctness:** For each optimization applied, run the full test suite (`cargo test --manifest-path tidal/Cargo.toml --lib`) to verify no behavioral change. 3. **Performance validation:** For each optimization, compare Criterion before/after results. The optimization must show measurable improvement (> 5%) to justify the code change. 4. **Property test stability:** If any optimized function has existing property tests (e.g., `HotSignalState` accuracy), verify they still pass.