tidaldb/docs/planning/milestone-7/phase-3/task-06-flamegraph-profiling.md
2026-02-23 22:41:16 -07:00

8.3 KiB

Task 06: Flamegraph Profiling + Hotspot Optimization

Delivers

Flamegraph profiles of RETRIEVE and SEARCH hot paths at 1M items. Top-3 hotspots documented. Any function consuming > 10% of total time is optimized with before/after benchmark evidence.

Complexity

L

Dependencies

  • task-01 complete (1M-item benchmark suite with baselines)

Technical Design

1. Profiling tooling

Use cargo flamegraph (wraps perf on Linux, dtrace on macOS) to generate SVG flamegraphs from the scale benchmark:

# Install if needed
cargo install flamegraph

# Profile RETRIEVE at 1M items
# Run the scale benchmark binary under perf/dtrace sampling
cargo flamegraph --manifest-path tidal/Cargo.toml \
    --bench scale \
    -o docs/profiling/retrieve_1m.svg \
    -- --bench "retrieve_1m/for_you"

# Profile SEARCH at 1M items
cargo flamegraph --manifest-path tidal/Cargo.toml \
    --bench scale \
    -o docs/profiling/search_1m.svg \
    -- --bench "search_1m/text_only"

On macOS, if dtrace is restricted, use Instruments.app with the Time Profiler template targeting the bench binary, or use samply:

cargo install samply
cargo bench --manifest-path tidal/Cargo.toml --bench scale --no-run
samply record target/release/deps/scale-* -- --bench "retrieve_1m/for_you"

2. Profiling harness

Write a standalone binary that exercises the hot path in a tight loop for profiling (avoids Criterion's measurement overhead interfering with the profile):

// tidal/benches/profile_retrieve.rs (not a Criterion bench -- raw binary)

use std::collections::HashMap;
use std::time::Duration;

use tidaldb::TidalDb;
use tidaldb::query::retrieve::Retrieve;
use tidaldb::ranking::diversity::DiversityConstraints;
use tidaldb::schema::{
    DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldType, Timestamp, Window,
};

fn main() {
    // Build 1M-item DB (same as task-01)
    let db = build_scale_db();

    let query = Retrieve::builder()
        .profile("for_you")
        .limit(20)
        .diversity(DiversityConstraints::new().max_per_creator(2))
        .build()
        .unwrap();

    // Warm up
    for _ in 0..10 {
        let _ = db.retrieve(&query).unwrap();
    }

    // Hot loop for profiler sampling
    let iterations = 1000;
    let start = std::time::Instant::now();
    for _ in 0..iterations {
        let _ = db.retrieve(&query).unwrap();
    }
    let elapsed = start.elapsed();
    println!(
        "{iterations} iterations in {elapsed:?} ({:.2} us/iter)",
        elapsed.as_micros() as f64 / iterations as f64
    );
}

Register as an example or standalone binary:

[[example]]
name = "profile_retrieve"

[[example]]
name = "profile_search"

3. Expected hotspot categories

Based on the RETRIEVE pipeline architecture (5 stages), likely hotspots:

Stage Expected cost Why
1. Candidate generation Low Bitmap scan or DashMap iteration
2. Filter evaluation Medium Bitmap intersections over 1M-bit bitmaps
3. Signal scoring High DashMap lookups + exp() for every candidate x every signal
4. Diversity enforcement Low-Medium Sorting + greedy selection
5. Result assembly Low Top-K collection

For SEARCH:

Stage Expected cost Why
BM25 scoring High Tantivy posting list traversal across multiple segments
ANN search High USearch graph traversal with distance computation
RRF fusion Low Rank merging is O(n log n) on small candidate sets
Profile scoring Medium Same as RETRIEVE stage 3

4. Optimization patterns by hotspot type

If DashMap lookup is > 10%

The entries.get(&(entity_id, type_id)) call on every candidate for every signal type involves hashing and shard locking. Optimization: batch reads by pre-sharding candidates by shard index:

/// Pre-group entity IDs by DashMap shard to minimize lock contention
/// and improve cache locality during the scoring pass.
fn batch_score_by_shard(
    entries: &DashMap<(EntityId, SignalTypeId), EntitySignalEntry>,
    candidates: &[EntityId],
    type_id: SignalTypeId,
    lambda: f64,
    now_ns: u64,
) -> Vec<(EntityId, f64)> {
    // DashMap has 16 shards; group candidates by shard index
    let shard_count = entries.shards().len();
    let mut sharded: Vec<Vec<EntityId>> = vec![Vec::new(); shard_count];
    for &id in candidates {
        let hash = entries.hash_usize(&(id, type_id));
        let shard_idx = hash % shard_count;
        sharded[shard_idx].push(id);
    }

    let mut results = Vec::with_capacity(candidates.len());
    for shard_candidates in &sharded {
        for &id in shard_candidates {
            if let Some(entry) = entries.get(&(id, type_id)) {
                let score = entry.hot.current_score(0, now_ns, lambda);
                results.push((id, score));
            }
        }
    }
    results
}

If exp() is > 10%

The (-lambda * dt).exp() call dominates when scoring hundreds of candidates. Optimization: use a fast approximation for ranking-only queries (relative ordering is sufficient):

/// Fast exponential approximation for ranking (not absolute scoring).
/// Uses the identity: exp(x) ~ (1 + x/1024)^1024 for small |x|.
/// Accuracy: < 0.1% relative error for |x| < 10.
#[inline]
fn fast_exp_approx(x: f64) -> f64 {
    // Schraudolph's algorithm: interpret float bits as exp approximation
    // Only suitable when relative ordering matters, not absolute values.
    let y = 1.0 + x / 1024.0;
    y.powi(1024)
}

Or use the Jacobs forward-decay trick from the research doc: factor out e^(-lambda * t_now) which is constant across all entities, and rank by S_static alone (zero exp() calls at query time).

If bitmap intersection is > 10%

RoaringBitmap AND operations on 1M-bit bitmaps can be optimized by checking cardinality first and short-circuiting. The current evaluator already does selectivity-ordered AND; verify this is working.

If sort is > 10%

The top-K selection in diversity enforcement sorts all scored candidates. Replace with a partial sort (selection algorithm):

// Instead of: candidates.sort_by(|a, b| b.score.partial_cmp(&a.score));
// Use: select the top K without fully sorting
fn top_k_partial_sort(candidates: &mut [(EntityId, f64)], k: usize) {
    if k < candidates.len() {
        candidates.select_nth_unstable_by(k, |a, b| {
            b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)
        });
        candidates.truncate(k);
        candidates.sort_unstable_by(|a, b| {
            b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)
        });
    }
}

5. Measurement methodology

For each identified hotspot:

  1. Before optimization: Record Criterion benchmark result from task-01 baselines.
  2. Apply optimization.
  3. After optimization: Re-run the exact same Criterion benchmark.
  4. Document: Before/after numbers, percentage improvement, and whether the hotspot dropped below 10%.

6. Output artifacts

  • docs/profiling/retrieve_1m.svg -- RETRIEVE flamegraph
  • docs/profiling/search_1m.svg -- SEARCH flamegraph
  • docs/profiling/hotspot-analysis.md -- Written analysis of top-3 hotspots with before/after numbers

Acceptance Criteria

  • Flamegraph SVGs generated for RETRIEVE and SEARCH at 1M items
  • Top-3 hotspots identified and documented with percentage of total time
  • Any hotspot > 10% optimized with before/after Criterion benchmark evidence
  • docs/profiling/hotspot-analysis.md contains: hotspot name, percentage, root cause, optimization applied, before/after latency
  • No correctness regression: existing test suites pass after optimizations
  • No performance regression in non-optimized paths (measured via existing benchmarks)

Test Strategy

  1. Profiling (manual): Generate flamegraphs and visually inspect. This is inherently a manual analysis task.
  2. Optimization correctness: For each optimization applied, run the full test suite (cargo test --manifest-path tidal/Cargo.toml --lib) to verify no behavioral change.
  3. Performance validation: For each optimization, compare Criterion before/after results. The optimization must show measurable improvement (> 5%) to justify the code change.
  4. Property test stability: If any optimized function has existing property tests (e.g., HotSignalState accuracy), verify they still pass.