jordan 213b8efcca feat: complete M6-M7 + Enterprise Readiness milestones; split oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-23 22:41:16 -07:00

8.3 KiB

Raw Blame History

Task 06: Flamegraph Profiling + Hotspot Optimization

Delivers

Flamegraph profiles of RETRIEVE and SEARCH hot paths at 1M items. Top-3 hotspots documented. Any function consuming > 10% of total time is optimized with before/after benchmark evidence.

Complexity

Dependencies

task-01 complete (1M-item benchmark suite with baselines)

Technical Design

1. Profiling tooling

Use cargo flamegraph (wraps perf on Linux, dtrace on macOS) to generate SVG flamegraphs from the scale benchmark:

# Install if needed
cargo install flamegraph

# Profile RETRIEVE at 1M items
# Run the scale benchmark binary under perf/dtrace sampling
cargo flamegraph --manifest-path tidal/Cargo.toml \
    --bench scale \
    -o docs/profiling/retrieve_1m.svg \
    -- --bench "retrieve_1m/for_you"

# Profile SEARCH at 1M items
cargo flamegraph --manifest-path tidal/Cargo.toml \
    --bench scale \
    -o docs/profiling/search_1m.svg \
    -- --bench "search_1m/text_only"

On macOS, if dtrace is restricted, use Instruments.app with the Time Profiler template targeting the bench binary, or use samply:

cargo install samply
cargo bench --manifest-path tidal/Cargo.toml --bench scale --no-run
samply record target/release/deps/scale-* -- --bench "retrieve_1m/for_you"

2. Profiling harness

Write a standalone binary that exercises the hot path in a tight loop for profiling (avoids Criterion's measurement overhead interfering with the profile):

// tidal/benches/profile_retrieve.rs (not a Criterion bench -- raw binary)

use std::collections::HashMap;
use std::time::Duration;

use tidaldb::TidalDb;
use tidaldb::query::retrieve::Retrieve;
use tidaldb::ranking::diversity::DiversityConstraints;
use tidaldb::schema::{
    DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldType, Timestamp, Window,
};

fn main() {
    // Build 1M-item DB (same as task-01)
    let db = build_scale_db();

    let query = Retrieve::builder()
        .profile("for_you")
        .limit(20)
        .diversity(DiversityConstraints::new().max_per_creator(2))
        .build()
        .unwrap();

    // Warm up
    for _ in 0..10 {
        let _ = db.retrieve(&query).unwrap();
    }

    // Hot loop for profiler sampling
    let iterations = 1000;
    let start = std::time::Instant::now();
    for _ in 0..iterations {
        let _ = db.retrieve(&query).unwrap();
    }
    let elapsed = start.elapsed();
    println!(
        "{iterations} iterations in {elapsed:?} ({:.2} us/iter)",
        elapsed.as_micros() as f64 / iterations as f64
    );
}

[[example]]
name = "profile_retrieve"

[[example]]
name = "profile_search"

3. Expected hotspot categories

Based on the RETRIEVE pipeline architecture (5 stages), likely hotspots:

Stage	Expected cost	Why
1. Candidate generation	Low	Bitmap scan or DashMap iteration
2. Filter evaluation	Medium	Bitmap intersections over 1M-bit bitmaps
3. Signal scoring	High	DashMap lookups + exp() for every candidate x every signal
4. Diversity enforcement	Low-Medium	Sorting + greedy selection
5. Result assembly	Low	Top-K collection

For SEARCH:

Stage	Expected cost	Why
BM25 scoring	High	Tantivy posting list traversal across multiple segments
ANN search	High	USearch graph traversal with distance computation
RRF fusion	Low	Rank merging is O(n log n) on small candidate sets
Profile scoring	Medium	Same as RETRIEVE stage 3

4. Optimization patterns by hotspot type

If DashMap lookup is > 10%

The entries.get(&(entity_id, type_id)) call on every candidate for every signal type involves hashing and shard locking. Optimization: batch reads by pre-sharding candidates by shard index:

/// Pre-group entity IDs by DashMap shard to minimize lock contention
/// and improve cache locality during the scoring pass.
fn batch_score_by_shard(
    entries: &DashMap<(EntityId, SignalTypeId), EntitySignalEntry>,
    candidates: &[EntityId],
    type_id: SignalTypeId,
    lambda: f64,
    now_ns: u64,
) -> Vec<(EntityId, f64)> {
    // DashMap has 16 shards; group candidates by shard index
    let shard_count = entries.shards().len();
    let mut sharded: Vec<Vec<EntityId>> = vec![Vec::new(); shard_count];
    for &id in candidates {
        let hash = entries.hash_usize(&(id, type_id));
        let shard_idx = hash % shard_count;
        sharded[shard_idx].push(id);
    }

    let mut results = Vec::with_capacity(candidates.len());
    for shard_candidates in &sharded {
        for &id in shard_candidates {
            if let Some(entry) = entries.get(&(id, type_id)) {
                let score = entry.hot.current_score(0, now_ns, lambda);
                results.push((id, score));
            }
        }
    }
    results
}

If exp() is > 10%

The (-lambda * dt).exp() call dominates when scoring hundreds of candidates. Optimization: use a fast approximation for ranking-only queries (relative ordering is sufficient):

/// Fast exponential approximation for ranking (not absolute scoring).
/// Uses the identity: exp(x) ~ (1 + x/1024)^1024 for small |x|.
/// Accuracy: < 0.1% relative error for |x| < 10.
#[inline]
fn fast_exp_approx(x: f64) -> f64 {
    // Schraudolph's algorithm: interpret float bits as exp approximation
    // Only suitable when relative ordering matters, not absolute values.
    let y = 1.0 + x / 1024.0;
    y.powi(1024)
}

Or use the Jacobs forward-decay trick from the research doc: factor out e^(-lambda * t_now) which is constant across all entities, and rank by S_static alone (zero exp() calls at query time).

If bitmap intersection is > 10%

RoaringBitmap AND operations on 1M-bit bitmaps can be optimized by checking cardinality first and short-circuiting. The current evaluator already does selectivity-ordered AND; verify this is working.

If sort is > 10%

The top-K selection in diversity enforcement sorts all scored candidates. Replace with a partial sort (selection algorithm):

// Instead of: candidates.sort_by(|a, b| b.score.partial_cmp(&a.score));
// Use: select the top K without fully sorting
fn top_k_partial_sort(candidates: &mut [(EntityId, f64)], k: usize) {
    if k < candidates.len() {
        candidates.select_nth_unstable_by(k, |a, b| {
            b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)
        });
        candidates.truncate(k);
        candidates.sort_unstable_by(|a, b| {
            b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)
        });
    }
}

5. Measurement methodology

For each identified hotspot:

Before optimization: Record Criterion benchmark result from task-01 baselines.
Apply optimization.
After optimization: Re-run the exact same Criterion benchmark.
Document: Before/after numbers, percentage improvement, and whether the hotspot dropped below 10%.

6. Output artifacts

docs/profiling/retrieve_1m.svg -- RETRIEVE flamegraph
docs/profiling/search_1m.svg -- SEARCH flamegraph
docs/profiling/hotspot-analysis.md -- Written analysis of top-3 hotspots with before/after numbers

Acceptance Criteria

Flamegraph SVGs generated for RETRIEVE and SEARCH at 1M items
Top-3 hotspots identified and documented with percentage of total time
Any hotspot > 10% optimized with before/after Criterion benchmark evidence
docs/profiling/hotspot-analysis.md contains: hotspot name, percentage, root cause, optimization applied, before/after latency
No correctness regression: existing test suites pass after optimizations
No performance regression in non-optimized paths (measured via existing benchmarks)

Test Strategy

Profiling (manual): Generate flamegraphs and visually inspect. This is inherently a manual analysis task.
Optimization correctness: For each optimization applied, run the full test suite (cargo test --manifest-path tidal/Cargo.toml --lib) to verify no behavioral change.
Performance validation: For each optimization, compare Criterion before/after results. The optimization must show measurable improvement (> 5%) to justify the code change.
Property test stability: If any optimized function has existing property tests (e.g., HotSignalState accuracy), verify they still pass.

8.3 KiB Raw Blame History