tidaldb/docs/planning/milestone-7/phase-3/task-06-flamegraph-profiling.md

# Task 06: Flamegraph Profiling + Hotspot Optimization

## Delivers

Flamegraph profiles of RETRIEVE and SEARCH hot paths at 1M items. Top-3 hotspots documented. Any function consuming > 10% of total time is optimized with before/after benchmark evidence.

## Complexity

L

## Dependencies

- task-01 complete (1M-item benchmark suite with baselines)

## Technical Design

### 1. Profiling tooling

Use `cargo flamegraph` (wraps `perf` on Linux, `dtrace` on macOS) to generate SVG flamegraphs from the scale benchmark:

```bash
# Install if needed
cargo install flamegraph

# Profile RETRIEVE at 1M items
# Run the scale benchmark binary under perf/dtrace sampling
cargo flamegraph --manifest-path tidal/Cargo.toml \
    --bench scale \
    -o docs/profiling/retrieve_1m.svg \
    -- --bench "retrieve_1m/for_you"

# Profile SEARCH at 1M items
cargo flamegraph --manifest-path tidal/Cargo.toml \
    --bench scale \
    -o docs/profiling/search_1m.svg \
    -- --bench "search_1m/text_only"
```

On macOS, if `dtrace` is restricted, use `Instruments.app` with the Time Profiler template targeting the bench binary, or use `samply`:

```bash
cargo install samply
cargo bench --manifest-path tidal/Cargo.toml --bench scale --no-run
samply record target/release/deps/scale-* -- --bench "retrieve_1m/for_you"
```

### 2. Profiling harness

Write a standalone binary that exercises the hot path in a tight loop for profiling (avoids Criterion's measurement overhead interfering with the profile):

```rust
// tidal/benches/profile_retrieve.rs (not a Criterion bench -- raw binary)

use std::collections::HashMap;
use std::time::Duration;

use tidaldb::TidalDb;
use tidaldb::query::retrieve::Retrieve;
use tidaldb::ranking::diversity::DiversityConstraints;
use tidaldb::schema::{
    DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldType, Timestamp, Window,
};

fn main() {
    // Build 1M-item DB (same as task-01)
    let db = build_scale_db();

    let query = Retrieve::builder()
        .profile("for_you")
        .limit(20)
        .diversity(DiversityConstraints::new().max_per_creator(2))
        .build()
        .unwrap();

    // Warm up
    for _ in 0..10 {
        let _ = db.retrieve(&query).unwrap();
    }

    // Hot loop for profiler sampling
    let iterations = 1000;
    let start = std::time::Instant::now();
    for _ in 0..iterations {
        let _ = db.retrieve(&query).unwrap();
    }
    let elapsed = start.elapsed();
    println!(
        "{iterations} iterations in {elapsed:?} ({:.2} us/iter)",
        elapsed.as_micros() as f64 / iterations as f64
    );
}
```

Register as an example or standalone binary:

```toml
[[example]]
name = "profile_retrieve"

[[example]]
name = "profile_search"
```

### 3. Expected hotspot categories

Based on the RETRIEVE pipeline architecture (5 stages), likely hotspots:

| Stage | Expected cost | Why |
|-------|--------------|-----|
| 1. Candidate generation | Low | Bitmap scan or DashMap iteration |
| 2. Filter evaluation | Medium | Bitmap intersections over 1M-bit bitmaps |
| 3. Signal scoring | High | DashMap lookups + exp() for every candidate x every signal |
| 4. Diversity enforcement | Low-Medium | Sorting + greedy selection |
| 5. Result assembly | Low | Top-K collection |

For SEARCH:
| Stage | Expected cost | Why |
|-------|--------------|-----|
| BM25 scoring | High | Tantivy posting list traversal across multiple segments |
| ANN search | High | USearch graph traversal with distance computation |
| RRF fusion | Low | Rank merging is O(n log n) on small candidate sets |
| Profile scoring | Medium | Same as RETRIEVE stage 3 |

### 4. Optimization patterns by hotspot type

#### If DashMap lookup is > 10%

The `entries.get(&(entity_id, type_id))` call on every candidate for every signal type involves hashing and shard locking. Optimization: batch reads by pre-sharding candidates by shard index:

```rust
/// Pre-group entity IDs by DashMap shard to minimize lock contention
/// and improve cache locality during the scoring pass.
fn batch_score_by_shard(
    entries: &DashMap<(EntityId, SignalTypeId), EntitySignalEntry>,
    candidates: &[EntityId],
    type_id: SignalTypeId,
    lambda: f64,
    now_ns: u64,
) -> Vec<(EntityId, f64)> {
    // DashMap has 16 shards; group candidates by shard index
    let shard_count = entries.shards().len();
    let mut sharded: Vec<Vec<EntityId>> = vec![Vec::new(); shard_count];
    for &id in candidates {
        let hash = entries.hash_usize(&(id, type_id));
        let shard_idx = hash % shard_count;
        sharded[shard_idx].push(id);
    }

    let mut results = Vec::with_capacity(candidates.len());
    for shard_candidates in &sharded {
        for &id in shard_candidates {
            if let Some(entry) = entries.get(&(id, type_id)) {
                let score = entry.hot.current_score(0, now_ns, lambda);
                results.push((id, score));
            }
        }
    }
    results
}
```

#### If exp() is > 10%

The `(-lambda * dt).exp()` call dominates when scoring hundreds of candidates. Optimization: use a fast approximation for ranking-only queries (relative ordering is sufficient):

```rust
/// Fast exponential approximation for ranking (not absolute scoring).
/// Uses the identity: exp(x) ~ (1 + x/1024)^1024 for small |x|.
/// Accuracy: < 0.1% relative error for |x| < 10.
#[inline]
fn fast_exp_approx(x: f64) -> f64 {
    // Schraudolph's algorithm: interpret float bits as exp approximation
    // Only suitable when relative ordering matters, not absolute values.
    let y = 1.0 + x / 1024.0;
    y.powi(1024)
}
```

Or use the Jacobs forward-decay trick from the research doc: factor out `e^(-lambda * t_now)` which is constant across all entities, and rank by `S_static` alone (zero exp() calls at query time).

#### If bitmap intersection is > 10%

RoaringBitmap AND operations on 1M-bit bitmaps can be optimized by checking cardinality first and short-circuiting. The current evaluator already does selectivity-ordered AND; verify this is working.

#### If sort is > 10%

The top-K selection in diversity enforcement sorts all scored candidates. Replace with a partial sort (selection algorithm):

```rust
// Instead of: candidates.sort_by(|a, b| b.score.partial_cmp(&a.score));
// Use: select the top K without fully sorting
fn top_k_partial_sort(candidates: &mut [(EntityId, f64)], k: usize) {
    if k < candidates.len() {
        candidates.select_nth_unstable_by(k, |a, b| {
            b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)
        });
        candidates.truncate(k);
        candidates.sort_unstable_by(|a, b| {
            b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)
        });
    }
}
```

### 5. Measurement methodology

For each identified hotspot:

1. **Before optimization:** Record Criterion benchmark result from task-01 baselines.
2. **Apply optimization.**
3. **After optimization:** Re-run the exact same Criterion benchmark.
4. **Document:** Before/after numbers, percentage improvement, and whether the hotspot dropped below 10%.

### 6. Output artifacts

- `docs/profiling/retrieve_1m.svg` -- RETRIEVE flamegraph
- `docs/profiling/search_1m.svg` -- SEARCH flamegraph
- `docs/profiling/hotspot-analysis.md` -- Written analysis of top-3 hotspots with before/after numbers

## Acceptance Criteria

- [ ] Flamegraph SVGs generated for RETRIEVE and SEARCH at 1M items
- [ ] Top-3 hotspots identified and documented with percentage of total time
- [ ] Any hotspot > 10% optimized with before/after Criterion benchmark evidence
- [ ] `docs/profiling/hotspot-analysis.md` contains: hotspot name, percentage, root cause, optimization applied, before/after latency
- [ ] No correctness regression: existing test suites pass after optimizations
- [ ] No performance regression in non-optimized paths (measured via existing benchmarks)

## Test Strategy

1. **Profiling (manual):** Generate flamegraphs and visually inspect. This is inherently a manual analysis task.
2. **Optimization correctness:** For each optimization applied, run the full test suite (`cargo test --manifest-path tidal/Cargo.toml --lib`) to verify no behavioral change.
3. **Performance validation:** For each optimization, compare Criterion before/after results. The optimization must show measurable improvement (> 5%) to justify the code change.
4. **Property test stability:** If any optimized function has existing property tests (e.g., `HotSignalState` accuracy), verify they still pass.