8.3 KiB
Task 06: Flamegraph Profiling + Hotspot Optimization
Delivers
Flamegraph profiles of RETRIEVE and SEARCH hot paths at 1M items. Top-3 hotspots documented. Any function consuming > 10% of total time is optimized with before/after benchmark evidence.
Complexity
L
Dependencies
- task-01 complete (1M-item benchmark suite with baselines)
Technical Design
1. Profiling tooling
Use cargo flamegraph (wraps perf on Linux, dtrace on macOS) to generate SVG flamegraphs from the scale benchmark:
# Install if needed
cargo install flamegraph
# Profile RETRIEVE at 1M items
# Run the scale benchmark binary under perf/dtrace sampling
cargo flamegraph --manifest-path tidal/Cargo.toml \
--bench scale \
-o docs/profiling/retrieve_1m.svg \
-- --bench "retrieve_1m/for_you"
# Profile SEARCH at 1M items
cargo flamegraph --manifest-path tidal/Cargo.toml \
--bench scale \
-o docs/profiling/search_1m.svg \
-- --bench "search_1m/text_only"
On macOS, if dtrace is restricted, use Instruments.app with the Time Profiler template targeting the bench binary, or use samply:
cargo install samply
cargo bench --manifest-path tidal/Cargo.toml --bench scale --no-run
samply record target/release/deps/scale-* -- --bench "retrieve_1m/for_you"
2. Profiling harness
Write a standalone binary that exercises the hot path in a tight loop for profiling (avoids Criterion's measurement overhead interfering with the profile):
// tidal/benches/profile_retrieve.rs (not a Criterion bench -- raw binary)
use std::collections::HashMap;
use std::time::Duration;
use tidaldb::TidalDb;
use tidaldb::query::retrieve::Retrieve;
use tidaldb::ranking::diversity::DiversityConstraints;
use tidaldb::schema::{
DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldType, Timestamp, Window,
};
fn main() {
// Build 1M-item DB (same as task-01)
let db = build_scale_db();
let query = Retrieve::builder()
.profile("for_you")
.limit(20)
.diversity(DiversityConstraints::new().max_per_creator(2))
.build()
.unwrap();
// Warm up
for _ in 0..10 {
let _ = db.retrieve(&query).unwrap();
}
// Hot loop for profiler sampling
let iterations = 1000;
let start = std::time::Instant::now();
for _ in 0..iterations {
let _ = db.retrieve(&query).unwrap();
}
let elapsed = start.elapsed();
println!(
"{iterations} iterations in {elapsed:?} ({:.2} us/iter)",
elapsed.as_micros() as f64 / iterations as f64
);
}
Register as an example or standalone binary:
[[example]]
name = "profile_retrieve"
[[example]]
name = "profile_search"
3. Expected hotspot categories
Based on the RETRIEVE pipeline architecture (5 stages), likely hotspots:
| Stage | Expected cost | Why |
|---|---|---|
| 1. Candidate generation | Low | Bitmap scan or DashMap iteration |
| 2. Filter evaluation | Medium | Bitmap intersections over 1M-bit bitmaps |
| 3. Signal scoring | High | DashMap lookups + exp() for every candidate x every signal |
| 4. Diversity enforcement | Low-Medium | Sorting + greedy selection |
| 5. Result assembly | Low | Top-K collection |
For SEARCH:
| Stage | Expected cost | Why |
|---|---|---|
| BM25 scoring | High | Tantivy posting list traversal across multiple segments |
| ANN search | High | USearch graph traversal with distance computation |
| RRF fusion | Low | Rank merging is O(n log n) on small candidate sets |
| Profile scoring | Medium | Same as RETRIEVE stage 3 |
4. Optimization patterns by hotspot type
If DashMap lookup is > 10%
The entries.get(&(entity_id, type_id)) call on every candidate for every signal type involves hashing and shard locking. Optimization: batch reads by pre-sharding candidates by shard index:
/// Pre-group entity IDs by DashMap shard to minimize lock contention
/// and improve cache locality during the scoring pass.
fn batch_score_by_shard(
entries: &DashMap<(EntityId, SignalTypeId), EntitySignalEntry>,
candidates: &[EntityId],
type_id: SignalTypeId,
lambda: f64,
now_ns: u64,
) -> Vec<(EntityId, f64)> {
// DashMap has 16 shards; group candidates by shard index
let shard_count = entries.shards().len();
let mut sharded: Vec<Vec<EntityId>> = vec![Vec::new(); shard_count];
for &id in candidates {
let hash = entries.hash_usize(&(id, type_id));
let shard_idx = hash % shard_count;
sharded[shard_idx].push(id);
}
let mut results = Vec::with_capacity(candidates.len());
for shard_candidates in &sharded {
for &id in shard_candidates {
if let Some(entry) = entries.get(&(id, type_id)) {
let score = entry.hot.current_score(0, now_ns, lambda);
results.push((id, score));
}
}
}
results
}
If exp() is > 10%
The (-lambda * dt).exp() call dominates when scoring hundreds of candidates. Optimization: use a fast approximation for ranking-only queries (relative ordering is sufficient):
/// Fast exponential approximation for ranking (not absolute scoring).
/// Uses the identity: exp(x) ~ (1 + x/1024)^1024 for small |x|.
/// Accuracy: < 0.1% relative error for |x| < 10.
#[inline]
fn fast_exp_approx(x: f64) -> f64 {
// Schraudolph's algorithm: interpret float bits as exp approximation
// Only suitable when relative ordering matters, not absolute values.
let y = 1.0 + x / 1024.0;
y.powi(1024)
}
Or use the Jacobs forward-decay trick from the research doc: factor out e^(-lambda * t_now) which is constant across all entities, and rank by S_static alone (zero exp() calls at query time).
If bitmap intersection is > 10%
RoaringBitmap AND operations on 1M-bit bitmaps can be optimized by checking cardinality first and short-circuiting. The current evaluator already does selectivity-ordered AND; verify this is working.
If sort is > 10%
The top-K selection in diversity enforcement sorts all scored candidates. Replace with a partial sort (selection algorithm):
// Instead of: candidates.sort_by(|a, b| b.score.partial_cmp(&a.score));
// Use: select the top K without fully sorting
fn top_k_partial_sort(candidates: &mut [(EntityId, f64)], k: usize) {
if k < candidates.len() {
candidates.select_nth_unstable_by(k, |a, b| {
b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)
});
candidates.truncate(k);
candidates.sort_unstable_by(|a, b| {
b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)
});
}
}
5. Measurement methodology
For each identified hotspot:
- Before optimization: Record Criterion benchmark result from task-01 baselines.
- Apply optimization.
- After optimization: Re-run the exact same Criterion benchmark.
- Document: Before/after numbers, percentage improvement, and whether the hotspot dropped below 10%.
6. Output artifacts
docs/profiling/retrieve_1m.svg-- RETRIEVE flamegraphdocs/profiling/search_1m.svg-- SEARCH flamegraphdocs/profiling/hotspot-analysis.md-- Written analysis of top-3 hotspots with before/after numbers
Acceptance Criteria
- Flamegraph SVGs generated for RETRIEVE and SEARCH at 1M items
- Top-3 hotspots identified and documented with percentage of total time
- Any hotspot > 10% optimized with before/after Criterion benchmark evidence
docs/profiling/hotspot-analysis.mdcontains: hotspot name, percentage, root cause, optimization applied, before/after latency- No correctness regression: existing test suites pass after optimizations
- No performance regression in non-optimized paths (measured via existing benchmarks)
Test Strategy
- Profiling (manual): Generate flamegraphs and visually inspect. This is inherently a manual analysis task.
- Optimization correctness: For each optimization applied, run the full test suite (
cargo test --manifest-path tidal/Cargo.toml --lib) to verify no behavioral change. - Performance validation: For each optimization, compare Criterion before/after results. The optimization must show measurable improvement (> 5%) to justify the code change.
- Property test stability: If any optimized function has existing property tests (e.g.,
HotSignalStateaccuracy), verify they still pass.