235 lines
8.3 KiB
Markdown
235 lines
8.3 KiB
Markdown
# Task 06: Flamegraph Profiling + Hotspot Optimization
|
|
|
|
## Delivers
|
|
|
|
Flamegraph profiles of RETRIEVE and SEARCH hot paths at 1M items. Top-3 hotspots documented. Any function consuming > 10% of total time is optimized with before/after benchmark evidence.
|
|
|
|
## Complexity
|
|
|
|
L
|
|
|
|
## Dependencies
|
|
|
|
- task-01 complete (1M-item benchmark suite with baselines)
|
|
|
|
## Technical Design
|
|
|
|
### 1. Profiling tooling
|
|
|
|
Use `cargo flamegraph` (wraps `perf` on Linux, `dtrace` on macOS) to generate SVG flamegraphs from the scale benchmark:
|
|
|
|
```bash
|
|
# Install if needed
|
|
cargo install flamegraph
|
|
|
|
# Profile RETRIEVE at 1M items
|
|
# Run the scale benchmark binary under perf/dtrace sampling
|
|
cargo flamegraph --manifest-path tidal/Cargo.toml \
|
|
--bench scale \
|
|
-o docs/profiling/retrieve_1m.svg \
|
|
-- --bench "retrieve_1m/for_you"
|
|
|
|
# Profile SEARCH at 1M items
|
|
cargo flamegraph --manifest-path tidal/Cargo.toml \
|
|
--bench scale \
|
|
-o docs/profiling/search_1m.svg \
|
|
-- --bench "search_1m/text_only"
|
|
```
|
|
|
|
On macOS, if `dtrace` is restricted, use `Instruments.app` with the Time Profiler template targeting the bench binary, or use `samply`:
|
|
|
|
```bash
|
|
cargo install samply
|
|
cargo bench --manifest-path tidal/Cargo.toml --bench scale --no-run
|
|
samply record target/release/deps/scale-* -- --bench "retrieve_1m/for_you"
|
|
```
|
|
|
|
### 2. Profiling harness
|
|
|
|
Write a standalone binary that exercises the hot path in a tight loop for profiling (avoids Criterion's measurement overhead interfering with the profile):
|
|
|
|
```rust
|
|
// tidal/benches/profile_retrieve.rs (not a Criterion bench -- raw binary)
|
|
|
|
use std::collections::HashMap;
|
|
use std::time::Duration;
|
|
|
|
use tidaldb::TidalDb;
|
|
use tidaldb::query::retrieve::Retrieve;
|
|
use tidaldb::ranking::diversity::DiversityConstraints;
|
|
use tidaldb::schema::{
|
|
DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldType, Timestamp, Window,
|
|
};
|
|
|
|
fn main() {
|
|
// Build 1M-item DB (same as task-01)
|
|
let db = build_scale_db();
|
|
|
|
let query = Retrieve::builder()
|
|
.profile("for_you")
|
|
.limit(20)
|
|
.diversity(DiversityConstraints::new().max_per_creator(2))
|
|
.build()
|
|
.unwrap();
|
|
|
|
// Warm up
|
|
for _ in 0..10 {
|
|
let _ = db.retrieve(&query).unwrap();
|
|
}
|
|
|
|
// Hot loop for profiler sampling
|
|
let iterations = 1000;
|
|
let start = std::time::Instant::now();
|
|
for _ in 0..iterations {
|
|
let _ = db.retrieve(&query).unwrap();
|
|
}
|
|
let elapsed = start.elapsed();
|
|
println!(
|
|
"{iterations} iterations in {elapsed:?} ({:.2} us/iter)",
|
|
elapsed.as_micros() as f64 / iterations as f64
|
|
);
|
|
}
|
|
```
|
|
|
|
Register as an example or standalone binary:
|
|
|
|
```toml
|
|
[[example]]
|
|
name = "profile_retrieve"
|
|
|
|
[[example]]
|
|
name = "profile_search"
|
|
```
|
|
|
|
### 3. Expected hotspot categories
|
|
|
|
Based on the RETRIEVE pipeline architecture (5 stages), likely hotspots:
|
|
|
|
| Stage | Expected cost | Why |
|
|
|-------|--------------|-----|
|
|
| 1. Candidate generation | Low | Bitmap scan or DashMap iteration |
|
|
| 2. Filter evaluation | Medium | Bitmap intersections over 1M-bit bitmaps |
|
|
| 3. Signal scoring | High | DashMap lookups + exp() for every candidate x every signal |
|
|
| 4. Diversity enforcement | Low-Medium | Sorting + greedy selection |
|
|
| 5. Result assembly | Low | Top-K collection |
|
|
|
|
For SEARCH:
|
|
| Stage | Expected cost | Why |
|
|
|-------|--------------|-----|
|
|
| BM25 scoring | High | Tantivy posting list traversal across multiple segments |
|
|
| ANN search | High | USearch graph traversal with distance computation |
|
|
| RRF fusion | Low | Rank merging is O(n log n) on small candidate sets |
|
|
| Profile scoring | Medium | Same as RETRIEVE stage 3 |
|
|
|
|
### 4. Optimization patterns by hotspot type
|
|
|
|
#### If DashMap lookup is > 10%
|
|
|
|
The `entries.get(&(entity_id, type_id))` call on every candidate for every signal type involves hashing and shard locking. Optimization: batch reads by pre-sharding candidates by shard index:
|
|
|
|
```rust
|
|
/// Pre-group entity IDs by DashMap shard to minimize lock contention
|
|
/// and improve cache locality during the scoring pass.
|
|
fn batch_score_by_shard(
|
|
entries: &DashMap<(EntityId, SignalTypeId), EntitySignalEntry>,
|
|
candidates: &[EntityId],
|
|
type_id: SignalTypeId,
|
|
lambda: f64,
|
|
now_ns: u64,
|
|
) -> Vec<(EntityId, f64)> {
|
|
// DashMap has 16 shards; group candidates by shard index
|
|
let shard_count = entries.shards().len();
|
|
let mut sharded: Vec<Vec<EntityId>> = vec![Vec::new(); shard_count];
|
|
for &id in candidates {
|
|
let hash = entries.hash_usize(&(id, type_id));
|
|
let shard_idx = hash % shard_count;
|
|
sharded[shard_idx].push(id);
|
|
}
|
|
|
|
let mut results = Vec::with_capacity(candidates.len());
|
|
for shard_candidates in &sharded {
|
|
for &id in shard_candidates {
|
|
if let Some(entry) = entries.get(&(id, type_id)) {
|
|
let score = entry.hot.current_score(0, now_ns, lambda);
|
|
results.push((id, score));
|
|
}
|
|
}
|
|
}
|
|
results
|
|
}
|
|
```
|
|
|
|
#### If exp() is > 10%
|
|
|
|
The `(-lambda * dt).exp()` call dominates when scoring hundreds of candidates. Optimization: use a fast approximation for ranking-only queries (relative ordering is sufficient):
|
|
|
|
```rust
|
|
/// Fast exponential approximation for ranking (not absolute scoring).
|
|
/// Uses the identity: exp(x) ~ (1 + x/1024)^1024 for small |x|.
|
|
/// Accuracy: < 0.1% relative error for |x| < 10.
|
|
#[inline]
|
|
fn fast_exp_approx(x: f64) -> f64 {
|
|
// Schraudolph's algorithm: interpret float bits as exp approximation
|
|
// Only suitable when relative ordering matters, not absolute values.
|
|
let y = 1.0 + x / 1024.0;
|
|
y.powi(1024)
|
|
}
|
|
```
|
|
|
|
Or use the Jacobs forward-decay trick from the research doc: factor out `e^(-lambda * t_now)` which is constant across all entities, and rank by `S_static` alone (zero exp() calls at query time).
|
|
|
|
#### If bitmap intersection is > 10%
|
|
|
|
RoaringBitmap AND operations on 1M-bit bitmaps can be optimized by checking cardinality first and short-circuiting. The current evaluator already does selectivity-ordered AND; verify this is working.
|
|
|
|
#### If sort is > 10%
|
|
|
|
The top-K selection in diversity enforcement sorts all scored candidates. Replace with a partial sort (selection algorithm):
|
|
|
|
```rust
|
|
// Instead of: candidates.sort_by(|a, b| b.score.partial_cmp(&a.score));
|
|
// Use: select the top K without fully sorting
|
|
fn top_k_partial_sort(candidates: &mut [(EntityId, f64)], k: usize) {
|
|
if k < candidates.len() {
|
|
candidates.select_nth_unstable_by(k, |a, b| {
|
|
b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)
|
|
});
|
|
candidates.truncate(k);
|
|
candidates.sort_unstable_by(|a, b| {
|
|
b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)
|
|
});
|
|
}
|
|
}
|
|
```
|
|
|
|
### 5. Measurement methodology
|
|
|
|
For each identified hotspot:
|
|
|
|
1. **Before optimization:** Record Criterion benchmark result from task-01 baselines.
|
|
2. **Apply optimization.**
|
|
3. **After optimization:** Re-run the exact same Criterion benchmark.
|
|
4. **Document:** Before/after numbers, percentage improvement, and whether the hotspot dropped below 10%.
|
|
|
|
### 6. Output artifacts
|
|
|
|
- `docs/profiling/retrieve_1m.svg` -- RETRIEVE flamegraph
|
|
- `docs/profiling/search_1m.svg` -- SEARCH flamegraph
|
|
- `docs/profiling/hotspot-analysis.md` -- Written analysis of top-3 hotspots with before/after numbers
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] Flamegraph SVGs generated for RETRIEVE and SEARCH at 1M items
|
|
- [ ] Top-3 hotspots identified and documented with percentage of total time
|
|
- [ ] Any hotspot > 10% optimized with before/after Criterion benchmark evidence
|
|
- [ ] `docs/profiling/hotspot-analysis.md` contains: hotspot name, percentage, root cause, optimization applied, before/after latency
|
|
- [ ] No correctness regression: existing test suites pass after optimizations
|
|
- [ ] No performance regression in non-optimized paths (measured via existing benchmarks)
|
|
|
|
## Test Strategy
|
|
|
|
1. **Profiling (manual):** Generate flamegraphs and visually inspect. This is inherently a manual analysis task.
|
|
2. **Optimization correctness:** For each optimization applied, run the full test suite (`cargo test --manifest-path tidal/Cargo.toml --lib`) to verify no behavioral change.
|
|
3. **Performance validation:** For each optimization, compare Criterion before/after results. The optimization must show measurable improvement (> 5%) to justify the code change.
|
|
4. **Property test stability:** If any optimized function has existing property tests (e.g., `HotSignalState` accuracy), verify they still pass.
|