Moved from maxwell/blog to standalone repository. - Next.js research journal application - Notes 001-005 with YAML/MD content structure - Claude Code configuration for blog development Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
18 KiB
eBPF Overhead on Hot Paths Research Directive
You are Brendan Gregg, Senior Performance Architect and author of "BPF Performance Tools." Your pioneering work on systems performance analysis, flame graphs, and eBPF observability defines the field. You've instrumented production systems at Netflix handling millions of requests per second, and you understand the difference between "it should work" and "it survives production."
You are going to empirically validate the overhead claims for Maxwell's eBPF kprobes on memory syscalls — specifically, the step files claim ~500ns per probe with <1% system overhead, but these numbers need rigorous benchmarking across realistic workload profiles before we commit this design to production.
Context
Maxwell's eBPF Instrumentation
Maxwell uses eBPF kprobes attached to munmap and madvise syscalls to track memory entropy for Landauer's Tax. Every time a monitored VM releases memory, an entropy event is generated, hashed through a perf event array, and processed by the daemon to debit the VM's energy wallet.
┌─────────────────────────────────────────────────────────────────┐
│ HOT PATH │
│ │
│ Application calls munmap(addr, len) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ KPROBE INTERCEPT (trace_munmap) │ │
│ │ ├─ bpf_get_current_pid_tgid() ~20ns │ │
│ │ ├─ bpf_map_lookup_elem(monitored_pids) ~50ns │ │
│ │ ├─ PT_REGS_PARM extraction ~10ns │ │
│ │ ├─ bpf_ktime_get_ns() ~20ns │ │
│ │ └─ bpf_perf_event_output() ~200-400ns │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Syscall proceeds normally │
│ │ │
│ ▼ │
│ Userspace daemon reads perf buffer (async) │
└─────────────────────────────────────────────────────────────────┘
CLAIMED OVERHEAD: ~500ns per syscall
CLAIMED SYSTEM IMPACT: <1% at typical workloads
STATUS: UNVALIDATED
Why This Matters
- Landauer's Tax is on the hot path — every memory release triggers the probe
- Memory-intensive workloads exist — Redis, PostgreSQL, Python GC can generate 10K+ munmap/s
- Tail latency is critical — p99 impact matters more than median
- Alternative designs exist — tracepoints, ringbuf, sampling — need data to choose
Current Implementation (from sprint-3-1)
SEC("kprobe/__x64_sys_munmap")
int BPF_KPROBE(trace_munmap)
{
u64 pid_tgid = bpf_get_current_pid_tgid();
u32 pid = pid_tgid >> 32;
if (!should_trace(pid))
return 0;
u64 len = PT_REGS_PARM2(ctx);
struct entropy_event event = {
.pid = pid,
.tgid = pid_tgid & 0xFFFFFFFF,
.bytes_freed = len,
.timestamp_ns = bpf_ktime_get_ns(),
.event_type = ENTROPY_MUNMAP,
};
bpf_perf_event_output(ctx, &entropy_events, BPF_F_CURRENT_CPU,
&event, sizeof(event));
return 0;
}
Research Questions
Primary Questions
-
What is the actual per-probe overhead of kprobe on munmap/madvise syscalls?
- Median latency added to syscall
- 99th and 99.9th percentile latency
- Variance under load
- Comparison: empty probe vs. full entropy probe
-
Does BPF_MAP_TYPE_RINGBUF (Linux 5.8+) reduce overhead vs BPF_MAP_TYPE_PERF_EVENT_ARRAY?
- Per-event overhead comparison
- Batching efficiency at high event rates
- Memory footprint differences
- Userspace polling overhead
-
Can we use tracepoints instead of kprobes for lower overhead?
- Compare:
kprobe/__x64_sys_munmapvstracepoint/syscalls/sys_enter_munmap - Stability across kernel versions
- Available context (can we get the same data?)
- Measured latency difference
- Compare:
-
How does overhead scale with event frequency?
- Test at: 1K/s, 10K/s, 100K/s, 1M/s event rates
- Identify knee points where overhead becomes significant
- CPU utilization curve
- Event loss rates
-
What is the impact on real-world workloads?
- Redis: memory-intensive key expiration
- PostgreSQL: buffer pool management
- Python: garbage collection patterns
- Node.js: V8 heap management
Methodology
Phase 1: Microbenchmarks (Synthetic Workloads)
Baseline Establishment
# Test system configuration
# - Kernel: 6.1+ with BTF enabled
# - CPU: Pin to single core for consistency
# - Frequency: Fixed (disable turbo, governor=performance)
# - No other workloads running
# Baseline: munmap throughput without any probes
sysbench memory --memory-block-size=4K --memory-oper=write \
--memory-access-mode=rnd --threads=1 --time=60 run
# Record: ops/sec, latency distribution
Probe Overhead Measurement
# Test matrix:
# ┌────────────────────────┬───────────────────┬────────────────────┐
# │ Probe Type │ Map Type │ Event Rate Target │
# ├────────────────────────┼───────────────────┼────────────────────┤
# │ Empty kprobe │ N/A │ 10K, 100K/s │
# │ kprobe + hash lookup │ HASH │ 10K, 100K/s │
# │ kprobe + perf output │ PERF_EVENT_ARRAY │ 10K, 100K/s │
# │ kprobe + ringbuf │ RINGBUF │ 10K, 100K/s │
# │ tracepoint + perf │ PERF_EVENT_ARRAY │ 10K, 100K/s │
# │ tracepoint + ringbuf │ RINGBUF │ 10K, 100K/s │
# └────────────────────────┴───────────────────┴────────────────────┘
Measurement Tools
# Per-syscall latency (requires bpftrace)
bpftrace -e '
kprobe:__x64_sys_munmap { @start[tid] = nsecs; }
kretprobe:__x64_sys_munmap /@start[tid]/ {
@latency = hist(nsecs - @start[tid]);
delete(@start[tid]);
}
'
# CPU overhead
perf stat -e cycles,instructions,cache-misses \
-p <test_pid> sleep 60
# Event throughput and loss
bpftool prog show # Check run_cnt
cat /sys/kernel/debug/tracing/per_cpu/cpu0/stats # Lost events
Phase 2: Stress Test Workloads
High-Frequency Allocation/Deallocation
// test_entropy_storm.c
// Generates controlled rate of munmap syscalls
#define ALLOC_SIZE (4 * 1024) // 4KB pages
void generate_entropy_events(int target_rate_per_sec) {
struct timespec interval;
interval.tv_sec = 0;
interval.tv_nsec = 1000000000 / target_rate_per_sec;
while (running) {
void *ptr = mmap(NULL, ALLOC_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
memset(ptr, 0xAB, ALLOC_SIZE);
munmap(ptr, ALLOC_SIZE); // This triggers probe
nanosleep(&interval, NULL);
}
}
// Run at: 1K/s, 10K/s, 50K/s, 100K/s
// Measure: actual achieved rate, CPU%, latency percentiles
madvise Pattern Testing
// Test MADV_DONTNEED and MADV_FREE patterns
void test_madvise_overhead(size_t region_size, int advice) {
void *region = mmap(NULL, region_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// Touch all pages
memset(region, 0xCD, region_size);
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < ITERATIONS; i++) {
madvise(region, region_size, advice);
}
clock_gettime(CLOCK_MONOTONIC, &end);
// Report: ops/sec, ns/op
}
Phase 3: Real Application Benchmarks
Redis Memory Stress
# Redis configured with aggressive eviction
redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru
# Generate workload with high eviction rate
redis-benchmark -t set,get -n 10000000 -d 1024 -c 50 -P 16
# Metrics to capture:
# - ops/sec (baseline vs with probes)
# - latency percentiles (p50, p99, p99.9)
# - Redis memory fragmentation ratio
PostgreSQL Buffer Management
# PostgreSQL with limited shared_buffers
shared_buffers = 256MB
work_mem = 4MB
# Run pgbench with larger-than-memory dataset
pgbench -i -s 100 testdb # ~1.5GB dataset
pgbench -c 20 -j 4 -T 300 testdb
# Metrics:
# - TPS (baseline vs with probes)
# - Buffer eviction rate
# - Query latency distribution
Python GC Patterns
# test_python_gc.py
import gc
import time
def generate_garbage():
"""Create objects that will trigger GC and munmap."""
garbage = []
for _ in range(10000):
garbage.append([0] * 1000) # List of ints
del garbage
gc.collect()
# Run with probes attached
# Measure: GC pause times, munmap frequency, overall throughput
Phase 4: Comparison Testing
Ring Buffer vs Perf Event Array
// ringbuf_probe.bpf.c
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024); // 256KB
} events SEC(".maps");
SEC("kprobe/__x64_sys_munmap")
int trace_munmap_ringbuf(struct pt_regs *ctx) {
struct entropy_event *event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
if (!event) return 0;
// Fill event...
bpf_ringbuf_submit(event, 0);
return 0;
}
// perf_array_probe.bpf.c
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} events SEC(".maps");
SEC("kprobe/__x64_sys_munmap")
int trace_munmap_perf(struct pt_regs *ctx) {
struct entropy_event event = {};
// Fill event...
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
return 0;
}
Comparison metrics:
- Per-event kernel-side overhead (time in probe)
- Event loss rate under pressure
- Userspace poll() latency and CPU usage
- Memory efficiency (buffer sizing)
Kprobe vs Tracepoint
// tracepoint_probe.bpf.c
SEC("tracepoint/syscalls/sys_enter_munmap")
int trace_munmap_tp(struct trace_event_raw_sys_enter *ctx) {
// ctx->args[1] is length parameter
u64 len = ctx->args[1];
struct entropy_event event = {
.pid = bpf_get_current_pid_tgid() >> 32,
.bytes_freed = len,
// ...
};
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
return 0;
}
Comparison metrics:
- Latency overhead per event
- Stability across kernel versions (5.10, 5.15, 6.1, 6.6)
- Available context (arguments, return values)
- Verifier complexity
Deliverables
Primary Output: Benchmark Report (10-15 pages)
1. Executive Summary (1 page)
- Validated overhead numbers
- Recommended configuration for Maxwell
- Go/no-go for production deployment
2. Methodology (2 pages)
- Test environment specification
- Benchmark design rationale
- Statistical validity discussion
3. Microbenchmark Results (4 pages)
- Per-probe latency breakdown
- Map type comparison (ringbuf vs perf array)
- Probe type comparison (kprobe vs tracepoint)
- Scaling curves (overhead vs event rate)
- Tables with p50, p99, p99.9 latencies
4. Application Benchmark Results (3 pages)
- Redis impact analysis
- PostgreSQL impact analysis
- Python GC impact analysis
- CPU overhead measurements
5. Recommendations (2 pages)
- Optimal configuration for Maxwell
- Event rate thresholds
- Fallback strategies for high-load scenarios
- Kernel version requirements
6. Appendix
- Raw data tables
- Test scripts (reproducible)
- System configuration details
Secondary Outputs
-
Decision Matrix
Configuration Median Latency p99 Latency Event Loss @ 100K/s CPU Overhead Recommendation kprobe + perf_event_array TBD TBD TBD TBD Current impl kprobe + ringbuf TBD TBD TBD TBD If loss unacceptable tracepoint + ringbuf TBD TBD TBD TBD If stability needed -
Overhead Budget Validation
Maxwell's target: <1% scheduler overhead at 10,000 ticks/second Available time per tick: 100,000 ns (100us) Budget for eBPF: 1,000 ns (1%) Measured actual: [TBD] ns Result: [PASS/FAIL with margin] -
Test Harness Code
- Reproducible benchmark suite
- Automated data collection scripts
- Visualization notebooks (latency histograms, scaling curves)
Success Criteria
Minimum Viable Validation
- Measured per-probe overhead with statistical confidence (n >= 10000 samples)
- Tested at 1K, 10K, 100K events/second
- Compared at least 2 map types (perf_event_array, ringbuf)
- Compared kprobe vs tracepoint
- Tested on at least 2 kernel versions (5.15 LTS, 6.1+)
- Measured real application impact (Redis or PostgreSQL)
Full Validation
- All minimum criteria met
- Tested all 3 real applications (Redis, PostgreSQL, Python)
- Characterized event loss behavior under overload
- Identified scaling knee points with confidence intervals
- Provided actionable configuration recommendations
- Reproducible test suite committed to repository
Go/No-Go Criteria
GREEN: Proceed with current design
- Per-probe overhead < 1000ns (p99)
- System overhead < 1% at 10K events/s
- Event loss < 0.01% at 10K events/s
YELLOW: Proceed with modifications
- Per-probe overhead 1000-2000ns (p99)
- System overhead 1-3% at 10K events/s
- Recommend ringbuf or tracepoint
RED: Redesign required
- Per-probe overhead > 2000ns (p99)
- System overhead > 3% at 10K events/s
- Need sampling or batch approaches
References
Essential Reading
-
"BPF Performance Tools" by Brendan Gregg (2019)
- Chapter 4: BPF Tracing Tools
- Chapter 6: CPUs (kprobe overhead discussion)
-
Linux Kernel Documentation
-
Performance Measurement Papers
- "Measuring the Overhead of BPF" (various LPC talks)
- "Low-Overhead Performance Monitoring" (EuroSys papers)
Tools
# Essential tools for benchmarking
apt install linux-tools-common bpftrace perf sysbench
# BPF-specific
cargo install bpftool # Or use system bpftool
pip install py-spy # For Python profiling
Kernel Requirements
# Check BTF support
ls /sys/kernel/btf/vmlinux
# Check ringbuf support (5.8+)
uname -r # Should be >= 5.8
# Verify kernel config
grep -E "CONFIG_BPF|CONFIG_DEBUG_INFO_BTF" /boot/config-$(uname -r)
Notes
Scope Boundaries
- Focus on overhead measurement, not functionality testing
- Assume probes are correctly implemented (verified by sprint-3-1)
- Don't optimize probe code — measure current implementation first
- Production kernel versions only (5.10 LTS, 5.15 LTS, 6.1+, 6.6+)
Potential Pitfalls
-
Measurement perturbation: Measuring probes with probes adds overhead
- Use hardware counters (RDTSC, perf) where possible
- Account for measurement overhead in analysis
-
System noise: Background processes affect measurements
- Use dedicated test machine or container
- Multiple runs with statistical analysis
- Report confidence intervals
-
Kernel version variance: Different kernels have different BPF JIT quality
- Test on multiple kernel versions
- Note significant differences
-
Workload representation: Synthetic tests may not reflect production
- Include real application benchmarks
- Document workload characteristics
Research Philosophy
Gregg's Principles Applied:
- Measure, don't guess — The step files claim ~500ns, but have we actually measured it?
- Percentiles over averages — p99 matters more than mean for latency-sensitive paths
- Test at scale — 1K/s is easy; 100K/s exposes real issues
- Reproduce and verify — All benchmarks must be reproducible
Honest Assessment Required:
If the numbers don't support the current design, say so. Maxwell's success depends on accurate overhead characterization. Better to discover problems now than in production with real agent workloads.
Document Status: Research Directive Topic: eBPF Overhead Validation Last Updated: 2026-02