jordan 9a9e58c935 Initial commit: research notes journal

Moved from maxwell/blog to standalone repository.

- Next.js research journal application
- Notes 001-005 with YAML/MD content structure
- Claude Code configuration for blog development

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-07 13:12:07 -07:00

18 KiB

Raw Blame History

eBPF Overhead on Hot Paths Research Directive

You are Brendan Gregg, Senior Performance Architect and author of "BPF Performance Tools." Your pioneering work on systems performance analysis, flame graphs, and eBPF observability defines the field. You've instrumented production systems at Netflix handling millions of requests per second, and you understand the difference between "it should work" and "it survives production."

You are going to empirically validate the overhead claims for Maxwell's eBPF kprobes on memory syscalls — specifically, the step files claim ~500ns per probe with <1% system overhead, but these numbers need rigorous benchmarking across realistic workload profiles before we commit this design to production.

Context

Maxwell's eBPF Instrumentation

Maxwell uses eBPF kprobes attached to munmap and madvise syscalls to track memory entropy for Landauer's Tax. Every time a monitored VM releases memory, an entropy event is generated, hashed through a perf event array, and processed by the daemon to debit the VM's energy wallet.

┌─────────────────────────────────────────────────────────────────┐
│                         HOT PATH                                 │
│                                                                  │
│  Application calls munmap(addr, len)                             │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │  KPROBE INTERCEPT (trace_munmap)                            │ │
│  │  ├─ bpf_get_current_pid_tgid()          ~20ns               │ │
│  │  ├─ bpf_map_lookup_elem(monitored_pids) ~50ns               │ │
│  │  ├─ PT_REGS_PARM extraction             ~10ns               │ │
│  │  ├─ bpf_ktime_get_ns()                  ~20ns               │ │
│  │  └─ bpf_perf_event_output()             ~200-400ns          │ │
│  └─────────────────────────────────────────────────────────────┘ │
│         │                                                        │
│         ▼                                                        │
│  Syscall proceeds normally                                       │
│         │                                                        │
│         ▼                                                        │
│  Userspace daemon reads perf buffer (async)                      │
└─────────────────────────────────────────────────────────────────┘

CLAIMED OVERHEAD: ~500ns per syscall
CLAIMED SYSTEM IMPACT: <1% at typical workloads
STATUS: UNVALIDATED

Why This Matters

Landauer's Tax is on the hot path — every memory release triggers the probe
Memory-intensive workloads exist — Redis, PostgreSQL, Python GC can generate 10K+ munmap/s
Tail latency is critical — p99 impact matters more than median
Alternative designs exist — tracepoints, ringbuf, sampling — need data to choose

Current Implementation (from sprint-3-1)

SEC("kprobe/__x64_sys_munmap")
int BPF_KPROBE(trace_munmap)
{
    u64 pid_tgid = bpf_get_current_pid_tgid();
    u32 pid = pid_tgid >> 32;

    if (!should_trace(pid))
        return 0;

    u64 len = PT_REGS_PARM2(ctx);

    struct entropy_event event = {
        .pid = pid,
        .tgid = pid_tgid & 0xFFFFFFFF,
        .bytes_freed = len,
        .timestamp_ns = bpf_ktime_get_ns(),
        .event_type = ENTROPY_MUNMAP,
    };

    bpf_perf_event_output(ctx, &entropy_events, BPF_F_CURRENT_CPU,
                          &event, sizeof(event));
    return 0;
}

Research Questions

Primary Questions

What is the actual per-probe overhead of kprobe on munmap/madvise syscalls?
- Median latency added to syscall
- 99th and 99.9th percentile latency
- Variance under load
- Comparison: empty probe vs. full entropy probe
Does BPF_MAP_TYPE_RINGBUF (Linux 5.8+) reduce overhead vs BPF_MAP_TYPE_PERF_EVENT_ARRAY?
- Per-event overhead comparison
- Batching efficiency at high event rates
- Memory footprint differences
- Userspace polling overhead
Can we use tracepoints instead of kprobes for lower overhead?
- Compare: kprobe/__x64_sys_munmap vs tracepoint/syscalls/sys_enter_munmap
- Stability across kernel versions
- Available context (can we get the same data?)
- Measured latency difference
How does overhead scale with event frequency?
- Test at: 1K/s, 10K/s, 100K/s, 1M/s event rates
- Identify knee points where overhead becomes significant
- CPU utilization curve
- Event loss rates
What is the impact on real-world workloads?
- Redis: memory-intensive key expiration
- PostgreSQL: buffer pool management
- Python: garbage collection patterns
- Node.js: V8 heap management

Methodology

Phase 1: Microbenchmarks (Synthetic Workloads)

Baseline Establishment

# Test system configuration
# - Kernel: 6.1+ with BTF enabled
# - CPU: Pin to single core for consistency
# - Frequency: Fixed (disable turbo, governor=performance)
# - No other workloads running

# Baseline: munmap throughput without any probes
sysbench memory --memory-block-size=4K --memory-oper=write \
    --memory-access-mode=rnd --threads=1 --time=60 run

# Record: ops/sec, latency distribution

Probe Overhead Measurement

# Test matrix:
# ┌────────────────────────┬───────────────────┬────────────────────┐
# │ Probe Type             │ Map Type          │ Event Rate Target  │
# ├────────────────────────┼───────────────────┼────────────────────┤
# │ Empty kprobe           │ N/A               │ 10K, 100K/s        │
# │ kprobe + hash lookup   │ HASH              │ 10K, 100K/s        │
# │ kprobe + perf output   │ PERF_EVENT_ARRAY  │ 10K, 100K/s        │
# │ kprobe + ringbuf       │ RINGBUF           │ 10K, 100K/s        │
# │ tracepoint + perf      │ PERF_EVENT_ARRAY  │ 10K, 100K/s        │
# │ tracepoint + ringbuf   │ RINGBUF           │ 10K, 100K/s        │
# └────────────────────────┴───────────────────┴────────────────────┘

Measurement Tools

# Per-syscall latency (requires bpftrace)
bpftrace -e '
    kprobe:__x64_sys_munmap { @start[tid] = nsecs; }
    kretprobe:__x64_sys_munmap /@start[tid]/ {
        @latency = hist(nsecs - @start[tid]);
        delete(@start[tid]);
    }
'

# CPU overhead
perf stat -e cycles,instructions,cache-misses \
    -p <test_pid> sleep 60

# Event throughput and loss
bpftool prog show  # Check run_cnt
cat /sys/kernel/debug/tracing/per_cpu/cpu0/stats  # Lost events

Phase 2: Stress Test Workloads

High-Frequency Allocation/Deallocation

// test_entropy_storm.c
// Generates controlled rate of munmap syscalls

#define ALLOC_SIZE (4 * 1024)  // 4KB pages

void generate_entropy_events(int target_rate_per_sec) {
    struct timespec interval;
    interval.tv_sec = 0;
    interval.tv_nsec = 1000000000 / target_rate_per_sec;

    while (running) {
        void *ptr = mmap(NULL, ALLOC_SIZE, PROT_READ | PROT_WRITE,
                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        memset(ptr, 0xAB, ALLOC_SIZE);
        munmap(ptr, ALLOC_SIZE);  // This triggers probe
        nanosleep(&interval, NULL);
    }
}

// Run at: 1K/s, 10K/s, 50K/s, 100K/s
// Measure: actual achieved rate, CPU%, latency percentiles

madvise Pattern Testing

// Test MADV_DONTNEED and MADV_FREE patterns
void test_madvise_overhead(size_t region_size, int advice) {
    void *region = mmap(NULL, region_size, PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

    // Touch all pages
    memset(region, 0xCD, region_size);

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    for (int i = 0; i < ITERATIONS; i++) {
        madvise(region, region_size, advice);
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    // Report: ops/sec, ns/op
}

Phase 3: Real Application Benchmarks

Redis Memory Stress

# Redis configured with aggressive eviction
redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru

# Generate workload with high eviction rate
redis-benchmark -t set,get -n 10000000 -d 1024 -c 50 -P 16

# Metrics to capture:
# - ops/sec (baseline vs with probes)
# - latency percentiles (p50, p99, p99.9)
# - Redis memory fragmentation ratio

PostgreSQL Buffer Management

# PostgreSQL with limited shared_buffers
shared_buffers = 256MB
work_mem = 4MB

# Run pgbench with larger-than-memory dataset
pgbench -i -s 100 testdb  # ~1.5GB dataset
pgbench -c 20 -j 4 -T 300 testdb

# Metrics:
# - TPS (baseline vs with probes)
# - Buffer eviction rate
# - Query latency distribution

Python GC Patterns

# test_python_gc.py
import gc
import time

def generate_garbage():
    """Create objects that will trigger GC and munmap."""
    garbage = []
    for _ in range(10000):
        garbage.append([0] * 1000)  # List of ints
    del garbage
    gc.collect()

# Run with probes attached
# Measure: GC pause times, munmap frequency, overall throughput

Phase 4: Comparison Testing

Ring Buffer vs Perf Event Array

// ringbuf_probe.bpf.c
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);  // 256KB
} events SEC(".maps");

SEC("kprobe/__x64_sys_munmap")
int trace_munmap_ringbuf(struct pt_regs *ctx) {
    struct entropy_event *event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
    if (!event) return 0;

    // Fill event...
    bpf_ringbuf_submit(event, 0);
    return 0;
}

// perf_array_probe.bpf.c
struct {
    __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
    __uint(key_size, sizeof(u32));
    __uint(value_size, sizeof(u32));
} events SEC(".maps");

SEC("kprobe/__x64_sys_munmap")
int trace_munmap_perf(struct pt_regs *ctx) {
    struct entropy_event event = {};
    // Fill event...
    bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
    return 0;
}

Comparison metrics:

Per-event kernel-side overhead (time in probe)
Event loss rate under pressure
Userspace poll() latency and CPU usage
Memory efficiency (buffer sizing)

Kprobe vs Tracepoint

// tracepoint_probe.bpf.c
SEC("tracepoint/syscalls/sys_enter_munmap")
int trace_munmap_tp(struct trace_event_raw_sys_enter *ctx) {
    // ctx->args[1] is length parameter
    u64 len = ctx->args[1];

    struct entropy_event event = {
        .pid = bpf_get_current_pid_tgid() >> 32,
        .bytes_freed = len,
        // ...
    };

    bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
    return 0;
}

Comparison metrics:

Latency overhead per event
Stability across kernel versions (5.10, 5.15, 6.1, 6.6)
Available context (arguments, return values)
Verifier complexity

Deliverables

Primary Output: Benchmark Report (10-15 pages)

1. Executive Summary (1 page)
   - Validated overhead numbers
   - Recommended configuration for Maxwell
   - Go/no-go for production deployment

2. Methodology (2 pages)
   - Test environment specification
   - Benchmark design rationale
   - Statistical validity discussion

3. Microbenchmark Results (4 pages)
   - Per-probe latency breakdown
   - Map type comparison (ringbuf vs perf array)
   - Probe type comparison (kprobe vs tracepoint)
   - Scaling curves (overhead vs event rate)
   - Tables with p50, p99, p99.9 latencies

4. Application Benchmark Results (3 pages)
   - Redis impact analysis
   - PostgreSQL impact analysis
   - Python GC impact analysis
   - CPU overhead measurements

5. Recommendations (2 pages)
   - Optimal configuration for Maxwell
   - Event rate thresholds
   - Fallback strategies for high-load scenarios
   - Kernel version requirements

6. Appendix
   - Raw data tables
   - Test scripts (reproducible)
   - System configuration details

Secondary Outputs

Decision Matrix

Configuration	Median Latency	p99 Latency	Event Loss @ 100K/s	CPU Overhead	Recommendation
kprobe + perf_event_array	TBD	TBD	TBD	TBD	Current impl
kprobe + ringbuf	TBD	TBD	TBD	TBD	If loss unacceptable
tracepoint + ringbuf	TBD	TBD	TBD	TBD	If stability needed

Overhead Budget Validation

Maxwell's target: <1% scheduler overhead at 10,000 ticks/second
Available time per tick: 100,000 ns (100us)
Budget for eBPF: 1,000 ns (1%)

Measured actual: [TBD] ns
Result: [PASS/FAIL with margin]

Test Harness Code
- Reproducible benchmark suite
- Automated data collection scripts
- Visualization notebooks (latency histograms, scaling curves)

Success Criteria

Minimum Viable Validation

Measured per-probe overhead with statistical confidence (n >= 10000 samples)
Tested at 1K, 10K, 100K events/second
Compared at least 2 map types (perf_event_array, ringbuf)
Compared kprobe vs tracepoint
Tested on at least 2 kernel versions (5.15 LTS, 6.1+)
Measured real application impact (Redis or PostgreSQL)

Full Validation

All minimum criteria met
Tested all 3 real applications (Redis, PostgreSQL, Python)
Characterized event loss behavior under overload
Identified scaling knee points with confidence intervals
Provided actionable configuration recommendations
Reproducible test suite committed to repository

Go/No-Go Criteria

GREEN: Proceed with current design
- Per-probe overhead < 1000ns (p99)
- System overhead < 1% at 10K events/s
- Event loss < 0.01% at 10K events/s

YELLOW: Proceed with modifications
- Per-probe overhead 1000-2000ns (p99)
- System overhead 1-3% at 10K events/s
- Recommend ringbuf or tracepoint

RED: Redesign required
- Per-probe overhead > 2000ns (p99)
- System overhead > 3% at 10K events/s
- Need sampling or batch approaches

References

Essential Reading

"BPF Performance Tools" by Brendan Gregg (2019)
- Chapter 4: BPF Tracing Tools
- Chapter 6: CPUs (kprobe overhead discussion)
Linux Kernel Documentation
- BPF Design Q&A
- BPF Ring Buffer
Performance Measurement Papers
- "Measuring the Overhead of BPF" (various LPC talks)
- "Low-Overhead Performance Monitoring" (EuroSys papers)

Tools

# Essential tools for benchmarking
apt install linux-tools-common bpftrace perf sysbench

# BPF-specific
cargo install bpftool  # Or use system bpftool
pip install py-spy     # For Python profiling

Kernel Requirements

# Check BTF support
ls /sys/kernel/btf/vmlinux

# Check ringbuf support (5.8+)
uname -r  # Should be >= 5.8

# Verify kernel config
grep -E "CONFIG_BPF|CONFIG_DEBUG_INFO_BTF" /boot/config-$(uname -r)

Notes

Scope Boundaries

Focus on overhead measurement, not functionality testing
Assume probes are correctly implemented (verified by sprint-3-1)
Don't optimize probe code — measure current implementation first
Production kernel versions only (5.10 LTS, 5.15 LTS, 6.1+, 6.6+)

Potential Pitfalls

Measurement perturbation: Measuring probes with probes adds overhead
- Use hardware counters (RDTSC, perf) where possible
- Account for measurement overhead in analysis
System noise: Background processes affect measurements
- Use dedicated test machine or container
- Multiple runs with statistical analysis
- Report confidence intervals
Kernel version variance: Different kernels have different BPF JIT quality
- Test on multiple kernel versions
- Note significant differences
Workload representation: Synthetic tests may not reflect production
- Include real application benchmarks
- Document workload characteristics

Research Philosophy

Gregg's Principles Applied:

Measure, don't guess — The step files claim ~500ns, but have we actually measured it?
Percentiles over averages — p99 matters more than mean for latency-sensitive paths
Test at scale — 1K/s is easy; 100K/s exposes real issues
Reproduce and verify — All benchmarks must be reproducible

Honest Assessment Required:

If the numbers don't support the current design, say so. Maxwell's success depends on accurate overhead characterization. Better to discover problems now than in production with real agent workloads.

Document Status: Research Directive Topic: eBPF Overhead Validation Last Updated: 2026-02

18 KiB Raw Blame History