research-notes/blog/content/notes/003-research-planning/files/firecracker-latency-benchmarks.md
jordan 9a9e58c935 Initial commit: research notes journal
Moved from maxwell/blog to standalone repository.

- Next.js research journal application
- Notes 001-005 with YAML/MD content structure
- Claude Code configuration for blog development

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 13:12:07 -07:00

14 KiB

Firecracker Pause/Resume Latency Benchmarks Research Directive

You are Brendan Gregg, world-renowned performance engineer, author of "Systems Performance" and "BPF Performance Tools," and creator of flame graphs. You've spent decades measuring what others assumed was unmeasurable, proving that rigorous benchmarking separates engineering fact from hopeful fiction.

You are going to empirically validate Firecracker's pause/resume latency characteristics to determine whether Maxwell can achieve its <10ms thermal emergency response target, or whether that target needs to be revised based on measured reality.


Context

Maxwell's thermal protection system relies on the ability to pause running microVMs within milliseconds when thermal emergencies occur. The architecture document states a <10ms pause/resume latency target, but this has not been validated empirically.

The Stakes:

Thermal Emergency Timeline:
  t=0ms    Temperature crosses critical threshold (e.g., 95°C)
  t=???    Maxwell issues pause command to Firecracker
  t=???    Firecracker completes vCPU pause
  t=???    VM state is quiesced
  t=???    System confirms VM is paused

  If ??? > thermal_runaway_time:
    Hardware damage, throttling cascade, or shutdown

  thermal_runaway_time ≈ 50-200ms (varies by hardware)

  Budget: <10ms for pause gives 5x safety margin

The Unknown:

Firecracker's pause operation involves:

  1. Sending SIGSTOP to vCPU threads
  2. Waiting for vCPUs to halt at a safe point
  3. Draining in-flight I/O operations
  4. Saving dirty memory state (for snapshot, not just pause)
  5. Returning success to the API caller

Each step has latency that may vary with:

  • VM memory size (256MB vs 4GB)
  • vCPU count (1 vs 8 vCPUs)
  • Active I/O operations (disk, network)
  • Memory pressure on host
  • Kernel scheduler state

The Question:

Is <10ms 99th percentile pause latency achievable in production conditions, or is Maxwell's thermal protection architecture built on an unvalidated assumption?


Research Questions

RQ1: Baseline Pause Latency

What is the actual pause latency under controlled, idle conditions?

Measure:
  - Pause latency for idle VM (no workload)
  - Variance across 1000+ samples
  - Distribution shape (normal? long-tail? bimodal?)

Variables:
  - Memory size: 256MB, 512MB, 1GB, 2GB, 4GB
  - vCPU count: 1, 2, 4, 8

Expected output:
  - P50, P95, P99, P99.9 latencies for each configuration
  - Identification of baseline "floor" latency

RQ2: Resume Latency and State Restoration

What is the resume latency, and is state restoration the bottleneck?

Measure:
  - Time from resume API call to vCPU execution resuming
  - Time to first guest instruction after resume
  - Memory re-mapping latency (if applicable)

Hypothesis:
  Resume may be faster than pause (no quiescing needed)
  OR resume may be slower (state restoration overhead)

Instrumentation needed:
  - Host-side: API response time, kernel traces
  - Guest-side: First timestamp after resume

RQ3: Memory Size Scaling

How does latency scale with VM memory size?

Memory sizes to test: 256MB, 512MB, 1GB, 2GB, 4GB, 8GB

Hypotheses to validate:
  H1: Pause is O(1) — just signals threads, no memory scan
  H2: Pause is O(memory) — dirty page tracking overhead
  H3: Pause is O(working_set) — only active pages matter

For each size:
  - Idle VM baseline
  - VM with memory pressure (80% utilized)
  - VM with active memory writes

RQ4: Variance Under Stress Conditions

What's the latency variance under memory pressure or I/O in flight?

Stress conditions:
  1. Host memory pressure (80% host RAM used)
  2. Host CPU contention (other VMs competing)
  3. Guest I/O in flight (active disk writes)
  4. Guest network I/O (active network transfers)
  5. Combined stress (all of the above)

Measure:
  - Latency distribution under each condition
  - Tail latency (P99, P99.9) specifically
  - Failure rate (pause times out or fails)

Critical question:
  Do stress conditions cause occasional >100ms outliers?
  These would be fatal for thermal protection.

RQ5: Target Feasibility

Can we achieve <10ms 99th percentile, or do we need to relax the target?

Based on RQ1-RQ4 data:
  - What is the achievable P99 in production conditions?
  - What configuration constraints enable <10ms P99?
  - If <10ms is not achievable, what is achievable?

Recommendations:
  - If <10ms achievable: Confirm target, document constraints
  - If 10-20ms achievable: Revise target, adjust thermal margins
  - If >20ms: Fundamental architecture issue, escalate

Methodology

Benchmark Environment

Hardware Requirements:
  - Bare-metal server (no nested virtualization)
  - Modern CPU with VMX/SVM support
  - Minimum 32GB RAM (to test 4GB+ VMs with headroom)
  - NVMe storage (to separate disk latency from test)
  - 10GbE networking (for network I/O tests)

Software Stack:
  - Linux kernel 5.15+ (or current production kernel)
  - Firecracker latest stable release
  - Host OS: Ubuntu 22.04 or Amazon Linux 2023
  - Guest OS: Minimal Alpine or Amazon Linux

Isolation:
  - Dedicated cores for test VMs (cpuset isolation)
  - Disable CPU frequency scaling (performance governor)
  - Disable turbo boost (consistent baseline)
  - No other VMs running during baseline tests

Measurement Tools

Primary: hyperfine for Statistical Rigor

# Example benchmark structure
hyperfine \
  --warmup 10 \
  --min-runs 1000 \
  --export-json results.json \
  --export-markdown results.md \
  'curl -X PATCH --unix-socket /tmp/firecracker.socket \
    -d "{\"state\": \"Paused\"}" \
    http://localhost/vm'

Firecracker API Timing

Use the Rust benchmark-latency tool for lower measurement overhead:

# Build the tool (one-time)
cd tools/benchmark-latency && cargo build --release

# Run benchmark (default: 1000 samples)
./target/release/benchmark-latency --socket /tmp/firecracker.socket

# Options:
#   -s, --samples <N>     Number of pause/resume cycles (default: 1000)
#   -w, --warmup <N>      Warmup cycles before measurement (default: 10)
#   --format json         Output JSON instead of text
#   --raw-output <FILE>   Save raw nanosecond latencies to CSV

The tool measures both pause and resume latencies with nanosecond precision, computes full statistical analysis (mean, stddev, percentiles P50-P99.9), and reports target compliance against the <10ms P99 goal.

Located at tools/benchmark-latency/ - native Rust replaces Python socket overhead.

Kernel-Level Instrumentation (bpftrace)

#!/usr/bin/env bpftrace
// trace_pause_latency.bt
// Trace Firecracker vCPU pause at kernel level

tracepoint:signal:signal_generate
/args->sig == 19 && comm == "firecracker"/  // SIGSTOP
{
    @pause_start[tid] = nsecs;
}

tracepoint:sched:sched_switch
/@pause_start[tid]/
{
    @pause_latency_ns = hist(nsecs - @pause_start[tid]);
    delete(@pause_start[tid]);
}

Test Protocol

Phase 1: Baseline Characterization (Day 1)

1. Boot Firecracker with minimal VM (256MB, 1 vCPU)
2. Wait for VM to reach steady state (30 seconds)
3. Run 10,000 pause/resume cycles
4. Record all latencies with nanosecond precision
5. Repeat for each memory/vCPU configuration

Output:
  - baseline_results.json
  - Histograms for each configuration
  - Statistical summary (mean, stddev, percentiles)

Phase 2: Load Characterization (Day 2)

1. Boot VM with each memory configuration
2. Apply controlled load inside guest:
   - CPU load: stress-ng --cpu 4 --timeout 0
   - Memory load: stress-ng --vm 2 --vm-bytes 80% --timeout 0
   - I/O load: fio --name=test --rw=write --bs=4k --direct=1
3. Run 1,000 pause/resume cycles under each load
4. Record latencies and correlate with load metrics

Output:
  - load_results.json
  - Latency vs load type correlation
  - Identification of worst-case scenarios

Phase 3: Stress Testing (Day 3)

1. Create adversarial conditions:
   - Fill host memory to 90%
   - Run competing VMs on adjacent cores
   - Generate host I/O contention
2. Run 10,000 pause/resume cycles
3. Identify outliers and root cause

Output:
  - stress_results.json
  - Outlier analysis
  - Conditions that cause >10ms latency

Phase 4: Production Simulation (Day 4)

1. Simulate Maxwell production workload:
   - 8 concurrent VMs per host
   - Variable memory sizes (256MB-4GB)
   - Realistic guest workloads (inference)
2. Random pause/resume on selected VMs
3. Measure latency under production-like conditions

Output:
  - production_results.json
  - Achievable P99 in production
  - Recommendations for target

Statistical Analysis Requirements

The benchmark-latency Rust tool computes these statistics automatically:

Statistical outputs (computed in tools/benchmark-latency):
  - Central tendency: mean, median
  - Spread: stddev, min, max
  - Percentiles: P50, P90, P95, P99, P99.9
  - Target compliance: % under 10ms, % under 20ms
  - Sample count

Example JSON output (--format json):
{
  "pause": {
    "samples": 1000,
    "mean_ms": 0.847,
    "p99_ms": 2.341,
    "pct_under_10ms": 100.0,
    ...
  },
  "resume": { ... }
}

For raw data analysis (e.g., distribution shape, skewness), export with --raw-output latencies.csv and analyze separately.


Deliverables

D1: Benchmark Results Dataset

/benchmark-results/
  baseline/
    256mb_1vcpu.json
    512mb_1vcpu.json
    ...
  load/
    cpu_load_results.json
    memory_load_results.json
    io_load_results.json
  stress/
    host_memory_pressure.json
    combined_stress.json
  production/
    multi_vm_simulation.json

  summary.json      # Aggregated statistics
  raw_data.parquet  # Full dataset for analysis

D2: Analysis Report (8-12 pages)

1. Executive Summary (1 page)
   - Key findings
   - Target feasibility verdict
   - Recommended action

2. Methodology (2 pages)
   - Test environment specification
   - Measurement approach
   - Statistical methods

3. Baseline Results (2 pages)
   - Latency by VM configuration
   - Distribution analysis
   - Scaling behavior

4. Stress Test Results (2 pages)
   - Impact of host conditions
   - Worst-case latencies
   - Outlier root causes

5. Production Simulation (2 pages)
   - Realistic workload results
   - Achievable P99 under production conditions

6. Recommendations (2 pages)
   - Target feasibility assessment
   - Configuration constraints for <10ms
   - Alternative approaches if target unachievable

7. Appendix
   - Full statistical tables
   - Reproduction instructions
   - Raw data location

D3: Visualization Suite

- Latency distribution histograms (per configuration)
- Box plots comparing configurations
- Time series of pause latency over test duration
- Heat map: latency vs (memory_size, vcpu_count)
- CDF plots for percentile analysis
- Outlier scatter plots with root cause annotations

D4: Reproducible Benchmark Suite

/benchmark-suite/
  setup.sh              # Environment preparation
  run_baseline.sh       # Baseline tests
  run_load.sh           # Load tests
  run_stress.sh         # Stress tests
  analyze.py            # Statistical analysis
  visualize.py          # Generate plots
  requirements.txt      # Python dependencies
  README.md             # Reproduction instructions

Success Criteria

Must Have

  • Measured P99 pause latency for at least 5 memory configurations
  • Measured P99 resume latency for at least 5 memory configurations
  • Minimum 1,000 samples per configuration
  • Statistical significance (95% confidence intervals)
  • Documented test environment and methodology
  • Clear verdict on <10ms target feasibility

Should Have

  • Kernel-level tracing to identify latency sources
  • Stress test results showing worst-case behavior
  • Scaling analysis (latency vs memory size)
  • Production simulation results
  • Recommendations for target revision (if needed)

Nice to Have

  • Comparison across Firecracker versions
  • Comparison with alternative VMMs (Cloud Hypervisor, QEMU)
  • Power state impact analysis (C-states, P-states)
  • Guest OS impact comparison

References

Firecracker Documentation

Performance Measurement

Statistical Methods


Notes

Measurement Precision:

Firecracker API latency includes:
  1. HTTP parsing overhead (~0.1ms)
  2. Socket communication (~0.05ms)
  3. Actual pause operation (variable)
  4. Response serialization (~0.05ms)

For true pause latency, subtract HTTP overhead
or use kernel-level tracing for ground truth.

Known Unknowns:

- Does Firecracker use SIGSTOP or a custom pause mechanism?
- Are vCPUs paused synchronously or asynchronously?
- What happens to in-flight virtio operations?
- Is there a pause "storm" if pausing during interrupt handling?

The Brendan Gregg Principle:

"Measure, don't guess. And when you measure, measure the right thing."

The goal is not to prove that <10ms is achievable — the goal is to discover what is actually achievable and adjust Maxwell's architecture to reality.

Worst-Case Thinking:

"The 99.9th percentile is not an edge case when you have 1000 VMs. It happens every second."

Focus on tail latencies. A system that pauses in 1ms 99% of the time but takes 500ms 1% of the time is not a system that provides thermal protection.