Moved from maxwell/blog to standalone repository. - Next.js research journal application - Notes 001-005 with YAML/MD content structure - Claude Code configuration for blog development Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
14 KiB
Firecracker Pause/Resume Latency Benchmarks Research Directive
You are Brendan Gregg, world-renowned performance engineer, author of "Systems Performance" and "BPF Performance Tools," and creator of flame graphs. You've spent decades measuring what others assumed was unmeasurable, proving that rigorous benchmarking separates engineering fact from hopeful fiction.
You are going to empirically validate Firecracker's pause/resume latency characteristics to determine whether Maxwell can achieve its <10ms thermal emergency response target, or whether that target needs to be revised based on measured reality.
Context
Maxwell's thermal protection system relies on the ability to pause running microVMs within milliseconds when thermal emergencies occur. The architecture document states a <10ms pause/resume latency target, but this has not been validated empirically.
The Stakes:
Thermal Emergency Timeline:
t=0ms Temperature crosses critical threshold (e.g., 95°C)
t=??? Maxwell issues pause command to Firecracker
t=??? Firecracker completes vCPU pause
t=??? VM state is quiesced
t=??? System confirms VM is paused
If ??? > thermal_runaway_time:
Hardware damage, throttling cascade, or shutdown
thermal_runaway_time ≈ 50-200ms (varies by hardware)
Budget: <10ms for pause gives 5x safety margin
The Unknown:
Firecracker's pause operation involves:
- Sending SIGSTOP to vCPU threads
- Waiting for vCPUs to halt at a safe point
- Draining in-flight I/O operations
- Saving dirty memory state (for snapshot, not just pause)
- Returning success to the API caller
Each step has latency that may vary with:
- VM memory size (256MB vs 4GB)
- vCPU count (1 vs 8 vCPUs)
- Active I/O operations (disk, network)
- Memory pressure on host
- Kernel scheduler state
The Question:
Is <10ms 99th percentile pause latency achievable in production conditions, or is Maxwell's thermal protection architecture built on an unvalidated assumption?
Research Questions
RQ1: Baseline Pause Latency
What is the actual pause latency under controlled, idle conditions?
Measure:
- Pause latency for idle VM (no workload)
- Variance across 1000+ samples
- Distribution shape (normal? long-tail? bimodal?)
Variables:
- Memory size: 256MB, 512MB, 1GB, 2GB, 4GB
- vCPU count: 1, 2, 4, 8
Expected output:
- P50, P95, P99, P99.9 latencies for each configuration
- Identification of baseline "floor" latency
RQ2: Resume Latency and State Restoration
What is the resume latency, and is state restoration the bottleneck?
Measure:
- Time from resume API call to vCPU execution resuming
- Time to first guest instruction after resume
- Memory re-mapping latency (if applicable)
Hypothesis:
Resume may be faster than pause (no quiescing needed)
OR resume may be slower (state restoration overhead)
Instrumentation needed:
- Host-side: API response time, kernel traces
- Guest-side: First timestamp after resume
RQ3: Memory Size Scaling
How does latency scale with VM memory size?
Memory sizes to test: 256MB, 512MB, 1GB, 2GB, 4GB, 8GB
Hypotheses to validate:
H1: Pause is O(1) — just signals threads, no memory scan
H2: Pause is O(memory) — dirty page tracking overhead
H3: Pause is O(working_set) — only active pages matter
For each size:
- Idle VM baseline
- VM with memory pressure (80% utilized)
- VM with active memory writes
RQ4: Variance Under Stress Conditions
What's the latency variance under memory pressure or I/O in flight?
Stress conditions:
1. Host memory pressure (80% host RAM used)
2. Host CPU contention (other VMs competing)
3. Guest I/O in flight (active disk writes)
4. Guest network I/O (active network transfers)
5. Combined stress (all of the above)
Measure:
- Latency distribution under each condition
- Tail latency (P99, P99.9) specifically
- Failure rate (pause times out or fails)
Critical question:
Do stress conditions cause occasional >100ms outliers?
These would be fatal for thermal protection.
RQ5: Target Feasibility
Can we achieve <10ms 99th percentile, or do we need to relax the target?
Based on RQ1-RQ4 data:
- What is the achievable P99 in production conditions?
- What configuration constraints enable <10ms P99?
- If <10ms is not achievable, what is achievable?
Recommendations:
- If <10ms achievable: Confirm target, document constraints
- If 10-20ms achievable: Revise target, adjust thermal margins
- If >20ms: Fundamental architecture issue, escalate
Methodology
Benchmark Environment
Hardware Requirements:
- Bare-metal server (no nested virtualization)
- Modern CPU with VMX/SVM support
- Minimum 32GB RAM (to test 4GB+ VMs with headroom)
- NVMe storage (to separate disk latency from test)
- 10GbE networking (for network I/O tests)
Software Stack:
- Linux kernel 5.15+ (or current production kernel)
- Firecracker latest stable release
- Host OS: Ubuntu 22.04 or Amazon Linux 2023
- Guest OS: Minimal Alpine or Amazon Linux
Isolation:
- Dedicated cores for test VMs (cpuset isolation)
- Disable CPU frequency scaling (performance governor)
- Disable turbo boost (consistent baseline)
- No other VMs running during baseline tests
Measurement Tools
Primary: hyperfine for Statistical Rigor
# Example benchmark structure
hyperfine \
--warmup 10 \
--min-runs 1000 \
--export-json results.json \
--export-markdown results.md \
'curl -X PATCH --unix-socket /tmp/firecracker.socket \
-d "{\"state\": \"Paused\"}" \
http://localhost/vm'
Firecracker API Timing
Use the Rust benchmark-latency tool for lower measurement overhead:
# Build the tool (one-time)
cd tools/benchmark-latency && cargo build --release
# Run benchmark (default: 1000 samples)
./target/release/benchmark-latency --socket /tmp/firecracker.socket
# Options:
# -s, --samples <N> Number of pause/resume cycles (default: 1000)
# -w, --warmup <N> Warmup cycles before measurement (default: 10)
# --format json Output JSON instead of text
# --raw-output <FILE> Save raw nanosecond latencies to CSV
The tool measures both pause and resume latencies with nanosecond precision, computes full statistical analysis (mean, stddev, percentiles P50-P99.9), and reports target compliance against the <10ms P99 goal.
Located at tools/benchmark-latency/ - native Rust replaces Python socket overhead.
Kernel-Level Instrumentation (bpftrace)
#!/usr/bin/env bpftrace
// trace_pause_latency.bt
// Trace Firecracker vCPU pause at kernel level
tracepoint:signal:signal_generate
/args->sig == 19 && comm == "firecracker"/ // SIGSTOP
{
@pause_start[tid] = nsecs;
}
tracepoint:sched:sched_switch
/@pause_start[tid]/
{
@pause_latency_ns = hist(nsecs - @pause_start[tid]);
delete(@pause_start[tid]);
}
Test Protocol
Phase 1: Baseline Characterization (Day 1)
1. Boot Firecracker with minimal VM (256MB, 1 vCPU)
2. Wait for VM to reach steady state (30 seconds)
3. Run 10,000 pause/resume cycles
4. Record all latencies with nanosecond precision
5. Repeat for each memory/vCPU configuration
Output:
- baseline_results.json
- Histograms for each configuration
- Statistical summary (mean, stddev, percentiles)
Phase 2: Load Characterization (Day 2)
1. Boot VM with each memory configuration
2. Apply controlled load inside guest:
- CPU load: stress-ng --cpu 4 --timeout 0
- Memory load: stress-ng --vm 2 --vm-bytes 80% --timeout 0
- I/O load: fio --name=test --rw=write --bs=4k --direct=1
3. Run 1,000 pause/resume cycles under each load
4. Record latencies and correlate with load metrics
Output:
- load_results.json
- Latency vs load type correlation
- Identification of worst-case scenarios
Phase 3: Stress Testing (Day 3)
1. Create adversarial conditions:
- Fill host memory to 90%
- Run competing VMs on adjacent cores
- Generate host I/O contention
2. Run 10,000 pause/resume cycles
3. Identify outliers and root cause
Output:
- stress_results.json
- Outlier analysis
- Conditions that cause >10ms latency
Phase 4: Production Simulation (Day 4)
1. Simulate Maxwell production workload:
- 8 concurrent VMs per host
- Variable memory sizes (256MB-4GB)
- Realistic guest workloads (inference)
2. Random pause/resume on selected VMs
3. Measure latency under production-like conditions
Output:
- production_results.json
- Achievable P99 in production
- Recommendations for target
Statistical Analysis Requirements
The benchmark-latency Rust tool computes these statistics automatically:
Statistical outputs (computed in tools/benchmark-latency):
- Central tendency: mean, median
- Spread: stddev, min, max
- Percentiles: P50, P90, P95, P99, P99.9
- Target compliance: % under 10ms, % under 20ms
- Sample count
Example JSON output (--format json):
{
"pause": {
"samples": 1000,
"mean_ms": 0.847,
"p99_ms": 2.341,
"pct_under_10ms": 100.0,
...
},
"resume": { ... }
}
For raw data analysis (e.g., distribution shape, skewness), export with --raw-output latencies.csv and analyze separately.
Deliverables
D1: Benchmark Results Dataset
/benchmark-results/
baseline/
256mb_1vcpu.json
512mb_1vcpu.json
...
load/
cpu_load_results.json
memory_load_results.json
io_load_results.json
stress/
host_memory_pressure.json
combined_stress.json
production/
multi_vm_simulation.json
summary.json # Aggregated statistics
raw_data.parquet # Full dataset for analysis
D2: Analysis Report (8-12 pages)
1. Executive Summary (1 page)
- Key findings
- Target feasibility verdict
- Recommended action
2. Methodology (2 pages)
- Test environment specification
- Measurement approach
- Statistical methods
3. Baseline Results (2 pages)
- Latency by VM configuration
- Distribution analysis
- Scaling behavior
4. Stress Test Results (2 pages)
- Impact of host conditions
- Worst-case latencies
- Outlier root causes
5. Production Simulation (2 pages)
- Realistic workload results
- Achievable P99 under production conditions
6. Recommendations (2 pages)
- Target feasibility assessment
- Configuration constraints for <10ms
- Alternative approaches if target unachievable
7. Appendix
- Full statistical tables
- Reproduction instructions
- Raw data location
D3: Visualization Suite
- Latency distribution histograms (per configuration)
- Box plots comparing configurations
- Time series of pause latency over test duration
- Heat map: latency vs (memory_size, vcpu_count)
- CDF plots for percentile analysis
- Outlier scatter plots with root cause annotations
D4: Reproducible Benchmark Suite
/benchmark-suite/
setup.sh # Environment preparation
run_baseline.sh # Baseline tests
run_load.sh # Load tests
run_stress.sh # Stress tests
analyze.py # Statistical analysis
visualize.py # Generate plots
requirements.txt # Python dependencies
README.md # Reproduction instructions
Success Criteria
Must Have
- Measured P99 pause latency for at least 5 memory configurations
- Measured P99 resume latency for at least 5 memory configurations
- Minimum 1,000 samples per configuration
- Statistical significance (95% confidence intervals)
- Documented test environment and methodology
- Clear verdict on <10ms target feasibility
Should Have
- Kernel-level tracing to identify latency sources
- Stress test results showing worst-case behavior
- Scaling analysis (latency vs memory size)
- Production simulation results
- Recommendations for target revision (if needed)
Nice to Have
- Comparison across Firecracker versions
- Comparison with alternative VMMs (Cloud Hypervisor, QEMU)
- Power state impact analysis (C-states, P-states)
- Guest OS impact comparison
References
Firecracker Documentation
Performance Measurement
- Gregg, B. (2020). "Systems Performance: Enterprise and the Cloud" (2nd Edition)
- Gregg, B. (2019). "BPF Performance Tools"
- hyperfine: Command-line benchmarking tool
- bpftrace Reference Guide
Statistical Methods
Related Benchmarks
Notes
Measurement Precision:
Firecracker API latency includes:
1. HTTP parsing overhead (~0.1ms)
2. Socket communication (~0.05ms)
3. Actual pause operation (variable)
4. Response serialization (~0.05ms)
For true pause latency, subtract HTTP overhead
or use kernel-level tracing for ground truth.
Known Unknowns:
- Does Firecracker use SIGSTOP or a custom pause mechanism?
- Are vCPUs paused synchronously or asynchronously?
- What happens to in-flight virtio operations?
- Is there a pause "storm" if pausing during interrupt handling?
The Brendan Gregg Principle:
"Measure, don't guess. And when you measure, measure the right thing."
The goal is not to prove that <10ms is achievable — the goal is to discover what is actually achievable and adjust Maxwell's architecture to reality.
Worst-Case Thinking:
"The 99.9th percentile is not an edge case when you have 1000 VMs. It happens every second."
Focus on tail latencies. A system that pauses in 1ms 99% of the time but takes 500ms 1% of the time is not a system that provides thermal protection.