How does `criterion::Criterion` provide statistically robust benchmarking compared to simple timing measurements?

Simple timing measurements are prone to noise, outliers, and misleading results. Criterion provides statistical analysis that accounts for these issues, giving you confidence that your benchmarks reflect actual performance rather than measurement artifacts.

The Problem with Simple Timing

A naive benchmark measures execution time once or a few times:

use std::time::Instant;
 
fn naive_benchmark() {
    let start = Instant::now();
    
    for _ in 0..1000 {
        let _ = expensive_operation();
    }
    
    let elapsed = start.elapsed();
    println!("Time: {:?}", elapsed);
}
 
fn expensive_operation() -> i32 {
    (0..1000).sum()
}

This approach has several problems:

Single measurements capture random system state
No indication of measurement variance
Warm-up effects are ignored
Outliers skew results invisibly

Criterion's Statistical Approach

Criterion runs many iterations, analyzes the distribution, and reports confidence intervals:

use criterion::{black_box, criterion_group, criterion_main, Criterion};
 
fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("expensive_operation", |b| {
        b.iter(|| expensive_operation(black_box(1000)))
    });
}
 
fn expensive_operation(n: i32) -> i32 {
    (0..n).sum()
}
 
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

The output includes statistical analysis rather than a single number.

Warm-up Phase

Criterion runs a warm-up phase to let caches fill and the CPU stabilize:

use criterion::Criterion;
 
fn demonstrates_warmup() {
    let mut c = Criterion::default();
    
    // Criterion automatically:
    // 1. Runs the benchmark briefly to estimate speed
    // 2. Runs a warm-up period (default: 1 second)
    // 3. Then begins actual measurement
    
    // Warm-up ensures:
    // - CPU caches are hot
    // - Branch predictors are trained
    // - JIT compilation (if applicable) is complete
    // - CPU frequency has stabilized
}

Without warm-up, early iterations run slower due to cold caches and untrained branch predictors.

Multiple Sampling Iterations

Criterion collects many samples across different iteration counts:

use criterion::Criterion;
 
fn sampling_process() {
    // Criterion's sampling strategy:
    // 1. Start with small iteration counts (e.g., 1 iteration per sample)
    // 2. Gradually increase iteration count
    // 3. Collect ~100 samples total
    // 4. Each sample measures time for N iterations
    
    // Example sampling sequence:
    // Sample 1:  1 iteration
    // Sample 2:  1 iteration
    // ...
    // Sample 50: 10 iterations
    // ...
    // Sample 100: 100 iterations
    
    // This provides data at multiple time scales
    // and helps identify outliers and noise
}

Multiple samples reveal the distribution of execution times, not just a point estimate.

Statistical Analysis: Mean and Confidence Intervals

Criterion calculates confidence intervals, not just averages:

// Example Criterion output:
// expensive_operation    time:   [1.2345 us 1.2456 us 1.2578 us]
//                                ^lower    ^estimate  ^upper
// Found 4 outliers among 100 measurements (4.00%)
// 2 (2.00%) high mild
// 2 (2.00%) high severe
 
// The confidence interval tells you:
// - The true mean is 95% likely to be within this range
// - Narrow interval = reliable measurement
// - Wide interval = high variance or noise

The confidence interval accounts for measurement uncertainty:

use criterion::{Criterion, Throughput};
 
fn with_throughput(c: &mut Criterion) {
    c.bench_function("process_1kb", |b| {
        b.throughput(Throughput::Bytes(1024));
        b.iter(|| {
            let data = vec![0u8; 1024];
            process_data(&data)
        })
    });
}
 
// Output shows throughput:
// process_1kb            time:   [2.3456 us 2.4567 us 2.5678 us]
//                        thrpt:  [398.12 MiB/s 412.34 MiB/s 426.78 MiB/s]

Outlier Detection and Classification

Criterion identifies and classifies outliers:

use criterion::Criterion;
 
// Outliers are measurements that deviate significantly from the median
// Classification by severity:
// - Mild: 1.5-3.0 IQR (interquartile range) from median
// - Severe: >3.0 IQR from median
// 
// Classification by direction:
// - Low: Faster than expected (rare)
// - High: Slower than expected (common, caused by interrupts, etc.)
 
// Example output:
// Found 8 outliers among 100 measurements (8.00%)
// 1 (1.00%) low mild
// 3 (3.00%) high mild
// 4 (4.00%) high severe

Outliers indicate system noise. Many severe outliers suggest an unreliable measurement environment.

Regression Detection

Criterion compares against previous runs to detect performance regressions:

use criterion::{Criterion, BenchmarkId};
 
fn regression_detection(c: &mut Criterion) {
    let mut group = c.benchmark_group("comparison");
    
    for size in [100, 1000, 10000].iter() {
        group.bench_with_input(BenchmarkId::new("algorithm", size), size, |b, &size| {
            b.iter(|| algorithm(black_box(size)))
        });
    }
    
    group.finish();
}
 
// On subsequent runs, Criterion loads previous results
// Output includes comparison:
// algorithm/100          time:   [1.2345 us 1.2456 us 1.2578 us]
//                        change: [+2.3456% +3.4567% +4.5678%] (p = 0.03 < 0.05)
//                        Performance has regressed.

The statistical significance (p-value) tells you whether the change is real or noise:

// Criterion's regression output:
// change: [+1.23% +2.34% +3.45%]
//         ^lower ^estimate ^upper
// 
// If the confidence interval includes 0%, change is not statistically significant
// 
// p-value interpretation:
// p < 0.05: Statistically significant change
// p >= 0.05: Change could be noise

Handling Iteration Overhead

Criterion accounts for loop overhead:

use criterion::{black_box, Criterion};
 
fn overhead_handling(c: &mut Criterion) {
    c.bench_function("with_black_box", |b| {
        b.iter(|| {
            // black_box prevents compiler from optimizing away results
            let result = expensive_operation(black_box(100));
            black_box(result)
        })
    });
}
 
// Without black_box:
// let _ = expensive_operation(100);
// Compiler might optimize this to nothing if result is unused
 
// With black_box:
// The compiler cannot assume anything about the value
// Forces actual computation

The black_box function prevents the compiler from optimizing away computations whose results are unused.

Comparing Multiple Implementations

Criterion provides tools for comparing algorithms:

use criterion::{BenchmarkId, Criterion};
 
fn compare_implementations(c: &mut Criterion) {
    let mut group = c.benchmark_group("sorting");
    
    let data: Vec<i32> = (0..1000).collect();
    
    group.bench_function("slice::sort", |b| {
        b.iter_batched(
            || data.clone(),
            |mut d| d.sort(),
            criterion::BatchSize::SmallInput,
        )
    });
    
    group.bench_function("slice::sort_unstable", |b| {
        b.iter_batched(
            || data.clone(),
            |mut d| d.sort_unstable(),
            criterion::BatchSize::SmallInput,
        )
    });
    
    group.finish();
}
 
// Output:
// sorting/slice::sort          time:   [45.678 us 46.789 us 47.890 us]
// sorting/slice::sort_unstable time:   [23.456 us 24.567 us 25.678 us]
//                              48.5% faster

Parameterized Benchmarks

Criterion supports parameterized benchmarks with automatic scaling:

use criterion::{BenchmarkId, Criterion, Throughput};
 
fn parameterized_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("search");
    
    for size in [100, 1_000, 10_000, 100_000] {
        group.throughput(Throughput::Elements(size));
        
        let data: Vec<i32> = (0..size).collect();
        let target = size / 2;
        
        group.bench_with_input(BenchmarkId::new("linear", size), &data, |b, data| {
            b.iter(|| {
                data.iter().position(|&x| x == target)
            })
        });
        
        group.bench_with_input(BenchmarkId::new("binary", size), &data, |b, data| {
            b.iter(|| {
                data.binary_search(&target)
            })
        });
    }
    
    group.finish();
}
 
// Output shows scaling behavior:
// search/linear/100       time:   [123.45 ns]
// search/binary/100       time:   [15.67 ns]
// search/linear/1000      time:   [1.23 us]
// search/binary/1000      time:   [21.34 ns]
// ...

Memory Profiling

Criterion can estimate memory throughput:

use criterion::{Criterion, Throughput};
 
fn memory_throughput(c: &mut Criterion) {
    let mut group = c.benchmark_group("memory");
    
    group.throughput(Throughput::Bytes(1024 * 1024));  // 1 MB
    
    group.bench_function("copy_1mb", |b| {
        let src = vec![0u8; 1024 * 1024];
        let mut dst = vec![0u8; 1024 * 1024];
        
        b.iter(|| {
            dst.copy_from_slice(&src);
            black_box(&dst)
        })
    });
    
    group.finish();
}
 
// Output includes throughput:
// memory/copy_1mb         time:   [123.45 us 124.56 us 125.67 us]
//                         thrpt:  [7.8 GiB/s 8.0 GiB/s 8.2 GiB/s]

Custom Configuration

Criterion allows fine-tuning of measurement parameters:

use criterion::{Criterion, SamplingMode};
 
fn custom_configuration() -> Criterion {
    Criterion::default()
        // Measurement time per sample
        .measurement_time(std::time::Duration::from_secs(5))
        
        // Number of samples to take
        .sample_size(500)  // Default is 100
        
        // Warm-up time
        .warm_up_time(std::time::Duration::from_secs(2))
        
        // Number of warm-up iterations
        // (usually auto-detected from warm_up_time)
        
        // Significance level for regression detection
        .significance_level(0.05)
        
        // Confidence level for confidence intervals
        .confidence_level(0.95)
        
        // Sampling mode
        .sampling_mode(SamplingMode::Auto)  // Linear or Flat
}

Noise Sources and Mitigation

Criterion helps identify and account for noise:

use criterion::Criterion;
 
// Common noise sources:
// 1. CPU frequency scaling
// 2. Cache effects from other processes
// 3. OS scheduling
// 4. Thermal throttling
// 5. Memory allocation patterns
 
fn reduce_noise() {
    // Criterion mitigates noise by:
    // - Running many samples
    // - Statistical analysis to identify outliers
    // - Long measurement times
    // - Confidence intervals
    
    // For additional noise reduction:
    // - Use a dedicated benchmark machine
    // - Disable CPU frequency scaling
    // - Run with high priority
    // - Disable turbo boost
    // - Run multiple times and compare
}

HTML Reports

Criterion generates HTML reports with detailed analysis:

use criterion::Criterion;
 
fn html_reports() {
    let criterion = Criterion::default()
        // HTML reports are generated by default in target/criterion/
        // Each benchmark gets a detailed report including:
        // - Sample distribution
        // - Regression analysis
        // - Comparison with previous runs
        // - PDF visualization
        
        // Disable HTML generation:
        // .without_plots()
        ;
}

The HTML reports provide visualizations that help identify patterns in measurement noise.

Statistical Foundations

Criterion's statistics are based on robust estimators:

// Mean: Weighted average across all samples
// - Not a simple arithmetic mean
// - Weighted by the number of iterations in each sample
 
// Confidence Interval: Bootstrap estimation
// - Resamples from the measured data
// - Estimates the distribution of the mean
// - 95% confidence level by default
 
// Outlier Detection: IQR-based
// - Uses median and quartiles
// - Robust to extreme outliers
// - Classifies by severity and direction
 
// Regression Detection: Paired t-test
// - Compares distributions from two runs
// - Accounts for variance in both measurements
// - Reports statistical significance

Comparison with Simple Timing

use std::time::Instant;
use criterion::{black_box, Criterion};
 
// Simple timing
fn simple_timing() {
    let times: Vec<u64> = (0..10)
        .map(|_| {
            let start = Instant::now();
            for _ in 0..1000 {
                black_box(expensive_operation(black_box(100)));
            }
            start.elapsed().as_nanos() as u64
        })
        .collect();
    
    let mean: f64 = times.iter().sum::<u64>() as f64 / times.len() as f64;
    println!("Mean: {} ns", mean);
    
    // What's missing?
    // - No confidence interval
    // - No outlier detection
    // - No warm-up
    // - No regression detection
    // - No noise characterization
}
 
// Criterion
fn criterion_timing(c: &mut Criterion) {
    c.bench_function("expensive", |b| {
        b.iter(|| expensive_operation(black_box(100)))
    });
    
    // Criterion provides:
    // - Confidence intervals
    // - Outlier detection and classification
    // - Proper warm-up
    // - Regression detection
    // - Statistical significance testing
    // - Visual reports
}
 
fn expensive_operation(n: i32) -> i32 {
    (0..n).sum()
}

Synthesis

Criterion provides statistically robust benchmarking through:

Warm-up phase: Eliminates cold-start effects from caches and branch predictors.
Multiple sampling: Collects data across many iterations and iteration counts, revealing the full distribution of execution times.
Statistical analysis: Calculates confidence intervals that quantify measurement uncertainty, rather than reporting a single point estimate.
Outlier detection: Identifies and classifies measurements that deviate from the norm, helping you understand measurement noise.
Regression detection: Compares against previous runs with statistical significance testing, distinguishing real performance changes from noise.
Visualization: HTML reports provide visual analysis of measurement distributions and trends.

Simple timing measurements give you a number, but Criterion gives you confidence in that number. For production codebases where performance matters, this statistical rigor helps you make informed decisions about optimizations and catch regressions before they reach users.

How does criterion::Criterion provide statistically robust benchmarking compared to simple timing measurements?