When using rayon, how does par_chunks differ from par_iter for slice processing?

par_iter processes each element individually in parallel, creating one parallel task per element and relying on rayon's work-stealing scheduler to distribute work. par_chunks divides the slice into fixed-size chunks first, then processes each chunk in parallel, where each parallel task handles multiple consecutive elements sequentially within its chunk. The key difference is granularity: par_iter maximizes parallelism at the element level while par_chunks reduces parallel overhead by batching elements into fewer, larger tasks. Choose par_iter for uniform-cost operations where per-element overhead is negligible; choose par_chunks when per-element work is small and the overhead of parallel task management would dominate.

Basic par_iter Processing

use rayon::prelude::*;
 
fn par_iter_example() {
    let data = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
    
    // par_iter: each element processed as separate parallel task
    let squares: Vec<i32> = data.par_iter()
        .map(|&x| x * x)
        .collect();
    
    println!("Squares: {:?}", squares);
    // [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
}
 
fn par_iter_characteristics() {
    let data: Vec<i32> = (0..1000).collect();
    
    // Creates ~1000 parallel tasks (conceptually)
    // Rayon's work-stealing distributes them across threads
    let sum: i32 = data.par_iter().sum();
    
    // Each element is independently scheduled
    // Maximum parallelism but also maximum scheduling overhead
}

par_iter treats each element as an independent parallel unit of work.

Basic par_chunks Processing

use rayon::prelude::*;
 
fn par_chunks_example() {
    let data = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
    
    // par_chunks: divide into chunks of 3 elements each
    let chunk_sums: Vec<i32> = data.par_chunks(3)
        .map(|chunk| chunk.iter().sum())
        .collect();
    
    println!("Chunk sums: {:?}", chunk_sums);
    // [6, 15, 24, 10]  // chunks: [1,2,3], [4,5,6], [7,8,9], [10]
}
 
fn par_chunks_characteristics() {
    let data: Vec<i32> = (0..1000).collect();
    
    // Creates ~1000/100 = 10 parallel tasks
    // Each task processes 100 elements sequentially
    let sum: i32 = data.par_chunks(100)
        .map(|chunk| chunk.iter().sum())
        .sum();
    
    // Fewer parallel tasks, less scheduling overhead
    // Each thread processes a batch sequentially
}

par_chunks groups elements into chunks, processing each chunk as a unit.

Visualizing the Difference

use rayon::prelude::*;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
 
fn visualize_difference() {
    let data: Vec<i32> = (0..12).collect();
    let counter = Arc::new(AtomicUsize::new(0));
    
    // par_iter: 12 parallel tasks (conceptually)
    // [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
    // Each element is its own task
    
    let c1 = counter.clone();
    let iter_count = data.par_iter()
        .map(|&x| {
            c1.fetch_add(1, Ordering::SeqCst);
            x * 2
        })
        .count();
    
    println!("par_iter processed {} elements", iter_count);
    
    // par_chunks(4): 3 parallel tasks
    // [0,1,2,3] | [4,5,6,7] | [8,9,10,11]
    // Each chunk is one task
    
    let c2 = counter.clone();
    let chunk_count = data.par_chunks(4)
        .map(|chunk| {
            c2.fetch_add(1, Ordering::SeqCst);
            chunk.len()
        })
        .count();
    
    println!("par_chunks created {} tasks", chunk_count);
    // par_chunks created 3 tasks (vs 12 for par_iter)
}

The number of parallel tasks differs dramatically based on chunk size.

Performance: Overhead vs Parallelism

use rayon::prelude::*;
use std::time::Instant;
 
fn performance_comparison() {
    let data: Vec<i32> = (0..1_000_000).collect();
    
    // Cheap operation: overhead dominates with par_iter
    let start = Instant::now();
    let sum: i64 = data.par_iter()
        .map(|&x| x as i64)
        .sum();
    let iter_time = start.elapsed();
    println!("par_iter (cheap): {:?}", iter_time);
    
    // par_chunks reduces overhead for cheap operations
    let start = Instant::now();
    let sum: i64 = data.par_chunks(10_000)
        .map(|chunk| chunk.iter().map(|&x| x as i64).sum())
        .sum();
    let chunks_time = start.elapsed();
    println!("par_chunks (cheap): {:?}", chunks_time);
    
    // For very cheap operations, larger chunks can be faster
    // due to reduced parallel scheduling overhead
}
 
fn expensive_operation_comparison() {
    let data: Vec<i32> = (0..1000).collect();
    
    // Expensive operation: parallelism worth the overhead
    let start = Instant::now();
    let results: Vec<_> = data.par_iter()
        .map(|&x| {
            // Simulate expensive computation
            (0..1000).fold(x, |acc, _| acc.wrapping_mul(7).wrapping_add(1))
        })
        .collect();
    let iter_time = start.elapsed();
    println!("par_iter (expensive): {:?}", iter_time);
    
    // For expensive operations, par_iter is often better
    // Work-stealing can balance load dynamically
}

Cheap operations benefit from chunking; expensive operations benefit from fine-grained parallelism.

Cache Locality Benefits

use rayon::prelude::*;
 
fn cache_locality() {
    let data: Vec<Vec<i32>> = (0..1000)
        .map(|i| (0..1000).map(|j| i * 1000 + j).collect())
        .collect();
    
    // par_iter: processes each inner vector in parallel
    // Good for independent operations on each vector
    let sums: Vec<i64> = data.par_iter()
        .map(|inner| inner.iter().map(|&x| x as i64).sum())
        .collect();
    
    // par_chunks: processes multiple vectors per task
    // Better cache locality for the chunk of vectors
    let sums: Vec<i64> = data.par_chunks(100)
        .map(|chunk| {
            // Process multiple inner vectors in one task
            // Data stays in cache longer
            chunk.iter().map(|inner| {
                inner.iter().map(|&x| x as i64).sum()
            }).sum()
        })
        .collect();
}

par_chunks can improve cache locality by keeping related data in a single task.

When par_iter Excels

use rayon::prelude::*;
 
fn par_iter_advantages() {
    // 1. Uniform expensive operations
    let data: Vec<i32> = (0..1000).collect();
    let hashes: Vec<u64> = data.par_iter()
        .map(|&x| {
            // Each hash is expensive - parallelism worthwhile
            let mut hash = x as u64;
            for _ in 0..1000 {
                hash = hash.wrapping_mul(31).wrapping_add(hash >> 3);
            }
            hash
        })
        .collect();
    
    // 2. Variable work per element (work-stealing helps)
    let variable_work: Vec<i32> = (0..1000).collect();
    let results: Vec<i64> = variable_work.par_iter()
        .map(|&n| {
            // Some elements are much more expensive
            (0..n).fold(0i64, |acc, i| acc + i as i64)
        })
        .collect();
    // Rayon's work-stealing balances load automatically
    
    // 3. When you need per-element parallelism
    let flags: Vec<bool> = data.par_iter()
        .map(|&x| x % 2 == 0)
        .collect();
}
 
fn variable_workload_example() {
    // Work-stealing demonstration
    let data: Vec<u32> = (0..100).collect();
    
    // Some tasks are fast, some are slow
    let results: Vec<u64> = data.par_iter()
        .map(|&n| {
            // Tasks with larger n do more work
            (0..n * 1000).fold(0u64, |acc, _| acc.wrapping_add(1))
        })
        .collect();
    
    // par_iter with work-stealing: fast tasks can help with slow tasks
    // par_chunks: one slow chunk blocks its thread
}

par_iter excels with expensive or variable-cost operations where work-stealing matters.

When par_chunks Excels

use rayon::prelude::*;
 
fn par_chunks_advantages() {
    let data: Vec<i32> = (0..1_000_000).collect();
    
    // 1. Very cheap operations - avoid overhead
    let sum: i64 = data.par_chunks(50_000)
        .map(|chunk| chunk.iter().map(|&x| x as i64).sum())
        .sum();
    
    // 2. Sequential processing within chunk is natural
    let mut output = vec![0i32; data.len()];
    data.par_chunks(10_000)
        .enumerate()
        .for_each(|(i, chunk)| {
            let start = i * 10_000;
            for (j, &val) in chunk.iter().enumerate() {
                // Sequential processing within chunk
                // Can use local state efficiently
                output[start + j] = val * 2;
            }
        });
    
    // 3. Reducing contention on shared resources
    let results: Vec<i64> = data.par_chunks(100_000)
        .map(|chunk| {
            // One lock acquisition per chunk, not per element
            let local_sum: i64 = chunk.iter().map(|&x| x as i64).sum();
            local_sum
        })
        .collect();
}
 
fn memory_efficiency() {
    let large_data: Vec<f64> = vec![0.0; 100_000_000];
    
    // par_iter: many parallel tasks, each accessing large_data
    // More memory pressure from parallel execution contexts
    
    // par_chunks: fewer tasks, sequential access within chunk
    // Better memory access patterns
    let sum: f64 = large_data.par_chunks(1_000_000)
        .map(|chunk| chunk.iter().sum())
        .sum();
}

par_chunks excels when overhead dominates or when sequential processing within chunks is beneficial.

Chunk Size Selection

use rayon::prelude::*;
 
fn chunk_size_guidance() {
    let data: Vec<i32> = (0..1_000_000).collect();
    
    // Too small: overhead dominates
    let sum: i64 = data.par_chunks(1)
        .map(|chunk| chunk.iter().map(|&x| x as i64).sum())
        .sum();
    // Same as par_iter - 1 million tasks
    
    // Too large: not enough parallelism
    let sum: i64 = data.par_chunks(1_000_000)
        .map(|chunk| chunk.iter().map(|&x| x as i64).sum())
        .sum();
    // Single task - no parallelism
    
    // Balance: enough chunks to utilize all threads
    let num_threads = rayon::current_num_threads();
    let chunk_size = data.len() / (num_threads * 10);  // ~10 chunks per thread
    
    let sum: i64 = data.par_chunks(chunk_size)
        .map(|chunk| chunk.iter().map(|&x| x as i64).sum())
        .sum();
    
    // General guidance:
    // - Cheap operations: larger chunks (1000-10000 elements)
    // - Moderate operations: medium chunks (100-1000 elements)
    // - Expensive operations: smaller chunks (1-100 elements)
    // - Benchmark for your specific case
}

Chunk size balances parallelism overhead against scheduling costs.

par_chunks_mut for In-Place Modification

use rayon::prelude::*;
 
fn par_chunks_mut_example() {
    let mut data = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
    
    // Modify each chunk in place
    data.par_chunks_mut(3)
        .for_each(|chunk| {
            for item in chunk.iter_mut() {
                *item *= 2;
            }
        });
    
    println!("Modified: {:?}", data);
    // [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
    
    // par_iter_mut equivalent
    let mut data2 = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
    data2.par_iter_mut().for_each(|item| *item *= 2);
}
 
fn in_place_batch_processing() {
    let mut image = vec![0u8; 1920 * 1080 * 3];  // RGB image
    
    // Process 100 rows at a time
    let row_stride = 1920 * 3;
    image.par_chunks_mut(row_stride * 100)
        .enumerate()
        .for_each(|(i, chunk)| {
            let row_offset = i * 100;
            for (pixel_idx, pixel) in chunk.iter_mut().enumerate() {
                let row = row_offset + pixel_idx / (1920 * 3);
                let col = pixel_idx % (1920 * 3);
                // Apply filter based on position
                *pixel = (*pixel as f32 * 1.1) as u8;
            }
        });
}

par_chunks_mut enables batch in-place modifications.

Combining with Other Operations

use rayon::prelude::*;
 
fn combined_operations() {
    let data: Vec<i32> = (0..1000).collect();
    
    // par_chunks + filter + map
    let even_sums: Vec<i64> = data.par_chunks(100)
        .map(|chunk| {
            chunk.iter()
                .filter(|&&x| x % 2 == 0)
                .map(|&x| x as i64)
                .sum()
        })
        .collect();
    
    // par_iter equivalent (more tasks, same result)
    let even_sum: i64 = data.par_iter()
        .filter(|&&x| x % 2 == 0)
        .map(|&x| x as i64)
        .sum();
    
    // Sequential processing within parallel chunks
    let results: Vec<Vec<i32>> = data.par_chunks(50)
        .map(|chunk| {
            // Build local result, avoiding contention
            chunk.iter()
                .filter(|&&x| x % 3 == 0)
                .cloned()
                .collect()
        })
        .collect();
    
    // Flatten results
    let all_results: Vec<i32> = results.into_iter()
        .flatten()
        .collect();
}

Both integrate with iterator adapters; par_chunks wraps sequential iteration per chunk.

Real-World Example: Batch Processing

use rayon::prelude::*;
use std::collections::HashMap;
 
fn batch_database_updates() {
    struct Update { id: u64, value: String }
    
    let updates: Vec<Update> = (0..10_000)
        .map(|i| Update {
            id: i,
            value: format!("value_{}", i),
        })
        .collect();
    
    // Batch updates - reduce database round trips
    let results: Vec<Vec<u64>> = updates.par_chunks(100)
        .map(|batch| {
            // Simulate batch database update
            batch.iter().map(|u| u.id).collect()
        })
        .collect();
    
    // par_iter would require 10,000 individual updates
    // par_chunks requires only 100 batch updates
}
 
fn histogram_with_chunks() {
    let data: Vec<u32> = (0..1_000_000).map(|x| x % 100).collect();
    
    // Build local histograms in each chunk, then merge
    let global_histogram: HashMap<u32, u64> = data.par_chunks(100_000)
        .map(|chunk| {
            let mut local_hist = HashMap::new();
            for &val in chunk {
                *local_hist.entry(val).or_insert(0) += 1;
            }
            local_hist
        })
        .reduce(HashMap::new, |mut acc, local| {
            for (k, v) in local {
                *acc.entry(k).or_insert(0) += v;
            }
            acc
        });
    
    // Fewer HashMap operations than per-element parallelism
}

par_chunks enables efficient batch processing patterns.

Comparison with par_split

use rayon::prelude::*;
 
fn split_vs_chunks() {
    let data = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
    
    // par_chunks: fixed size chunks
    // [1,2,3,4] | [5,6,7,8] | [9,10]
    let chunk_results: Vec<&[i32]> = data.par_chunks(4).collect();
    
    // par_split: splits by predicate
    // Not available directly, but similar concept
    // par_chunks is for fixed-size batch division
    
    // par_chunks_exact: ensures all chunks have same size (except last)
    let exact_chunks: Vec<&[i32]> = data.par_chunks_exact(3).collect();
    // [1,2,3] | [4,5,6] | [7,8,9] | [10]
    
    // par_chunks_exact provides remainder() for leftover elements
}

par_chunks provides fixed-size chunking; the exact variant ensures uniform chunk sizes.

Decision Guide

use rayon::prelude::*;
 
fn decision_guide() {
    // Use par_iter when:
    // 1. Per-element operation is expensive (computations, I/O)
    // 2. Work per element varies significantly
    // 3. You need fine-grained load balancing
    // 4. Elements are independent and uniform
    
    // Use par_chunks when:
    // 1. Per-element operation is cheap (simple arithmetic)
    // 2. You're doing batch operations (database, network)
    // 3. Sequential processing within batch is natural
    // 4. You want to reduce parallel overhead
    // 5. Cache locality matters
    
    // Benchmark both for your specific case!
}
 
fn benchmark_template() {
    let data: Vec<i64> = (0..1_000_000).collect();
    
    // Test different approaches
    let start = std::time::Instant::now();
    let sum: i64 = data.par_iter().sum();
    println!("par_iter: {:?}", start.elapsed());
    
    for chunk_size in [100, 1000, 10000, 100000] {
        let start = std::time::Instant::now();
        let sum: i64 = data.par_chunks(chunk_size)
            .map(|c| c.iter().sum())
            .sum();
        println!("par_chunks({}): {:?}", chunk_size, start.elapsed());
    }
}

Benchmark both approaches to find the optimal strategy for your workload.

Comparison Summary

Aspect par_iter par_chunks
Parallelism Per-element Per-chunk
Tasks created N (one per element) N / chunk_size
Overhead Higher (more tasks) Lower (fewer tasks)
Load balancing Fine-grained work-stealing Coarse-grained
Cache locality Depends on access pattern Better (sequential in chunk)
Best for Expensive, variable work Cheap, uniform work
Batch operations Awkward Natural
Memory pattern Random access Sequential within chunk

Synthesis

The choice between par_iter and par_chunks represents a trade-off between parallelism granularity and overhead:

Use par_iter when:

  • Per-element work is computationally expensive
  • Work per element varies (work-stealing helps balance)
  • Elements are truly independent
  • You want maximum parallelism with automatic load balancing
  • The overhead of task management is negligible compared to the work

Use par_chunks when:

  • Per-element work is cheap and uniform
  • You're performing batch operations (database updates, API calls)
  • Sequential processing within a batch is natural
  • You want to reduce parallel scheduling overhead
  • Cache locality matters for your workload

Key insight: par_chunks is essentially par_iter with batching built inโ€”it divides work into larger chunks before parallelizing, reducing the number of parallel tasks. The optimal chunk size depends on your specific workload: large enough to amortize scheduling overhead but small enough to utilize all threads effectively. For truly expensive operations, par_iter with its fine-grained work-stealing often performs better. For cheap operations or batch-style work, par_chunks with an appropriately chosen size wins. When in doubt, benchmark both approaches with realistic data sizes.