Loading page…
Rust walkthroughs
Loading page…
rayon, how does par_chunks differ from par_iter for slice processing?par_iter processes each element individually in parallel, creating one parallel task per element and relying on rayon's work-stealing scheduler to distribute work. par_chunks divides the slice into fixed-size chunks first, then processes each chunk in parallel, where each parallel task handles multiple consecutive elements sequentially within its chunk. The key difference is granularity: par_iter maximizes parallelism at the element level while par_chunks reduces parallel overhead by batching elements into fewer, larger tasks. Choose par_iter for uniform-cost operations where per-element overhead is negligible; choose par_chunks when per-element work is small and the overhead of parallel task management would dominate.
use rayon::prelude::*;
fn par_iter_example() {
let data = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
// par_iter: each element processed as separate parallel task
let squares: Vec<i32> = data.par_iter()
.map(|&x| x * x)
.collect();
println!("Squares: {:?}", squares);
// [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
}
fn par_iter_characteristics() {
let data: Vec<i32> = (0..1000).collect();
// Creates ~1000 parallel tasks (conceptually)
// Rayon's work-stealing distributes them across threads
let sum: i32 = data.par_iter().sum();
// Each element is independently scheduled
// Maximum parallelism but also maximum scheduling overhead
}par_iter treats each element as an independent parallel unit of work.
use rayon::prelude::*;
fn par_chunks_example() {
let data = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
// par_chunks: divide into chunks of 3 elements each
let chunk_sums: Vec<i32> = data.par_chunks(3)
.map(|chunk| chunk.iter().sum())
.collect();
println!("Chunk sums: {:?}", chunk_sums);
// [6, 15, 24, 10] // chunks: [1,2,3], [4,5,6], [7,8,9], [10]
}
fn par_chunks_characteristics() {
let data: Vec<i32> = (0..1000).collect();
// Creates ~1000/100 = 10 parallel tasks
// Each task processes 100 elements sequentially
let sum: i32 = data.par_chunks(100)
.map(|chunk| chunk.iter().sum())
.sum();
// Fewer parallel tasks, less scheduling overhead
// Each thread processes a batch sequentially
}par_chunks groups elements into chunks, processing each chunk as a unit.
use rayon::prelude::*;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
fn visualize_difference() {
let data: Vec<i32> = (0..12).collect();
let counter = Arc::new(AtomicUsize::new(0));
// par_iter: 12 parallel tasks (conceptually)
// [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
// Each element is its own task
let c1 = counter.clone();
let iter_count = data.par_iter()
.map(|&x| {
c1.fetch_add(1, Ordering::SeqCst);
x * 2
})
.count();
println!("par_iter processed {} elements", iter_count);
// par_chunks(4): 3 parallel tasks
// [0,1,2,3] | [4,5,6,7] | [8,9,10,11]
// Each chunk is one task
let c2 = counter.clone();
let chunk_count = data.par_chunks(4)
.map(|chunk| {
c2.fetch_add(1, Ordering::SeqCst);
chunk.len()
})
.count();
println!("par_chunks created {} tasks", chunk_count);
// par_chunks created 3 tasks (vs 12 for par_iter)
}The number of parallel tasks differs dramatically based on chunk size.
use rayon::prelude::*;
use std::time::Instant;
fn performance_comparison() {
let data: Vec<i32> = (0..1_000_000).collect();
// Cheap operation: overhead dominates with par_iter
let start = Instant::now();
let sum: i64 = data.par_iter()
.map(|&x| x as i64)
.sum();
let iter_time = start.elapsed();
println!("par_iter (cheap): {:?}", iter_time);
// par_chunks reduces overhead for cheap operations
let start = Instant::now();
let sum: i64 = data.par_chunks(10_000)
.map(|chunk| chunk.iter().map(|&x| x as i64).sum())
.sum();
let chunks_time = start.elapsed();
println!("par_chunks (cheap): {:?}", chunks_time);
// For very cheap operations, larger chunks can be faster
// due to reduced parallel scheduling overhead
}
fn expensive_operation_comparison() {
let data: Vec<i32> = (0..1000).collect();
// Expensive operation: parallelism worth the overhead
let start = Instant::now();
let results: Vec<_> = data.par_iter()
.map(|&x| {
// Simulate expensive computation
(0..1000).fold(x, |acc, _| acc.wrapping_mul(7).wrapping_add(1))
})
.collect();
let iter_time = start.elapsed();
println!("par_iter (expensive): {:?}", iter_time);
// For expensive operations, par_iter is often better
// Work-stealing can balance load dynamically
}Cheap operations benefit from chunking; expensive operations benefit from fine-grained parallelism.
use rayon::prelude::*;
fn cache_locality() {
let data: Vec<Vec<i32>> = (0..1000)
.map(|i| (0..1000).map(|j| i * 1000 + j).collect())
.collect();
// par_iter: processes each inner vector in parallel
// Good for independent operations on each vector
let sums: Vec<i64> = data.par_iter()
.map(|inner| inner.iter().map(|&x| x as i64).sum())
.collect();
// par_chunks: processes multiple vectors per task
// Better cache locality for the chunk of vectors
let sums: Vec<i64> = data.par_chunks(100)
.map(|chunk| {
// Process multiple inner vectors in one task
// Data stays in cache longer
chunk.iter().map(|inner| {
inner.iter().map(|&x| x as i64).sum()
}).sum()
})
.collect();
}par_chunks can improve cache locality by keeping related data in a single task.
use rayon::prelude::*;
fn par_iter_advantages() {
// 1. Uniform expensive operations
let data: Vec<i32> = (0..1000).collect();
let hashes: Vec<u64> = data.par_iter()
.map(|&x| {
// Each hash is expensive - parallelism worthwhile
let mut hash = x as u64;
for _ in 0..1000 {
hash = hash.wrapping_mul(31).wrapping_add(hash >> 3);
}
hash
})
.collect();
// 2. Variable work per element (work-stealing helps)
let variable_work: Vec<i32> = (0..1000).collect();
let results: Vec<i64> = variable_work.par_iter()
.map(|&n| {
// Some elements are much more expensive
(0..n).fold(0i64, |acc, i| acc + i as i64)
})
.collect();
// Rayon's work-stealing balances load automatically
// 3. When you need per-element parallelism
let flags: Vec<bool> = data.par_iter()
.map(|&x| x % 2 == 0)
.collect();
}
fn variable_workload_example() {
// Work-stealing demonstration
let data: Vec<u32> = (0..100).collect();
// Some tasks are fast, some are slow
let results: Vec<u64> = data.par_iter()
.map(|&n| {
// Tasks with larger n do more work
(0..n * 1000).fold(0u64, |acc, _| acc.wrapping_add(1))
})
.collect();
// par_iter with work-stealing: fast tasks can help with slow tasks
// par_chunks: one slow chunk blocks its thread
}par_iter excels with expensive or variable-cost operations where work-stealing matters.
use rayon::prelude::*;
fn par_chunks_advantages() {
let data: Vec<i32> = (0..1_000_000).collect();
// 1. Very cheap operations - avoid overhead
let sum: i64 = data.par_chunks(50_000)
.map(|chunk| chunk.iter().map(|&x| x as i64).sum())
.sum();
// 2. Sequential processing within chunk is natural
let mut output = vec![0i32; data.len()];
data.par_chunks(10_000)
.enumerate()
.for_each(|(i, chunk)| {
let start = i * 10_000;
for (j, &val) in chunk.iter().enumerate() {
// Sequential processing within chunk
// Can use local state efficiently
output[start + j] = val * 2;
}
});
// 3. Reducing contention on shared resources
let results: Vec<i64> = data.par_chunks(100_000)
.map(|chunk| {
// One lock acquisition per chunk, not per element
let local_sum: i64 = chunk.iter().map(|&x| x as i64).sum();
local_sum
})
.collect();
}
fn memory_efficiency() {
let large_data: Vec<f64> = vec![0.0; 100_000_000];
// par_iter: many parallel tasks, each accessing large_data
// More memory pressure from parallel execution contexts
// par_chunks: fewer tasks, sequential access within chunk
// Better memory access patterns
let sum: f64 = large_data.par_chunks(1_000_000)
.map(|chunk| chunk.iter().sum())
.sum();
}par_chunks excels when overhead dominates or when sequential processing within chunks is beneficial.
use rayon::prelude::*;
fn chunk_size_guidance() {
let data: Vec<i32> = (0..1_000_000).collect();
// Too small: overhead dominates
let sum: i64 = data.par_chunks(1)
.map(|chunk| chunk.iter().map(|&x| x as i64).sum())
.sum();
// Same as par_iter - 1 million tasks
// Too large: not enough parallelism
let sum: i64 = data.par_chunks(1_000_000)
.map(|chunk| chunk.iter().map(|&x| x as i64).sum())
.sum();
// Single task - no parallelism
// Balance: enough chunks to utilize all threads
let num_threads = rayon::current_num_threads();
let chunk_size = data.len() / (num_threads * 10); // ~10 chunks per thread
let sum: i64 = data.par_chunks(chunk_size)
.map(|chunk| chunk.iter().map(|&x| x as i64).sum())
.sum();
// General guidance:
// - Cheap operations: larger chunks (1000-10000 elements)
// - Moderate operations: medium chunks (100-1000 elements)
// - Expensive operations: smaller chunks (1-100 elements)
// - Benchmark for your specific case
}Chunk size balances parallelism overhead against scheduling costs.
use rayon::prelude::*;
fn par_chunks_mut_example() {
let mut data = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
// Modify each chunk in place
data.par_chunks_mut(3)
.for_each(|chunk| {
for item in chunk.iter_mut() {
*item *= 2;
}
});
println!("Modified: {:?}", data);
// [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
// par_iter_mut equivalent
let mut data2 = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
data2.par_iter_mut().for_each(|item| *item *= 2);
}
fn in_place_batch_processing() {
let mut image = vec![0u8; 1920 * 1080 * 3]; // RGB image
// Process 100 rows at a time
let row_stride = 1920 * 3;
image.par_chunks_mut(row_stride * 100)
.enumerate()
.for_each(|(i, chunk)| {
let row_offset = i * 100;
for (pixel_idx, pixel) in chunk.iter_mut().enumerate() {
let row = row_offset + pixel_idx / (1920 * 3);
let col = pixel_idx % (1920 * 3);
// Apply filter based on position
*pixel = (*pixel as f32 * 1.1) as u8;
}
});
}par_chunks_mut enables batch in-place modifications.
use rayon::prelude::*;
fn combined_operations() {
let data: Vec<i32> = (0..1000).collect();
// par_chunks + filter + map
let even_sums: Vec<i64> = data.par_chunks(100)
.map(|chunk| {
chunk.iter()
.filter(|&&x| x % 2 == 0)
.map(|&x| x as i64)
.sum()
})
.collect();
// par_iter equivalent (more tasks, same result)
let even_sum: i64 = data.par_iter()
.filter(|&&x| x % 2 == 0)
.map(|&x| x as i64)
.sum();
// Sequential processing within parallel chunks
let results: Vec<Vec<i32>> = data.par_chunks(50)
.map(|chunk| {
// Build local result, avoiding contention
chunk.iter()
.filter(|&&x| x % 3 == 0)
.cloned()
.collect()
})
.collect();
// Flatten results
let all_results: Vec<i32> = results.into_iter()
.flatten()
.collect();
}Both integrate with iterator adapters; par_chunks wraps sequential iteration per chunk.
use rayon::prelude::*;
use std::collections::HashMap;
fn batch_database_updates() {
struct Update { id: u64, value: String }
let updates: Vec<Update> = (0..10_000)
.map(|i| Update {
id: i,
value: format!("value_{}", i),
})
.collect();
// Batch updates - reduce database round trips
let results: Vec<Vec<u64>> = updates.par_chunks(100)
.map(|batch| {
// Simulate batch database update
batch.iter().map(|u| u.id).collect()
})
.collect();
// par_iter would require 10,000 individual updates
// par_chunks requires only 100 batch updates
}
fn histogram_with_chunks() {
let data: Vec<u32> = (0..1_000_000).map(|x| x % 100).collect();
// Build local histograms in each chunk, then merge
let global_histogram: HashMap<u32, u64> = data.par_chunks(100_000)
.map(|chunk| {
let mut local_hist = HashMap::new();
for &val in chunk {
*local_hist.entry(val).or_insert(0) += 1;
}
local_hist
})
.reduce(HashMap::new, |mut acc, local| {
for (k, v) in local {
*acc.entry(k).or_insert(0) += v;
}
acc
});
// Fewer HashMap operations than per-element parallelism
}par_chunks enables efficient batch processing patterns.
use rayon::prelude::*;
fn split_vs_chunks() {
let data = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
// par_chunks: fixed size chunks
// [1,2,3,4] | [5,6,7,8] | [9,10]
let chunk_results: Vec<&[i32]> = data.par_chunks(4).collect();
// par_split: splits by predicate
// Not available directly, but similar concept
// par_chunks is for fixed-size batch division
// par_chunks_exact: ensures all chunks have same size (except last)
let exact_chunks: Vec<&[i32]> = data.par_chunks_exact(3).collect();
// [1,2,3] | [4,5,6] | [7,8,9] | [10]
// par_chunks_exact provides remainder() for leftover elements
}par_chunks provides fixed-size chunking; the exact variant ensures uniform chunk sizes.
use rayon::prelude::*;
fn decision_guide() {
// Use par_iter when:
// 1. Per-element operation is expensive (computations, I/O)
// 2. Work per element varies significantly
// 3. You need fine-grained load balancing
// 4. Elements are independent and uniform
// Use par_chunks when:
// 1. Per-element operation is cheap (simple arithmetic)
// 2. You're doing batch operations (database, network)
// 3. Sequential processing within batch is natural
// 4. You want to reduce parallel overhead
// 5. Cache locality matters
// Benchmark both for your specific case!
}
fn benchmark_template() {
let data: Vec<i64> = (0..1_000_000).collect();
// Test different approaches
let start = std::time::Instant::now();
let sum: i64 = data.par_iter().sum();
println!("par_iter: {:?}", start.elapsed());
for chunk_size in [100, 1000, 10000, 100000] {
let start = std::time::Instant::now();
let sum: i64 = data.par_chunks(chunk_size)
.map(|c| c.iter().sum())
.sum();
println!("par_chunks({}): {:?}", chunk_size, start.elapsed());
}
}Benchmark both approaches to find the optimal strategy for your workload.
| Aspect | par_iter | par_chunks |
|--------|------------|--------------|
| Parallelism | Per-element | Per-chunk |
| Tasks created | N (one per element) | N / chunk_size |
| Overhead | Higher (more tasks) | Lower (fewer tasks) |
| Load balancing | Fine-grained work-stealing | Coarse-grained |
| Cache locality | Depends on access pattern | Better (sequential in chunk) |
| Best for | Expensive, variable work | Cheap, uniform work |
| Batch operations | Awkward | Natural |
| Memory pattern | Random access | Sequential within chunk |
The choice between par_iter and par_chunks represents a trade-off between parallelism granularity and overhead:
Use par_iter when:
Use par_chunks when:
Key insight: par_chunks is essentially par_iter with batching built in—it divides work into larger chunks before parallelizing, reducing the number of parallel tasks. The optimal chunk size depends on your specific workload: large enough to amortize scheduling overhead but small enough to utilize all threads effectively. For truly expensive operations, par_iter with its fine-grained work-stealing often performs better. For cheap operations or batch-style work, par_chunks with an appropriately chosen size wins. When in doubt, benchmark both approaches with realistic data sizes.