Loading pageā¦
Rust walkthroughs
Loading pageā¦
rayon's par_iter() handle work stealing and thread scheduling under the hood?Rayon transforms sequential iteration into parallel execution through a sophisticated work-stealing scheduler built on fork-join parallelism. Understanding this mechanism explains why Rayon achieves near-linear speedup on multi-core systems and how to structure code for optimal performance.
Rayon's execution model centers on fork-join parallelism, where work is recursively split (forked) and results are combined (joined). The par_iter() call transforms a collection into a parallel iterator that the scheduler distributes across threads:
use rayon::prelude::*;
fn sum_parallel(data: &[i32]) -> i32 {
data.par_iter().sum()
}This deceptively simple call triggers a complex orchestration. The parallel iterator doesn't immediately divide work evenly among threads. Instead, it creates a tree of tasks that threads discover and claim dynamically.
Each thread in Rayon's thread pool maintains a local deque (double-ended queue) of tasks. This deque has a critical property: the owning thread operates on the bottom (LIFO - last in, first out), while other threads steal from the top (FIFO - first in, first out).
// Conceptual model of a thread's work queue
struct WorkerQueue {
tasks: VecDeque<Task>,
// Owner pushes and pops from the bottom (LIFO)
fn push(&mut self, task: Task) { /* ... */ }
fn pop(&mut self) -> Option<Task> { /* ... */ }
// Thieves steal from the top (FIFO)
fn steal(&mut self) -> Option<Task> { /* ... */ }
}When a thread spawns a task via join() or iterates over a parallel iterator, it pushes the spawned work onto its local queue. The thread then continues executing, periodically checking its queue for work.
When you call par_iter(), Rayon doesn't pre-divide the collection into chunks. Instead, it creates a splitter that can recursively divide work:
fn parallel_sum(data: &[i32]) -> i32 {
data.par_iter().sum()
}
// Conceptually, this becomes:
fn parallel_sum_recursive(data: &[i32]) -> i32 {
if data.len() < THRESHOLD {
// Sequential execution for small chunks
data.iter().sum()
} else {
let (left, right) = data.split_at(data.len() / 2);
let (left_sum, right_sum) = rayon::join(
|| parallel_sum_recursive(left),
|| parallel_sum_recursive(right),
);
left_sum + right_sum
}
}The rayon::join() function is the primitive that enables work stealing. When called, it pushes one closure onto the local queue and executes the other immediately. If the pushed work isn't stolen by the time the immediate work completes, the thread pops it from its own queue.
Work stealing occurs when a thread exhausts its local queue:
// Simplified model of a worker thread's main loop
fn worker_loop(thread_id: usize, pool: &ThreadPool) {
let local_queue = pool.get_queue(thread_id);
loop {
// First, try local work
if let Some(task) = local_queue.pop() {
task.execute();
continue;
}
// No local work? Try to steal from others
let mut victim = random_other_thread(thread_id);
if let Some(task) = pool.get_queue(victim).steal() {
task.execute();
continue;
}
// No work anywhere? Sleep or yield
thread::yield_now();
}
}The random victim selection is crucial: it reduces contention when multiple threads try to steal from the same victim. This randomized approach, combined with the split ordering, creates natural load balancing.
The asymmetric access pattern optimizes cache locality:
use rayon::prelude::*;
fn process_data(data: &mut [Data]) {
data.par_iter_mut().for_each(|item| {
// Process item
});
}When a thread pushes tasks from splitting a collection, the "most recent" split represents data that's likely still in cache. By processing LIFO, the thread works on hot data first. When another thread steals from the top (FIFO), it receives "older" tasksālarger chunks that represent independent work with better cache characteristics for the stealing thread.
Rayon automatically adjusts task granularity based on the number of available CPUs and the nature of the work:
use rayon::prelude::*;
// With 8 CPUs, this might split into 8-64 tasks depending on cost
let sum: i32 = (0..1_000_000)
.into_par_iter()
.map(|x| expensive_computation(x))
.sum();
// With cheap operations, Rayon creates fewer tasks to avoid overhead
let sum: i32 = (0..1_000_000)
.into_par_iter()
.sum();The cost model considers:
You can manually control splitting with with_min_len():
data.par_iter()
.with_min_len(1000) // Each task processes at least 1000 items
.for_each(|item| process(item));Rayon maintains a global thread pool initialized on first use:
use rayon::{ThreadPool, ThreadPoolBuilder};
// Custom thread pool configuration
let pool = ThreadPoolBuilder::new()
.num_threads(4)
.thread_name(|i| format!("rayon-worker-{}", i))
.build()
.unwrap();
pool.install(|| {
// This runs in the custom pool
data.par_iter().for_each(|item| process(item));
});The default pool uses num_cpus::get() threads. Each thread has its own deque, and threads communicate through work stealing without a central dispatcher.
A critical aspect of Rayon's scheduler is how it handles blocking operations:
use rayon::prelude::*;
// PROBLEM: Blocking inside parallel iteration
items.par_iter().for_each(|item| {
let result = blocking_io_operation(item); // BAD!
process(result);
});When a thread blocks, it can't steal work or execute other tasks. The work-stealing queue accumulates, but the blocked thread's tasks wait. Other threads will eventually steal them, but with reduced parallelism.
For blocking operations, use rayon::spawn_blocking or consider tokio:
// Better: Use spawn_blocking for I/O
items.par_iter().for_each(|item| {
// For mixed CPU + I/O work, consider async runtime integration
});Rayon limits recursion depth to prevent stack overflow:
fn deep_parallel(data: &[i32], depth: usize) -> i32 {
if depth > 32 || data.len() < 1024 {
return data.iter().sum();
}
let (left, right) = data.split_at(data.len() / 2);
let (l, r) = rayon::join(
|| deep_parallel(left, depth + 1),
|| deep_parallel(right, depth + 1),
);
l + r
}The scheduler tracks the current "fork depth" and switches to sequential execution when depth exceeds a threshold, preventing stack exhaustion from deeply nested parallelism.
spawn and scopeBeyond iterators, Rayon provides lower-level primitives:
use rayon::spawn;
use std::sync::atomic::{AtomicUsize, Ordering};
let counter = AtomicUsize::new(0);
// Fire-and-forget parallelism
spawn(|| {
counter.fetch_add(1, Ordering::Relaxed);
});
// Scoped parallelism with guaranteed completion
rayon::scope(|s| {
s.spawn(|_| {
// This completes before scope exits
counter.fetch_add(1, Ordering::Relaxed);
});
});
// All spawned tasks complete herescope ensures all spawned tasks complete before the closure returns, enabling safe borrowing of stack data:
let mut data = vec![0; 1000];
rayon::scope(|s| {
s.spawn(|_| {
// Can safely borrow data because scope waits for completion
data[0] = 42;
});
});
// data[0] is guaranteed to be 42 hereThe work-stealing approach provides several guarantees:
Near-optimal load balancing: Threads that finish early steal from busy threads, ensuring all cores stay utilized.
Low overhead: Tasks are cheap to create; the overhead is amortized across many iterations.
Cache-friendly: LIFO processing keeps related data in cache; stealing transfers larger, independent chunks.
use rayon::prelude::*;
// Benchmark comparison
fn benchmark_parallel_vs_sequential(data: &[i32]) {
// Sequential: ~N iterations, single-threaded
let seq_sum: i32 = data.iter().sum();
// Parallel: ~N/CPU iterations per thread, with stealing overhead
let par_sum: i32 = data.par_iter().sum();
// For large N and simple operations, parallel is ~CPU_COUNT times faster
// For small N or expensive operations, overhead may dominate
}Rayon provides logging for understanding task distribution:
use rayon::ThreadPoolBuilder;
ThreadPoolBuilder::new()
.num_threads(4)
.start_handler(|thread_id| {
eprintln!("Thread {} started", thread_id);
})
.exit_handler(|thread_id| {
eprintln!("Thread {} exiting", thread_id);
})
.build_global()
.unwrap();Rayon's par_iter() achieves efficient parallelism through a decentralized work-stealing scheduler. Rather than centrally assigning work, each thread maintains a local task queue and steals from others when idle. The fork-join model naturally creates a task tree that adapts to available parallelism without requiring explicit chunking.
The LIFO/FIFO split between owners and thieves optimizes cache locality: threads work on recently-split (cache-hot) data while stealing transfers older, larger (independent) chunks. This asymmetry, combined with random victim selection and adaptive granularity, produces near-linear speedup on multi-core systems.
Understanding this model informs how to structure parallel code: prefer fine-grained parallelism over manual chunking, avoid blocking in parallel iterators, and trust the scheduler to balance load. The abstraction lets you write data.par_iter().sum() while the runtime orchestrates complex work distribution across all available cores.