Loading pageâŚ
Rust walkthroughs
Loading pageâŚ
rayon::ThreadPool, how does thread-local storage behave for parallel iterators?Thread-local storage in Rayon behaves differently than standard thread-local variables because Rayon's work-stealing scheduler may execute tasks on any thread in the pool, and tasks can be split across threads mid-execution. Each thread in the ThreadPool maintains its own thread-local state, but a single parallel iterator operation may access thread-local storage from multiple threads as different chunks are stolen and executed. Use rayon::thread_local for per-thread caches that benefit from locality, but understand that values are not shared between threads and you cannot rely on a single thread executing an entire operation.
use rayon::ThreadPool;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
fn main() {
// Create a thread pool with 4 threads
let pool = ThreadPoolBuilder::new()
.num_threads(4)
.build()
.unwrap();
// Install the pool as the global pool
pool.install(|| {
(0..100).into_par_iter().for_each(|i| {
println!("Processing {} on thread {:?}", i, rayon::current_thread_index());
});
});
}Rayon's thread pool distributes work across threads using work-stealing.
use std::cell::RefCell;
thread_local! {
static COUNTER: RefCell<i32> = RefCell::new(0);
}
fn main() {
// Each OS thread has its own COUNTER instance
std::thread::spawn(|| {
COUNTER.with(|c| {
*c.borrow_mut() += 1;
println!("Thread 1: {}", *c.borrow());
});
});
std::thread::spawn(|| {
COUNTER.with(|c| {
*c.borrow_mut() += 1;
println!("Thread 2: {}", *c.borrow());
});
});
}Standard thread-local storage is tied to OS threadsâeach thread sees its own copy.
use rayon::prelude::*;
use std::cell::RefCell;
// Rayon provides its own thread-local macro
rayon::thread_local! {
static BUFFER: RefCell<Vec<i32>> = RefCell::new(Vec::new());
}
fn main() {
let data: Vec<i32> = (0..1000).collect();
// Each Rayon thread gets its own BUFFER
data.par_iter().for_each(|&i| {
BUFFER.with(|buf| {
buf.borrow_mut().push(i);
// Each thread accumulates its own batch
});
});
// But we can't easily retrieve the accumulated values
// because they're distributed across threads
}Rayon's thread_local! creates per-worker-thread storage, not per-task storage.
use rayon::prelude::*;
fn main() {
let pool = rayon::ThreadPoolBuilder::new()
.num_threads(4)
.build()
.unwrap();
pool.install(|| {
(0..20).into_par_iter().for_each(|i| {
// current_thread_index() returns the Rayon thread index
// Not the OS thread ID
if let Some(idx) = rayon::current_thread_index() {
println!("Item {} on thread {}", i, idx);
}
});
});
}rayon::current_thread_index() returns the index within the Rayon pool.
use rayon::prelude::*;
use std::cell::RefCell;
rayon::thread_local! {
static LOCAL_SUM: RefCell<i32> = RefCell::new(0);
}
fn main() {
let data: Vec<i32> = (0..10000).collect();
// Each thread accumulates into its own local storage
data.par_iter().for_each(|&i| {
LOCAL_SUM.with(|sum| {
*sum.borrow_mut() += i;
});
});
// Problem: sums are distributed across threads
// We can't easily get the total
// Solution: Use parallel reduce instead
let total: i32 = data.par_iter().sum();
println!("Total: {}", total);
}Work stealing means tasks can execute on any threadâthread-locals aren't predictable.
use rayon::prelude::*;
use std::cell::RefCell;
use std::sync::Mutex;
rayon::thread_local! {
static THREAD_RESULTS: RefCell<Vec<i32>> = RefCell::new(Vec::new());
}
fn main() {
let data: Vec<i32> = (0..1000).collect();
let all_results: Arc<Mutex<Vec<Vec<i32>>>> = Arc::new(Mutex::new(Vec::new()));
// Each thread accumulates its own results
data.par_iter().for_each(|&i| {
THREAD_RESULTS.with(|results| {
results.borrow_mut().push(i * 2);
});
});
// But we still can't easily collect thread results
// Rayon's thread-local doesn't provide a "flush" mechanism
// Better approach: collect in parallel and reduce
let doubled: Vec<i32> = data.par_iter().map(|&i| i * 2).collect();
}Thread-local storage is for per-thread caches, not for aggregating parallel results.
use rayon::prelude::*;
use std::cell::RefCell;
// Thread-local cache for expensive computation
rayon::thread_local! {
static COMPUTE_CACHE: RefCell<Vec<u64>> = RefCell::new(Vec::new());
}
fn expensive_computation(n: u64) -> u64 {
// Simulate expensive work
n.pow(2)
}
fn main() {
let inputs: Vec<u64> = (0..1000).collect();
// Each thread can reuse its cache
inputs.par_iter().for_each(|&n| {
COMPUTE_CACHE.with(|cache| {
let mut c = cache.borrow_mut();
// Use cache for intermediate results
c.clear();
c.push(n);
c.push(n * 2);
let result = expensive_computation(n);
c.push(result);
});
});
}Thread-local caches avoid repeated allocation within each thread.
use rayon::prelude::*;
use std::cell::RefCell;
rayon::thread_local! {
// Lazy initialization happens per thread on first access
static THREAD_ID: RefCell<Option<usize>> = RefCell::new(None);
}
fn main() {
(0..100).into_par_iter().for_each(|i| {
THREAD_ID.with(|id| {
let mut id_ref = id.borrow_mut();
if id_ref.is_none() {
// Initialize on first use per thread
*id_ref = Some(rayon::current_thread_index().unwrap());
println!("Thread {:?} initialized", id_ref);
}
// Use the thread-local state
// println!("Processing {} on thread {:?}", i, *id_ref);
});
});
}Each thread initializes its thread-local storage on first access.
use rayon::prelude::*;
fn main() {
let pool = rayon::ThreadPoolBuilder::new()
.num_threads(2) // Only 2 threads
.build()
.unwrap();
pool.install(|| {
// This parallel iterator splits work across 2 threads
// Even if there are 100 items
(0..100).into_par_iter().for_each(|i| {
// Thread index will only be 0 or 1
println!("Thread {:?}: item {}", rayon::current_thread_index(), i);
});
});
}Thread-local storage is limited by the number of threads in the pool.
use rayon::prelude::*;
use std::cell::RefCell;
// Per-thread accumulator
rayon::thread_local! {
static ACCUMULATOR: RefCell<i64> = RefCell::new(0);
}
fn main() {
let data: Vec<i64> = (0..1_000_000).collect();
// Use parallel sum - Rayon handles the reduction efficiently
let sum: i64 = data.par_iter().sum();
println!("Sum: {}", sum);
// If you need per-thread aggregation, use chunks or fold
let chunk_sums: Vec<i64> = data
.par_iter()
.fold(
|| 0i64, // Identity for each split
|acc, &x| acc + x // Accumulate within each split
)
.collect();
let total: i64 = chunk_sums.iter().sum();
println!("Total from chunks: {}", total);
}Use Rayon's built-in reductions instead of manual thread-local accumulation.
use rayon::prelude::*;
use std::cell::RefCell;
// Scratch buffer for per-thread string formatting
rayon::thread_local! {
static STRING_BUFFER: RefCell<String> = RefCell::new(String::new());
}
fn format_item(item: i32) -> String {
STRING_BUFFER.with(|buf| {
let mut s = buf.borrow_mut();
s.clear();
s.push_str("Item: ");
s.push_str(&item.to_string());
s.push_str(" processed");
s.clone() // Return owned string
})
}
fn main() {
let items: Vec<i32> = (0..1000).collect();
// Each thread reuses its buffer
let formatted: Vec<String> = items.par_iter()
.map(|&i| format_item(i))
.collect();
// Benefit: No repeated allocations per item
// Buffer is reused within each thread
}Thread-local scratch buffers reduce allocation overhead within threads.
use rayon::prelude::*;
rayon::thread_local! {
static DEPTH: RefCell<usize> = RefCell::new(0);
}
fn main() {
let matrix: Vec<Vec<i32>> = vec![
(0..10).collect(),
(10..20).collect(),
(20..30).collect(),
];
// Nested parallelism
matrix.par_iter().for_each(|row| {
row.par_iter().for_each(|&val| {
// May execute on same or different thread as outer iteration
// Thread-local is still per-thread, not per-parallel-iterator-level
});
});
}Nested parallelism doesn't create nested thread-local scopes.
use rayon::ThreadPool;
use std::cell::RefCell;
rayon::thread_local! {
static POOL_LOCAL: RefCell<i32> = RefCell::new(0);
}
fn main() {
let pool1 = rayon::ThreadPoolBuilder::new()
.num_threads(2)
.build()
.unwrap();
let pool2 = rayon::ThreadPoolBuilder::new()
.num_threads(2)
.build()
.unwrap();
// Thread-local is shared across pools (it's global)
pool1.install(|| {
POOL_LOCAL.with(|v| *v.borrow_mut() = 1);
});
pool2.install(|| {
POOL_LOCAL.with(|v| {
// Different threads, same global thread-local
println!("Pool 2 sees: {}", *v.borrow());
});
});
}Thread-local storage is global to all Rayon threads, not per-pool.
use rayon::ThreadPool;
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};
// To have per-pool state, pass it explicitly
struct PoolContext {
processed: AtomicUsize,
}
fn main() {
let pool1_ctx = Arc::new(PoolContext {
processed: AtomicUsize::new(0),
});
let pool1 = rayon::ThreadPoolBuilder::new()
.num_threads(2)
.build()
.unwrap();
let ctx = pool1_ctx.clone();
pool1.install(|| {
(0..100).into_par_iter().for_each(|_| {
ctx.processed.fetch_add(1, Ordering::Relaxed);
});
});
println!("Processed: {}", pool1_ctx.processed.load(Ordering::Relaxed));
}For per-pool state, pass context explicitly rather than using thread-local.
use rayon::prelude::*;
use std::cell::RefCell;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
rayon::thread_local! {
static COUNTER: RefCell<usize> = RefCell::new(0);
}
fn main() {
let global_count = Arc::new(AtomicUsize::new(0));
// Each thread increments its local counter
(0..1000).into_par_iter().for_each(|_| {
COUNTER.with(|c| {
*c.borrow_mut() += 1;
});
global_count.fetch_add(1, Ordering::Relaxed);
});
// global_count == 1000
// But COUNTER values are distributed across threads
// We can't easily access all thread-local values
println!("Global count: {}", global_count.load(Ordering::Relaxed));
}Thread-local values are scattered across threads with no global access.
use rayon::prelude::*;
use std::cell::RefCell;
// Thread-local is useful for:
// 1. Per-thread caches (avoid repeated computation)
// 2. Scratch buffers (avoid repeated allocation)
// 3. Thread-specific resources (connections, handles)
rayon::thread_local! {
// Example: Per-thread RNG state
static RNG_STATE: RefCell<u64> = RefCell::new(0);
}
fn xorshift(state: &mut u64) -> u64 {
*state ^= *state << 13;
*state ^= *state >> 7;
*state ^= *state << 17;
*state
}
fn main() {
// Each thread can maintain its own RNG state
// No contention between threads
(0..1000).into_par_iter().for_each(|i| {
RNG_STATE.with(|s| {
let mut state = s.borrow_mut();
*state = (*state).wrapping_add(i as u64);
let _random = xorshift(&mut state);
});
});
}Thread-local storage works best for independent per-thread state.
use rayon::prelude::*;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use std::cell::RefCell;
rayon::thread_local! {
static LOCAL_COUNTER: RefCell<usize> = RefCell::new(0);
}
fn main() {
let data: Vec<usize> = (0..10_000).collect();
// Thread-local: no contention, but can't aggregate easily
data.par_iter().for_each(|_| {
LOCAL_COUNTER.with(|c| *c.borrow_mut() += 1);
});
// Atomic: some contention, but global visibility
let atomic_counter = Arc::new(AtomicUsize::new(0));
data.par_iter().for_each(|_| {
atomic_counter.fetch_add(1, Ordering::Relaxed);
});
println!("Atomic count: {}", atomic_counter.load(Ordering::Relaxed));
}Use thread-local for thread-private state; use atomics for shared counting.
| Aspect | Behavior | |--------|----------| | Scope | Per-Rayon-thread (per worker) | | Lifetime | Persists across tasks on same thread | | Work stealing | Tasks may run on different threads than expected | | Access | No global view of all thread-local values | | Use cases | Scratch buffers, per-thread caches, RNG state | | Limitation | Cannot aggregate across threads automatically |
Thread-local storage in Rayon's ThreadPool operates at the worker thread level, not the task level:
Thread-local is per-worker-thread: Each thread in the pool has its own copy of the thread-local variable. When a task runs on thread 0, it sees thread 0's copy; when work-stealing moves it to thread 1, subsequent operations see thread 1's copy. There's no way to access all thread-local values globallyâyou only see the current thread's value.
Work stealing changes execution context: A single parallel iterator operation may execute chunks on different threads due to work stealing. This means a "per-thread" value in your mental model actually means "per-chunk-if-stolen," making thread-local values unpredictable if you expect all items to see the same local state.
Thread-local shines for scratch storage: The right use is for per-thread buffers, caches, or state that shouldn't be sharedâreusing allocations, maintaining thread-local RNG state, or caching expensive per-thread computations. The wrong use is trying to accumulate results across threadsâyou need fold/reduce patterns or atomics for that.
Key insight: Rayon's thread-local storage is tied to the worker threads, which persist for the lifetime of the pool. This is excellent for amortizing costs (reuse allocations, maintain caches) but useless for aggregating parallel results. For aggregation, use Rayon's built-in fold and reduce operations which handle the parallel reduction correctly. For shared state, use Arc with atomics or mutexes. Thread-local in Rayon is an optimization for per-thread state, not a communication mechanism.