How does `rayon::slice::ParallelSlice::par_chunks` handle uneven chunk sizes for parallel processing?

par_chunks divides a slice into chunks of the specified size and processes them in parallel, with the final chunk potentially being smaller when the slice length isn't evenly divisible by the chunk size. This "remainder chunk" contains all remaining elements and is processed just like the full-size chunks—Rayon doesn't discard elements or pad the chunk. The chunk size determines the maximum elements per chunk, not a fixed size, so a slice of 10 elements with chunk size 3 produces four chunks: three chunks of 3 elements each and one remainder chunk of 1 element. This design ensures all elements are processed regardless of divisibility, but requires your parallel closure to handle chunks of varying sizes.

Basic Chunk Division

use rayon::prelude::*;
 
fn main() {
    let data = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
    
    // Split into chunks of 3
    // 10 elements / 3 per chunk = 3 full chunks + 1 remainder
    // Chunks: [1,2,3], [4,5,6], [7,8,9], [10]
    
    data.par_chunks(3).for_each(|chunk| {
        println!("Chunk: {:?}", chunk);
    });
    
    // Output (order may vary due to parallelism):
    // Chunk: [1, 2, 3]
    // Chunk: [4, 5, 6]
    // Chunk: [7, 8, 9]
    // Chunk: [10]
}

The final chunk contains remaining elements when division isn't even.

Understanding Chunk Sizes

use rayon::prelude::*;
 
fn main() {
    // Different slice lengths with chunk size 4
    
    // Length 8, chunk size 4: perfect division
    // Chunks: [0,1,2,3], [4,5,6,7] - both size 4
    let data: Vec<i32> = (0..8).collect();
    println!("8 elements, chunk size 4:");
    data.par_chunks(4).for_each(|chunk| {
        println!("  Chunk size {}: {:?}", chunk.len(), chunk);
    });
    
    // Length 10, chunk size 4: remainder
    // Chunks: [0,1,2,3], [4,5,6,7], [8,9] - last is smaller
    let data: Vec<i32> = (0..10).collect();
    println!("\n10 elements, chunk size 4:");
    data.par_chunks(4).for_each(|chunk| {
        println!("  Chunk size {}: {:?}", chunk.len(), chunk);
    });
    
    // Length 3, chunk size 4: single small chunk
    // Chunks: [0,1,2] - entire slice is one chunk
    let data: Vec<i32> = (0..3).collect();
    println!("\n3 elements, chunk size 4:");
    data.par_chunks(4).for_each(|chunk| {
        println!("  Chunk size {}: {:?}", chunk.len(), chunk);
    });
    
    // Length 0, chunk size 4: no chunks
    let data: Vec<i32> = vec![];
    println!("\n0 elements, chunk size 4:");
    data.par_chunks(4).for_each(|chunk| {
        println!("  Chunk: {:?}", chunk);
    });
    println!("  (no output - empty slice produces no chunks)");
}

Chunk sizes vary: full-size chunks followed by a potentially smaller remainder.

Handling Variable Chunk Sizes in Closures

use rayon::prelude::*;
 
fn main() {
    let data: Vec<i32> = (0..10).collect();
    
    // CORRECT: Handle variable chunk sizes
    let sum: i32 = data.par_chunks(3).map(|chunk| {
        // Each chunk can be different size
        // Last chunk may have 1-2 elements instead of 3
        chunk.iter().sum()
    }).sum();
    
    println!("Sum: {}", sum);
    
    // WRONG: Assuming fixed chunk size
    // Don't do this - it will panic or produce wrong results
    // let wrong: i32 = data.par_chunks(3).map(|chunk| {
    //     chunk[0] + chunk[1] + chunk[2]  // PANIC on last chunk!
    // }).sum();
    
    // CORRECT: Index with bounds checking or iteration
    let sum: i32 = data.par_chunks(3).map(|chunk| {
        // Safe: use iteration
        chunk.iter().sum()
    }).sum();
    
    // Or if you need indices, use get()
    let first_elements: Vec<i32> = data.par_chunks(3)
        .filter_map(|chunk| chunk.get(0).copied())
        .collect();
    
    println!("First elements: {:?}", first_elements);
}

Your parallel closure must handle chunks of varying sizes safely.

Par Chunks vs Sequential Chunks

use rayon::prelude::*;
 
fn main() {
    let data: Vec<i32> = (0..10).collect();
    
    // Sequential chunks - guaranteed order
    println!("Sequential chunks (ordered):");
    for chunk in data.chunks(3) {
        println!("  {:?}", chunk);
    }
    
    // Parallel chunks - order depends on parallel execution
    println!("\nParallel chunks (unordered):");
    data.par_chunks(3).for_each(|chunk| {
        // Order is NOT guaranteed
        println!("  {:?}", chunk);
    });
    
    // To get ordered results, use collect or map+reduce
    let results: Vec<i32> = data.par_chunks(3)
        .map(|chunk| chunk.iter().sum())
        .collect();  // Preserves order
    
    println!("\nOrdered results: {:?}", results);
}

par_chunks processes in parallel; order of execution is not guaranteed.

Par Chunks Mut for Mutable Access

use rayon::prelude::*;
 
fn main() {
    let mut data: Vec<i32> = (0..10).collect();
    
    // par_chunks_mut allows modifying elements in place
    data.par_chunks_mut(3).for_each(|chunk| {
        // Each chunk can be modified
        for elem in chunk.iter_mut() {
            *elem *= 2;
        }
    });
    
    println!("Doubled: {:?}", data);
    
    // The remainder chunk is also modified
    // Input: [0,1,2,3,4,5,6,7,8,9]
    // Chunks: [0,1,2], [3,4,5], [6,7,8], [9]
    // After:  [0,2,4,6,8,10,12,14,16,18]
    
    // Example: Normalize each chunk by its max
    let mut data: Vec<f64> = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0];
    
    data.par_chunks_mut(3).for_each(|chunk| {
        if let Some(&max_val) = chunk.iter().cloned().max_by(|a, b| a.partial_cmp(b).unwrap()) {
            for elem in chunk.iter_mut() {
                *elem /= max_val;
            }
        }
    });
    
    println!("Normalized per chunk: {:?}", data);
}

par_chunks_mut divides into mutable chunks for in-place modification.

Chunk Size Selection

use rayon::prelude::*;
 
fn main() {
    let data: Vec<i32> = (0..1000).collect();
    
    // SMALL CHUNK SIZE: More chunks, more overhead
    // Each chunk is a separate parallel task
    // Too small = excessive scheduling overhead
    
    let small: Vec<i32> = data.par_chunks(1)
        .map(|chunk| chunk.iter().sum())
        .collect();
    
    // LARGE CHUNK SIZE: Fewer chunks, less parallelism
    // If chunk size >= data length, only one chunk (no parallelism)
    
    let large: Vec<i32> = data.par_chunks(1000)
        .map(|chunk| chunk.iter().sum())
        .collect();
    // Only 1 chunk, no parallelism benefit
    
    // BALANCED: Enough chunks for parallelism, big enough to amortize overhead
    
    let balanced: Vec<i32> = data.par_chunks(100)
        .map(|chunk| chunk.iter().sum())
        .collect();
    // 10 chunks, good for parallel execution
    
    // GENERAL GUIDANCE:
    // - More CPU cores = can use smaller chunks
    // - More expensive work per element = can use smaller chunks
    // - Cheap work per element = use larger chunks
    // - Aim for at least as many chunks as cores
    
    println!("Small (chunk=1): {} results", small.len());
    println!("Large (chunk=1000): {} results", large.len());
    println!("Balanced (chunk=100): {} results", balanced.len());
}

Choose chunk size based on workload and available parallelism.

Work Distribution Across Threads

use rayon::prelude::*;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
 
fn main() {
    let data: Vec<i32> = (0..20).collect();
    
    // Track which thread processes which chunk
    let thread_id = Arc::new(AtomicUsize::new(0));
    let thread_names: Arc<std::sync::Mutex<Vec<(usize, Vec<i32>)>>> = 
        Arc::new(std::sync::Mutex::new(Vec::new()));
    
    // Create unique thread identifier
    let counter = Arc::clone(&thread_id);
    let results = Arc::clone(&thread_names);
    
    rayon::ThreadPoolBuilder::new()
        .num_threads(4)
        .build_global()
        .unwrap();
    
    data.par_chunks(3).for_each(|chunk| {
        // Different threads may process different chunks
        // Rayon's work-stealing scheduler distributes chunks
        let results = Arc::clone(&results);
        let chunk_vec: Vec<i32> = chunk.to_vec();
        results.lock().unwrap().push(chunk_vec);
    });
    
    println!("Processed chunks: {:?}", thread_names.lock().unwrap().len());
    
    // Rayon distributes work based on:
    // 1. Number of available threads
    // 2. Current thread workload
    // 3. Work-stealing between threads
    // 4. Chunk count relative to thread count
}

Rayon's work-stealing scheduler distributes chunks across available threads.

Par Chunks Exact for Fixed Sizes

use rayon::prelude::*;
 
fn main() {
    let data: Vec<i32> = (0..10).collect();
    
    // par_chunks_exact requires divisor, panics on remainder
    // This ensures every chunk is exactly the specified size
    
    // data.par_chunks_exact(3)  // Would skip last element (10)
    
    // par_chunks_exact returns iterator over full-sized chunks
    // The remainder must be handled separately
    
    // First, check if division is even
    if data.len() % 3 == 0 {
        data.par_chunks_exact(3).for_each(|chunk| {
            println!("Exact chunk: {:?}", chunk);
        });
    } else {
        // Handle remainder separately
        let chunk_size = 3;
        let remainder_start = (data.len() / chunk_size) * chunk_size;
        
        data.par_chunks_exact(3).for_each(|chunk| {
            println!("Exact chunk: {:?}", chunk);
        });
        
        // Handle remainder
        let remainder = &data[remainder_start..];
        println!("Remainder: {:?}", remainder);
    }
    
    // Alternatively, use par_chunks and handle variable sizes
    data.par_chunks(3).for_each(|chunk| {
        // chunk can be size 3 OR smaller for the last one
        println!("Chunk size {}: {:?}", chunk.len(), chunk);
    });
}

par_chunks_exact guarantees uniform size but excludes remainder elements.

Remainder Handling Patterns

use rayon::prelude::*;
 
fn main() {
    let data: Vec<i32> = (0..10).collect();
    let chunk_size = 3;
    
    // PATTERN 1: Process everything with par_chunks
    // Simple but must handle variable sizes
    let sum: i32 = data.par_chunks(chunk_size)
        .map(|chunk| chunk.iter().sum())
        .sum();
    println!("Total sum: {}", sum);
    
    // PATTERN 2: Separate remainder processing
    let full_chunks_count = data.len() / chunk_size;
    let remainder_len = data.len() % chunk_size;
    
    let full_chunks_sum: i32 = data.par_chunks_exact(chunk_size)
        .map(|chunk| chunk.iter().sum())
        .sum();
    
    let remainder_sum: i32 = if remainder_len > 0 {
        data[full_chunks_count * chunk_size..].iter().sum()
    } else {
        0
    };
    
    println!("Full chunks: {}, Remainder: {}", full_chunks_sum, remainder_sum);
    
    // PATTERN 3: Pad to make even
    let padded_len = ((data.len() + chunk_size - 1) / chunk_size) * chunk_size;
    let mut padded_data = data.clone();
    padded_data.resize(padded_len, 0);  // Pad with zeros
    
    // Now all chunks are equal size
    let padded_sum: i32 = padded_data.par_chunks_exact(chunk_size)
        .map(|chunk| chunk.iter().sum())
        .sum();
    println!("Padded sum: {}", padded_sum);
    
    // Note: padding changes results for some operations
}

Choose your remainder handling strategy based on your computation requirements.

Practical Example: Parallel Matrix Processing

use rayon::prelude::*;
 
fn main() {
    // Process a matrix in row-chunks
    let width = 10;
    let height = 7;  // Not divisible by typical chunk sizes
    let matrix: Vec<f64> = (0..width * height)
        .map(|i| i as f64)
        .collect();
    
    // Process rows in parallel chunks
    // Each row is one element for this example
    let row_sums: Vec<f64> = matrix.par_chunks(width)
        .map(|row| row.iter().sum())
        .collect();
    
    println!("Row sums: {:?}", row_sums);
    // 7 rows -> 7 chunks, each chunk is exactly width elements
    // In this case, all chunks are same size because width divides matrix
    
    // More realistic: process blocks of rows
    let rows_per_chunk = 2;  // 7 rows / 2 = 3 chunks of 2 rows + 1 remainder
    let block_sums: Vec<f64> = (0..height)
        .collect::<Vec<_>>()
        .par_chunks(rows_per_chunk)
        .map(|row_indices| {
            row_indices.iter()
                .flat_map(|&row_idx| {
                    let start = row_idx * width;
                    &matrix[start..start + width]
                })
                .sum()
        })
        .collect();
    
    println!("Block sums: {:?}", block_sums);
    // Last block may have fewer rows
}

Row-based chunking often produces even chunks; block-based chunking may have remainders.

Practical Example: Parallel File Processing

use rayon::prelude::*;
 
fn main() {
    // Simulate processing records in chunks
    let records: Vec<String> = (0..23)
        .map(|i| format!("Record {}", i))
        .collect();
    
    let batch_size = 5;
    
    // Process in batches, handling remainder
    let batch_results: Vec<usize> = records.par_chunks(batch_size)
        .map(|batch| {
            println!("Processing batch of {} records", batch.len());
            batch.len()  // Each batch returns count
        })
        .collect();
    
    println!("Batch sizes: {:?}", batch_results);
    // Output: [5, 5, 5, 5, 3] - last batch is smaller
    
    // All records processed, including remainder
    let total: usize = batch_results.iter().sum();
    println!("Total processed: {}", total);
    assert_eq!(total, records.len());
}

Real-world batch processing often has uneven final batches.

Combining with Reduce Operations

use rayon::prelude::*;
 
fn main() {
    let data: Vec<i32> = (0..100).collect();
    
    // Sum using chunk-based processing
    // Remainder handling is automatic - all elements included
    let sum: i32 = data.par_chunks(7)
        .map(|chunk| chunk.iter().sum())
        .sum();
    
    println!("Sum: {}", sum);
    assert_eq!(sum, (0..100).sum());
    
    // Find max using chunk-based processing
    let max: Option<i32> = data.par_chunks(13)
        .map(|chunk| chunk.iter().max().copied())
        .max();
    
    println!("Max: {:?}", max);
    assert_eq!(max, Some(99));
    
    // Filter across chunks
    let even_count: usize = data.par_chunks(10)
        .map(|chunk| chunk.iter().filter(|&&x| x % 2 == 0).count())
        .sum();
    
    println!("Even numbers: {}", even_count);
    
    // All elements processed regardless of chunk size
    // The 7, 13, and 10 don't divide 100 evenly, but results are correct
}

Reduction operations correctly handle the remainder chunk.

Synthesis

Chunk size behavior:

Slice Length	Chunk Size	Chunks Produced	Last Chunk Size
10	3	4	1 (remainder)
10	4	3	2 (remainder)
10	5	2	5 (exact)
10	10	1	10 (entire slice)
10	20	1	10 (single chunk)
0	5	0	N/A (empty)

Key behaviors:

Aspect	Behavior
Remainder	Always processed, size < chunk_size possible
Order	Execution order undefined, result order preserved with collect
Empty slice	Produces no chunks
Single element slice	Single chunk of size 1
Chunk size >= length	Single chunk containing entire slice

Chunk size selection factors:

Factor	Recommendation
Work per element	More work = smaller chunks OK
Number of cores	More cores = smaller chunks beneficial
Overhead concern	Larger chunks reduce scheduling overhead
Load balancing	Smaller chunks enable better stealing

Paradigm comparison:

Method	Guarantees	Use Case
`par_chunks(n)`	All elements processed, last may be smaller	General parallel processing
`par_chunks_exact(n)`	Only full chunks, excludes remainder	When remainder handled separately
`par_chunks_mut(n)`	Mutable access to all elements	In-place modification
`par_chunks_exact_mut(n)`	Mutable full chunks only	In-place with separate remainder

Key insight: par_chunks handles uneven division by including a final "remainder chunk" that contains all remaining elements, ensuring complete coverage of the input slice. The chunk size you specify is a maximum, not a guarantee—your parallel closures must handle chunks of varying sizes, typically by iterating over the chunk rather than assuming a fixed number of elements. This design prioritizes completeness over uniformity: every element is processed exactly once, but the final chunk may be smaller. For cases where you need guaranteed equal-sized chunks, par_chunks_exact excludes the remainder, requiring you to handle those elements separately. The choice of chunk size affects parallelism granularity: smaller chunks create more parallel tasks (more overhead, better load balancing), while larger chunks create fewer tasks (less overhead, potentially unbalanced). A reasonable starting point is to have at least as many chunks as available threads, adjusting based on the computational cost of processing each element.

How does rayon::slice::ParallelSlice::par_chunks handle uneven chunk sizes for parallel processing?