How does rand::seq::IteratorRandom::choose uniformly sample from iterators of unknown length?

IteratorRandom::choose uses reservoir sampling with a single-element reservoir—the algorithm maintains a probability of selecting each element inversely proportional to the number of elements seen so far, ensuring that after processing n elements, each has exactly 1/n probability of being selected. The key insight is that on encountering the i-th element (1-indexed), the algorithm selects it with probability 1/i, replacing the currently held element; this ensures uniform selection because the probability of any earlier element j surviving is the product of not being replaced at each subsequent step, which mathematically equals 1/n for all elements.

Basic choose Usage

use rand::seq::IteratorRandom;
use rand::thread_rng;
 
fn basic_choose() {
    let items = vec
![1, 2, 3, 4, 5];
    
    // Choose a random element from the iterator
    let chosen = items.iter().choose(&mut thread_rng());
    
    // Result is Option<&item>
    // Some(item) if iterator was non-empty
    // None if iterator was empty
    assert!(chosen.is_some());
}

choose returns an Option because the iterator might be empty.

The Uniform Sampling Problem

// Problem: Choose uniformly from an iterator
// - You don't know the length in advance
// - You can only iterate once (stream)
// - You can't store all elements (possibly infinite)
 
// Naive approach (requires knowing length):
fn naive_choose<T>(items: &[T]) -> Option<&T> {
    if items.is_empty() {
        return None;
    }
    let index = rand::thread_rng().gen_range(0..items.len());
    Some(&items[index])
}
// This works for slices, but not for iterators of unknown length
 
// The challenge: How to choose uniformly without knowing length?

Iterators don't always support len() or size_hint() may be inaccurate.

Reservoir Sampling Algorithm

use rand::Rng;
 
fn reservoir_sampling_explanation() {
    // Algorithm for choosing from unknown-length stream:
    
    // 1. Store first element (i=1)
    // 2. For each subsequent element i (i=2, 3, ...):
    //    - Select it with probability 1/i
    //    - If selected, replace current choice
    
    // Example walkthrough:
    // Elements: [A, B, C, D]
    
    // i=1: Store A. Choice = A.
    // i=2: Select B with probability 1/2.
    //      If selected (50%): Choice = B
    //      If not (50%): Choice = A
    // i=3: Select C with probability 1/3.
    //      If selected (33%): Choice = C
    //      If not (67%): Choice stays (A or B)
    // i=4: Select D with probability 1/4.
    //      If selected (25%): Choice = D
    //      If not (75%): Choice stays
    
    // After 4 elements, each has probability 1/4:
    // P(A) = 1 × (1-1/2) × (1-1/3) × (1-1/4) = 1/4
    // P(B) = 1/2 × (1-1/3) × (1-1/4) = 1/4
    // P(C) = 1/3 × (1-1/4) = 1/4
    // P(D) = 1/4 = 1/4
}

Each element has equal probability because selection chance decreases as more elements are seen.

Implementation Logic

use rand::Rng;
 
fn choose_implementation<I, R>(mut iter: I, mut rng: R) -> Option<I::Item>
where
    I: Iterator,
    R: Rng,
{
    // Step 1: Take the first element as initial choice
    let mut chosen = iter.next()?;
    
    // Step 2: Process remaining elements
    for (i, element) in iter.enumerate() {
        // i is 0-indexed, we want 1-indexed count
        // We've seen i+2 elements now (initial + i more + current)
        let count = i + 2;  // Number of elements seen including this one
        
        // Select this element with probability 1/count
        if rng.gen_range(0..count) == 0 {
            chosen = element;
        }
    }
    
    Some(chosen)
}
 
// This is essentially what IteratorRandom::choose does

The algorithm stores one element and potentially replaces it on each iteration.

Mathematical Proof of Uniformity

fn proof_of_uniformity() {
    // Claim: After processing n elements, each has probability 1/n
    
    // For the first element:
    // P(surviving) = P(not replaced at step 2) × P(not replaced at step 3) × ... × P(not replaced at step n)
    //              = (1 - 1/2) × (1 - 1/3) × ... × (1 - 1/n)
    //              = 1/2 × 2/3 × 3/4 × ... × (n-1)/n
    //              = 1/n  (telescoping product)
    
    // For element at position k:
    // P(being chosen at step k) = 1/k
    // P(surviving after) = (1 - 1/(k+1)) × ... × (1 - 1/n) = k/n
    // P(final) = 1/k × k/n = 1/n
    
    // For the last element:
    // P(being chosen) = 1/n (directly)
    
    // Result: All n elements have probability 1/n
}

The telescoping product ensures uniform probability for all elements.

choose_multiple for Larger Samples

use rand::seq::IteratorRandom;
use rand::thread_rng;
 
fn choose_multiple_example() {
    let items = vec
![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
    
    // Choose 3 elements uniformly
    let chosen: Vec<_> = items.iter().choose_multiple(&mut thread_rng(), 3);
    
    // Returns exactly 3 elements (or fewer if iterator is shorter)
    assert_eq!(chosen.len(), 3);
    
    // Each element has equal probability of being included
    // Uses reservoir sampling with 3-element reservoir
}

choose_multiple extends the algorithm to select multiple elements.

Reservoir Sampling for Multiple Elements

use rand::Rng;
 
fn reservoir_multiple<I, R>(mut iter: I, mut rng: R, k: usize) -> Vec<I::Item>
where
    I: Iterator,
    R: Rng,
{
    let mut reservoir: Vec<I::Item> = Vec::with_capacity(k);
    
    // Fill reservoir with first k elements
    for _ in 0..k {
        if let Some(item) = iter.next() {
            reservoir.push(item);
        } else {
            return reservoir;  // Fewer than k elements
        }
    }
    
    // Process remaining elements
    for (i, element) in iter.enumerate() {
        let count = k + i + 1;  // Elements seen so far
        
        // Random index in [0, count)
        let j = rng.gen_range(0..count);
        
        // If j < k, replace reservoir[j] with this element
        if j < k {
            reservoir[j] = element;
        }
    }
    
    reservoir
}
 
// Each of the n elements has probability k/n of being in final reservoir
// Each subset of size k is equally likely

For k samples, the algorithm maintains a k-element reservoir.

Handling Empty Iterators

use rand::seq::IteratorRandom;
use rand::thread_rng;
 
fn empty_iterator() {
    let empty: Vec<i32> = vec
![];
    
    let result = empty.iter().choose(&mut thread_rng());
    
    // Returns None for empty iterator
    assert!(result.is_none());
    
    // choose_multiple also handles empty:
    let multiple: Vec<_> = empty.iter().choose_multiple(&mut thread_rng(), 3);
    assert!(multiple.is_empty());
}

choose returns None when the iterator produces no elements.

Size Hint Optimization

use rand::seq::IteratorRandom;
use rand::thread_rng;
 
fn size_hint_optimization() {
    // IteratorRandom::choose may use size_hint for optimization
    // If size_hint is exact and small, can use direct random index
    
    // For iterators with known size:
    let vec = vec
![1, 2, 3, 4, 5];
    // size_hint returns (5, Some(5))
    // Can choose: random index in [0, 5), return that element
    // No need for reservoir sampling
    
    // For iterators with unknown size:
    let iter = (0..).filter(|x| x % 2 == 0).take(100);
    // size_hint may be inaccurate
    // Must use reservoir sampling
}

When size_hint provides exact size, a simpler algorithm may be used.

Infinite Iterators

use rand::seq::IteratorRandom;
use rand::thread_rng;
 
fn infinite_iterators() {
    // Works with infinite iterators!
    // Process elements one at a time, maintain single choice
    
    // This would process forever without .take():
    // let chosen = (0..).choose(&mut thread_rng());  // Never returns
    
    // With .take(), processes limited elements:
    let chosen = (0..1_000_000).choose(&mut thread_rng());
    assert!(chosen.is_some());
    
    // Works because algorithm is online:
    // - Processes one element at a time
    // - Stores only the current choice
    // - Memory: O(1)
}

The algorithm is online—processing elements one at a time with constant memory.

Memory Efficiency

use rand::seq::IteratorRandom;
use rand::thread_rng;
 
fn memory_efficiency() {
    // choose stores only:
    // 1. Current chosen element (or None)
    // 2. A counter
    
    // Memory: O(1)
    // Even for iterators yielding large items:
    
    let large_items: Vec<Vec<u8>> = (0..1000)
        .map(|i| vec
![0u8; 1_000_000])  // 1MB per item
        .collect();
    
    // choose only stores ONE item at a time
    let chosen = large_items.iter().choose(&mut thread_rng());
    
    // Not storing all items, just current choice
    // Works with streaming data that can't fit in memory
}

O(1) memory means the algorithm works with arbitrarily large streams.

Comparison with collect and Random Index

use rand::seq::IteratorRandom;
use rand::{thread_rng, Rng};
 
fn comparison() {
    let data = vec
![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
    
    // Approach 1: collect then choose (requires storing all)
    let all: Vec<_> = data.iter().collect();
    let index = thread_rng().gen_range(0..all.len());
    let chosen1 = &all[index];
    // Memory: O(n) - must store all elements
    
    // Approach 2: choose directly (constant memory)
    let chosen2 = data.iter().choose(&mut thread_rng());
    // Memory: O(1) - stores only one element
    
    // Both give uniform distribution
    // Approach 2 works on streams of unknown length
    
    // When to use approach 1:
    // - You already have all data in memory
    // - You need random access multiple times
    
    // When to use approach 2:
    // - Data is streaming
    // - Don't want to store all elements
    // - Single pass over data
}

choose is essential when you can't or don't want to store all elements.

Iterator Size Classes

use rand::seq::IteratorRandom;
use rand::thread_rng;
 
fn size_classes() {
    // Size known exactly (slice, Vec, etc.):
    let vec = vec
![1, 2, 3, 4, 5];
    // May use fast path: random index
    
    // Size bounded (range, etc.):
    let range_iter = 0..100;
    // size_hint gives exact size
    // Fast path possible
    
    // Size unknown (filter, etc.):
    let filtered = vec.iter().filter(|x| *x % 2 == 0);
    // size_hint may be (0, None)
    // Must use reservoir sampling
    
    // Size potentially infinite:
    let infinite = std::iter::repeat(1).take(100);
    // Works fine with reservoir sampling
    // (if infinite and no take, would run forever)
}

The algorithm adapts to different iterator characteristics.

Weighted Random Choice

use rand::seq::IteratorRandom;
use rand::thread_rng;
 
fn weighted_choice() {
    // choose gives uniform distribution
    // For weighted choice, use choose_weighted:
    
    let items = vec
![
        ("common", 10),
        ("rare", 1),
        ("legendary", 0.1),
    ];
    
    // choose_weighted uses weights:
    let result = items.iter()
        .choose_weighted(&mut thread_rng(), |item| item.1);
    
    // Or choose_weighted_multiple for multiple selections:
    let results: Vec<_> = items.iter()
        .choose_weighted_multiple(&mut thread_rng(), 2, |item| item.1);
}

For non-uniform selection, use choose_weighted instead of choose.

Performance Characteristics

use rand::seq::IteratorRandom;
 
fn performance() {
    // Time complexity: O(n)
    // Must examine every element exactly once
    // Cannot skip elements (unknown length)
    
    // Space complexity: O(1)
    // Single element storage
    // Constant regardless of iterator size
    
    // Random bits needed: O(n)
    // One random number per element (roughly)
    
    // For exact-size iterators:
    // May use O(1) random bits (single index)
    // Depends on implementation details
}

Time is linear in iterator length, but memory is constant.

Thread Safety and Rng

use rand::seq::IteratorRandom;
use rand::{thread_rng, SeedableRng, rngs::StdRng};
 
fn thread_safety() {
    // thread_rng() is thread-local
    // Each thread has its own RNG
    
    // For reproducible results across runs:
    let mut rng = StdRng::seed_from_u64(42);
    let result1 = vec
![1, 2, 3].iter().choose(&mut rng);
    
    let mut rng2 = StdRng::seed_from_u64(42);
    let result2 = vec
![1, 2, 3].iter().choose(&mut rng2);
    
    // result1 and result2 are identical
    
    // For parallel usage:
    // Each thread needs its own Rng
    // thread_rng() provides one per thread
}

Each thread should have its own RNG for parallel random selection.

Practical Example: Log Sampling

use rand::seq::IteratorRandom;
use rand::thread_rng;
 
fn log_sampling() {
    // Sample one log line uniformly from a potentially huge log
    fn sample_log_line(filename: &str) -> Option<String> {
        use std::fs::File;
        use std::io::{BufRead, BufReader};
        
        let file = File::open(filename).ok()?;
        let reader = BufReader::new(file);
        
        // Process lines one at a time
        // No need to load entire file into memory
        reader.lines()
            .filter_map(|line| line.ok())
            .choose(&mut thread_rng())
    }
    
    // Works for files larger than memory
    // Only one line is retained at any time
}

Sampling from large files without loading them entirely.

Practical Example: Reservoir for Analytics

use rand::seq::IteratorRandom;
use rand::thread_rng;
 
fn analytics_reservoir() {
    // Maintain a sample of recent events
    struct ReservoirSample<T> {
        sample: Vec<T>,
        count: usize,  // Total events seen
    }
    
    impl<T> ReservoirSample<T> {
        fn with_capacity(k: usize) -> Self {
            ReservoirSample {
                sample: Vec::with_capacity(k),
                count: 0,
            }
        }
        
        fn add(&mut self, item: T, rng: &mut impl rand::Rng) {
            self.count += 1;
            
            if self.sample.len() < self.sample.capacity() {
                self.sample.push(item);
            } else {
                // Replace random element with probability k/count
                let j = rng.gen_range(0..self.count);
                if j < self.sample.len() {
                    self.sample[j] = item;
                }
            }
        }
    }
    
    // This is essentially choose_multiple's algorithm
}

Streaming analytics use reservoir sampling for unbiased samples.

Synthesis

The core algorithm:

// For single element:
// 1. Store first element
// 2. For each subsequent element i (1-indexed):
//    - Replace current choice with probability 1/i
// Result: Each of n elements has probability 1/n
 
// For k elements:
// 1. Store first k elements
// 2. For each subsequent element i (k+1-indexed):
//    - Generate random index j in [0, i)
//    - If j < k, replace reservoir[j] with this element
// Result: Each element has probability k/n of being selected

Why it works:

// The probability of element at position k being in final result:
// P(chosen at k) × P(not replaced after)
// = 1/k × (k/(k+1) × (k+1)/(k+2) × ... × (n-1)/n)
// = 1/k × k/n
// = 1/n
 
// The telescoping product (k/(k+1) × (k+1)/(k+2) × ...) cancels
// All intermediate terms cancel, leaving k/n

Key properties:

// 1. Uniform: Each element has equal probability
// 2. Online: Process elements one at a time
// 3. Memory O(1): Store only current choice (or k for multiple)
// 4. Time O(n): Examine each element once
// 5. Works for unknown length: No need for size in advance

When to use:

// Use choose when:
// - Iterator length unknown
// - Single-pass processing required
// - Memory constrained
// - Streaming data
 
// Use collect + random index when:
// - Data already in memory
// - Need multiple random selections
// - Size known exactly

Key insight: IteratorRandom::choose implements reservoir sampling to uniformly select from iterators of unknown length by maintaining a probability of 1/i for selecting the i-th element, ensuring that after n elements, each has exactly 1/n probability of being the final choice. The algorithm is online (single pass), memory-efficient (O(1) storage), and works for any iterator—even infinite ones (with a take). For multiple elements, choose_multiple extends the same logic using a k-element reservoir where each new element has probability k/i of entering the reservoir by replacing a random existing element.