How does regex::RegexBuilder::size_limit prevent denial-of-service from pathological patterns?

RegexBuilder::size_limit constrains the total heap memory used by the compiled regex's internal state machine, preventing runaway memory consumption from patterns that would create enormous finite automata. This protection targets a different attack vector than backtrack_limit—while backtrack limits prevent exponential time complexity during matching, size limits prevent exponential space complexity during compilation.

The Memory Problem in Regex Compilation

use regex::Regex;
 
fn memory_explosion_example() {
    // This pattern creates exponentially large automata
    // a(aa)*b matches a, aaa, aaaaa, etc.
    // But the NFA state count grows exponentially with pattern features
    let pattern = "a(aaa)*b";
    
    // Some pathological patterns create millions of states
    // Without limits, compilation could consume GB of memory
}

Regex engines compile patterns into state machines. Complex patterns with many alternations or nested quantifiers can create state machines with millions of states.

Setting Size Limits

use regex::Regex;
 
fn basic_size_limit() -> Result<(), regex::Error> {
    // Default size limit: ~10MB of heap
    // Patterns exceeding this fail to compile
    
    // Set a custom size limit (in bytes)
    let re = Regex::builder()
        .size_limit(1024 * 1024)  // 1 MB limit
        .build(r"a(b|c)*d")?;
    
    // The pattern compiles if its state machine fits in 1MB
    assert!(re.is_match("abbbbcdd"));
    
    Ok(())
}

The size limit restricts how large the compiled state machine can grow.

The Default Protection

use regex::Regex;
 
fn default_limits() {
    // The default size limit is approximately 10MB
    // This prevents most memory exhaustion attacks
    
    // When a pattern would create a state machine > 10MB:
    let result = Regex::new(r"(a?){50}b(c?){50}");
    
    // This pattern might succeed or fail depending on
    // how many states the regex engine generates
    
    // Explicitly small limit for untrusted input:
    let result = Regex::builder()
        .size_limit(100 * 1024)  // 100 KB
        .build(r"(a?){50}b");
    
    match result {
        Ok(re) => println!("Compiled successfully"),
        Err(e) => println!("Pattern too large: {}", e),
    }
}

The default 10MB limit protects against most accidental memory exhaustion.

Size Limit vs Backtrack Limit

use regex::Regex;
 
fn different_protections() {
    // Two different attack vectors:
    
    // 1. SIZE LIMIT: Prevents exponential STATE SPACE
    //    Pattern compiles into too many NFA/DFA states
    //    Attack happens during COMPILATION
    //    
    //    Example: Alternation explosion
    //    (a|b|c|d|e){20} can create millions of states
    
    let result = Regex::builder()
        .size_limit(1024)  // Very small for demonstration
        .build(r"(a|b){20}");
    // Fails: "regex too large"
    
    // 2. BACKTRACK LIMIT: Prevents exponential TIME
    //    Pattern causes too many match attempts
    //    Attack happens during EXECUTION
    //    
    //    Example: Catastrophic backtracking
    //    a(a?){50}b matched against "aaaaaaa...aaa" 
    
    let result = Regex::builder()
        .backtrack_limit(1000)  // Limit backtracking steps
        .build(r"a(b|c)*d");
    
    // These are INDEPENDENT protections:
    // - size_limit affects compilation memory
    // - backtrack_limit affects execution time
}

Size limits protect memory during compilation; backtrack limits protect time during matching.

Patterns That Create Large State Machines

use regex::Regex;
 
fn large_state_patterns() {
    // Pattern features that explode state count:
    
    // 1. Many alternations with overlapping characters
    let pattern = "(a|aa|aaa|aaaa){30}";
    // Each alternation multiplies states
    
    // 2. Nested quantifiers with variable lengths
    let pattern = "(a?){40}b";
    // Each `?` doubles potential states
    
    // 3. Large counted repetitions
    let pattern = "a{1000}";
    // Explicit state for each position
    
    // 4. Complex character classes
    let pattern = "[a-z0-9_-]{100}";
    // States for each character class match
    
    // These patterns may hit size limits:
    for limit in [1024, 10*1024, 100*1024, 1024*1024] {
        let result = Regex::builder()
            .size_limit(limit)
            .build("(a|b|c|d|e){20}");
        
        match result {
            Ok(_) => println!("Compiled at {} bytes", limit),
            Err(_) => println!("Failed at {} bytes", limit),
        }
    }
}

Alternation combinations and nested quantifiers create state machine explosions.

Practical Attack Scenario

use regex::Regex;
 
fn dos_attack_example() {
    // Attacker-controlled pattern input:
    fn compile_user_pattern(input: &str) -> Result<Regex, String> {
        // WITHOUT protection:
        // Regex::new(input)?  // User could crash server
        
        // WITH protection:
        Regex::builder()
            .size_limit(50 * 1024)  // 50 KB max
            .build(input)
            .map_err(|e| format!("Pattern rejected: {}", e))
    }
    
    // Attacker might try:
    let malicious_patterns = [
        "(a|b){50}",           // Alternation explosion
        "(a?){100}",           // Optional explosion
        "a(aaa)*b(ccc)*d",     // Nested quantifiers
    ];
    
    for pattern in malicious_patterns {
        match compile_user_pattern(pattern) {
            Ok(_) => println!("Accepted: {}", pattern),
            Err(e) => println!("Rejected: {}", e),
        }
    }
}

Size limits protect against malicious patterns in user-controlled regex input.

Memory Calculation Details

use regex::Regex;
 
fn memory_internals() {
    // The size limit accounts for:
    // 
    // 1. NFA states (each state has transitions)
    // 2. DFA states (if compiled to DFA)
    // 3. Character class tables
    // 4. Capture group information
    // 5. Various internal data structures
    
    // Example: Simple pattern
    let re = Regex::new(r"a+b+c+").unwrap();
    // Uses minimal memory: few states, simple transitions
    
    // Example: Complex pattern
    let re = Regex::new(r"(a|b|c|d|e|f|g|h|i|j){10}").unwrap();
    // Uses much more memory: 10^10 potential states
    // Actually optimized, but demonstrates the concept
    
    // The engine may use lazy DFA compilation:
    // - NFA compiled upfront
    // - DFA states generated on-demand during matching
    // - size_limit still applies to NFA size
}

The size limit covers all internal data structures, not just raw state count.

Combining Size and Backtrack Limits

use regex::Regex;
 
fn comprehensive_protection() -> Result<Regex, regex::Error> {
    // For untrusted patterns, use both limits:
    Regex::builder()
        .size_limit(100 * 1024)      // 100 KB for compilation
        .backtrack_limit(100_000)    // 100K steps for matching
        .build(r"[a-z]+@[a-z]+\.[a-z]+")
    
    // This protects against:
    // - Memory exhaustion (size_limit)
    // - CPU exhaustion (backtrack_limit)
}
 
// For trusted patterns, you might relax limits:
fn trusted_pattern() -> Result<Regex, regex::Error> {
    Regex::builder()
        .size_limit(100 * 1024 * 1024)  // 100 MB for complex patterns
        .build(r"very complex trusted pattern")
}

Use both limits for defense-in-depth against untrusted patterns.

Error Handling for Size Limits

use regex::{Regex, Error};
 
fn handle_compilation_errors() {
    let pattern = "(a|b){100}";  // Will exceed small limits
    
    let result = Regex::builder()
        .size_limit(1024)  // 1 KB - too small
        .build(pattern);
    
    match result {
        Ok(re) => {
            println!("Compiled successfully");
        }
        Err(Error::CompiledTooBig(size_limit)) => {
            println!("Pattern would create state machine > {} bytes", size_limit);
            println!("Increase size_limit or simplify pattern");
        }
        Err(e) => {
            println!("Other error: {}", e);
        }
    }
}

The CompiledTooBig error variant provides the size limit that was exceeded.

Realistic Size Limit Values

use regex::Regex;
 
fn realistic_limits() {
    // Conservative limits for untrusted input:
    let conservative = Regex::builder()
        .size_limit(50 * 1024)  // 50 KB
        .build(r"[a-z]+");
    
    // Moderate limits for semi-trusted input:
    let moderate = Regex::builder()
        .size_limit(1024 * 1024)  // 1 MB
        .build(r"complex pattern");
    
    // Relaxed limits for trusted/validated patterns:
    let relaxed = Regex::builder()
        .size_limit(10 * 1024 * 1024)  // 10 MB (default)
        .build(r"very complex trusted pattern");
    
    // Minimal limits for strict security:
    let minimal = Regex::builder()
        .size_limit(10 * 1024)  // 10 KB
        .build(r"simple pattern");
    
    // The right limit depends on:
    // - Expected pattern complexity
    // - Trust level of pattern source
    // - Available memory
    // - Performance requirements
}

Choose size limits based on threat model and resource constraints.

Patterns That Bypass Size Limits

use regex::Regex;
 
fn still_vulnerable() {
    // Size limit doesn't protect against EVERYTHING:
    
    // 1. Time-based attacks still possible
    //    (use backtrack_limit for these)
    let pattern = "a(a?){50}b";
    let re = Regex::builder()
        .size_limit(1024 * 1024)
        .build(pattern)
        .unwrap();
    // This compiles small, but matches slowly!
    
    // 2. Memory in the INPUT string
    let input = "a".repeat(1_000_000_000);  // 1 GB input
    // size_limit doesn't limit input size
    
    // 3. Capture group overhead
    let pattern = "(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)";
    // Many captures use memory beyond state machine
    
    // 4. Literal string matching
    let pattern = "a".repeat(1_000_000);  // 1 MB pattern
    // Literal patterns may not create many states
    // But the pattern string itself is large
}

Size limits specifically target state machine size, not all attack vectors.

Diagnostic Output

use regex::Regex;
 
fn debug_pattern_size() {
    // If a pattern fails to compile, check why:
    
    fn try_compile_with_increasing_limits(pattern: &str) {
        let limits: Vec<usize> = vec![
            1024,           // 1 KB
            10 * 1024,      // 10 KB
            100 * 1024,     // 100 KB
            1024 * 1024,    // 1 MB
            10 * 1024 * 1024, // 10 MB
        ];
        
        for limit in limits {
            match Regex::builder()
                .size_limit(limit)
                .build(pattern)
            {
                Ok(_) => {
                    println!("Compiled at {} bytes", limit);
                    return;
                }
                Err(_) => {
                    println!("Failed at {} bytes", limit);
                }
            }
        }
    }
    
    try_compile_with_increasing_limits("(a|b|c|d|e){30}");
}

Incrementally increase limits to find a pattern's actual memory requirement.

Best Practices

use regex::Regex;
 
fn best_practices() {
    // 1. Always set limits for untrusted patterns
    fn safe_compile(pattern: &str) -> Result<Regex, String> {
        Regex::builder()
            .size_limit(100 * 1024)      // 100 KB
            .backtrack_limit(10_000)      // 10K steps
            .build(pattern)
            .map_err(|e| e.to_string())
    }
    
    // 2. Validate patterns before compilation
    fn validate_pattern(pattern: &str) -> Result<(), String> {
        let max_len = 1000;
        if pattern.len() > max_len {
            return Err(format!("Pattern too long: {} > {}", pattern.len(), max_len));
        }
        
        // Check for suspicious patterns
        if pattern.contains("{50") || pattern.contains("{100") {
            return Err("Pattern contains suspicious repetition".into());
        }
        
        Ok(())
    }
    
    // 3. Use simpler alternatives when possible
    fn simplify_pattern() {
        // Instead of: (a|b|c|d|e|f|g|h|i|j)+
        // Consider:    [a-j]+
        // Character classes compile more efficiently
        
        let good = Regex::new(r"[a-j]+");  // Efficient
        let bad = Regex::new(r"(a|b|c|d|e|f|g|h|i|j)+");  // Less efficient
    }
    
    // 4. Pre-compile patterns when possible
    fn precompile() {
        // Don't compile on every request
        // Cache compiled regexes
        
        use std::sync::OnceLock;
        static CACHED: OnceLock<Regex> = OnceLock::new();
        
        let re = CACHED.get_or_init(|| {
            Regex::new(r"pattern").unwrap()
        });
    }
}

Combine size limits with validation and pattern simplification for robust protection.

Summary

fn summary() {
    // 1. size_limit constrains compiled state machine memory
    // 2. Default is ~10MB, suitable for most legitimate patterns
    // 3. Protects against compilation-time memory exhaustion
    // 4. Independent from backtrack_limit (which protects execution time)
    // 5. Alternation combinations are common attack vectors
    // 6. Use both size_limit and backtrack_limit for untrusted input
    // 7. Error::CompiledTooBig indicates limit was exceeded
    // 8. Size limits don't protect against time-based attacks
    // 9. Character classes [abc] are more efficient than alternations (a|b|c)
    // 10. Pre-compile patterns when possible to avoid repeated compilation
}

Key insight: RegexBuilder::size_limit protects against memory exhaustion attacks by limiting the size of the compiled regex state machine during compilation, complementing backtrack_limit which protects against CPU exhaustion during matching. The two limits target different attack vectors: size limits prevent patterns that create exponentially large state machines (like nested alternations (a|b|c){N}), while backtrack limits prevent patterns that cause exponential matching time (like a(a?){N}b against long strings). For untrusted regex input, set both limits to appropriate values based on your threat model and resource constraints. The default 10MB limit is a reasonable balance for most applications but should be reduced for public-facing regex compilation services where memory exhaustion is a denial-of-service risk.