How does regex::RegexBuilder::size_limit prevent denial-of-service from exponential backtracking?

regex::RegexBuilder::size_limit sets a maximum memory bound on the compiled regex state machine, preventing denial-of-service attacks where malicious patterns could cause the regex engine to consume excessive memory during compilation. While the regex crate is generally immune to catastrophic backtracking due to its NFA-based implementation, certain patterns with massive character classes or deeply nested repetitions can produce enormous compiled state machines that exhaust memory. The size_limit method constrains the compiled automaton size, causing compilation to fail rather than allowing runaway memory consumption. This is distinct from execution-time protection—regex uses lazy DFAs and bounded backtracking that naturally limit execution time—but protects against the compilation phase where certain pathological patterns could otherwise allocate gigabytes of memory for their state machines.

The Problem: Runaway Compilation

use regex::Regex;
 
fn main() {
    // A pattern that compiles to a huge state machine
    let pattern = "[a-z]{50}";
    
    // This compiles but creates a large automaton
    let re = Regex::new(pattern).unwrap();
    
    println!("Pattern compiled successfully");
    // For very large patterns, this could consume significant memory
}

Patterns with large character classes or complex alternations can create huge state machines.

Basic size_limit Usage

use regex::RegexBuilder;
 
fn main() {
    // Limit compiled regex to 1MB
    let result = RegexBuilder::new(r"[a-z]{50}")
        .size_limit(1024 * 1024)  // 1 MB limit
        .build();
    
    match result {
        Ok(re) => println!("Compiled: {:?}", re),
        Err(e) => println!("Compilation failed: {}", e),
    }
}

size_limit constrains the memory used by the compiled state machine.

Patterns That Trigger Large Compilation

use regex::{Regex, RegexBuilder};
 
fn main() {
    // These patterns can create large compiled state machines
    
    // 1. Large counted repetitions
    let pattern1 = r"a{100}";  // 100 consecutive 'a' states
    
    // 2. Large character classes with repetitions
    let pattern2 = r"[a-zA-Z0-9]{50}";  // Many states for character combinations
    
    // 3. Complex alternations
    let pattern3 = r"(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z){20}";
    
    // 4. Nested repetitions
    let pattern4 = r"(a+){10}";  // Can create many states
    
    for (name, pattern) in [("large repetition", pattern1), 
                            ("char class repetition", pattern2),
                            ("alternation repetition", pattern3),
                            ("nested repetition", pattern4)] {
        let result = RegexBuilder::new(pattern)
            .size_limit(100_000)  // 100KB limit
            .build();
        
        match result {
            Ok(_) => println!("{}: compiled within limit", name),
            Err(e) => println!("{}: exceeded limit - {}", name, e),
        }
    }
}

Various pattern constructs can lead to large compiled state machines.

Understanding What size_limit Limits

use regex::RegexBuilder;
 
fn main() {
    // size_limit constrains the COMPILED state machine, not:
    // - Execution time (handled by the regex engine's bounded execution)
    // - Input string size (not limited)
    // - Match result size (not limited)
    
    // This pattern with a limited character class
    let small_pattern = r"a{10}";
    let small_re = RegexBuilder::new(small_pattern)
        .size_limit(10_000)
        .build()
        .unwrap();
    
    // Can still match against very long strings
    let long_string = "a".repeat(1_000_000);
    let matches: Vec<_> = small_re.find_iter(&long_string).collect();
    println!("Found {} matches in 1M character string", matches.len());
    
    // size_limit only affects compilation, not execution
}

size_limit affects compilation memory, not execution behavior.

Default Size Limit

use regex::RegexBuilder;
 
fn main() {
    // The default size limit is approximately 10MB
    // Patterns exceeding this will fail to compile
    
    // Check the current default
    let re = RegexBuilder::new(r"a")
        .build()
        .unwrap();
    
    // The default is set by the regex crate's internal constant
    println!("Default size limit is around 10MB");
    
    // You can check if a pattern exceeds default by attempting compilation
    let large_pattern = format!("[a-z]{{{}}}", 1000);
    match Regex::new(&large_pattern) {
        Ok(_) => println!("Pattern within default limit"),
        Err(e) => println!("Pattern exceeded default: {}", e),
    }
}

The regex crate has a sensible default limit around 10MB.

Setting Appropriate Limits

use regex::RegexBuilder;
 
fn main() {
    // Different applications need different limits
    
    // Web API handling untrusted input - be strict
    let api_limit = 100 * 1024;  // 100KB
    
    // Configuration file parsing - moderate
    let config_limit = 1024 * 1024;  // 1MB
    
    // Trusted internal patterns - generous
    let internal_limit = 10 * 1024 * 1024;  // 10MB
    
    // Example: parsing user-provided regex in a search API
    let user_pattern = r"[a-z]+@[a-z]+\.[a-z]{2,}";
    
    match RegexBuilder::new(user_pattern)
        .size_limit(api_limit)
        .build() {
        Ok(re) => println!("User pattern accepted"),
        Err(e) => println!("Pattern too complex: {}", e),
    }
}

Choose limits based on your threat model and use case.

Compilation Error Handling

use regex::{Regex, RegexBuilder};
use std::fmt;
 
#[derive(Debug)]
enum RegexError {
    Compilation(String),
    SizeLimit(String),
}
 
impl fmt::Display for RegexError {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self {
            RegexError::Compilation(s) => write!(f, "Regex compilation error: {}", s),
            RegexError::SizeLimit(s) => write!(f, "Regex size limit exceeded: {}", s),
        }
    }
}
 
impl std::error::Error for RegexError {}
 
fn compile_safe_regex(pattern: &str, max_size: usize) -> Result<Regex, RegexError> {
    RegexBuilder::new(pattern)
        .size_limit(max_size)
        .build()
        .map_err(|e| {
            let msg = e.to_string();
            if msg.contains("exceeds size limit") || msg.contains("too big") {
                RegexError::SizeLimit(msg)
            } else {
                RegexError::Compilation(msg)
            }
        })
}
 
fn main() {
    let pattern = r"[a-z]{1000}";
    
    match compile_safe_regex(pattern, 1024) {
        Ok(re) => println!("Compiled: {:?}", re),
        Err(RegexError::SizeLimit(msg)) => {
            println!("Pattern rejected - too large: {}", msg);
        }
        Err(RegexError::Compilation(msg)) => {
            println!("Invalid pattern syntax: {}", msg);
        }
    }
}

Handle size limit errors distinctly from syntax errors for better user feedback.

Comparison with Execution Time Limits

use regex::RegexBuilder;
use std::time::{Duration, Instant};
 
fn main() {
    // regex crate has built-in protection against exponential backtracking
    // Unlike some regex engines, it uses NFA-based execution
    
    // This "evil" pattern that causes ReDoS in backtracking engines
    let evil_pattern = r"(a+)+$";
    let re = RegexBuilder::new(evil_pattern)
        .size_limit(1024 * 1024)
        .build()
        .unwrap();
    
    // In backtracking engines, this would be catastrophic
    let input = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!";
    
    let start = Instant::now();
    let is_match = re.is_match(input);
    let duration = start.elapsed();
    
    println!("Match: {}, Time: {:?}", is_match, duration);
    // With regex crate, this completes quickly
    // The NFA-based engine doesn't have exponential backtracking
    
    // size_limit is for compilation, not execution
    println!("Note: regex crate's NFA engine prevents ReDoS by design");
}

The regex crate prevents ReDoS through its NFA engine; size_limit addresses compilation memory.

Real-World Attack Scenario

use regex::{Regex, RegexBuilder};
 
fn main() {
    // Scenario: User-provided regex patterns in a search feature
    
    // Malicious user tries to submit a pattern that creates huge state machine
    let malicious_patterns = [
        // Large counted repetition
        r"[a-z]{10000}",
        // Deeply nested alternatives
        r"((((((((((a|b))))))))))",
        // Large character class combinations
        r"[ -~]{1000}",  // ASCII printable range
    ];
    
    let size_limit = 50 * 1024;  // 50KB limit for user patterns
    
    for (i, pattern) in malicious_patterns.iter().enumerate() {
        match RegexBuilder::new(pattern)
            .size_limit(size_limit)
            .build() {
            Ok(_) => println!("Pattern {}: accepted", i),
            Err(e) => println!("Pattern {}: rejected - {}", i, e),
        }
    }
    
    // Without size_limit, these could exhaust server memory
}

size_limit protects services that accept user-provided regex patterns.

Nesting and Alternation Impact

use regex::RegexBuilder;
 
fn main() {
    // Patterns that can create large compiled sizes
    
    // Each nesting level multiplies states
    let shallow = r"(a|b)";           // Small
    let medium = r"((a|b)|(c|d))";    // Medium  
    let deep = r"((((a|b)|(c|d))|((e|f)|(g|h)))|(((i|j)|(k|l))|((m|n)|(o|p))))";  // Large
    
    let limit = 10 * 1024;  // 10KB
    
    for (name, pattern) in [("shallow", shallow), ("medium", medium), ("deep", deep)] {
        match RegexBuilder::new(pattern)
            .size_limit(limit)
            .build() {
            Ok(_) => println!("{}: compiled", name),
            Err(_) => println!("{}: exceeded limit", name),
        }
    }
    
    // With repetition, size grows exponentially
    let repeated = r"(a|b){20}";  // Alternation with repetition
    
    match RegexBuilder::new(repeated)
        .size_limit(1024)
        .build() {
        Ok(_) => println!("repeated: compiled"),
        Err(_) => println!("repeated: exceeded limit"),
    }
}

Nesting depth and alternations multiply compiled state machine size.

Combining with Other Builder Options

use regex::RegexBuilder;
 
fn main() {
    // size_limit can be combined with other builder options
    
    let pattern = r"(?i)[a-z]{10}";  // Case insensitive
    
    let re = RegexBuilder::new(pattern)
        .size_limit(100 * 1024)      // Limit compiled size
        .case_insensitive(true)      // Explicit case insensitive
        .multi_line(true)            // Multi-line mode
        .dot_matches_new_line(true)  // Dot matches \n
        .build()
        .unwrap();
    
    // Test the compiled regex
    let text = "ABCDEFGHIJ";
    assert!(re.is_match(text));
    println!("Case insensitive match works");
}

size_limit integrates with other RegexBuilder configuration options.

Monitoring Compiled Size

use regex::Regex;
 
fn estimate_regex_complexity(pattern: &str) -> usize {
    // Rough heuristic: count special constructs
    let mut complexity = 1;
    let chars: Vec<char> = pattern.chars().collect();
    
    for i in 0..chars.len() {
        match chars[i] {
            '*' | '+' | '?' => complexity *= 2,
            '{' => {
                // Counted repetition - potentially large
                complexity += 10;
            }
            '[' => {
                // Character class
                complexity += 5;
            }
            '|' => {
                // Alternation
                complexity *= 2;
            }
            '(' if i + 1 < chars.len() && chars[i + 1] != '?' => {
                // Grouping
                complexity += 2;
            }
            _ => {}
        }
    }
    
    complexity
}
 
fn main() {
    let patterns = [
        r"a",
        r"a*",
        r"a{10}",
        r"[a-z]{10}",
        r"(a|b){10}",
    ];
    
    for pattern in patterns {
        let complexity = estimate_regex_complexity(pattern);
        let recommended_limit = complexity * 1024;  // Rough estimate
        
        println!("Pattern: {}", pattern);
        println!("  Estimated complexity: {}", complexity);
        println!("  Recommended size_limit: {} bytes", recommended_limit);
        
        // Try compiling with suggested limit
        let result = regex::RegexBuilder::new(pattern)
            .size_limit(recommended_limit.max(1024))
            .build();
        
        match result {
            Ok(_) => println!("  Result: compiled successfully"),
            Err(e) => println!("  Result: failed - {}", e),
        }
    }
}

Heuristics can help estimate appropriate limits before compilation.

size_limit vs bytes::Regex

use regex::bytes::RegexBuilder as BytesRegexBuilder;
use regex::RegexBuilder;
 
fn main() {
    // size_limit applies to both string and bytes regex
    
    let pattern = r"[a-z]{50}";
    
    // String regex
    let string_re = RegexBuilder::new(pattern)
        .size_limit(50 * 1024)
        .build();
    
    // Bytes regex
    let bytes_re = BytesRegexBuilder::new(pattern)
        .size_limit(50 * 1024)
        .build();
    
    println!("String regex: {:?}", string_re.is_ok());
    println!("Bytes regex: {:?}", bytes_re.is_ok());
    
    // Both support the same size_limit for their compiled state machines
}

size_limit works for both regex::Regex and regex::bytes::Regex.

Fuzzing and Testing Patterns

use regex::{Regex, RegexBuilder};
 
fn test_pattern_safety(pattern: &str) -> Result<Regex, String> {
    // Compile with strict limits to test pattern safety
    let result = RegexBuilder::new(pattern)
        .size_limit(10 * 1024)  // 10KB limit for testing
        .build();
    
    match result {
        Ok(re) => {
            // Also test with various inputs to ensure reasonable execution
            let test_inputs = ["", "a", "test", &"a".repeat(100)];
            
            for input in test_inputs {
                // This should be fast with regex crate's NFA
                let _ = re.is_match(input);
            }
            
            Ok(re)
        }
        Err(e) => Err(format!("Pattern unsafe: {}", e)),
    }
}
 
fn main() {
    let test_patterns = [
        r"\d+",           // Safe
        r"[a-z]+",        // Safe
        r"[a-z]{10000}",  // Large - may fail size limit
        r"(a|b){100}",    // Complex - may fail size limit
    ];
    
    for pattern in test_patterns {
        match test_pattern_safety(pattern) {
            Ok(_) => println!("{}: SAFE", pattern),
            Err(e) => println!("{}: {}", pattern, e),
        }
    }
}

Test user-provided patterns with strict limits before using them in production.

Synthesis

What size_limit protects against:

Scenario Protected? Mechanism
Compilation memory exhaustion Yes Limits state machine size
Runtime ReDoS attacks No (not needed) NFA engine prevents backtracking
Input string size No Not the purpose of size_limit
Match result size No Not the purpose of size_limit

Recommended limits by context:

Context Recommended Limit Rationale
User-provided patterns 50-100 KB Untrusted input needs strict limits
Application patterns 1-10 MB Trusted patterns can be larger
System utilities Default (10 MB) Balance safety and functionality

Pattern characteristics affecting size:

Construct Impact on Size
Literal characters Minimal
Simple character classes Small
Large counted repetitions {n} Linear in n
Large character classes Proportional to class size
Nested alternations Exponential in depth
Repetition of alternation Multiplicative

Key insight: regex::RegexBuilder::size_limit addresses a specific attack vector—malicious patterns that compile to enormous state machines—rather than the more common ReDoS attack based on exponential backtracking. The regex crate's NFA-based execution engine already prevents the classic ReDoS scenario where a pattern like (a+)+$ matches exponentially against crafted input. Instead, size_limit protects the compilation phase: a pattern like [a-z]{100000} could allocate hundreds of megabytes for its state machine, and without a limit, a server accepting user-provided regex patterns could be memory-starved. The default limit of approximately 10MB provides reasonable protection for most use cases, but security-conscious applications handling untrusted patterns should set stricter limits (50-100KB). The method is part of RegexBuilder alongside other compilation-time options like case_insensitive and multi_line, and it applies before any execution happens—if compilation fails due to size, no state machine is created and no input can be processed. This makes it a first line of defense for regex-powered search features, configuration parsers that accept patterns, or any system where users can influence the regex but shouldn't be able to consume arbitrary server memory.