How does regex::RegexBuilder::size_limit prevent denial-of-service from pathological patterns?

size_limit restricts the memory allocated for compiled regex state machines, preventing pathological patterns that would otherwise cause exponential state explosion during compilation or execution. The limit constrains the size of the deterministic finite automaton (DFA) that the regex engine builds, rejecting patterns that would require too many states—patterns that could otherwise consume excessive memory or CPU time, enabling denial-of-service attacks.

Basic size_limit Configuration

use regex::RegexBuilder;
 
fn basic_size_limit() {
    // Default size limit is approximately 10MB
    let regex = RegexBuilder::new(r"\d+")
        .size_limit(1024 * 1024)  // 1MB limit
        .build();
    
    match regex {
        Ok(re) => {
            // Regex compiled successfully within limit
            println!("Pattern compiled: {:?}", re);
        }
        Err(e) => {
            // Pattern exceeded size limit
            eprintln!("Regex too large: {}", e);
        }
    }
}

The size_limit method sets a byte limit on the compiled regex state machine size.

Pathological Pattern Example

use regex::RegexBuilder;
 
fn pathological_pattern() {
    // This pattern can cause exponential state explosion:
    // a?^n a^n matches but requires exponential backtracking
    let pattern = "a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaa";
    
    // Without size limit, this could hang or consume massive memory
    let result = RegexBuilder::new(pattern)
        .size_limit(1024)  // Very small limit to demonstrate
        .build();
    
    match result {
        Ok(_) => println!("Pattern compiled successfully"),
        Err(e) => println!("Pattern rejected: {}", e),
        // Error: "regex size limit exceeded"
    }
}

Pathological patterns like nested optional quantifiers can cause state explosion.

How State Machine Size Grows

use regex::RegexBuilder;
 
fn state_size_growth() {
    // Simple patterns: small state machine
    let simple = RegexBuilder::new(r"\d+")
        .build()
        .unwrap();
    // Few states needed
    
    // Character classes: moderate growth
    let moderate = RegexBuilder::new(r"[a-zA-Z0-9]+")
        .build()
        .unwrap();
    // More states for character ranges
    
    // Alternation: can grow multiplicatively
    let complex = RegexBuilder::new(r"(abc|abd|abe|abf|abg)+")
        .build()
        .unwrap();
    // States multiply with alternation depth
    
    // Nested quantifiers: potential exponential growth
    let dangerous = RegexBuilder::new(r"(a+)+b")
        .size_limit(100 * 1024)  // May need higher limit
        .build();
    // Can require many states due to backtracking potential
}

State machine complexity depends on pattern structure; certain constructs cause multiplicative state growth.

Size Limit vs Execution Time

use regex::RegexBuilder;
 
fn size_vs_execution() {
    // size_limit affects compilation, not just execution
    // Large state machines also take longer to execute
    
    // Pattern that creates large state machine
    let pattern = r"(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p)+";
    
    // Low limit: Rejects at compile time
    let low_limit = RegexBuilder::new(pattern)
        .size_limit(100)
        .build();
    assert!(low_limit.is_err());
    
    // High limit: Accepts but may use more memory/CPU
    let high_limit = RegexBuilder::new(pattern)
        .size_limit(10 * 1024 * 1024)  // 10MB
        .build();
    assert!(high_limit.is_ok());
}

Larger state machines consume more memory during both compilation and execution.

Common Attack Patterns

use regex::RegexBuilder;
 
fn common_attack_patterns() {
    // Pattern 1: Nested quantifiers (a+)+ 
    // Causes exponential backtracking on "aaa...aaa!"
    let nested = "(a+)+b";
    // Input "aaaaaaaaaaaaaaaaaaaac" causes exponential time
    
    // Pattern 2: Overlapping alternatives (a|a)*
    // Creates many states for overlapping matches
    let overlapping = "(a|aa|aaa)+";
    
    // Pattern 3: Backreferences
    // Can require exponential states
    let backref = r"(a+)\1";
    
    // Pattern 4: Counted repetitions with ranges
    // a{1,1000} creates many states
    let counted = "a{1,1000}";
    
    // size_limit protects against all these
    for pattern in [nested, overlapping, backref, counted] {
        let result = RegexBuilder::new(pattern)
            .size_limit(1024 * 1024)  // 1MB
            .build();
        
        if result.is_err() {
            println!("Pattern '{}' rejected due to size limit", pattern);
        }
    }
}

These patterns are commonly exploited in ReDoS (Regular Expression Denial of Service) attacks.

Setting Appropriate Limits

use regex::RegexBuilder;
 
fn appropriate_limits() {
    // For simple validation patterns
    let validation = RegexBuilder::new(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")
        .size_limit(64 * 1024)  // 64KB is usually enough for validation
        .build();
    
    // For complex search patterns
    let search = RegexBuilder::new(r"\b\w+\b")
        .size_limit(1024 * 1024)  // 1MB for typical search
        .build();
    
    // For user-provided patterns (need strict limits)
    fn compile_user_pattern(pattern: &str) -> Result<regex::Regex, regex::Error> {
        RegexBuilder::new(pattern)
            .size_limit(256 * 1024)  // Conservative 256KB limit
            .build()
    }
    
    // For trusted patterns (internal use)
    let trusted = RegexBuilder::new(r"complex|pattern|here")
        .size_limit(10 * 1024 * 1024)  // 10MB for trusted patterns
        .build();
}

Set size limits based on pattern source and complexity needs.

Size Limit vs Backtrack Limit

use regex::RegexBuilder;
 
fn size_vs_backtrack() {
    // size_limit: Controls state machine size at compile time
    // Prevents patterns that create too many states
    
    // backtrack_limit: Controls execution time
    // Prevents patterns from running too long
    
    // Both limits work together for DoS protection
    let result = RegexBuilder::new(r"(a+)+b")
        .size_limit(1024 * 1024)      // Limit state machine size
        .backtrack_limit(100_000)      // Limit backtracking steps
        .build();
    
    // size_limit catches patterns at compile time
    // backtrack_limit catches patterns at execution time
    
    match result {
        Ok(re) => {
            // Use carefully
            let text = "aaaaaaaaaaaaaaaaaaaaaaaaaac";
            match re.find(text) {
                Some(m) => println!("Match: {:?}", m),
                None => println!("No match"),
            }
        }
        Err(e) => {
            println!("Rejected: {}", e);
        }
    }
}

size_limit prevents compilation-time DoS; backtrack_limit prevents execution-time DoS.

Measuring Actual State Size

use regex::RegexBuilder;
 
fn measure_state_size() {
    // There's no direct API to query state machine size
    // But you can approximate by trying different limits
    
    fn find_min_size_limit(pattern: &str) -> Option<usize> {
        // Binary search for minimum viable size limit
        let mut low = 1024usize;
        let mut high = 100 * 1024 * 1024;  // 100MB
        
        while low < high {
            let mid = low + (high - low) / 2;
            match RegexBuilder::new(pattern)
                .size_limit(mid)
                .build()
            {
                Ok(_) => high = mid,
                Err(_) => low = mid + 1,
            }
        }
        
        Some(low)
    }
    
    // Simple pattern needs little space
    if let Some(size) = find_min_size_limit(r"\d+") {
        println!("Pattern \\d+ needs approximately {} bytes", size);
    }
}

The actual state machine size depends on pattern complexity and optimization.

Comparison with Other Regex Engines

use regex::RegexBuilder;
 
fn engine_comparison() {
    // The 'regex' crate uses a hybrid NFA/DFA approach
    // size_limit applies to the compiled DFA
    
    // Other engines handle limits differently:
    // - PCRE: Uses recursion limit (stack-based)
    // - RE2: Always-bounded execution (Google's safe regex)
    // - Oniguruma: Various limit options
    
    // The 'regex' crate approach:
    // 1. Compiles to NFA first
    // 2. Lazily builds DFA states during execution
    // 3. size_limit caps total DFA state cache size
    
    // This provides protection while maintaining performance
    let result = RegexBuilder::new(r"complex|pattern|here")
        .size_limit(1024 * 1024)
        .build();
}

The regex crate's hybrid approach balances safety with performance.

Pattern Rewriting to Reduce Size

use regex::RegexBuilder;
 
fn pattern_optimization() {
    // Avoid patterns that create large state machines
    
    // Bad: Nested quantifiers
    let bad_pattern = r"(a|b)*c";
    // Creates many states due to potential backtracking
    
    // Better: Flatten if possible
    let better_pattern = r"[ab]*c";
    // Simpler state machine
    
    // Bad: Overlapping alternatives
    let bad = r"(a|ab|abc)+";
    // Many overlapping possibilities
    
    // Better: Specific ordering or non-overlapping
    let better = r"(?:a|ab|abc)+";
    // Still overlapping but clearer intent
    
    // Bad: Unnecessary groups
    let bad = r"((a|b))+";
    
    // Better: Non-capturing or simplified
    let better = r"(?:a|b)+";
    
    // Test both patterns
    for pattern in [bad_pattern, better_pattern] {
        let result = RegexBuilder::new(pattern)
            .size_limit(1024)
            .build();
        println!("Pattern '{}': {}", pattern, 
            if result.is_ok() { "OK" } else { "Rejected" });
    }
}

Rewriting patterns can significantly reduce state machine size.

Error Handling for Size Limit

use regex::RegexBuilder;
use std::error::Error;
 
fn handle_size_limit_error() -> Result<(), Box<dyn Error>> {
    let pattern = r"(a+)+b";
    
    let regex = RegexBuilder::new(pattern)
        .size_limit(100)
        .build()
        .map_err(|e| {
            if e.to_string().contains("size limit") {
                // Specific handling for size limit errors
                format!("Pattern '{}' creates too large a state machine", pattern)
            } else {
                format!("Pattern compilation failed: {}", e)
            }
        })?;
    
    Ok(())
}
 
fn safe_user_regex(pattern: &str) -> Result<regex::Regex, String> {
    RegexBuilder::new(pattern)
        .size_limit(256 * 1024)  // 256KB limit for user patterns
        .build()
        .map_err(|e| {
            // Provide user-friendly error message
            if e.to_string().contains("size limit") {
                "Pattern is too complex. Simplify nested quantifiers or reduce alternations.".to_string()
            } else {
                format!("Invalid pattern: {}", e)
            }
        })
}

Catch and handle size limit errors with user-friendly messages.

Testing Size Limits

use regex::RegexBuilder;
 
#[cfg(test)]
mod tests {
    use super::*;
    
    #[test]
    fn test_reasonable_patterns_compile() {
        // Reasonable patterns should compile within size limit
        let patterns = [
            r"\d+",
            r"[a-zA-Z]+",
            r"^https?://[\w.-]+",
            r"^\w+@\w+\.\w+$",
        ];
        
        for pattern in patterns {
            let result = RegexBuilder::new(pattern)
                .size_limit(64 * 1024)
                .build();
            assert!(result.is_ok(), "Pattern '{}' should compile", pattern);
        }
    }
    
    #[test]
    fn test_pathological_patterns_rejected() {
        // Pathological patterns should be rejected
        let patterns = [
            r"(a+)+",
            r"(a|aa)+",
            r"(?:a?){1000}a{1000}",
        ];
        
        for pattern in patterns {
            let result = RegexBuilder::new(pattern)
                .size_limit(1024)  // Very strict limit
                .build();
            // May succeed with higher limits, but strict limits should reject
            // Actual behavior depends on pattern optimization
        }
    }
}

Test that reasonable patterns compile and pathological ones are caught.

Integration in Web Services

use regex::RegexBuilder;
use std::collections::HashMap;
use std::sync::Arc;
 
// Service for compiling user regex patterns safely
struct SafeRegexService {
    cache: HashMap<String, Arc<regex::Regex>>,
    max_size: usize,
}
 
impl SafeRegexService {
    fn new() -> Self {
        Self {
            cache: HashMap::new(),
            max_size: 256 * 1024,  // 256KB
        }
    }
    
    fn compile(&mut self, pattern: &str) -> Result<Arc<regex::Regex>, String> {
        // Check cache first
        if let Some(re) = self.cache.get(pattern) {
            return Ok(Arc::clone(re));
        }
        
        // Compile with size limit
        let re = RegexBuilder::new(pattern)
            .size_limit(self.max_size)
            .backtrack_limit(50_000)  // Also limit execution
            .build()
            .map_err(|e| format!("Pattern rejected: {}", e))?;
        
        let re = Arc::new(re);
        self.cache.insert(pattern.to_string(), Arc::clone(&re));
        Ok(re)
    }
    
    fn validate_input(&self, pattern: &str, input: &str) -> Result<bool, String> {
        let re = self.compile(pattern)?;
        Ok(re.is_match(input))
    }
}

Web services should use size limits for any user-provided regex patterns.

Synthesis

Quick comparison:

Limit Type What it Limits When Applied Attack Prevented
size_limit DFA state machine memory Compile time State explosion
backtrack_limit Backtracking iterations Execution time Catastrophic backtracking

Common limits:

use regex::RegexBuilder;
 
fn common_limits() {
    // Conservative for user input
    RegexBuilder::new("pattern")
        .size_limit(64 * 1024)      // 64KB
        .backtrack_limit(10_000);
    
    // Moderate for internal patterns
    RegexBuilder::new("pattern")
        .size_limit(1024 * 1024)    // 1MB
        .backtrack_limit(100_000);
    
    // Relaxed for trusted/known patterns
    RegexBuilder::new("pattern")
        .size_limit(10 * 1024 * 1024)  // 10MB
        .backtrack_limit(1_000_000);
}

Key insight: size_limit is a compile-time protection against ReDoS attacks that works by limiting the memory required for the regex engine's state machine. The regex crate uses a hybrid NFA/DFA approach where states are lazily built during execution; size_limit caps the total size of this state cache. This prevents patterns that would create exponentially many states—patterns like (a|b)* where the number of possible states grows with input length. The protection is proactive: rather than allowing pathological patterns to compile and potentially hang during execution, size_limit rejects them upfront during compilation. For comprehensive DoS protection, combine size_limit with backtrack_limit: the former catches patterns that create large state machines, while the latter catches patterns that would execute too many operations during matching. When accepting regex patterns from untrusted sources—search functionality, configuration files, API endpoints—always apply strict size limits to prevent denial-of-service attacks through regex state explosion.