How does `regex::RegexBuilder::size_limit` prevent denial-of-service from exponential backtracking?

regex::RegexBuilder::size_limit sets a byte limit on the size of the compiled regex's internal state machine, preventing malicious or accidental patterns from consuming excessive memory during compilation. Certain regex patterns—particularly those with nested quantifiers like (a+)+b—can cause exponential state space explosion when compiled to a deterministic finite automaton (DFA). The size limit catches these pathological patterns at compile time, rejecting them before they can exhaust system resources. This is distinct from runtime backtracking limits; size_limit protects against compilation-time denial-of-service, not matching-time attacks.

The Exponential State Explosion Problem

use regex::Regex;
 
fn problematic_patterns() {
    // These patterns can cause exponential DFA growth
    
    // Nested quantifiers: (a+)+b
    // Matches "ab", "aab", "aaab", etc., but the DFA has exponential states
    // because each "a" can be matched by either the inner or outer +
    
    // Simple input: "aaaab"
    // The DFA must track all possible ways to partition the "a"s
    
    // This pattern works for reasonable inputs:
    let pattern = "(a+)+b";
    let re = Regex::new(pattern).unwrap();
    assert!(re.is_match("aaaab"));
    
    // But with certain inputs, DFA compilation could be huge
    // size_limit prevents this by capping DFA size
}

Nested quantifiers and alternations can create patterns where the minimal DFA has exponentially more states than the pattern length.

Basic size_limit Usage

use regex::RegexBuilder;
 
fn basic_size_limit() {
    // Default size limit is reasonable for most patterns
    let re = RegexBuilder::new(r"\w+@\w+\.\w+")
        .build()
        .unwrap();
    
    // Set explicit size limit (in bytes)
    let re = RegexBuilder::new(r"(a+)+b")
        .size_limit(10_000)  // 10KB limit for compiled DFA
        .build();
    
    match re {
        Ok(regex) => println!("Pattern compiled successfully"),
        Err(e) => println!("Pattern rejected: {}", e),
    }
}

The size_limit method sets the maximum bytes the compiled regex can consume.

Default Size Limit

use regex::RegexBuilder;
 
fn default_limit() {
    // Default: approximately 10MB for the DFA
    // This is generous for normal patterns
    
    // Most patterns compile well under the limit
    let normal = RegexBuilder::new(r"\d{4}-\d{2}-\d{2}")
        .build()
        .unwrap();
    
    // Patterns that exceed the limit fail to compile
    let problematic = RegexBuilder::new(r"(a|b|c|d|e|f)+")
        .size_limit(100)  // Very small limit for demonstration
        .build();
    
    match problematic {
        Ok(_) => println!("Compiled"),
        Err(e) => println!("Failed: {}", e),
        // Failed: compiled regex exceeds size limit
    }
}

The default limit is set high enough for practical patterns but catches pathological cases.

Size Limit vs Backtracking Limit

use regex::RegexBuilder;
 
fn two_different_limits() {
    // size_limit: compilation-time limit
    // Controls DFA/memory size during pattern compilation
    
    let re = RegexBuilder::new(r"(a+)+b")
        .size_limit(1_000_000)  // Compilation limit
        .build()
        .unwrap();
    
    // backtracking_limit (if using backtracking engine)
    // Controls runtime matching steps
    // Note: Rust's regex crate uses DFAs, not backtracking by default
    
    // The regex crate primarily uses:
    // - DFA (deterministic finite automaton)
    // - NFA (nondeterministic finite automaton) simulation
    // - Hybrid approaches
    
    // size_limit catches problems BEFORE matching starts
    // backtracking_limit catches problems DURING matching
}

size_limit prevents compilation attacks; it doesn't limit matching time.

Pattern Types That Trigger Size Limits

use regex::RegexBuilder;
 
fn exponential_patterns() {
    // Pattern 1: Nested quantifiers
    let nested = RegexBuilder::new(r"(a+)+")
        .size_limit(1000)
        .build();
    // Inner + and outer + create exponential possibilities
    
    // Pattern 2: Multiple overlapping alternations
    let overlapping = RegexBuilder::new(r"(a|aa|aaa|aaaa)+")
        .size_limit(1000)
        .build();
    // Many ways to match same strings
    
    // Pattern 3: Repeated optional groups
    let optional = RegexBuilder::new(r"(a?){100}")
        .size_limit(1000)
        .build();
    // Exponential combinations with many optional elements
    
    // These patterns create huge DFAs because:
    // - Each branch or quantifier adds state combinations
    // - Overlapping possibilities multiply states
    // - The minimal DFA becomes exponentially large
}

Patterns with nested quantifiers, overlapping alternations, or many optional elements can hit size limits.

Why DFAs Can Be Exponentially Large

// Understanding DFA state explosion:
//
// Pattern: (a+)+b
// Input possibilities: "ab", "aab", "aaab", "aaaab", ...
//
// For input "aaab" (three 'a's followed by 'b'):
// How many ways can we match the 'a's with (a+)+?
//
// (aaa)b    - outer + matches all three
// (a)(aa)b  - outer + has two groups: "a" and "aa"
// (aa)(a)b  - outer + has two groups: "aa" and "a"
// (a)(a)(a)b - outer + has three groups
//
// The DFA must track ALL these possibilities
// to correctly reject or accept
//
// For n 'a's: approximately 2^(n-1) groupings
// DFA states grow exponentially
 
fn explain_state_explosion() {
    // Simple alternation: reasonable
    let simple = RegexBuilder::new(r"a|b|c")
        .build()
        .unwrap();
    // DFA: ~3 states (one per alternative)
    
    // Nested quantifier: can be exponential
    let nested = RegexBuilder::new(r"(a+)+")
        .size_limit(10_000)
        .build();
    // DFA: potentially thousands of states
    // Because: inner + has states, outer + multiplies them
    
    // The regex crate's size_limit catches these
    // before they consume all available memory
}

DFAs must track all possible match paths, leading to exponential state growth for certain patterns.

Practical Example: Email Validation

use regex::RegexBuilder;
 
fn email_validation() {
    // Naive email pattern (simplified)
    let naive = RegexBuilder::new(r"[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z0-9]+")
        .build()
        .unwrap();
    
    // Complex email pattern (RFC 5322-ish)
    // This has many alternations and optional parts
    let complex_pattern = r#"(?:[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-zA-Z0-9-]*[a-zA-Z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"#;
    
    // Even complex patterns usually compile fine
    let complex = RegexBuilder::new(complex_pattern)
        .build()
        .unwrap();
    
    // size_limit protects against intentional DoS patterns
    // Not normal complex patterns
}

Normal complex patterns compile fine; size limits catch pathological cases.

Measuring Compiled Regex Size

use regex::Regex;
 
fn measure_size() {
    // The regex crate doesn't expose exact DFA size directly
    // But you can observe size_limit behavior:
    
    // Pattern that should be small
    let small = Regex::new(r"hello").unwrap();
    println!("Small pattern compiled");
    
    // Pattern approaching limits
    let medium = Regex::new(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
        .unwrap();
    println!("Medium pattern compiled");
    
    // If a pattern fails size_limit, it returns an error
    // rather than consuming unbounded memory
    
    // To check if your pattern is close to limits,
    // progressively reduce size_limit and see when it fails
}

Size limits prevent unbounded memory; patterns approaching limits fail with clear errors.

Setting Appropriate Size Limits

use regex::RegexBuilder;
 
fn setting_limits() {
    // Default: ~10MB - good for most applications
    
    // Small patterns: very low limit is fine
    let tiny = RegexBuilder::new(r"\d{4}")
        .size_limit(1000)  // 1KB
        .build()
        .unwrap();
    
    // Complex but valid patterns: increase limit
    let complex = RegexBuilder::new(r"[a-z]+@[a-z]+\.[a-z]{2,}")
        .size_limit(100_000)  // 100KB
        .build()
        .unwrap();
    
    // For accepting user-defined patterns:
    // Use conservative limits to prevent DoS
    fn compile_user_pattern(pattern: &str) -> Result<Regex, regex::Error> {
        RegexBuilder::new(pattern)
            .size_limit(50_000)  // Conservative: 50KB
            .build()
    }
    
    // For trusted internal patterns:
    // Higher limits or no limit (None not available, just use default)
    fn compile_trusted_pattern(pattern: &str) -> Result<Regex, regex::Error> {
        RegexBuilder::new(pattern)
            .size_limit(10_000_000)  // 10MB - generous for trusted patterns
            .build()
    }
}

Choose size limits based on whether patterns are trusted (internal) or untrusted (user input).

size_limit and Regex Engine Types

use regex::RegexBuilder;
 
fn engine_types() {
    // The regex crate uses multiple engines internally:
    // - PikeVM (NFA simulation) - handles any pattern, slower
    // - DFA (deterministic) - fast, but can be huge
    // - Hybrid/Lazy DFA - builds DFA on-demand
    
    // size_limit primarily affects DFA compilation
    // When DFA would be too large, the crate may fall back
    // to slower engines or reject the pattern
    
    let pattern = "(a|b)*c";
    
    // This pattern creates a reasonably-sized DFA
    let re = RegexBuilder::new(pattern)
        .size_limit(100_000)
        .build()
        .unwrap();
    
    // The regex crate intelligently chooses engines:
    // - Tries to build fast DFA
    // - Falls back to slower NFA if DFA too large
    // - Or fails if even fallback exceeds limits
}

The regex crate may use slower engines when fast DFAs would exceed size limits.

Attack Mitigation Example

use regex::RegexBuilder;
 
fn mitigating_attacks() {
    // Malicious user provides nested quantifiers
    let attack_patterns = vec
![
        "(a+)+b",           // Nested quantifiers
        "(a|aa)+",          // Overlapping alternatives
        "((a+)+)+",         // Triple nested
        "(.?){1000}",       // Exponential optional
    ];
    
    let safe_limit = 100_000;  // 100KB
    
    for pattern in attack_patterns {
        let result = RegexBuilder::new(pattern)
            .size_limit(safe_limit)
            .build();
        
        match result {
            Ok(_) => println!("Pattern '{}' compiled (under limit)", pattern),
            Err(e) => println!("Pattern '{}' rejected: {}", pattern, e),
        }
    }
    
    // Output likely:
    // Pattern '(a+)+b' compiled (under limit) - may be under default
    // Pattern '(a|aa)+' compiled (under limit)
    // Pattern '((a+)+)+' may be rejected - too complex
    // Pattern '(.?){1000}' rejected - too many states
    
    // Attack prevented: program doesn't hang or crash
}

Size limits transform potential DoS attacks into compile-time errors.

Comparison with Other Regex Libraries

// Comparison with other regex implementations:
//
// PCRE/Perl-style regex engines:
// - Use backtracking
// - Can have exponential TIME on certain inputs
// - Need backtracking limits (steps, time)
//
// RE2 (Google's regex library):
// - Uses DFAs (like Rust's regex crate)
// - Has compilation size limits
// - Guaranteed linear time matching
//
// Rust regex crate:
// - Uses DFAs and NFAs
// - Has size_limit for compilation
// - Guaranteed no exponential runtime
// - Best of both: fast AND safe
 
fn rust_regex_advantages() {
    // Rust's regex crate guarantees:
    // 1. O(n) matching time (linear in input)
    // 2. Bounded compilation memory (size_limit)
    // 3. No backtracking explosions
    
    // Other libraries might need:
    // - Backtracking limits
    // - Timeouts
    // - Input sanitization
}

Rust's regex crate uses DFA-based matching, preventing runtime backtracking attacks; size_limit prevents compilation attacks.

Related: bytes_limit for Compiled Regex

use regex::RegexBuilder;
 
fn bytes_limit() {
    // bytes_limit is another related setting
    // It limits the size of the compiled program in bytes
    
    let re = RegexBuilder::new(r"(a+)+")
        .size_limit(10_000)      // DFA size limit
        .build();
    
    // Note: Different regex implementations may have
    // different limit parameters. Check documentation.
    
    // The regex crate's size_limit specifically limits
    // the total heap memory used by compiled regex
}

The size limit constrains heap memory used by the compiled regex state machine.

Best Practices for Size Limits

use regex::RegexBuilder;
 
fn best_practices() {
    // 1. Use default for trusted internal patterns
    let internal = RegexBuilder::new(r"^[a-z]+$")
        .build()
        .unwrap();
    
    // 2. Set conservative limits for user-provided patterns
    fn compile_user_regex(pattern: &str) -> Result<regex::Regex, String> {
        RegexBuilder::new(pattern)
            .size_limit(10_000)  // 10KB - very conservative
            .build()
            .map_err(|e| format!("Invalid regex: {}", e))
    }
    
    // 3. Validate pattern complexity separately if needed
    fn is_pattern_safe(pattern: &str) -> bool {
        // Check for known dangerous patterns
        let dangerous = ["(a+)+", "(.*)*", "(.?){", "((.+)+)+"];
        !dangerous.iter().any(|d| pattern.contains(d))
    }
    
    // 4. Document limits in public APIs
    /// Compiles a user-provided regex pattern.
    /// 
    /// # Limits
    /// - Size limit: 50KB (prevents DoS from complex patterns)
    /// - Pattern must not contain nested quantifiers
    pub fn compile_pattern(pattern: &str) -> Result<regex::Regex, regex::Error> {
        RegexBuilder::new(pattern)
            .size_limit(50_000)
            .build()
    }
    
    // 5. Test patterns at various sizes
    #[cfg(test)]
    mod tests {
        use super::*;
        
        #[test]
        fn test_size_limit() {
            // Should compile
            assert!(compile_user_regex(r"\d+").is_ok());
            
            // Should fail - too complex
            // (actual behavior depends on limit and pattern)
        }
    }
}

Follow security best practices when accepting user-provided regex patterns.

Synthesis

Size limit purpose:

// size_limit prevents: Excessive memory during COMPILATION
// NOT: Excessive time during MATCHING
 
// The threat model:
// Attacker provides: (a+)+b
// With naive DFA:    Exponential states -> OOM
// With size_limit:   Pattern rejected -> Safe error

How it works:

// During compilation:
// 1. Parse regex pattern
// 2. Build NFA from pattern
// 3. Convert NFA to DFA (or simulate)
// 4. Track memory usage
// 5. If usage > size_limit: Abort and return error
// 6. If under limit: Complete and return Regex

Key distinction:

// Exponential BACKTRACKING (other engines):
// - Problem: Matching time
// - Input: "aaaaaaaaaaaaaaaX" against "(a+)+b"
// - Each 'a' creates branches
// - Matching takes exponential time
// - Solution: Time limits, backtracking limits
 
// Exponential STATE SPACE (Rust's issue):
// - Problem: Compilation memory
// - Pattern: "(a+)+" or similar
// - DFA has exponential states
// - Compilation uses exponential memory
// - Solution: size_limit
 
// Rust's regex crate prevents BOTH:
// - No backtracking engine -> No exponential TIME
// - size_limit -> No exponential MEMORY

Key insight: size_limit is a compile-time defense against patterns that would create enormous deterministic finite automata. The regex crate's DFA-based matching already prevents the classic "catastrophic backtracking" attacks that affect backtracking regex engines. But DFAs themselves can be exponentially large for certain patterns—size_limit catches these at compilation time, transforming a potential denial-of-service into a clear error message. For applications accepting user-provided regex patterns, setting a conservative size limit (like 10-50KB) is essential for security. For internal trusted patterns, the default limit (~10MB) is usually sufficient, but can be increased for legitimately complex patterns.

How does regex::RegexBuilder::size_limit prevent denial-of-service from exponential backtracking?