How does regex::RegexBuilder::size_limit prevent exponential backtracking in pathological cases?

regex::RegexBuilder::size_limit constrains the memory allocated for the compiled regex's internal state machine, preventing denial-of-service attacks from pathological patterns that would otherwise cause exponential state machine growth. When certain regex patterns are compiled—particularly those with nested quantifiers like (a+)+ or alternations with overlapping matches—the resulting finite automaton can grow exponentially relative to the pattern size. The size_limit method sets an upper bound on this growth, causing compilation to fail rather than consuming unbounded memory. This protection is essential for applications accepting user-provided regex patterns, where malicious inputs could otherwise exhaust system resources.

The Exponential Growth Problem

use regex::Regex;
 
fn main() {
    // This pattern looks simple but has exponential state explosion
    let pattern = "(a+)+b";
    
    // Without size limits, compilation could consume massive memory
    // The nested quantifiers create many overlapping possibilities
    match Regex::new(pattern) {
        Ok(re) => println!("Compiled: {:?}", re),
        Err(e) => println!("Error: {}", e),
    }
    
    // On pathological input like "aaaaaaaaaaaaaaaaaaaaaaaaaac",
    // matching can take exponential time
}

Nested quantifiers and overlapping alternations cause state machine explosion during compilation.

Basic size_limit Usage

use regex::RegexBuilder;
 
fn main() {
    // Set a size limit of 1MB for the compiled regex
    let result = RegexBuilder::new(r"(a+)+b")
        .size_limit(1024 * 1024)  // 1MB limit
        .build();
    
    match result {
        Ok(re) => println!("Compiled successfully"),
        Err(e) => println!("Compilation failed: {}", e),
    }
}

size_limit specifies the maximum memory allowed for the compiled state machine.

Default Size Limit

use regex::RegexBuilder;
 
fn main() {
    // The default size limit is approximately 10MB
    // This protects against most pathological patterns
    
    // Check if a pattern exceeds default limits
    let result = RegexBuilder::new(r"(a|b|c|d|e|f|g)+")
        .build();
    
    match result {
        Ok(re) => {
            println!("Pattern compiled within default limits");
        }
        Err(e) => {
            println!("Pattern too large: {}", e);
        }
    }
}

The default limit provides baseline protection without configuration.

Patterns That Cause State Explosion

use regex::RegexBuilder;
 
fn main() {
    // Nested quantifiers - classic pathological pattern
    let nested = "(a+)+";
    
    // Alternation with overlapping possibilities
    let overlapping = "(a|a)+";
    
    // Complex character classes with quantifiers
    let char_class = "([a-z][a-z])+";
    
    // Nested groups with alternations
    let nested_alt = "((a|b)+)+";
    
    let patterns = [nested, overlapping, char_class, nested_alt];
    
    for pattern in patterns {
        let result = RegexBuilder::new(pattern)
            .size_limit(100 * 1024)  // 100KB - intentionally small
            .build();
        
        match result {
            Ok(_) => println!("'{}' compiled OK", pattern),
            Err(_) => println!("'{}' exceeded size limit", pattern),
        }
    }
}

Nested quantifiers and overlapping alternatives are common causes of state explosion.

Exponential Backtracking in Matching

use regex::Regex;
use std::time::Instant;
 
fn main() {
    // Pattern vulnerable to backtracking
    let re = Regex::new(r"(a+)+b").unwrap();
    
    // Input that triggers exponential backtracking
    let input = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaac";
    
    let start = Instant::now();
    let is_match = re.is_match(input);
    let duration = start.elapsed();
    
    println!("Match: {}, Time: {:?}", is_match, duration);
    // This can take seconds or longer with enough 'a's
    // The pattern tries to match 'b' but input ends with 'c'
    // Each 'a' can belong to either inner or outer group
    // Exponentially many combinations are tried
}

The size_limit helps with compilation memory; backtracking limits help with match time.

Combining size_limit with backtrack_limit

use regex::RegexBuilder;
 
fn main() {
    // Protect against both compilation and matching issues
    let result = RegexBuilder::new(r"(a+)+b")
        .size_limit(1024 * 1024)          // 1MB for compilation
        .backtrack_limit(100_000)          // Limit backtracking steps
        .build();
    
    match result {
        Ok(re) => {
            let input = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaac";
            match re.is_match(input) {
                true => println!("Matched"),
                false => println!("No match"),
            }
        }
        Err(e) => println!("Compilation error: {}", e),
    }
}

Use both limits: size_limit for compilation, backtrack_limit for execution.

Nested Quantifier Analysis

use regex::RegexBuilder;
 
fn main() {
    // Each level of nesting multiplies state possibilities
    
    // Level 1: a+ - linear states
    let level1 = "a+";
    
    // Level 2: (a+)+ - quadratic to exponential
    let level2 = "(a+)+";
    
    // Level 3: ((a+)+)+ - exponential explosion
    let level3 = "((a+)+)+";
    
    for (name, pattern) in [("L1", level1), ("L2", level2), ("L3", level3)] {
        let result = RegexBuilder::new(pattern)
            .size_limit(10 * 1024)  // Small limit to show effect
            .build();
        
        println!("{} '{}': {}", name, pattern, 
            if result.is_ok() { "OK" } else { "FAILED" });
    }
}

Deeper nesting creates more dramatic state explosion.

Character Class Explosion

use regex::RegexBuilder;
 
fn main() {
    // Large character classes multiply states
    let small = "[a-z]+";  // 26 characters
    let medium = "[a-zA-Z0-9]+";  // 62 characters
    let large = "[\u{0000}-\u{FFFF}]+";  // Many characters
    
    // Combined with nested groups
    let nested_small = "([a-z]+)+";
    let nested_large = "([\u{0000}-\u{FFFF}]+)+";
    
    let patterns = [
        ("small_char_class", small),
        ("medium_char_class", medium),
        ("large_char_class", large),
        ("nested_small", nested_small),
        ("nested_large", nested_large),
    ];
    
    for (name, pattern) in patterns {
        let result = RegexBuilder::new(pattern)
            .size_limit(100 * 1024)
            .build();
        
        println!("{}: {}", name, if result.is_ok() { "OK" } else { "FAILED" });
    }
}

Character class size amplifies the state explosion from quantifiers.

Alternation Overlap

use regex::RegexBuilder;
 
fn main() {
    // Non-overlapping alternation - efficient
    let efficient = "(a|b|c|d)+";
    
    // Overlapping alternation - can cause issues
    let overlapping = "(a|ab|abc|abcd)+";
    
    // Ambiguous alternation - pathological
    let ambiguous = "(a.*|.*a)+";
    
    let patterns = [
        ("efficient", efficient),
        ("overlapping", overlapping),
        ("ambiguous", ambiguous),
    ];
    
    for (name, pattern) in patterns {
        let result = RegexBuilder::new(pattern)
            .size_limit(50 * 1024)
            .build();
        
        println!("{}: {}", name, if result.is_ok() { "OK" } else { "FAILED" });
    }
}

Overlapping alternatives in alternations create ambiguous paths that explode state count.

Safe Pattern Alternatives

use regex::RegexBuilder;
 
fn main() {
    // Instead of: (a+)+b  (vulnerable)
    // Use: a+b  (safe)
    let safe = RegexBuilder::new(r"a+b")
        .build();
    
    // Instead of: (a|a)+  (redundant)
    // Use: a+  (equivalent)
    let simplified = RegexBuilder::new(r"a+")
        .build();
    
    // Instead of: (.*)*  (can be pathological)
    // Use: .*  (equivalent)
    let greedy = RegexBuilder::new(r".*")
        .build();
    
    // All these compile efficiently
    println!("Safe patterns compiled successfully");
}

Simplifying patterns often eliminates pathological behavior.

Detecting Problematic Patterns

use regex::RegexBuilder;
 
fn is_pattern_safe(pattern: &str) -> Result<bool, String> {
    let result = RegexBuilder::new(pattern)
        .size_limit(1024 * 1024)  // 1MB
        .backtrack_limit(100_000)
        .build();
    
    match result {
        Ok(_) => Ok(true),
        Err(e) => {
            let err_str = e.to_string();
            if err_str.contains("exceeds size limit") {
                Err("Pattern too complex - possible state explosion".to_string())
            } else {
                Err(format!("Invalid pattern: {}", err_str))
            }
        }
    }
}
 
fn main() {
    let patterns = [
        r"\d+",           // Safe
        r"(a+)+",         // Problematic
        r"[a-z]+",        // Safe
        r"((a|b)+)+",     // Problematic
    ];
    
    for pattern in patterns {
        match is_pattern_safe(pattern) {
            Ok(_) => println!("'{}' is safe", pattern),
            Err(e) => println!("'{}' has issues: {}", pattern, e),
        }
    }
}

Validation functions can check patterns before using them in production.

Resource Limits in Production

use regex::RegexBuilder;
 
struct SafeRegex {
    pattern: String,
    compiled: Option<regex::Regex>,
}
 
impl SafeRegex {
    fn new(pattern: &str, max_size: usize) -> Result<Self, String> {
        let compiled = RegexBuilder::new(pattern)
            .size_limit(max_size)
            .backtrack_limit(100_000)
            .build()
            .map_err(|e| format!("Pattern rejected: {}", e))?;
        
        Ok(Self {
            pattern: pattern.to_string(),
            compiled: Some(compiled),
        })
    }
    
    fn is_match(&self, input: &str) -> bool {
        self.compiled.as_ref().map_or(false, |re| re.is_match(input))
    }
}
 
fn main() {
    // Production-safe regex with limits
    match SafeRegex::new(r"(a+)+b", 1024 * 1024) {
        Ok(re) => println!("Created safe regex"),
        Err(e) => println!("Rejected: {}", e),
    }
}

Wrapping regex creation with limits provides consistent protection.

Size Limit Units

use regex::RegexBuilder;
 
fn main() {
    // size_limit is in bytes for the compiled DFA/NFA state machine
    
    // Small patterns: ~1KB
    let small = RegexBuilder::new(r"\d+")
        .size_limit(1024)
        .build();
    println!("Small pattern: {}", small.is_ok());
    
    // Medium patterns: ~100KB
    let medium = RegexBuilder::new(r"[a-z]+@[a-z]+\.[a-z]+")
        .size_limit(100 * 1024)
        .build();
    println!("Medium pattern: {}", medium.is_ok());
    
    // Large patterns: ~10MB (default)
    let large = RegexBuilder::new(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
        .build();  // Uses default limit
    println!("Large pattern: {}", large.is_ok());
}

Size is measured in bytes for the internal state representation.

When to Increase size_limit

use regex::RegexBuilder;
 
fn main() {
    // Some legitimate patterns need more memory
    // Complex but well-formed patterns
    
    // Email regex (complex but usually OK)
    let email = RegexBuilder::new(
        r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
    )
    .size_limit(2 * 1024 * 1024)  // 2MB
    .build();
    
    // URL parsing (complex patterns)
    let url = RegexBuilder::new(
        r"https?://[^\s/$.?#].[^\s]*"
    )
    .size_limit(5 * 1024 * 1024)  // 5MB
    .build();
    
    // IPv6 addresses (many alternatives)
    let ipv6 = RegexBuilder::new(
        r"([0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}"
    )
    .size_limit(1024 * 1024)  // 1MB
    .build();
    
    println!("Complex patterns compiled: {} {} {}", 
        email.is_ok(), url.is_ok(), ipv6.is_ok());
}

Increase limits for legitimate complex patterns that are well-formed.

Measuring Compiled Size

use regex::RegexBuilder;
 
fn main() {
    // Unfortunately, there's no direct way to query compiled size
    // But you can detect if size limit was hit
    
    let patterns = [
        (r"\d+", "simple"),
        (r"(a+)+", "nested"),
        (r"((a|b)+)+", "double_nested"),
    ];
    
    for (pattern, name) in patterns {
        // Try with increasing limits
        for limit_kb in [10, 100, 1000] {
            let result = RegexBuilder::new(pattern)
                .size_limit(limit_kb * 1024)
                .build();
            
            if result.is_ok() {
                println!("'{}' compiled with {}KB limit", name, limit_kb);
                break;
            }
        }
    }
}

Binary search on size limit can approximate compiled size requirements.

Comparison with Other Regex Engines

use regex::RegexBuilder;
 
fn main() {
    // Rust's regex crate is already optimized
    // But pathological patterns can still cause issues
    
    // The regex crate uses:
    // 1. Lazy DFA construction (defers some state creation)
    // 2. NFA simulation for backreferences
    // 3. Multiple engines internally
    
    // size_limit protects the state machine construction
    // Even with optimizations, some patterns explode
    
    // Example: backtracking regex engines can hang forever
    // Rust's regex won't hang during matching (has internal limits)
    // But compilation can still use excessive memory
    
    let result = RegexBuilder::new(r"(a+)+b")
        .size_limit(1024)  // Very small to demonstrate
        .build();
    
    match result {
        Ok(_) => println!("Compiled"),
        Err(_) => println!("Size limit hit - pattern too complex"),
    }
}

Rust's regex has multiple internal protections; size_limit is one layer.

Attack Surface for User Patterns

use regex::RegexBuilder;
 
// Application that accepts user regex patterns
fn search_with_user_pattern(text: &str, pattern: &str) -> Result<bool, String> {
    let re = RegexBuilder::new(pattern)
        .size_limit(1024 * 1024)      // 1MB max compiled size
        .backtrack_limit(100_000)     // Limit match backtracking
        .build()
        .map_err(|e| format!("Invalid pattern: {}", e))?;
    
    Ok(re.is_match(text))
}
 
fn main() {
    // Benign user input
    let result = search_with_user_pattern("hello world", r"world");
    println!("Benign: {:?}", result);
    
    // Malicious input - exponential pattern
    let result = search_with_user_pattern("test", r"(a+)+b");
    println!("Malicious: {:?}", result);
    
    // Malicious input - trying to exhaust memory
    let result = search_with_user_pattern("test", &"a".repeat(10000));
    println!("Long pattern: {:?}", result);
}

Applications accepting user patterns must enforce limits.

Size Limit vs Backtrack Limit

use regex::RegexBuilder;
 
fn main() {
    // size_limit: compilation-time memory protection
    // Controls how large the compiled state machine can be
    
    // backtrack_limit: execution-time step protection
    // Controls how many backtracking steps during matching
    
    let pattern = r"(a+)+b";
    
    // Without backtrack_limit, matching can hang
    let re = RegexBuilder::new(pattern)
        .size_limit(10 * 1024 * 1024)
        .backtrack_limit(100_000)  // Important for this pattern!
        .build()
        .unwrap();
    
    let input = "aaaaaaaaaaaaaaaaaaaaaaaaaac";
    
    // With backtrack_limit, this returns quickly
    let is_match = re.is_match(input);
    println!("Match: {}", is_match);
    
    // Both limits work together:
    // size_limit prevents compilation DoS
    // backtrack_limit prevents matching DoS
}

Two limits protect different attack vectors: compilation and execution.

Pattern Rewrite Strategies

use regex::RegexBuilder;
 
fn main() {
    // Original pathological patterns and safe rewrites
    
    // Pathological: (a+)+
    // Safe: a+
    // The outer quantifier is redundant
    
    // Pathological: (a|a)+
    // Safe: a+
    // Overlapping alternatives are redundant
    
    // Pathological: (a*|b*)*
    // Safe: [ab]*
    // Character class is more efficient
    
    // Pathological: .*(.*)*
    // Safe: .*
    // Nested greedy quantifiers are redundant
    
    let safe_patterns = [
        r"a+",
        r"a+",
        r"[ab]*",
        r".*",
    ];
    
    for pattern in safe_patterns {
        let result = RegexBuilder::new(pattern)
            .size_limit(1024)  // Small limit
            .build();
        println!("'{}': {}", pattern, if result.is_ok() { "OK" } else { "FAILED" });
    }
}

Many pathological patterns have equivalent safe rewrites.

Error Handling for Size Limits

use regex::{RegexBuilder, Error};
 
fn main() {
    let result = RegexBuilder::new(r"(a+)+")
        .size_limit(100)  // Extremely small
        .build();
    
    match result {
        Ok(re) => println!("Compiled: {:?}", re),
        Err(e) => {
            // Check error type
            let err_str = e.to_string();
            
            if err_str.contains("size limit") {
                eprintln!("Pattern too complex: state machine too large");
                // Handle size limit error specifically
            } else if err_str.contains("parse") {
                eprintln!("Syntax error in pattern");
                // Handle syntax error
            } else {
                eprintln!("Other error: {}", e);
            }
        }
    }
}

Handle size limit errors specifically for user feedback.

Synthesis

What causes state explosion:

Pattern Feature Risk Level Explanation
Nested quantifiers (a+)+ High Exponential state combinations
Overlapping alternations (a|ab)+ High Ambiguous matching paths
Large character classes [\u0000-\uFFFF]+ Medium Multiplies states
Nested groups ((a+)+)+ Very High Each level multiplies complexity
Greedy nested quantifiers (.*)* High Exponential backtracking

Protection mechanisms:

Limit Protects Against When Applied
size_limit Memory exhaustion During compilation
backtrack_limit CPU exhaustion During matching

Default values:

Setting Default Typical Override
size_limit ~10MB Lower for untrusted input
backtrack_limit ~100,000 Lower for strict timeouts

Best practices:

// For trusted patterns (developer-provided)
let re = Regex::new(r"complex_pattern")?;
 
// For user-provided patterns
let re = RegexBuilder::new(user_pattern)
    .size_limit(1024 * 1024)
    .backtrack_limit(100_000)
    .build()?;

Key insight: size_limit protects the compilation phase by bounding the memory needed to construct the regex engine's internal state machine. Pathological patterns—particularly nested quantifiers and overlapping alternations—can cause the number of states to grow exponentially with pattern complexity. By default, the regex crate has a ~10MB limit that protects against accidental misuse. For applications accepting user-provided patterns, lower limits prevent denial-of-service attacks where malicious patterns would otherwise consume all available memory. The size_limit complements backtrack_limit: the former prevents excessive memory during compilation, while the latter prevents excessive CPU cycles during matching. Both limits are essential for production systems that process untrusted regex patterns, and together they ensure bounded resource consumption regardless of input complexity.