How does regex::RegexBuilder::size_limit prevent exponential backtracking in pathological cases?
regex::RegexBuilder::size_limit constrains the memory allocated for the compiled regex's internal state machine, preventing denial-of-service attacks from pathological patterns that would otherwise cause exponential state machine growth. When certain regex patterns are compiledâparticularly those with nested quantifiers like (a+)+ or alternations with overlapping matchesâthe resulting finite automaton can grow exponentially relative to the pattern size. The size_limit method sets an upper bound on this growth, causing compilation to fail rather than consuming unbounded memory. This protection is essential for applications accepting user-provided regex patterns, where malicious inputs could otherwise exhaust system resources.
The Exponential Growth Problem
use regex::Regex;
fn main() {
// This pattern looks simple but has exponential state explosion
let pattern = "(a+)+b";
// Without size limits, compilation could consume massive memory
// The nested quantifiers create many overlapping possibilities
match Regex::new(pattern) {
Ok(re) => println!("Compiled: {:?}", re),
Err(e) => println!("Error: {}", e),
}
// On pathological input like "aaaaaaaaaaaaaaaaaaaaaaaaaac",
// matching can take exponential time
}Nested quantifiers and overlapping alternations cause state machine explosion during compilation.
Basic size_limit Usage
use regex::RegexBuilder;
fn main() {
// Set a size limit of 1MB for the compiled regex
let result = RegexBuilder::new(r"(a+)+b")
.size_limit(1024 * 1024) // 1MB limit
.build();
match result {
Ok(re) => println!("Compiled successfully"),
Err(e) => println!("Compilation failed: {}", e),
}
}size_limit specifies the maximum memory allowed for the compiled state machine.
Default Size Limit
use regex::RegexBuilder;
fn main() {
// The default size limit is approximately 10MB
// This protects against most pathological patterns
// Check if a pattern exceeds default limits
let result = RegexBuilder::new(r"(a|b|c|d|e|f|g)+")
.build();
match result {
Ok(re) => {
println!("Pattern compiled within default limits");
}
Err(e) => {
println!("Pattern too large: {}", e);
}
}
}The default limit provides baseline protection without configuration.
Patterns That Cause State Explosion
use regex::RegexBuilder;
fn main() {
// Nested quantifiers - classic pathological pattern
let nested = "(a+)+";
// Alternation with overlapping possibilities
let overlapping = "(a|a)+";
// Complex character classes with quantifiers
let char_class = "([a-z][a-z])+";
// Nested groups with alternations
let nested_alt = "((a|b)+)+";
let patterns = [nested, overlapping, char_class, nested_alt];
for pattern in patterns {
let result = RegexBuilder::new(pattern)
.size_limit(100 * 1024) // 100KB - intentionally small
.build();
match result {
Ok(_) => println!("'{}' compiled OK", pattern),
Err(_) => println!("'{}' exceeded size limit", pattern),
}
}
}Nested quantifiers and overlapping alternatives are common causes of state explosion.
Exponential Backtracking in Matching
use regex::Regex;
use std::time::Instant;
fn main() {
// Pattern vulnerable to backtracking
let re = Regex::new(r"(a+)+b").unwrap();
// Input that triggers exponential backtracking
let input = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaac";
let start = Instant::now();
let is_match = re.is_match(input);
let duration = start.elapsed();
println!("Match: {}, Time: {:?}", is_match, duration);
// This can take seconds or longer with enough 'a's
// The pattern tries to match 'b' but input ends with 'c'
// Each 'a' can belong to either inner or outer group
// Exponentially many combinations are tried
}The size_limit helps with compilation memory; backtracking limits help with match time.
Combining size_limit with backtrack_limit
use regex::RegexBuilder;
fn main() {
// Protect against both compilation and matching issues
let result = RegexBuilder::new(r"(a+)+b")
.size_limit(1024 * 1024) // 1MB for compilation
.backtrack_limit(100_000) // Limit backtracking steps
.build();
match result {
Ok(re) => {
let input = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaac";
match re.is_match(input) {
true => println!("Matched"),
false => println!("No match"),
}
}
Err(e) => println!("Compilation error: {}", e),
}
}Use both limits: size_limit for compilation, backtrack_limit for execution.
Nested Quantifier Analysis
use regex::RegexBuilder;
fn main() {
// Each level of nesting multiplies state possibilities
// Level 1: a+ - linear states
let level1 = "a+";
// Level 2: (a+)+ - quadratic to exponential
let level2 = "(a+)+";
// Level 3: ((a+)+)+ - exponential explosion
let level3 = "((a+)+)+";
for (name, pattern) in [("L1", level1), ("L2", level2), ("L3", level3)] {
let result = RegexBuilder::new(pattern)
.size_limit(10 * 1024) // Small limit to show effect
.build();
println!("{} '{}': {}", name, pattern,
if result.is_ok() { "OK" } else { "FAILED" });
}
}Deeper nesting creates more dramatic state explosion.
Character Class Explosion
use regex::RegexBuilder;
fn main() {
// Large character classes multiply states
let small = "[a-z]+"; // 26 characters
let medium = "[a-zA-Z0-9]+"; // 62 characters
let large = "[\u{0000}-\u{FFFF}]+"; // Many characters
// Combined with nested groups
let nested_small = "([a-z]+)+";
let nested_large = "([\u{0000}-\u{FFFF}]+)+";
let patterns = [
("small_char_class", small),
("medium_char_class", medium),
("large_char_class", large),
("nested_small", nested_small),
("nested_large", nested_large),
];
for (name, pattern) in patterns {
let result = RegexBuilder::new(pattern)
.size_limit(100 * 1024)
.build();
println!("{}: {}", name, if result.is_ok() { "OK" } else { "FAILED" });
}
}Character class size amplifies the state explosion from quantifiers.
Alternation Overlap
use regex::RegexBuilder;
fn main() {
// Non-overlapping alternation - efficient
let efficient = "(a|b|c|d)+";
// Overlapping alternation - can cause issues
let overlapping = "(a|ab|abc|abcd)+";
// Ambiguous alternation - pathological
let ambiguous = "(a.*|.*a)+";
let patterns = [
("efficient", efficient),
("overlapping", overlapping),
("ambiguous", ambiguous),
];
for (name, pattern) in patterns {
let result = RegexBuilder::new(pattern)
.size_limit(50 * 1024)
.build();
println!("{}: {}", name, if result.is_ok() { "OK" } else { "FAILED" });
}
}Overlapping alternatives in alternations create ambiguous paths that explode state count.
Safe Pattern Alternatives
use regex::RegexBuilder;
fn main() {
// Instead of: (a+)+b (vulnerable)
// Use: a+b (safe)
let safe = RegexBuilder::new(r"a+b")
.build();
// Instead of: (a|a)+ (redundant)
// Use: a+ (equivalent)
let simplified = RegexBuilder::new(r"a+")
.build();
// Instead of: (.*)* (can be pathological)
// Use: .* (equivalent)
let greedy = RegexBuilder::new(r".*")
.build();
// All these compile efficiently
println!("Safe patterns compiled successfully");
}Simplifying patterns often eliminates pathological behavior.
Detecting Problematic Patterns
use regex::RegexBuilder;
fn is_pattern_safe(pattern: &str) -> Result<bool, String> {
let result = RegexBuilder::new(pattern)
.size_limit(1024 * 1024) // 1MB
.backtrack_limit(100_000)
.build();
match result {
Ok(_) => Ok(true),
Err(e) => {
let err_str = e.to_string();
if err_str.contains("exceeds size limit") {
Err("Pattern too complex - possible state explosion".to_string())
} else {
Err(format!("Invalid pattern: {}", err_str))
}
}
}
}
fn main() {
let patterns = [
r"\d+", // Safe
r"(a+)+", // Problematic
r"[a-z]+", // Safe
r"((a|b)+)+", // Problematic
];
for pattern in patterns {
match is_pattern_safe(pattern) {
Ok(_) => println!("'{}' is safe", pattern),
Err(e) => println!("'{}' has issues: {}", pattern, e),
}
}
}Validation functions can check patterns before using them in production.
Resource Limits in Production
use regex::RegexBuilder;
struct SafeRegex {
pattern: String,
compiled: Option<regex::Regex>,
}
impl SafeRegex {
fn new(pattern: &str, max_size: usize) -> Result<Self, String> {
let compiled = RegexBuilder::new(pattern)
.size_limit(max_size)
.backtrack_limit(100_000)
.build()
.map_err(|e| format!("Pattern rejected: {}", e))?;
Ok(Self {
pattern: pattern.to_string(),
compiled: Some(compiled),
})
}
fn is_match(&self, input: &str) -> bool {
self.compiled.as_ref().map_or(false, |re| re.is_match(input))
}
}
fn main() {
// Production-safe regex with limits
match SafeRegex::new(r"(a+)+b", 1024 * 1024) {
Ok(re) => println!("Created safe regex"),
Err(e) => println!("Rejected: {}", e),
}
}Wrapping regex creation with limits provides consistent protection.
Size Limit Units
use regex::RegexBuilder;
fn main() {
// size_limit is in bytes for the compiled DFA/NFA state machine
// Small patterns: ~1KB
let small = RegexBuilder::new(r"\d+")
.size_limit(1024)
.build();
println!("Small pattern: {}", small.is_ok());
// Medium patterns: ~100KB
let medium = RegexBuilder::new(r"[a-z]+@[a-z]+\.[a-z]+")
.size_limit(100 * 1024)
.build();
println!("Medium pattern: {}", medium.is_ok());
// Large patterns: ~10MB (default)
let large = RegexBuilder::new(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
.build(); // Uses default limit
println!("Large pattern: {}", large.is_ok());
}Size is measured in bytes for the internal state representation.
When to Increase size_limit
use regex::RegexBuilder;
fn main() {
// Some legitimate patterns need more memory
// Complex but well-formed patterns
// Email regex (complex but usually OK)
let email = RegexBuilder::new(
r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
)
.size_limit(2 * 1024 * 1024) // 2MB
.build();
// URL parsing (complex patterns)
let url = RegexBuilder::new(
r"https?://[^\s/$.?#].[^\s]*"
)
.size_limit(5 * 1024 * 1024) // 5MB
.build();
// IPv6 addresses (many alternatives)
let ipv6 = RegexBuilder::new(
r"([0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}"
)
.size_limit(1024 * 1024) // 1MB
.build();
println!("Complex patterns compiled: {} {} {}",
email.is_ok(), url.is_ok(), ipv6.is_ok());
}Increase limits for legitimate complex patterns that are well-formed.
Measuring Compiled Size
use regex::RegexBuilder;
fn main() {
// Unfortunately, there's no direct way to query compiled size
// But you can detect if size limit was hit
let patterns = [
(r"\d+", "simple"),
(r"(a+)+", "nested"),
(r"((a|b)+)+", "double_nested"),
];
for (pattern, name) in patterns {
// Try with increasing limits
for limit_kb in [10, 100, 1000] {
let result = RegexBuilder::new(pattern)
.size_limit(limit_kb * 1024)
.build();
if result.is_ok() {
println!("'{}' compiled with {}KB limit", name, limit_kb);
break;
}
}
}
}Binary search on size limit can approximate compiled size requirements.
Comparison with Other Regex Engines
use regex::RegexBuilder;
fn main() {
// Rust's regex crate is already optimized
// But pathological patterns can still cause issues
// The regex crate uses:
// 1. Lazy DFA construction (defers some state creation)
// 2. NFA simulation for backreferences
// 3. Multiple engines internally
// size_limit protects the state machine construction
// Even with optimizations, some patterns explode
// Example: backtracking regex engines can hang forever
// Rust's regex won't hang during matching (has internal limits)
// But compilation can still use excessive memory
let result = RegexBuilder::new(r"(a+)+b")
.size_limit(1024) // Very small to demonstrate
.build();
match result {
Ok(_) => println!("Compiled"),
Err(_) => println!("Size limit hit - pattern too complex"),
}
}Rust's regex has multiple internal protections; size_limit is one layer.
Attack Surface for User Patterns
use regex::RegexBuilder;
// Application that accepts user regex patterns
fn search_with_user_pattern(text: &str, pattern: &str) -> Result<bool, String> {
let re = RegexBuilder::new(pattern)
.size_limit(1024 * 1024) // 1MB max compiled size
.backtrack_limit(100_000) // Limit match backtracking
.build()
.map_err(|e| format!("Invalid pattern: {}", e))?;
Ok(re.is_match(text))
}
fn main() {
// Benign user input
let result = search_with_user_pattern("hello world", r"world");
println!("Benign: {:?}", result);
// Malicious input - exponential pattern
let result = search_with_user_pattern("test", r"(a+)+b");
println!("Malicious: {:?}", result);
// Malicious input - trying to exhaust memory
let result = search_with_user_pattern("test", &"a".repeat(10000));
println!("Long pattern: {:?}", result);
}Applications accepting user patterns must enforce limits.
Size Limit vs Backtrack Limit
use regex::RegexBuilder;
fn main() {
// size_limit: compilation-time memory protection
// Controls how large the compiled state machine can be
// backtrack_limit: execution-time step protection
// Controls how many backtracking steps during matching
let pattern = r"(a+)+b";
// Without backtrack_limit, matching can hang
let re = RegexBuilder::new(pattern)
.size_limit(10 * 1024 * 1024)
.backtrack_limit(100_000) // Important for this pattern!
.build()
.unwrap();
let input = "aaaaaaaaaaaaaaaaaaaaaaaaaac";
// With backtrack_limit, this returns quickly
let is_match = re.is_match(input);
println!("Match: {}", is_match);
// Both limits work together:
// size_limit prevents compilation DoS
// backtrack_limit prevents matching DoS
}Two limits protect different attack vectors: compilation and execution.
Pattern Rewrite Strategies
use regex::RegexBuilder;
fn main() {
// Original pathological patterns and safe rewrites
// Pathological: (a+)+
// Safe: a+
// The outer quantifier is redundant
// Pathological: (a|a)+
// Safe: a+
// Overlapping alternatives are redundant
// Pathological: (a*|b*)*
// Safe: [ab]*
// Character class is more efficient
// Pathological: .*(.*)*
// Safe: .*
// Nested greedy quantifiers are redundant
let safe_patterns = [
r"a+",
r"a+",
r"[ab]*",
r".*",
];
for pattern in safe_patterns {
let result = RegexBuilder::new(pattern)
.size_limit(1024) // Small limit
.build();
println!("'{}': {}", pattern, if result.is_ok() { "OK" } else { "FAILED" });
}
}Many pathological patterns have equivalent safe rewrites.
Error Handling for Size Limits
use regex::{RegexBuilder, Error};
fn main() {
let result = RegexBuilder::new(r"(a+)+")
.size_limit(100) // Extremely small
.build();
match result {
Ok(re) => println!("Compiled: {:?}", re),
Err(e) => {
// Check error type
let err_str = e.to_string();
if err_str.contains("size limit") {
eprintln!("Pattern too complex: state machine too large");
// Handle size limit error specifically
} else if err_str.contains("parse") {
eprintln!("Syntax error in pattern");
// Handle syntax error
} else {
eprintln!("Other error: {}", e);
}
}
}
}Handle size limit errors specifically for user feedback.
Synthesis
What causes state explosion:
| Pattern Feature | Risk Level | Explanation |
|---|---|---|
Nested quantifiers (a+)+ |
High | Exponential state combinations |
Overlapping alternations (a|ab)+ |
High | Ambiguous matching paths |
Large character classes [\u0000-\uFFFF]+ |
Medium | Multiplies states |
Nested groups ((a+)+)+ |
Very High | Each level multiplies complexity |
Greedy nested quantifiers (.*)* |
High | Exponential backtracking |
Protection mechanisms:
| Limit | Protects Against | When Applied |
|---|---|---|
size_limit |
Memory exhaustion | During compilation |
backtrack_limit |
CPU exhaustion | During matching |
Default values:
| Setting | Default | Typical Override |
|---|---|---|
size_limit |
~10MB | Lower for untrusted input |
backtrack_limit |
~100,000 | Lower for strict timeouts |
Best practices:
// For trusted patterns (developer-provided)
let re = Regex::new(r"complex_pattern")?;
// For user-provided patterns
let re = RegexBuilder::new(user_pattern)
.size_limit(1024 * 1024)
.backtrack_limit(100_000)
.build()?;Key insight: size_limit protects the compilation phase by bounding the memory needed to construct the regex engine's internal state machine. Pathological patternsâparticularly nested quantifiers and overlapping alternationsâcan cause the number of states to grow exponentially with pattern complexity. By default, the regex crate has a ~10MB limit that protects against accidental misuse. For applications accepting user-provided patterns, lower limits prevent denial-of-service attacks where malicious patterns would otherwise consume all available memory. The size_limit complements backtrack_limit: the former prevents excessive memory during compilation, while the latter prevents excessive CPU cycles during matching. Both limits are essential for production systems that process untrusted regex patterns, and together they ensure bounded resource consumption regardless of input complexity.
