How does regex::RegexBuilder::size_limit prevent denial-of-service from pathological patterns?
size_limit restricts the memory allocated for compiled regex state machines, preventing pathological patterns that would otherwise cause exponential state explosion during compilation or execution. The limit constrains the size of the deterministic finite automaton (DFA) that the regex engine builds, rejecting patterns that would require too many statesâpatterns that could otherwise consume excessive memory or CPU time, enabling denial-of-service attacks.
Basic size_limit Configuration
use regex::RegexBuilder;
fn basic_size_limit() {
// Default size limit is approximately 10MB
let regex = RegexBuilder::new(r"\d+")
.size_limit(1024 * 1024) // 1MB limit
.build();
match regex {
Ok(re) => {
// Regex compiled successfully within limit
println!("Pattern compiled: {:?}", re);
}
Err(e) => {
// Pattern exceeded size limit
eprintln!("Regex too large: {}", e);
}
}
}The size_limit method sets a byte limit on the compiled regex state machine size.
Pathological Pattern Example
use regex::RegexBuilder;
fn pathological_pattern() {
// This pattern can cause exponential state explosion:
// a?^n a^n matches but requires exponential backtracking
let pattern = "a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaa";
// Without size limit, this could hang or consume massive memory
let result = RegexBuilder::new(pattern)
.size_limit(1024) // Very small limit to demonstrate
.build();
match result {
Ok(_) => println!("Pattern compiled successfully"),
Err(e) => println!("Pattern rejected: {}", e),
// Error: "regex size limit exceeded"
}
}Pathological patterns like nested optional quantifiers can cause state explosion.
How State Machine Size Grows
use regex::RegexBuilder;
fn state_size_growth() {
// Simple patterns: small state machine
let simple = RegexBuilder::new(r"\d+")
.build()
.unwrap();
// Few states needed
// Character classes: moderate growth
let moderate = RegexBuilder::new(r"[a-zA-Z0-9]+")
.build()
.unwrap();
// More states for character ranges
// Alternation: can grow multiplicatively
let complex = RegexBuilder::new(r"(abc|abd|abe|abf|abg)+")
.build()
.unwrap();
// States multiply with alternation depth
// Nested quantifiers: potential exponential growth
let dangerous = RegexBuilder::new(r"(a+)+b")
.size_limit(100 * 1024) // May need higher limit
.build();
// Can require many states due to backtracking potential
}State machine complexity depends on pattern structure; certain constructs cause multiplicative state growth.
Size Limit vs Execution Time
use regex::RegexBuilder;
fn size_vs_execution() {
// size_limit affects compilation, not just execution
// Large state machines also take longer to execute
// Pattern that creates large state machine
let pattern = r"(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p)+";
// Low limit: Rejects at compile time
let low_limit = RegexBuilder::new(pattern)
.size_limit(100)
.build();
assert!(low_limit.is_err());
// High limit: Accepts but may use more memory/CPU
let high_limit = RegexBuilder::new(pattern)
.size_limit(10 * 1024 * 1024) // 10MB
.build();
assert!(high_limit.is_ok());
}Larger state machines consume more memory during both compilation and execution.
Common Attack Patterns
use regex::RegexBuilder;
fn common_attack_patterns() {
// Pattern 1: Nested quantifiers (a+)+
// Causes exponential backtracking on "aaa...aaa!"
let nested = "(a+)+b";
// Input "aaaaaaaaaaaaaaaaaaaac" causes exponential time
// Pattern 2: Overlapping alternatives (a|a)*
// Creates many states for overlapping matches
let overlapping = "(a|aa|aaa)+";
// Pattern 3: Backreferences
// Can require exponential states
let backref = r"(a+)\1";
// Pattern 4: Counted repetitions with ranges
// a{1,1000} creates many states
let counted = "a{1,1000}";
// size_limit protects against all these
for pattern in [nested, overlapping, backref, counted] {
let result = RegexBuilder::new(pattern)
.size_limit(1024 * 1024) // 1MB
.build();
if result.is_err() {
println!("Pattern '{}' rejected due to size limit", pattern);
}
}
}These patterns are commonly exploited in ReDoS (Regular Expression Denial of Service) attacks.
Setting Appropriate Limits
use regex::RegexBuilder;
fn appropriate_limits() {
// For simple validation patterns
let validation = RegexBuilder::new(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")
.size_limit(64 * 1024) // 64KB is usually enough for validation
.build();
// For complex search patterns
let search = RegexBuilder::new(r"\b\w+\b")
.size_limit(1024 * 1024) // 1MB for typical search
.build();
// For user-provided patterns (need strict limits)
fn compile_user_pattern(pattern: &str) -> Result<regex::Regex, regex::Error> {
RegexBuilder::new(pattern)
.size_limit(256 * 1024) // Conservative 256KB limit
.build()
}
// For trusted patterns (internal use)
let trusted = RegexBuilder::new(r"complex|pattern|here")
.size_limit(10 * 1024 * 1024) // 10MB for trusted patterns
.build();
}Set size limits based on pattern source and complexity needs.
Size Limit vs Backtrack Limit
use regex::RegexBuilder;
fn size_vs_backtrack() {
// size_limit: Controls state machine size at compile time
// Prevents patterns that create too many states
// backtrack_limit: Controls execution time
// Prevents patterns from running too long
// Both limits work together for DoS protection
let result = RegexBuilder::new(r"(a+)+b")
.size_limit(1024 * 1024) // Limit state machine size
.backtrack_limit(100_000) // Limit backtracking steps
.build();
// size_limit catches patterns at compile time
// backtrack_limit catches patterns at execution time
match result {
Ok(re) => {
// Use carefully
let text = "aaaaaaaaaaaaaaaaaaaaaaaaaac";
match re.find(text) {
Some(m) => println!("Match: {:?}", m),
None => println!("No match"),
}
}
Err(e) => {
println!("Rejected: {}", e);
}
}
}size_limit prevents compilation-time DoS; backtrack_limit prevents execution-time DoS.
Measuring Actual State Size
use regex::RegexBuilder;
fn measure_state_size() {
// There's no direct API to query state machine size
// But you can approximate by trying different limits
fn find_min_size_limit(pattern: &str) -> Option<usize> {
// Binary search for minimum viable size limit
let mut low = 1024usize;
let mut high = 100 * 1024 * 1024; // 100MB
while low < high {
let mid = low + (high - low) / 2;
match RegexBuilder::new(pattern)
.size_limit(mid)
.build()
{
Ok(_) => high = mid,
Err(_) => low = mid + 1,
}
}
Some(low)
}
// Simple pattern needs little space
if let Some(size) = find_min_size_limit(r"\d+") {
println!("Pattern \\d+ needs approximately {} bytes", size);
}
}The actual state machine size depends on pattern complexity and optimization.
Comparison with Other Regex Engines
use regex::RegexBuilder;
fn engine_comparison() {
// The 'regex' crate uses a hybrid NFA/DFA approach
// size_limit applies to the compiled DFA
// Other engines handle limits differently:
// - PCRE: Uses recursion limit (stack-based)
// - RE2: Always-bounded execution (Google's safe regex)
// - Oniguruma: Various limit options
// The 'regex' crate approach:
// 1. Compiles to NFA first
// 2. Lazily builds DFA states during execution
// 3. size_limit caps total DFA state cache size
// This provides protection while maintaining performance
let result = RegexBuilder::new(r"complex|pattern|here")
.size_limit(1024 * 1024)
.build();
}The regex crate's hybrid approach balances safety with performance.
Pattern Rewriting to Reduce Size
use regex::RegexBuilder;
fn pattern_optimization() {
// Avoid patterns that create large state machines
// Bad: Nested quantifiers
let bad_pattern = r"(a|b)*c";
// Creates many states due to potential backtracking
// Better: Flatten if possible
let better_pattern = r"[ab]*c";
// Simpler state machine
// Bad: Overlapping alternatives
let bad = r"(a|ab|abc)+";
// Many overlapping possibilities
// Better: Specific ordering or non-overlapping
let better = r"(?:a|ab|abc)+";
// Still overlapping but clearer intent
// Bad: Unnecessary groups
let bad = r"((a|b))+";
// Better: Non-capturing or simplified
let better = r"(?:a|b)+";
// Test both patterns
for pattern in [bad_pattern, better_pattern] {
let result = RegexBuilder::new(pattern)
.size_limit(1024)
.build();
println!("Pattern '{}': {}", pattern,
if result.is_ok() { "OK" } else { "Rejected" });
}
}Rewriting patterns can significantly reduce state machine size.
Error Handling for Size Limit
use regex::RegexBuilder;
use std::error::Error;
fn handle_size_limit_error() -> Result<(), Box<dyn Error>> {
let pattern = r"(a+)+b";
let regex = RegexBuilder::new(pattern)
.size_limit(100)
.build()
.map_err(|e| {
if e.to_string().contains("size limit") {
// Specific handling for size limit errors
format!("Pattern '{}' creates too large a state machine", pattern)
} else {
format!("Pattern compilation failed: {}", e)
}
})?;
Ok(())
}
fn safe_user_regex(pattern: &str) -> Result<regex::Regex, String> {
RegexBuilder::new(pattern)
.size_limit(256 * 1024) // 256KB limit for user patterns
.build()
.map_err(|e| {
// Provide user-friendly error message
if e.to_string().contains("size limit") {
"Pattern is too complex. Simplify nested quantifiers or reduce alternations.".to_string()
} else {
format!("Invalid pattern: {}", e)
}
})
}Catch and handle size limit errors with user-friendly messages.
Testing Size Limits
use regex::RegexBuilder;
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_reasonable_patterns_compile() {
// Reasonable patterns should compile within size limit
let patterns = [
r"\d+",
r"[a-zA-Z]+",
r"^https?://[\w.-]+",
r"^\w+@\w+\.\w+$",
];
for pattern in patterns {
let result = RegexBuilder::new(pattern)
.size_limit(64 * 1024)
.build();
assert!(result.is_ok(), "Pattern '{}' should compile", pattern);
}
}
#[test]
fn test_pathological_patterns_rejected() {
// Pathological patterns should be rejected
let patterns = [
r"(a+)+",
r"(a|aa)+",
r"(?:a?){1000}a{1000}",
];
for pattern in patterns {
let result = RegexBuilder::new(pattern)
.size_limit(1024) // Very strict limit
.build();
// May succeed with higher limits, but strict limits should reject
// Actual behavior depends on pattern optimization
}
}
}Test that reasonable patterns compile and pathological ones are caught.
Integration in Web Services
use regex::RegexBuilder;
use std::collections::HashMap;
use std::sync::Arc;
// Service for compiling user regex patterns safely
struct SafeRegexService {
cache: HashMap<String, Arc<regex::Regex>>,
max_size: usize,
}
impl SafeRegexService {
fn new() -> Self {
Self {
cache: HashMap::new(),
max_size: 256 * 1024, // 256KB
}
}
fn compile(&mut self, pattern: &str) -> Result<Arc<regex::Regex>, String> {
// Check cache first
if let Some(re) = self.cache.get(pattern) {
return Ok(Arc::clone(re));
}
// Compile with size limit
let re = RegexBuilder::new(pattern)
.size_limit(self.max_size)
.backtrack_limit(50_000) // Also limit execution
.build()
.map_err(|e| format!("Pattern rejected: {}", e))?;
let re = Arc::new(re);
self.cache.insert(pattern.to_string(), Arc::clone(&re));
Ok(re)
}
fn validate_input(&self, pattern: &str, input: &str) -> Result<bool, String> {
let re = self.compile(pattern)?;
Ok(re.is_match(input))
}
}Web services should use size limits for any user-provided regex patterns.
Synthesis
Quick comparison:
| Limit Type | What it Limits | When Applied | Attack Prevented |
|---|---|---|---|
size_limit |
DFA state machine memory | Compile time | State explosion |
backtrack_limit |
Backtracking iterations | Execution time | Catastrophic backtracking |
Common limits:
use regex::RegexBuilder;
fn common_limits() {
// Conservative for user input
RegexBuilder::new("pattern")
.size_limit(64 * 1024) // 64KB
.backtrack_limit(10_000);
// Moderate for internal patterns
RegexBuilder::new("pattern")
.size_limit(1024 * 1024) // 1MB
.backtrack_limit(100_000);
// Relaxed for trusted/known patterns
RegexBuilder::new("pattern")
.size_limit(10 * 1024 * 1024) // 10MB
.backtrack_limit(1_000_000);
}Key insight: size_limit is a compile-time protection against ReDoS attacks that works by limiting the memory required for the regex engine's state machine. The regex crate uses a hybrid NFA/DFA approach where states are lazily built during execution; size_limit caps the total size of this state cache. This prevents patterns that would create exponentially many statesâpatterns like (a|b)* where the number of possible states grows with input length. The protection is proactive: rather than allowing pathological patterns to compile and potentially hang during execution, size_limit rejects them upfront during compilation. For comprehensive DoS protection, combine size_limit with backtrack_limit: the former catches patterns that create large state machines, while the latter catches patterns that would execute too many operations during matching. When accepting regex patterns from untrusted sourcesâsearch functionality, configuration files, API endpointsâalways apply strict size limits to prevent denial-of-service attacks through regex state explosion.
