How does regex::RegexBuilder::size_limit prevent denial-of-service from pathological patterns?
RegexBuilder::size_limit constrains the total heap memory used by the compiled regex's internal state machine, preventing runaway memory consumption from patterns that would create enormous finite automata. This protection targets a different attack vector than backtrack_limit—while backtrack limits prevent exponential time complexity during matching, size limits prevent exponential space complexity during compilation.
The Memory Problem in Regex Compilation
use regex::Regex;
fn memory_explosion_example() {
// This pattern creates exponentially large automata
// a(aa)*b matches a, aaa, aaaaa, etc.
// But the NFA state count grows exponentially with pattern features
let pattern = "a(aaa)*b";
// Some pathological patterns create millions of states
// Without limits, compilation could consume GB of memory
}Regex engines compile patterns into state machines. Complex patterns with many alternations or nested quantifiers can create state machines with millions of states.
Setting Size Limits
use regex::Regex;
fn basic_size_limit() -> Result<(), regex::Error> {
// Default size limit: ~10MB of heap
// Patterns exceeding this fail to compile
// Set a custom size limit (in bytes)
let re = Regex::builder()
.size_limit(1024 * 1024) // 1 MB limit
.build(r"a(b|c)*d")?;
// The pattern compiles if its state machine fits in 1MB
assert!(re.is_match("abbbbcdd"));
Ok(())
}The size limit restricts how large the compiled state machine can grow.
The Default Protection
use regex::Regex;
fn default_limits() {
// The default size limit is approximately 10MB
// This prevents most memory exhaustion attacks
// When a pattern would create a state machine > 10MB:
let result = Regex::new(r"(a?){50}b(c?){50}");
// This pattern might succeed or fail depending on
// how many states the regex engine generates
// Explicitly small limit for untrusted input:
let result = Regex::builder()
.size_limit(100 * 1024) // 100 KB
.build(r"(a?){50}b");
match result {
Ok(re) => println!("Compiled successfully"),
Err(e) => println!("Pattern too large: {}", e),
}
}The default 10MB limit protects against most accidental memory exhaustion.
Size Limit vs Backtrack Limit
use regex::Regex;
fn different_protections() {
// Two different attack vectors:
// 1. SIZE LIMIT: Prevents exponential STATE SPACE
// Pattern compiles into too many NFA/DFA states
// Attack happens during COMPILATION
//
// Example: Alternation explosion
// (a|b|c|d|e){20} can create millions of states
let result = Regex::builder()
.size_limit(1024) // Very small for demonstration
.build(r"(a|b){20}");
// Fails: "regex too large"
// 2. BACKTRACK LIMIT: Prevents exponential TIME
// Pattern causes too many match attempts
// Attack happens during EXECUTION
//
// Example: Catastrophic backtracking
// a(a?){50}b matched against "aaaaaaa...aaa"
let result = Regex::builder()
.backtrack_limit(1000) // Limit backtracking steps
.build(r"a(b|c)*d");
// These are INDEPENDENT protections:
// - size_limit affects compilation memory
// - backtrack_limit affects execution time
}Size limits protect memory during compilation; backtrack limits protect time during matching.
Patterns That Create Large State Machines
use regex::Regex;
fn large_state_patterns() {
// Pattern features that explode state count:
// 1. Many alternations with overlapping characters
let pattern = "(a|aa|aaa|aaaa){30}";
// Each alternation multiplies states
// 2. Nested quantifiers with variable lengths
let pattern = "(a?){40}b";
// Each `?` doubles potential states
// 3. Large counted repetitions
let pattern = "a{1000}";
// Explicit state for each position
// 4. Complex character classes
let pattern = "[a-z0-9_-]{100}";
// States for each character class match
// These patterns may hit size limits:
for limit in [1024, 10*1024, 100*1024, 1024*1024] {
let result = Regex::builder()
.size_limit(limit)
.build("(a|b|c|d|e){20}");
match result {
Ok(_) => println!("Compiled at {} bytes", limit),
Err(_) => println!("Failed at {} bytes", limit),
}
}
}Alternation combinations and nested quantifiers create state machine explosions.
Practical Attack Scenario
use regex::Regex;
fn dos_attack_example() {
// Attacker-controlled pattern input:
fn compile_user_pattern(input: &str) -> Result<Regex, String> {
// WITHOUT protection:
// Regex::new(input)? // User could crash server
// WITH protection:
Regex::builder()
.size_limit(50 * 1024) // 50 KB max
.build(input)
.map_err(|e| format!("Pattern rejected: {}", e))
}
// Attacker might try:
let malicious_patterns = [
"(a|b){50}", // Alternation explosion
"(a?){100}", // Optional explosion
"a(aaa)*b(ccc)*d", // Nested quantifiers
];
for pattern in malicious_patterns {
match compile_user_pattern(pattern) {
Ok(_) => println!("Accepted: {}", pattern),
Err(e) => println!("Rejected: {}", e),
}
}
}Size limits protect against malicious patterns in user-controlled regex input.
Memory Calculation Details
use regex::Regex;
fn memory_internals() {
// The size limit accounts for:
//
// 1. NFA states (each state has transitions)
// 2. DFA states (if compiled to DFA)
// 3. Character class tables
// 4. Capture group information
// 5. Various internal data structures
// Example: Simple pattern
let re = Regex::new(r"a+b+c+").unwrap();
// Uses minimal memory: few states, simple transitions
// Example: Complex pattern
let re = Regex::new(r"(a|b|c|d|e|f|g|h|i|j){10}").unwrap();
// Uses much more memory: 10^10 potential states
// Actually optimized, but demonstrates the concept
// The engine may use lazy DFA compilation:
// - NFA compiled upfront
// - DFA states generated on-demand during matching
// - size_limit still applies to NFA size
}The size limit covers all internal data structures, not just raw state count.
Combining Size and Backtrack Limits
use regex::Regex;
fn comprehensive_protection() -> Result<Regex, regex::Error> {
// For untrusted patterns, use both limits:
Regex::builder()
.size_limit(100 * 1024) // 100 KB for compilation
.backtrack_limit(100_000) // 100K steps for matching
.build(r"[a-z]+@[a-z]+\.[a-z]+")
// This protects against:
// - Memory exhaustion (size_limit)
// - CPU exhaustion (backtrack_limit)
}
// For trusted patterns, you might relax limits:
fn trusted_pattern() -> Result<Regex, regex::Error> {
Regex::builder()
.size_limit(100 * 1024 * 1024) // 100 MB for complex patterns
.build(r"very complex trusted pattern")
}Use both limits for defense-in-depth against untrusted patterns.
Error Handling for Size Limits
use regex::{Regex, Error};
fn handle_compilation_errors() {
let pattern = "(a|b){100}"; // Will exceed small limits
let result = Regex::builder()
.size_limit(1024) // 1 KB - too small
.build(pattern);
match result {
Ok(re) => {
println!("Compiled successfully");
}
Err(Error::CompiledTooBig(size_limit)) => {
println!("Pattern would create state machine > {} bytes", size_limit);
println!("Increase size_limit or simplify pattern");
}
Err(e) => {
println!("Other error: {}", e);
}
}
}The CompiledTooBig error variant provides the size limit that was exceeded.
Realistic Size Limit Values
use regex::Regex;
fn realistic_limits() {
// Conservative limits for untrusted input:
let conservative = Regex::builder()
.size_limit(50 * 1024) // 50 KB
.build(r"[a-z]+");
// Moderate limits for semi-trusted input:
let moderate = Regex::builder()
.size_limit(1024 * 1024) // 1 MB
.build(r"complex pattern");
// Relaxed limits for trusted/validated patterns:
let relaxed = Regex::builder()
.size_limit(10 * 1024 * 1024) // 10 MB (default)
.build(r"very complex trusted pattern");
// Minimal limits for strict security:
let minimal = Regex::builder()
.size_limit(10 * 1024) // 10 KB
.build(r"simple pattern");
// The right limit depends on:
// - Expected pattern complexity
// - Trust level of pattern source
// - Available memory
// - Performance requirements
}Choose size limits based on threat model and resource constraints.
Patterns That Bypass Size Limits
use regex::Regex;
fn still_vulnerable() {
// Size limit doesn't protect against EVERYTHING:
// 1. Time-based attacks still possible
// (use backtrack_limit for these)
let pattern = "a(a?){50}b";
let re = Regex::builder()
.size_limit(1024 * 1024)
.build(pattern)
.unwrap();
// This compiles small, but matches slowly!
// 2. Memory in the INPUT string
let input = "a".repeat(1_000_000_000); // 1 GB input
// size_limit doesn't limit input size
// 3. Capture group overhead
let pattern = "(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)";
// Many captures use memory beyond state machine
// 4. Literal string matching
let pattern = "a".repeat(1_000_000); // 1 MB pattern
// Literal patterns may not create many states
// But the pattern string itself is large
}Size limits specifically target state machine size, not all attack vectors.
Diagnostic Output
use regex::Regex;
fn debug_pattern_size() {
// If a pattern fails to compile, check why:
fn try_compile_with_increasing_limits(pattern: &str) {
let limits: Vec<usize> = vec![
1024, // 1 KB
10 * 1024, // 10 KB
100 * 1024, // 100 KB
1024 * 1024, // 1 MB
10 * 1024 * 1024, // 10 MB
];
for limit in limits {
match Regex::builder()
.size_limit(limit)
.build(pattern)
{
Ok(_) => {
println!("Compiled at {} bytes", limit);
return;
}
Err(_) => {
println!("Failed at {} bytes", limit);
}
}
}
}
try_compile_with_increasing_limits("(a|b|c|d|e){30}");
}Incrementally increase limits to find a pattern's actual memory requirement.
Best Practices
use regex::Regex;
fn best_practices() {
// 1. Always set limits for untrusted patterns
fn safe_compile(pattern: &str) -> Result<Regex, String> {
Regex::builder()
.size_limit(100 * 1024) // 100 KB
.backtrack_limit(10_000) // 10K steps
.build(pattern)
.map_err(|e| e.to_string())
}
// 2. Validate patterns before compilation
fn validate_pattern(pattern: &str) -> Result<(), String> {
let max_len = 1000;
if pattern.len() > max_len {
return Err(format!("Pattern too long: {} > {}", pattern.len(), max_len));
}
// Check for suspicious patterns
if pattern.contains("{50") || pattern.contains("{100") {
return Err("Pattern contains suspicious repetition".into());
}
Ok(())
}
// 3. Use simpler alternatives when possible
fn simplify_pattern() {
// Instead of: (a|b|c|d|e|f|g|h|i|j)+
// Consider: [a-j]+
// Character classes compile more efficiently
let good = Regex::new(r"[a-j]+"); // Efficient
let bad = Regex::new(r"(a|b|c|d|e|f|g|h|i|j)+"); // Less efficient
}
// 4. Pre-compile patterns when possible
fn precompile() {
// Don't compile on every request
// Cache compiled regexes
use std::sync::OnceLock;
static CACHED: OnceLock<Regex> = OnceLock::new();
let re = CACHED.get_or_init(|| {
Regex::new(r"pattern").unwrap()
});
}
}Combine size limits with validation and pattern simplification for robust protection.
Summary
fn summary() {
// 1. size_limit constrains compiled state machine memory
// 2. Default is ~10MB, suitable for most legitimate patterns
// 3. Protects against compilation-time memory exhaustion
// 4. Independent from backtrack_limit (which protects execution time)
// 5. Alternation combinations are common attack vectors
// 6. Use both size_limit and backtrack_limit for untrusted input
// 7. Error::CompiledTooBig indicates limit was exceeded
// 8. Size limits don't protect against time-based attacks
// 9. Character classes [abc] are more efficient than alternations (a|b|c)
// 10. Pre-compile patterns when possible to avoid repeated compilation
}Key insight: RegexBuilder::size_limit protects against memory exhaustion attacks by limiting the size of the compiled regex state machine during compilation, complementing backtrack_limit which protects against CPU exhaustion during matching. The two limits target different attack vectors: size limits prevent patterns that create exponentially large state machines (like nested alternations (a|b|c){N}), while backtrack limits prevent patterns that cause exponential matching time (like a(a?){N}b against long strings). For untrusted regex input, set both limits to appropriate values based on your threat model and resource constraints. The default 10MB limit is a reasonable balance for most applications but should be reduced for public-facing regex compilation services where memory exhaustion is a denial-of-service risk.
