Loading pageā¦
Rust walkthroughs
Loading pageā¦
regex::RegexBuilder::size_limit prevent denial-of-service from exponential backtracking?regex::RegexBuilder::size_limit sets a byte limit on the size of the compiled regex's internal state machine, preventing malicious or accidental patterns from consuming excessive memory during compilation. Certain regex patternsāparticularly those with nested quantifiers like (a+)+bācan cause exponential state space explosion when compiled to a deterministic finite automaton (DFA). The size limit catches these pathological patterns at compile time, rejecting them before they can exhaust system resources. This is distinct from runtime backtracking limits; size_limit protects against compilation-time denial-of-service, not matching-time attacks.
use regex::Regex;
fn problematic_patterns() {
// These patterns can cause exponential DFA growth
// Nested quantifiers: (a+)+b
// Matches "ab", "aab", "aaab", etc., but the DFA has exponential states
// because each "a" can be matched by either the inner or outer +
// Simple input: "aaaab"
// The DFA must track all possible ways to partition the "a"s
// This pattern works for reasonable inputs:
let pattern = "(a+)+b";
let re = Regex::new(pattern).unwrap();
assert!(re.is_match("aaaab"));
// But with certain inputs, DFA compilation could be huge
// size_limit prevents this by capping DFA size
}Nested quantifiers and alternations can create patterns where the minimal DFA has exponentially more states than the pattern length.
use regex::RegexBuilder;
fn basic_size_limit() {
// Default size limit is reasonable for most patterns
let re = RegexBuilder::new(r"\w+@\w+\.\w+")
.build()
.unwrap();
// Set explicit size limit (in bytes)
let re = RegexBuilder::new(r"(a+)+b")
.size_limit(10_000) // 10KB limit for compiled DFA
.build();
match re {
Ok(regex) => println!("Pattern compiled successfully"),
Err(e) => println!("Pattern rejected: {}", e),
}
}The size_limit method sets the maximum bytes the compiled regex can consume.
use regex::RegexBuilder;
fn default_limit() {
// Default: approximately 10MB for the DFA
// This is generous for normal patterns
// Most patterns compile well under the limit
let normal = RegexBuilder::new(r"\d{4}-\d{2}-\d{2}")
.build()
.unwrap();
// Patterns that exceed the limit fail to compile
let problematic = RegexBuilder::new(r"(a|b|c|d|e|f)+")
.size_limit(100) // Very small limit for demonstration
.build();
match problematic {
Ok(_) => println!("Compiled"),
Err(e) => println!("Failed: {}", e),
// Failed: compiled regex exceeds size limit
}
}The default limit is set high enough for practical patterns but catches pathological cases.
use regex::RegexBuilder;
fn two_different_limits() {
// size_limit: compilation-time limit
// Controls DFA/memory size during pattern compilation
let re = RegexBuilder::new(r"(a+)+b")
.size_limit(1_000_000) // Compilation limit
.build()
.unwrap();
// backtracking_limit (if using backtracking engine)
// Controls runtime matching steps
// Note: Rust's regex crate uses DFAs, not backtracking by default
// The regex crate primarily uses:
// - DFA (deterministic finite automaton)
// - NFA (nondeterministic finite automaton) simulation
// - Hybrid approaches
// size_limit catches problems BEFORE matching starts
// backtracking_limit catches problems DURING matching
}size_limit prevents compilation attacks; it doesn't limit matching time.
use regex::RegexBuilder;
fn exponential_patterns() {
// Pattern 1: Nested quantifiers
let nested = RegexBuilder::new(r"(a+)+")
.size_limit(1000)
.build();
// Inner + and outer + create exponential possibilities
// Pattern 2: Multiple overlapping alternations
let overlapping = RegexBuilder::new(r"(a|aa|aaa|aaaa)+")
.size_limit(1000)
.build();
// Many ways to match same strings
// Pattern 3: Repeated optional groups
let optional = RegexBuilder::new(r"(a?){100}")
.size_limit(1000)
.build();
// Exponential combinations with many optional elements
// These patterns create huge DFAs because:
// - Each branch or quantifier adds state combinations
// - Overlapping possibilities multiply states
// - The minimal DFA becomes exponentially large
}Patterns with nested quantifiers, overlapping alternations, or many optional elements can hit size limits.
// Understanding DFA state explosion:
//
// Pattern: (a+)+b
// Input possibilities: "ab", "aab", "aaab", "aaaab", ...
//
// For input "aaab" (three 'a's followed by 'b'):
// How many ways can we match the 'a's with (a+)+?
//
// (aaa)b - outer + matches all three
// (a)(aa)b - outer + has two groups: "a" and "aa"
// (aa)(a)b - outer + has two groups: "aa" and "a"
// (a)(a)(a)b - outer + has three groups
//
// The DFA must track ALL these possibilities
// to correctly reject or accept
//
// For n 'a's: approximately 2^(n-1) groupings
// DFA states grow exponentially
fn explain_state_explosion() {
// Simple alternation: reasonable
let simple = RegexBuilder::new(r"a|b|c")
.build()
.unwrap();
// DFA: ~3 states (one per alternative)
// Nested quantifier: can be exponential
let nested = RegexBuilder::new(r"(a+)+")
.size_limit(10_000)
.build();
// DFA: potentially thousands of states
// Because: inner + has states, outer + multiplies them
// The regex crate's size_limit catches these
// before they consume all available memory
}DFAs must track all possible match paths, leading to exponential state growth for certain patterns.
use regex::RegexBuilder;
fn email_validation() {
// Naive email pattern (simplified)
let naive = RegexBuilder::new(r"[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z0-9]+")
.build()
.unwrap();
// Complex email pattern (RFC 5322-ish)
// This has many alternations and optional parts
let complex_pattern = r#"(?:[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-zA-Z0-9-]*[a-zA-Z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"#;
// Even complex patterns usually compile fine
let complex = RegexBuilder::new(complex_pattern)
.build()
.unwrap();
// size_limit protects against intentional DoS patterns
// Not normal complex patterns
}Normal complex patterns compile fine; size limits catch pathological cases.
use regex::Regex;
fn measure_size() {
// The regex crate doesn't expose exact DFA size directly
// But you can observe size_limit behavior:
// Pattern that should be small
let small = Regex::new(r"hello").unwrap();
println!("Small pattern compiled");
// Pattern approaching limits
let medium = Regex::new(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
.unwrap();
println!("Medium pattern compiled");
// If a pattern fails size_limit, it returns an error
// rather than consuming unbounded memory
// To check if your pattern is close to limits,
// progressively reduce size_limit and see when it fails
}Size limits prevent unbounded memory; patterns approaching limits fail with clear errors.
use regex::RegexBuilder;
fn setting_limits() {
// Default: ~10MB - good for most applications
// Small patterns: very low limit is fine
let tiny = RegexBuilder::new(r"\d{4}")
.size_limit(1000) // 1KB
.build()
.unwrap();
// Complex but valid patterns: increase limit
let complex = RegexBuilder::new(r"[a-z]+@[a-z]+\.[a-z]{2,}")
.size_limit(100_000) // 100KB
.build()
.unwrap();
// For accepting user-defined patterns:
// Use conservative limits to prevent DoS
fn compile_user_pattern(pattern: &str) -> Result<Regex, regex::Error> {
RegexBuilder::new(pattern)
.size_limit(50_000) // Conservative: 50KB
.build()
}
// For trusted internal patterns:
// Higher limits or no limit (None not available, just use default)
fn compile_trusted_pattern(pattern: &str) -> Result<Regex, regex::Error> {
RegexBuilder::new(pattern)
.size_limit(10_000_000) // 10MB - generous for trusted patterns
.build()
}
}Choose size limits based on whether patterns are trusted (internal) or untrusted (user input).
use regex::RegexBuilder;
fn engine_types() {
// The regex crate uses multiple engines internally:
// - PikeVM (NFA simulation) - handles any pattern, slower
// - DFA (deterministic) - fast, but can be huge
// - Hybrid/Lazy DFA - builds DFA on-demand
// size_limit primarily affects DFA compilation
// When DFA would be too large, the crate may fall back
// to slower engines or reject the pattern
let pattern = "(a|b)*c";
// This pattern creates a reasonably-sized DFA
let re = RegexBuilder::new(pattern)
.size_limit(100_000)
.build()
.unwrap();
// The regex crate intelligently chooses engines:
// - Tries to build fast DFA
// - Falls back to slower NFA if DFA too large
// - Or fails if even fallback exceeds limits
}The regex crate may use slower engines when fast DFAs would exceed size limits.
use regex::RegexBuilder;
fn mitigating_attacks() {
// Malicious user provides nested quantifiers
let attack_patterns = vec
![
"(a+)+b", // Nested quantifiers
"(a|aa)+", // Overlapping alternatives
"((a+)+)+", // Triple nested
"(.?){1000}", // Exponential optional
];
let safe_limit = 100_000; // 100KB
for pattern in attack_patterns {
let result = RegexBuilder::new(pattern)
.size_limit(safe_limit)
.build();
match result {
Ok(_) => println!("Pattern '{}' compiled (under limit)", pattern),
Err(e) => println!("Pattern '{}' rejected: {}", pattern, e),
}
}
// Output likely:
// Pattern '(a+)+b' compiled (under limit) - may be under default
// Pattern '(a|aa)+' compiled (under limit)
// Pattern '((a+)+)+' may be rejected - too complex
// Pattern '(.?){1000}' rejected - too many states
// Attack prevented: program doesn't hang or crash
}Size limits transform potential DoS attacks into compile-time errors.
// Comparison with other regex implementations:
//
// PCRE/Perl-style regex engines:
// - Use backtracking
// - Can have exponential TIME on certain inputs
// - Need backtracking limits (steps, time)
//
// RE2 (Google's regex library):
// - Uses DFAs (like Rust's regex crate)
// - Has compilation size limits
// - Guaranteed linear time matching
//
// Rust regex crate:
// - Uses DFAs and NFAs
// - Has size_limit for compilation
// - Guaranteed no exponential runtime
// - Best of both: fast AND safe
fn rust_regex_advantages() {
// Rust's regex crate guarantees:
// 1. O(n) matching time (linear in input)
// 2. Bounded compilation memory (size_limit)
// 3. No backtracking explosions
// Other libraries might need:
// - Backtracking limits
// - Timeouts
// - Input sanitization
}Rust's regex crate uses DFA-based matching, preventing runtime backtracking attacks; size_limit prevents compilation attacks.
use regex::RegexBuilder;
fn bytes_limit() {
// bytes_limit is another related setting
// It limits the size of the compiled program in bytes
let re = RegexBuilder::new(r"(a+)+")
.size_limit(10_000) // DFA size limit
.build();
// Note: Different regex implementations may have
// different limit parameters. Check documentation.
// The regex crate's size_limit specifically limits
// the total heap memory used by compiled regex
}The size limit constrains heap memory used by the compiled regex state machine.
use regex::RegexBuilder;
fn best_practices() {
// 1. Use default for trusted internal patterns
let internal = RegexBuilder::new(r"^[a-z]+$")
.build()
.unwrap();
// 2. Set conservative limits for user-provided patterns
fn compile_user_regex(pattern: &str) -> Result<regex::Regex, String> {
RegexBuilder::new(pattern)
.size_limit(10_000) // 10KB - very conservative
.build()
.map_err(|e| format!("Invalid regex: {}", e))
}
// 3. Validate pattern complexity separately if needed
fn is_pattern_safe(pattern: &str) -> bool {
// Check for known dangerous patterns
let dangerous = ["(a+)+", "(.*)*", "(.?){", "((.+)+)+"];
!dangerous.iter().any(|d| pattern.contains(d))
}
// 4. Document limits in public APIs
/// Compiles a user-provided regex pattern.
///
/// # Limits
/// - Size limit: 50KB (prevents DoS from complex patterns)
/// - Pattern must not contain nested quantifiers
pub fn compile_pattern(pattern: &str) -> Result<regex::Regex, regex::Error> {
RegexBuilder::new(pattern)
.size_limit(50_000)
.build()
}
// 5. Test patterns at various sizes
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_size_limit() {
// Should compile
assert!(compile_user_regex(r"\d+").is_ok());
// Should fail - too complex
// (actual behavior depends on limit and pattern)
}
}
}Follow security best practices when accepting user-provided regex patterns.
Size limit purpose:
// size_limit prevents: Excessive memory during COMPILATION
// NOT: Excessive time during MATCHING
// The threat model:
// Attacker provides: (a+)+b
// With naive DFA: Exponential states -> OOM
// With size_limit: Pattern rejected -> Safe errorHow it works:
// During compilation:
// 1. Parse regex pattern
// 2. Build NFA from pattern
// 3. Convert NFA to DFA (or simulate)
// 4. Track memory usage
// 5. If usage > size_limit: Abort and return error
// 6. If under limit: Complete and return RegexKey distinction:
// Exponential BACKTRACKING (other engines):
// - Problem: Matching time
// - Input: "aaaaaaaaaaaaaaaX" against "(a+)+b"
// - Each 'a' creates branches
// - Matching takes exponential time
// - Solution: Time limits, backtracking limits
// Exponential STATE SPACE (Rust's issue):
// - Problem: Compilation memory
// - Pattern: "(a+)+" or similar
// - DFA has exponential states
// - Compilation uses exponential memory
// - Solution: size_limit
// Rust's regex crate prevents BOTH:
// - No backtracking engine -> No exponential TIME
// - size_limit -> No exponential MEMORYKey insight: size_limit is a compile-time defense against patterns that would create enormous deterministic finite automata. The regex crate's DFA-based matching already prevents the classic "catastrophic backtracking" attacks that affect backtracking regex engines. But DFAs themselves can be exponentially large for certain patternsāsize_limit catches these at compilation time, transforming a potential denial-of-service into a clear error message. For applications accepting user-provided regex patterns, setting a conservative size limit (like 10-50KB) is essential for security. For internal trusted patterns, the default limit (~10MB) is usually sufficient, but can be increased for legitimately complex patterns.