Loading pageā¦
Rust walkthroughs
Loading pageā¦
regex::RegexBuilder::size_limit prevent denial-of-service from exponential backtracking?regex::RegexBuilder::size_limit sets a maximum memory bound on the compiled regex state machine, preventing denial-of-service attacks where malicious patterns could cause the regex engine to consume excessive memory during compilation. While the regex crate is generally immune to catastrophic backtracking due to its NFA-based implementation, certain patterns with massive character classes or deeply nested repetitions can produce enormous compiled state machines that exhaust memory. The size_limit method constrains the compiled automaton size, causing compilation to fail rather than allowing runaway memory consumption. This is distinct from execution-time protectionāregex uses lazy DFAs and bounded backtracking that naturally limit execution timeābut protects against the compilation phase where certain pathological patterns could otherwise allocate gigabytes of memory for their state machines.
use regex::Regex;
fn main() {
// A pattern that compiles to a huge state machine
let pattern = "[a-z]{50}";
// This compiles but creates a large automaton
let re = Regex::new(pattern).unwrap();
println!("Pattern compiled successfully");
// For very large patterns, this could consume significant memory
}Patterns with large character classes or complex alternations can create huge state machines.
use regex::RegexBuilder;
fn main() {
// Limit compiled regex to 1MB
let result = RegexBuilder::new(r"[a-z]{50}")
.size_limit(1024 * 1024) // 1 MB limit
.build();
match result {
Ok(re) => println!("Compiled: {:?}", re),
Err(e) => println!("Compilation failed: {}", e),
}
}size_limit constrains the memory used by the compiled state machine.
use regex::{Regex, RegexBuilder};
fn main() {
// These patterns can create large compiled state machines
// 1. Large counted repetitions
let pattern1 = r"a{100}"; // 100 consecutive 'a' states
// 2. Large character classes with repetitions
let pattern2 = r"[a-zA-Z0-9]{50}"; // Many states for character combinations
// 3. Complex alternations
let pattern3 = r"(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z){20}";
// 4. Nested repetitions
let pattern4 = r"(a+){10}"; // Can create many states
for (name, pattern) in [("large repetition", pattern1),
("char class repetition", pattern2),
("alternation repetition", pattern3),
("nested repetition", pattern4)] {
let result = RegexBuilder::new(pattern)
.size_limit(100_000) // 100KB limit
.build();
match result {
Ok(_) => println!("{}: compiled within limit", name),
Err(e) => println!("{}: exceeded limit - {}", name, e),
}
}
}Various pattern constructs can lead to large compiled state machines.
use regex::RegexBuilder;
fn main() {
// size_limit constrains the COMPILED state machine, not:
// - Execution time (handled by the regex engine's bounded execution)
// - Input string size (not limited)
// - Match result size (not limited)
// This pattern with a limited character class
let small_pattern = r"a{10}";
let small_re = RegexBuilder::new(small_pattern)
.size_limit(10_000)
.build()
.unwrap();
// Can still match against very long strings
let long_string = "a".repeat(1_000_000);
let matches: Vec<_> = small_re.find_iter(&long_string).collect();
println!("Found {} matches in 1M character string", matches.len());
// size_limit only affects compilation, not execution
}size_limit affects compilation memory, not execution behavior.
use regex::RegexBuilder;
fn main() {
// The default size limit is approximately 10MB
// Patterns exceeding this will fail to compile
// Check the current default
let re = RegexBuilder::new(r"a")
.build()
.unwrap();
// The default is set by the regex crate's internal constant
println!("Default size limit is around 10MB");
// You can check if a pattern exceeds default by attempting compilation
let large_pattern = format!("[a-z]{{{}}}", 1000);
match Regex::new(&large_pattern) {
Ok(_) => println!("Pattern within default limit"),
Err(e) => println!("Pattern exceeded default: {}", e),
}
}The regex crate has a sensible default limit around 10MB.
use regex::RegexBuilder;
fn main() {
// Different applications need different limits
// Web API handling untrusted input - be strict
let api_limit = 100 * 1024; // 100KB
// Configuration file parsing - moderate
let config_limit = 1024 * 1024; // 1MB
// Trusted internal patterns - generous
let internal_limit = 10 * 1024 * 1024; // 10MB
// Example: parsing user-provided regex in a search API
let user_pattern = r"[a-z]+@[a-z]+\.[a-z]{2,}";
match RegexBuilder::new(user_pattern)
.size_limit(api_limit)
.build() {
Ok(re) => println!("User pattern accepted"),
Err(e) => println!("Pattern too complex: {}", e),
}
}Choose limits based on your threat model and use case.
use regex::{Regex, RegexBuilder};
use std::fmt;
#[derive(Debug)]
enum RegexError {
Compilation(String),
SizeLimit(String),
}
impl fmt::Display for RegexError {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
RegexError::Compilation(s) => write!(f, "Regex compilation error: {}", s),
RegexError::SizeLimit(s) => write!(f, "Regex size limit exceeded: {}", s),
}
}
}
impl std::error::Error for RegexError {}
fn compile_safe_regex(pattern: &str, max_size: usize) -> Result<Regex, RegexError> {
RegexBuilder::new(pattern)
.size_limit(max_size)
.build()
.map_err(|e| {
let msg = e.to_string();
if msg.contains("exceeds size limit") || msg.contains("too big") {
RegexError::SizeLimit(msg)
} else {
RegexError::Compilation(msg)
}
})
}
fn main() {
let pattern = r"[a-z]{1000}";
match compile_safe_regex(pattern, 1024) {
Ok(re) => println!("Compiled: {:?}", re),
Err(RegexError::SizeLimit(msg)) => {
println!("Pattern rejected - too large: {}", msg);
}
Err(RegexError::Compilation(msg)) => {
println!("Invalid pattern syntax: {}", msg);
}
}
}Handle size limit errors distinctly from syntax errors for better user feedback.
use regex::RegexBuilder;
use std::time::{Duration, Instant};
fn main() {
// regex crate has built-in protection against exponential backtracking
// Unlike some regex engines, it uses NFA-based execution
// This "evil" pattern that causes ReDoS in backtracking engines
let evil_pattern = r"(a+)+$";
let re = RegexBuilder::new(evil_pattern)
.size_limit(1024 * 1024)
.build()
.unwrap();
// In backtracking engines, this would be catastrophic
let input = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!";
let start = Instant::now();
let is_match = re.is_match(input);
let duration = start.elapsed();
println!("Match: {}, Time: {:?}", is_match, duration);
// With regex crate, this completes quickly
// The NFA-based engine doesn't have exponential backtracking
// size_limit is for compilation, not execution
println!("Note: regex crate's NFA engine prevents ReDoS by design");
}The regex crate prevents ReDoS through its NFA engine; size_limit addresses compilation memory.
use regex::{Regex, RegexBuilder};
fn main() {
// Scenario: User-provided regex patterns in a search feature
// Malicious user tries to submit a pattern that creates huge state machine
let malicious_patterns = [
// Large counted repetition
r"[a-z]{10000}",
// Deeply nested alternatives
r"((((((((((a|b))))))))))",
// Large character class combinations
r"[ -~]{1000}", // ASCII printable range
];
let size_limit = 50 * 1024; // 50KB limit for user patterns
for (i, pattern) in malicious_patterns.iter().enumerate() {
match RegexBuilder::new(pattern)
.size_limit(size_limit)
.build() {
Ok(_) => println!("Pattern {}: accepted", i),
Err(e) => println!("Pattern {}: rejected - {}", i, e),
}
}
// Without size_limit, these could exhaust server memory
}size_limit protects services that accept user-provided regex patterns.
use regex::RegexBuilder;
fn main() {
// Patterns that can create large compiled sizes
// Each nesting level multiplies states
let shallow = r"(a|b)"; // Small
let medium = r"((a|b)|(c|d))"; // Medium
let deep = r"((((a|b)|(c|d))|((e|f)|(g|h)))|(((i|j)|(k|l))|((m|n)|(o|p))))"; // Large
let limit = 10 * 1024; // 10KB
for (name, pattern) in [("shallow", shallow), ("medium", medium), ("deep", deep)] {
match RegexBuilder::new(pattern)
.size_limit(limit)
.build() {
Ok(_) => println!("{}: compiled", name),
Err(_) => println!("{}: exceeded limit", name),
}
}
// With repetition, size grows exponentially
let repeated = r"(a|b){20}"; // Alternation with repetition
match RegexBuilder::new(repeated)
.size_limit(1024)
.build() {
Ok(_) => println!("repeated: compiled"),
Err(_) => println!("repeated: exceeded limit"),
}
}Nesting depth and alternations multiply compiled state machine size.
use regex::RegexBuilder;
fn main() {
// size_limit can be combined with other builder options
let pattern = r"(?i)[a-z]{10}"; // Case insensitive
let re = RegexBuilder::new(pattern)
.size_limit(100 * 1024) // Limit compiled size
.case_insensitive(true) // Explicit case insensitive
.multi_line(true) // Multi-line mode
.dot_matches_new_line(true) // Dot matches \n
.build()
.unwrap();
// Test the compiled regex
let text = "ABCDEFGHIJ";
assert!(re.is_match(text));
println!("Case insensitive match works");
}size_limit integrates with other RegexBuilder configuration options.
use regex::Regex;
fn estimate_regex_complexity(pattern: &str) -> usize {
// Rough heuristic: count special constructs
let mut complexity = 1;
let chars: Vec<char> = pattern.chars().collect();
for i in 0..chars.len() {
match chars[i] {
'*' | '+' | '?' => complexity *= 2,
'{' => {
// Counted repetition - potentially large
complexity += 10;
}
'[' => {
// Character class
complexity += 5;
}
'|' => {
// Alternation
complexity *= 2;
}
'(' if i + 1 < chars.len() && chars[i + 1] != '?' => {
// Grouping
complexity += 2;
}
_ => {}
}
}
complexity
}
fn main() {
let patterns = [
r"a",
r"a*",
r"a{10}",
r"[a-z]{10}",
r"(a|b){10}",
];
for pattern in patterns {
let complexity = estimate_regex_complexity(pattern);
let recommended_limit = complexity * 1024; // Rough estimate
println!("Pattern: {}", pattern);
println!(" Estimated complexity: {}", complexity);
println!(" Recommended size_limit: {} bytes", recommended_limit);
// Try compiling with suggested limit
let result = regex::RegexBuilder::new(pattern)
.size_limit(recommended_limit.max(1024))
.build();
match result {
Ok(_) => println!(" Result: compiled successfully"),
Err(e) => println!(" Result: failed - {}", e),
}
}
}Heuristics can help estimate appropriate limits before compilation.
use regex::bytes::RegexBuilder as BytesRegexBuilder;
use regex::RegexBuilder;
fn main() {
// size_limit applies to both string and bytes regex
let pattern = r"[a-z]{50}";
// String regex
let string_re = RegexBuilder::new(pattern)
.size_limit(50 * 1024)
.build();
// Bytes regex
let bytes_re = BytesRegexBuilder::new(pattern)
.size_limit(50 * 1024)
.build();
println!("String regex: {:?}", string_re.is_ok());
println!("Bytes regex: {:?}", bytes_re.is_ok());
// Both support the same size_limit for their compiled state machines
}size_limit works for both regex::Regex and regex::bytes::Regex.
use regex::{Regex, RegexBuilder};
fn test_pattern_safety(pattern: &str) -> Result<Regex, String> {
// Compile with strict limits to test pattern safety
let result = RegexBuilder::new(pattern)
.size_limit(10 * 1024) // 10KB limit for testing
.build();
match result {
Ok(re) => {
// Also test with various inputs to ensure reasonable execution
let test_inputs = ["", "a", "test", &"a".repeat(100)];
for input in test_inputs {
// This should be fast with regex crate's NFA
let _ = re.is_match(input);
}
Ok(re)
}
Err(e) => Err(format!("Pattern unsafe: {}", e)),
}
}
fn main() {
let test_patterns = [
r"\d+", // Safe
r"[a-z]+", // Safe
r"[a-z]{10000}", // Large - may fail size limit
r"(a|b){100}", // Complex - may fail size limit
];
for pattern in test_patterns {
match test_pattern_safety(pattern) {
Ok(_) => println!("{}: SAFE", pattern),
Err(e) => println!("{}: {}", pattern, e),
}
}
}Test user-provided patterns with strict limits before using them in production.
What size_limit protects against:
| Scenario | Protected? | Mechanism | |----------|------------|-----------| | Compilation memory exhaustion | Yes | Limits state machine size | | Runtime ReDoS attacks | No (not needed) | NFA engine prevents backtracking | | Input string size | No | Not the purpose of size_limit | | Match result size | No | Not the purpose of size_limit |
Recommended limits by context:
| Context | Recommended Limit | Rationale | |---------|------------------|-----------| | User-provided patterns | 50-100 KB | Untrusted input needs strict limits | | Application patterns | 1-10 MB | Trusted patterns can be larger | | System utilities | Default (10 MB) | Balance safety and functionality |
Pattern characteristics affecting size:
| Construct | Impact on Size |
|-----------|---------------|
| Literal characters | Minimal |
| Simple character classes | Small |
| Large counted repetitions {n} | Linear in n |
| Large character classes | Proportional to class size |
| Nested alternations | Exponential in depth |
| Repetition of alternation | Multiplicative |
Key insight: regex::RegexBuilder::size_limit addresses a specific attack vectorāmalicious patterns that compile to enormous state machinesārather than the more common ReDoS attack based on exponential backtracking. The regex crate's NFA-based execution engine already prevents the classic ReDoS scenario where a pattern like (a+)+$ matches exponentially against crafted input. Instead, size_limit protects the compilation phase: a pattern like [a-z]{100000} could allocate hundreds of megabytes for its state machine, and without a limit, a server accepting user-provided regex patterns could be memory-starved. The default limit of approximately 10MB provides reasonable protection for most use cases, but security-conscious applications handling untrusted patterns should set stricter limits (50-100KB). The method is part of RegexBuilder alongside other compilation-time options like case_insensitive and multi_line, and it applies before any execution happensāif compilation fails due to size, no state machine is created and no input can be processed. This makes it a first line of defense for regex-powered search features, configuration parsers that accept patterns, or any system where users can influence the regex but shouldn't be able to consume arbitrary server memory.