What are the trade-offs between `tower::limit::ConcurrencyLimit` and `RateLimit` for controlling service load?

ConcurrencyLimit controls how many requests can be in-flight simultaneously, rejecting requests that exceed the limit, while RateLimit controls how many requests are accepted per time window, allowing bursts up to the limit. Both protect services from overload but address different problems: ConcurrencyLimit prevents resource exhaustion from concurrent work, while RateLimit prevents abuse and ensures fair usage over time.

The Purpose of Load Limiting Middleware

use tower::Service;
use std::task::{Context, Poll};
 
// Without limits, services can be overwhelmed by:
// 1. Too many concurrent requests (memory, CPU, connection exhaustion)
// 2. Too many requests per second (abuse, resource starvation)
// 3. Both combined (cascading failures)
 
// tower::limit provides two middleware:
// - ConcurrencyLimit: Limits in-flight requests
// - RateLimit: Limits requests per time window

Load limiting middleware protects services from being overwhelmed by controlling how requests flow into a service.

ConcurrencyLimit: Controlling In-Flight Requests

use tower::limit::ConcurrencyLimit;
use tower::{Service, ServiceExt};
use std::time::Duration;
 
async fn concurrency_limit_example() {
    // ConcurrencyLimit wraps a service and limits concurrent requests
    let inner_service = make_service();
    
    // Allow at most 10 requests to be processing at once
    let limited_service = ConcurrencyLimit::new(inner_service, 10);
    
    // How it works:
    // - Requests beyond limit are queued (or rejected, depending on policy)
    // - When a request completes, the next queued request starts
    // - Total concurrent requests never exceeds 10
    
    // This protects against:
    // - Memory exhaustion from too many simultaneous operations
    // - Database connection pool exhaustion
    // - Thread pool saturation
}

ConcurrencyLimit bounds the number of requests actively being processed at any moment.

How ConcurrencyLimit Queues Requests

use tower::limit::ConcurrencyLimit;
use tower::{Service, ServiceExt};
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};
 
async fn concurrency_queue_behavior() {
    let counter = Arc::new(AtomicUsize::new(0));
    let counter_clone = counter.clone();
    
    // Service that tracks concurrent requests
    let service = tower::service_fn(move |_| {
        let counter = counter_clone.clone();
        async move {
            let current = counter.fetch_add(1, Ordering::SeqCst) + 1;
            println!("Concurrent requests: {}", current);
            
            // Simulate work
            tokio::time::sleep(Duration::from_millis(100)).await;
            
            counter.fetch_sub(1, Ordering::SeqCst);
            Ok::<_, Never>("done")
        }
    });
    
    let limited = ConcurrencyLimit::new(service, 3);  // Max 3 concurrent
    
    // If we send 10 requests:
    // - First 3 start immediately
    // - Next 7 are queued (waiting for completion)
    // - As each completes, next in queue starts
    // - Concurrent count never exceeds 3
}

Requests beyond the concurrency limit are buffered until capacity becomes available.

RateLimit: Controlling Requests Per Time Window

use tower::limit::RateLimit;
use tower::{Service, ServiceExt};
use std::time::Duration;
 
async fn rate_limit_example() {
    let inner_service = make_service();
    
    // Allow at most 100 requests per second
    let limited_service = RateLimit::new(inner_service, 100, Duration::from_secs(1));
    
    // How it works:
    // - Tracks requests within each time window
    // - Requests beyond limit are delayed until next window
    // - Allows bursts up to the limit, then throttles
    
    // This protects against:
    // - API abuse and DDoS attacks
    // - Resource starvation from high request rates
    // - Downstream service overload
}

RateLimit bounds how many requests are accepted per time period, smoothing out traffic spikes.

How RateLimit Throttles Requests

use tower::limit::RateLimit;
use tower::{Service, ServiceExt};
use std::time::Duration;
 
async fn rate_limit_behavior() {
    let service = tower::service_fn(|_| async { Ok::<_, Never>("done") });
    
    // 5 requests per second
    let limited = RateLimit::new(service, 5, Duration::from_secs(1));
    
    // If we send 10 requests in quick succession:
    // - First 5: processed immediately
    // - Next 5: delayed until 1 second window resets
    
    // After 1 second:
    // - Rate limit resets
    // - Another 5 requests allowed
    
    // This creates a "smoothed" request pattern:
    // - Bursts are allowed up to limit
    // - Beyond limit: requests are throttled, not rejected
}

RateLimit delays excess requests rather than rejecting them, smoothing traffic over time.

Key Differences in Behavior

use tower::limit::{ConcurrencyLimit, RateLimit};
use std::time::Duration;
 
fn comparing_limits() {
    // ConcurrencyLimit:
    // - Limits: In-flight requests (how many processing NOW)
    // - Effect: Queues requests when at capacity
    // - Recovery: Immediate (when request completes)
    // - Protection: Resource exhaustion
    // - Metric: "Concurrent requests"
    
    // RateLimit:
    // - Limits: Requests per time window (how many per SECOND)
    // - Effect: Delays requests when at limit
    // - Recovery: Time-based (when window resets)
    // - Protection: Abuse, overload over time
    // - Metric: "Requests per second"
    
    // Example scenario:
    // - ConcurrencyLimit(10): At most 10 requests processing at once
    //   - Can handle 1000 requests/second if each takes 10ms
    //   - Or 10 requests/second if each takes 1 second
    
    // - RateLimit(10, 1s): At most 10 requests per second
    //   - Whether requests take 1ms or 10 seconds
    //   - Independent of request duration
}

Aspect	ConcurrencyLimit	RateLimit
What it limits	In-flight requests	Requests per time window
Recovery trigger	Request completion	Time passage
Request handling	Queues until capacity	Delays until window reset
Dependent on request duration	Yes	No
Primary protection	Resource exhaustion	Abuse/overload

When to Use ConcurrencyLimit

use tower::limit::ConcurrencyLimit;
use tower::{Service, ServiceExt};
 
async fn concurrency_use_cases() {
    // Use ConcurrencyLimit when:
    
    // 1. Resource-bound services (databases, external APIs)
    let db_service = ConcurrencyLimit::new(database_service(), 50);
    // Database has limited connection pool
    // ConcurrencyLimit prevents pool exhaustion
    
    // 2. Memory-bound operations
    let memory_service = ConcurrencyLimit::new(large_allocation_service(), 10);
    // Each request allocates lots of memory
    // Limit prevents OOM by bounding concurrent allocations
    
    // 3. CPU-bound operations
    let cpu_service = ConcurrencyLimit::new(compute_service(), num_cpus::get());
    // Limit concurrent CPU-intensive work to core count
    
    // 4. Backend services with hard concurrency limits
    let backend = ConcurrencyLimit::new(external_api_service(), 20);
    // External API allows only 20 concurrent connections
}

Use ConcurrencyLimit when the constraint is the number of concurrent operations, not the rate of requests.

When to Use RateLimit

use tower::limit::RateLimit;
use std::time::Duration;
 
async fn rate_limit_use_cases() {
    // Use RateLimit when:
    
    // 1. API rate limiting (respect third-party limits)
    let api_service = RateLimit::new(
        third_party_api(),
        100,  // API allows 100 requests/second
        Duration::from_secs(1)
    );
    
    // 2. Abuse prevention
    let user_service = RateLimit::new(
        user_endpoint(),
        10,  // Each user gets 10 requests/second
        Duration::from_secs(1)
    );
    
    // 3. Fair resource allocation
    let shared_service = RateLimit::new(
        shared_resource(),
        50,  // 50 requests/second total
        Duration::from_secs(1)
    );
    // Prevents one client from monopolizing service
    
    // 4. Cost control (paid APIs)
    let paid_service = RateLimit::new(
        expensive_api(),
        1000,  // Budget allows 1000 calls/day
        Duration::from_secs(86_400)  // 24 hours
    );
}

Use RateLimit when the constraint is requests per time period, regardless of concurrent load.

Composing Both Limits Together

use tower::limit::{ConcurrencyLimit, RateLimit};
use std::time::Duration;
 
async fn combined_limits() {
    // Services often need BOTH limits:
    let service = make_service()
        .layer(ConcurrencyLimit::layer(100))  // Max 100 concurrent
        .layer(RateLimit::layer(1000, Duration::from_secs(1)));  // Max 1000/sec
    
    // Why both?
    // - RateLimit prevents request flood from overwhelming service
    // - ConcurrencyLimit prevents slow requests from consuming all resources
    
    // Example: Web service with database
    // - RateLimit: Allow 1000 requests/second
    // - ConcurrencyLimit: Only 50 concurrent DB queries
    // - If all requests need DB, only 50 process at once
    // - Rate limit prevents queue from growing unbounded
    
    // Layer ordering matters!
    // RateLimit then ConcurrencyLimit:
    // - RateLimit gates first, then ConcurrencyLimit
    // - Requests past rate limit never reach concurrency limit
    
    // ConcurrencyLimit then RateLimit:
    // - ConcurrencyLimit gates first, then RateLimit
    // - Different behavior for queued requests
}

Combining both limits provides defense-in-depth: rate limiting for abuse prevention, concurrency limiting for resource protection.

Layer Ordering Effects

use tower::limit::{ConcurrencyLimit, RateLimit};
use std::time::Duration;
 
async fn ordering_matters() {
    // Order 1: RateLimit then ConcurrencyLimit
    let service_v1 = tower::ServiceBuilder::new()
        .layer(RateLimit::layer(100, Duration::from_secs(1)))
        .layer(ConcurrencyLimit::layer(10))
        .service(inner_service());
    
    // Request flow:
    // 1. RateLimit checks: 100/sec allowed?
    //    - If no: delayed
    //    - If yes: pass to ConcurrencyLimit
    // 2. ConcurrencyLimit checks: <10 in flight?
    //    - If no: queued
    //    - If yes: process
    
    // Order 2: ConcurrencyLimit then RateLimit
    let service_v2 = tower::ServiceBuilder::new()
        .layer(ConcurrencyLimit::layer(10))
        .layer(RateLimit::layer(100, Duration::from_secs(1)))
        .service(inner_service());
    
    // Request flow:
    // 1. ConcurrencyLimit checks: <10 in flight?
    //    - If no: queued (not rate limited!)
    //    - If yes: pass to RateLimit
    // 2. RateLimit checks: 100/sec allowed?
    //    - If no: delayed
    //    - If yes: process
    
    // The first layer is applied first to incoming requests
    // Choose based on which limit you want to enforce first
}

Layer order determines which limit is checked first and affects how queued requests are handled.

Handling Backpressure

use tower::limit::ConcurrencyLimit;
use tower::{Service, ServiceExt};
use std::future::Future;
use std::pin::Pin;
 
async fn backpressure() {
    // ConcurrencyLimit creates backpressure naturally:
    // - When at capacity, poll_ready returns Pending
    // - Caller must wait for capacity
    // - This propagates backpressure through the service stack
    
    let limited = ConcurrencyLimit::new(service(), 5);
    
    // When calling the service:
    // 1. Call poll_ready() first
    // 2. If Pending, wait for notification
    // 3. When Ready, call call()
    
    // RateLimit also creates backpressure:
    // - When at limit, poll_ready returns Pending
    // - Waits until rate window resets
    
    // Backpressure is essential for:
    // - Preventing unbounded queues
    // - Propagating load through the system
    // - Letting callers know to slow down
}

Both middleware types propagate backpressure through poll_ready, allowing callers to react to load.

Response to Overload

use tower::limit::{ConcurrencyLimit, RateLimit};
use std::time::Duration;
 
async fn overload_response() {
    // ConcurrencyLimit under overload:
    // - Queues requests up to buffer size
    // - Beyond buffer: returns error (or Pending forever)
    // - Load shedding: extreme overload causes failures
    
    // RateLimit under overload:
    // - Delays requests until next window
    // - Never rejects, only throttles
    // - Eventual processing: all requests eventually handled
    
    // ConcurrencyLimit is "fail-fast" under sustained overload
    // RateLimit is "eventual processing" with delays
    
    // For systems that prefer graceful degradation:
    // - Use RateLimit with reasonable limits
    
    // For systems that must protect resources:
    // - Use ConcurrencyLimit to bound resources
    
    // For comprehensive protection:
    // - Use both, with RateLimit outermost
}

ConcurrencyLimit may reject requests under sustained overload, while RateLimit delays all requests until capacity is available.

Metrics and Observability

use tower::limit::{ConcurrencyLimit, RateLimit};
use std::time::Duration;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
 
async fn observability() {
    // ConcurrencyLimit metrics to track:
    // - Current concurrent requests (gauge)
    // - Queue size (how many waiting)
    // - Rejected requests (counter)
    // - Queue wait time (histogram)
    
    // RateLimit metrics to track:
    // - Current window count (gauge)
    // - Requests delayed (counter)
    // - Delay duration (histogram)
    // - Window resets (counter)
    
    // Example: Track concurrent requests
    let counter = Arc::new(AtomicUsize::new(0));
    
    let service = tower::service_fn(move |_| {
        let counter = counter.clone();
        async move {
            let current = counter.fetch_add(1, Ordering::SeqCst) + 1;
            // metric: concurrent_requests = current
            let result = process_request().await;
            counter.fetch_sub(1, Ordering::SeqCst);
            result
        }
    });
    
    let limited = ConcurrencyLimit::new(service, 10);
    
    // Monitor queue depth and adjust limits dynamically
}

Tracking metrics for both limits helps tune them appropriately.

Dynamic Limit Adjustment

use tower::limit::ConcurrencyLimit;
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};
 
async fn dynamic_limits() {
    // ConcurrencyLimit can be adjusted based on conditions:
    
    // Example: Adjust based on error rate
    let current_limit = Arc::new(AtomicUsize::new(100));
    
    // Lower limit when errors increase
    // Raise limit when healthy
    // This requires custom middleware or periodic checks
    
    // RateLimit can be adjusted per client:
    let client_tiers = vec![
        ("free", 10),      // 10 req/sec
        ("pro", 100),      // 100 req/sec
        ("enterprise", 1000), // 1000 req/sec
    ];
    
    // Implementation requires per-client rate limiters
    // Using tower::load::pending
}

Limits can be adjusted dynamically based on system health, time of day, or client tier.

Practical Configuration

use tower::limit::{ConcurrencyLimit, RateLimit};
use std::time::Duration;
 
async fn practical_config() {
    // Guidelines for ConcurrencyLimit:
    
    // Database: Limit to connection pool size
    let db_limit = ConcurrencyLimit::layer(20);  // Pool of 20
    
    // External API: Check their documentation
    let api_limit = ConcurrencyLimit::layer(10);  // API allows 10 concurrent
    
    // CPU-bound: Limit to CPU cores
    let cpu_limit = ConcurrencyLimit::layer(num_cpus::get());
    
    // Memory-bound: Limit based on memory per request
    // If each request needs 100MB and you have 8GB:
    let memory_limit = ConcurrencyLimit::layer(80);  // 8GB / 100MB
    
    // Guidelines for RateLimit:
    
    // Third-party API: Use their published limits
    let api_rate = RateLimit::layer(100, Duration::from_secs(1));
    
    // Fair usage: Allow reasonable bursts
    let fair_rate = RateLimit::layer(50, Duration::from_secs(1));
    
    // Cost control: Match your budget
    let budget_rate = RateLimit::layer(10_000, Duration::from_secs(86_400));
    
    // Combined: Protect both resources and rate
    let protected = tower::ServiceBuilder::new()
        .layer(RateLimit::layer(100, Duration::from_secs(1)))
        .layer(ConcurrencyLimit::layer(50))
        .service(my_service());
}

Configuration depends on resource constraints and external limits.

Summary Table

fn summary_table() {
    // | Aspect | ConcurrencyLimit | RateLimit |
    // |--------|------------------|-----------|
    // | Limits | In-flight requests | Requests per time |
    // | Protection | Resource exhaustion | Abuse/overload |
    // | Dependent on duration | Yes | No |
    // | Recovery | Request completion | Time passage |
    // | Overload response | Queue then reject | Delay (throttle) |
    // | Primary use case | Resource bounding | Traffic shaping |
    
    // | Scenario | Recommended |
    // |----------|-------------|
    // | Database connection limit | ConcurrencyLimit |
    // | API rate limit compliance | RateLimit |
    // | Memory-bound operations | ConcurrencyLimit |
    // | User quota enforcement | RateLimit |
    // | Production service | Both (RateLimit outer) |
}

Synthesis

Quick reference:

use tower::limit::{ConcurrencyLimit, RateLimit};
use std::time::Duration;
use tower::ServiceBuilder;
 
fn quick_reference() {
    // ConcurrencyLimit: Bound concurrent requests
    // Use when: Resource has hard limit (connections, memory, threads)
    let concurrency = ConcurrencyLimit::layer(50);
    
    // RateLimit: Bound requests per time
    // Use when: Enforcing quotas, preventing abuse, API limits
    let rate = RateLimit::layer(100, Duration::from_secs(1));
    
    // Combined: Defense in depth
    let combined = ServiceBuilder::new()
        .layer(rate)        // Rate limit first (traffic shaping)
        .layer(concurrency) // Then concurrency (resource protection)
        .service(my_service());
}

Key insight: ConcurrencyLimit and RateLimit address fundamentally different problems: concurrency limiting protects resources by bounding how many operations can be in progress simultaneously, while rate limiting protects against abuse and overload by bounding how many requests are accepted per time window. ConcurrencyLimit is request-duration-aware—if each request takes 100ms, a limit of 10 allows 100 requests/second, but if each takes 10 seconds, the same limit allows only 1 request/second. RateLimit is duration-agnostic—whether requests take 1ms or 10 seconds, only N requests are accepted per time window. The choice depends on what you're protecting: use ConcurrencyLimit when you have limited resources (database connections, memory, CPU cores) and need to prevent exhaustion; use RateLimit when you need to enforce quotas, comply with API limits, or prevent abuse regardless of resource usage. In production, combining both provides defense-in-depth: RateLimit at the outer layer shapes incoming traffic to acceptable rates, while ConcurrencyLimit at the inner layer ensures resource bounds are respected even when requests take longer than expected.

What are the trade-offs between tower::limit::ConcurrencyLimit and RateLimit for controlling service load?