What is the purpose of tower::load_shed::LoadShedLayer for handling backpressure in overloaded services?

LoadShedLayer wraps a service to reject requests when the underlying service's latency exceeds a configured threshold, implementing load shedding as a backpressure mechanism that prevents cascade failures by failing fast rather than queuing requests that would timeout anyway. The layer monitors service latency and transitions to a "loaded" state when responses become slow, returning an error immediately for new requests instead of allowing them to queue and eventually timeout, protecting both the service and its clients from resource exhaustion.

The Problem: Cascade Failures

// Without load shedding, overloaded services cause cascade failures
 
async fn without_load_shedding() {
    // Scenario: Service receives 1000 requests, can only handle 100
    
    // 1. All 1000 requests are accepted
    // 2. Queue builds up
    // 3. Latency increases for all requests
    // 4. Clients timeout waiting
    // 5. Clients retry, adding MORE requests
    // 6. Service becomes completely unresponsive
    // 7. Downstream services also fail
    // 8. System-wide cascade failure
    
    // This is the "death spiral" pattern:
    // - More load -> more queuing -> higher latency
    // - Higher latency -> more timeouts -> more retries
    // - More retries -> even more load
    // - Cycle continues until complete failure
}

Without load shedding, overloaded services create a feedback loop of increasing load and failures.

Backpressure and Load Shedding

// Backpressure: Mechanism to slow down or stop incoming work
// Load Shedding: Actively rejecting work when overloaded
 
// Two approaches to backpressure:
// 1. Cooperative: Tell clients to slow down (e.g., HTTP 429)
// 2. Aggressive: Refuse to accept work (load shedding)
 
// Load shedding is a form of backpressure that:
// - Fails fast instead of queuing
// - Protects the service from overload
// - Protects clients from timeout cascades
// - Allows partial service availability
 
// Key insight: Better to fail some requests fast
// than to fail all requests slowly

Load shedding is an aggressive backpressure technique that prioritizes partial availability over complete failure.

The LoadShedLayer Design

use tower::load_shed::LoadShedLayer;
use tower::{Service, ServiceBuilder};
use std::time::Duration;
 
async fn load_shed_layer_basics() {
    // LoadShedLayer wraps a service and monitors latency
    // When latency exceeds threshold, it enters "loaded" state
    // New requests are immediately rejected
    
    // Typical usage:
    let service = ServiceBuilder::new()
        // Load shed when latency exceeds 100ms
        .layer(LoadShedLayer::new(Duration::from_millis(100)))
        // The underlying service
        .service(my_service());
    
    // The layer:
    // 1. Tracks request latency
    // 2. Computes latency statistics
    // 3. Compares against threshold
    // 4. Returns error when overloaded
}

LoadShedLayer adds automatic load shedding based on latency monitoring.

How LoadShed Works

use tower::load_shed::LoadShedLayer;
use tower::{Service, ServiceExt};
use std::time::Duration;
 
async fn how_it_works() {
    // The layer has two states:
    // 1. Normal: Requests pass through to underlying service
    // 2. Loaded: Requests are immediately rejected
    
    // State transitions:
    // Normal -> Loaded: When latency exceeds threshold
    // Loaded -> Normal: When latency returns to acceptable levels
    
    // Each request:
    // 1. Check current state
    // 2. If loaded: return Overloaded error immediately
    // 3. If normal: forward to service, measure latency
    // 4. Update running latency estimate
    // 5. If new estimate > threshold, transition to loaded
    
    // The latency tracking uses exponential moving average
    // This smooths out latency spikes and provides stable state
}

The layer tracks running latency and transitions between normal and loaded states.

Basic Example with Service

use tower::{Service, ServiceExt};
use tower::load_shed::LoadShedLayer;
use std::time::Duration;
use std::task::{Context, Poll};
use futures::future::{ready, Ready};
 
// A slow service that simulates load
struct SlowService;
 
impl Service<String> for SlowService {
    type Response = String;
    type Error = String;
    type Future = Ready<Result<String, String>>;
 
    fn poll_ready(&mut self, _cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> {
        Poll::Ready(Ok(()))
    }
 
    fn call(&mut self, req: String) -> Self::Future {
        // Simulate slow responses under load
        ready(Ok(format!("Processed: {}", req)))
    }
}
 
async fn basic_example() {
    let mut service = LoadShedLayer::new(Duration::from_millis(50))
        .layer(SlowService);
    
    // When service is fast, requests succeed
    let response = service.ready().await.unwrap().call("test".to_string()).await;
    
    // When latency exceeds threshold, requests are rejected
    // The service enters "loaded" state
    // New calls return LoadShed error
}

The layer wraps any service and rejects requests when latency exceeds the threshold.

The Overloaded Error

use tower::load_shed::LoadShedLayer;
use tower::load::Load;
 
async fn overloaded_error() {
    // When the service is loaded, calls return LoadShed::Overloaded
    
    // The error indicates:
    // - Service is intentionally rejecting requests
    // - Latency exceeded threshold
    // - Client should handle this gracefully
    
    // Typical client handling:
    // 1. Return HTTP 503 Service Unavailable
    // 2. Client may retry with exponential backoff
    // 3. Circuit breaker may activate
    
    // This is different from a timeout:
    // - Timeout: waited too long, gave up
    // - LoadShed: didn't wait, rejected immediately
    // - LoadShed is faster and saves resources
}

The Overloaded error indicates intentional rejection, not a service failure.

Integration with Tower Middleware

use tower::ServiceBuilder;
use tower::load_shed::LoadShedLayer;
use tower::limit::{ConcurrencyLimitLayer, RateLimitLayer};
use tower::timeout::TimeoutLayer;
use std::time::Duration;
 
async fn middleware_stack() {
    // LoadShedLayer integrates with other Tower middleware
    
    let service = ServiceBuilder::new()
        // Rate limiting: max 100 requests per second
        .layer(RateLimitLayer::new(100, Duration::from_secs(1)))
        
        // Concurrency limit: max 10 concurrent requests
        .layer(ConcurrencyLimitLayer::new(10))
        
        // Load shedding: reject when latency > 100ms
        .layer(LoadShedLayer::new(Duration::from_millis(100)))
        
        // Timeout: fail if processing takes > 5 seconds
        .layer(TimeoutLayer::new(Duration::from_secs(5)))
        
        // The actual service
        .service(my_service());
    
    // Order matters:
    // 1. Rate limiting applies first (coarse backpressure)
    // 2. Concurrency limit (resource protection)
    // 3. Load shedding (latency-based backpressure)
    // 4. Timeout (final safety net)
    // 5. Service handles request
}

LoadShedLayer works well with other middleware in a layered defense strategy.

Layering Order Considerations

use tower::ServiceBuilder;
use tower::load_shed::LoadShedLayer;
use tower::limit::ConcurrencyLimitLayer;
use std::time::Duration;
 
async fn layering_order() {
    // Order of layers affects behavior significantly
    
    // Option 1: LoadShed before ConcurrencyLimit
    // Load shedding applies to ALL incoming requests
    // Before they even queue for concurrency slot
    let service = ServiceBuilder::new()
        .layer(LoadShedLayer::new(Duration::from_millis(100)))
        .layer(ConcurrencyLimitLayer::new(10))
        .service(my_service());
    
    // Option 2: ConcurrencyLimit before LoadShed
    // Requests queue for concurrency slot FIRST
    // Load shedding applies only to requests that got a slot
    let service = ServiceBuilder::new()
        .layer(ConcurrencyLimitLayer::new(10))
        .layer(LoadShedLayer::new(Duration::from_millis(100)))
        .service(my_service());
    
    // Recommended order depends on your priorities:
    // - Protect service first: LoadShed before ConcurrencyLimit
    // - Fair queueing: ConcurrencyLimit before LoadShed
}

Layer order determines when load shedding is evaluated relative to other limits.

Configuring the Latency Threshold

use tower::load_shed::LoadShedLayer;
use std::time::Duration;
 
async fn threshold_configuration() {
    // The threshold should be based on:
    // 1. Acceptable latency for your service
    // 2. Client timeout values
    // 3. Typical request processing time
    // 4. Service capacity
    
    // If clients timeout at 5 seconds:
    // Set threshold lower, e.g., 1-2 seconds
    // This gives clients a fast error instead of timeout
    
    // Tight threshold (aggressive load shedding)
    let aggressive = LoadShedLayer::new(Duration::from_millis(50));
    // Sheds load quickly, but may reject too many requests
    
    // Loose threshold (conservative load shedding)
    let conservative = LoadShedLayer::new(Duration::from_secs(5));
    // Allows more latency, but clients may timeout
    
    // Recommended: Set to P95 or P99 latency
    // Adjust based on your service's latency profile
    
    // If P95 latency is 200ms, threshold could be 500ms
    let recommended = LoadShedLayer::new(Duration::from_millis(500));
}

The threshold should balance protecting the service with avoiding unnecessary rejections.

Handling LoadShed Errors

use tower::load_shed::LoadShed;
use tower::{Service, ServiceExt};
use http::{Response, StatusCode};
 
async fn handle_loadshed_error() {
    // When service is loaded, call() returns LoadShed::Overloaded
    
    // In a web service, translate to HTTP 503:
    fn translate_error(err: LoadShed) -> Response<()> {
        // The service is intentionally rejecting requests
        // Return 503 Service Unavailable
        Response::builder()
            .status(StatusCode::SERVICE_UNAVAILABLE)
            .header("Retry-After", "5")  // Suggest retry delay
            .body(())
            .unwrap()
    }
    
    // Clients can then:
    // 1. Wait and retry
    // 2. Use fallback/cached data
    // 3. Return error to their caller
    
    // With circuit breaker:
    // Repeated LoadShed errors can open circuit
    // This stops sending requests to this service instance
}

LoadShed errors should be converted to appropriate client-facing responses.

LoadShed vs Timeout

use tower::load_shed::LoadShedLayer;
use tower::timeout::TimeoutLayer;
use std::time::Duration;
 
async fn loadshed_vs_timeout() {
    // Both protect against slow requests, but differently
    
    // Timeout:
    // - Waits until timeout duration
    // - Then returns error
    // - Request still consumes resources while waiting
    // - Good for handling hung requests
    
    // LoadShed:
    // - Rejects immediately if already overloaded
    // - No waiting time
    // - Prevents new requests from starting
    // - Good for handling overload
    
    // Use BOTH together:
    let service = ServiceBuilder::new()
        .layer(TimeoutLayer::new(Duration::from_secs(5)))
        .layer(LoadShedLayer::new(Duration::from_millis(500)))
        .service(my_service());
    
    // This provides:
    // - LoadShed: Reject if already slow (fast fail)
    // - Timeout: Fail if current request takes too long
    
    // Key difference:
    // - Timeout is per-request: "This request is too slow"
    // - LoadShed is service-wide: "Service is overloaded, try later"
}

Timeout handles slow individual requests; LoadShed handles service-wide overload.

LoadShed vs ConcurrencyLimit

use tower::load_shed::LoadShedLayer;
use tower::limit::ConcurrencyLimitLayer;
use std::time::Duration;
 
async fn loadshed_vs_concurrency() {
    // Both limit load, but based on different metrics
    
    // ConcurrencyLimit:
    // - Limits number of concurrent requests
    // - Queues excess requests
    // - Static limit, doesn't adapt to actual load
    // - Based on count, not latency
    
    // LoadShed:
    // - Limits based on latency
    // - Rejects (doesn't queue) when slow
    // - Dynamic, adapts to actual performance
    // - Based on measured latency
    
    // Use BOTH together:
    let service = ServiceBuilder::new()
        .layer(ConcurrencyLimitLayer::new(100))  // Max 100 concurrent
        .layer(LoadShedLayer::new(Duration::from_millis(200)))  // Shed if slow
        .service(my_service());
    
    // This provides:
    // - ConcurrencyLimit: Prevents resource exhaustion
    // - LoadShed: Prevents latency cascade
    
    // ConcurrencyLimit is proactive (prevent overload)
    // LoadShed is reactive (respond to observed latency)
}

ConcurrencyLimit prevents overload statically; LoadShed responds to actual latency.

Real-World Integration Example

use tower::ServiceBuilder;
use tower::load_shed::LoadShedLayer;
use tower::limit::{ConcurrencyLimitLayer, RateLimitLayer};
use tower::timeout::TimeoutLayer;
use tower::retry::RetryLayer;
use tower::load_shed::LoadShed;
use std::time::Duration;
use http::{Response, StatusCode};
 
async fn production_stack() {
    // Production-ready service with layered protection
    
    let service = ServiceBuilder::new()
        // Layer 1: Rate limiting (prevent abuse)
        .layer(RateLimitLayer::new(1000, Duration::from_secs(1)))
        
        // Layer 2: Concurrency limit (resource protection)
        .layer(ConcurrencyLimitLayer::new(100))
        
        // Layer 3: Load shedding (latency-based backpressure)
        .layer(LoadShedLayer::new(Duration::from_millis(200)))
        
        // Layer 4: Timeout (final safety net)
        .layer(TimeoutLayer::new(Duration::from_secs(10)))
        
        .service(my_service());
    
    // Error handling for each layer:
    // RateLimit: Returns error when quota exceeded
    // ConcurrencyLimit: Returns error when queue full
    // LoadShed: Returns LoadShed::Overloaded
    // Timeout: Returns error when time exceeded
    
    // Each error should be mapped appropriately:
    fn map_error(e: Box<dyn std::error::Error>) -> Response<String> {
        // Check error type and return appropriate response
        Response::builder()
            .status(StatusCode::SERVICE_UNAVAILABLE)
            .body("Service temporarily unavailable".to_string())
            .unwrap()
    }
}

A production service uses multiple layers for comprehensive protection.

Monitoring LoadShed State

use tower::load_shed::LoadShedLayer;
use tower::load::Load;
use std::time::Duration;
 
async fn monitoring() {
    // The Load trait provides insight into service load
    
    // LoadShedLayer tracks:
    // - Current latency estimate
    // - Whether service is in loaded state
    // - Number of requests shed
    
    // This can be exposed as metrics:
    // - tower_load_shed_loaded: 0 or 1 (is service loaded?)
    // - tower_load_shed_latency_us: current latency estimate
    // - tower_load_shed_requests_shed: counter
    
    // Use with metrics crate:
    // metrics::gauge!("service.loaded", loaded as f64);
    // metrics::gauge!("service.latency_us", latency.as_micros() as f64);
    
    // Monitoring helps tune the threshold:
    // - If always loaded: threshold too low
    // - If never loaded but timeouts occur: threshold too high
}

Monitoring the LoadShed state helps tune thresholds and understand service behavior.

Trade-offs of Load Shedding

use tower::load_shed::LoadShedLayer;
use std::time::Duration;
 
async fn tradeoffs() {
    // Advantages of LoadShed:
    // 1. Fast failure: Clients get error immediately
    // 2. Resource protection: Prevents resource exhaustion
    // 3. Cascade prevention: Stops overload from spreading
    // 4. Availability: Service stays responsive (rejects requests)
    
    // Disadvantages:
    // 1. Rejected requests: Some valid requests fail
    // 2. Requires tuning: Threshold needs careful calibration
    // 3. May overreact: Temporary latency spike triggers shedding
    // 4. Client handling: Clients must handle rejection gracefully
    
    // When to use:
    // - Services with variable load
    // - Services that must stay responsive
    // - Distributed systems needing cascade protection
    // - APIs where clients can retry or use fallbacks
    
    // When NOT to use:
    // - Services where every request must succeed
    // - Systems without retry/fallback mechanisms
    // - Services with very low latency requirements
    // - When simpler rate limiting is sufficient
    
    // Threshold tuning:
    // - Too low: Unnecessary rejections, underutilized capacity
    // - Too high: Overload before shedding kicks in
    // - Right: Sheds when truly overloaded, accepts when healthy
}

LoadShed provides significant benefits but requires careful configuration and client cooperation.

Comparison Table

async fn comparison_table() {
    // | Mechanism | Based On | Action | Queue | Use Case |
    // |-----------|----------|--------|-------|----------|
    // | RateLimit | Request rate | Reject excess | No | Prevent abuse |
    // | ConcurrencyLimit | Concurrent count | Queue excess | Yes | Resource protection |
    // | LoadShed | Latency | Reject all | No | Overload protection |
    // | Timeout | Request duration | Fail request | N/A | Hung requests |
    
    // | Layer | Latency | Resource Use | Client Impact |
    // |-------|----------|--------------|---------------|
    // | RateLimit | Low (reject fast) | Minimal | 429 error |
    // | ConcurrencyLimit | Variable (queue wait) | Moderate | Delayed or rejected |
    // | LoadShed | Lowest (immediate reject) | Minimal | 503 error |
    // | Timeout | High (wait full duration) | High | Timeout error |
}

Different layers serve different purposes; LoadShed specializes in latency-based overload protection.

Synthesis

Quick reference:

use tower::load_shed::LoadShedLayer;
use tower::ServiceBuilder;
use std::time::Duration;
 
// Basic usage
let service = ServiceBuilder::new()
    .layer(LoadShedLayer::new(Duration::from_millis(100)))
    .service(my_service());
 
// In middleware stack
let service = ServiceBuilder::new()
    .layer(RateLimitLayer::new(100, Duration::from_secs(1)))
    .layer(ConcurrencyLimitLayer::new(50))
    .layer(LoadShedLayer::new(Duration::from_millis(200)))
    .layer(TimeoutLayer::new(Duration::from_secs(5)))
    .service(my_service());
 
// The layer:
// - Monitors service latency
// - Rejects requests when latency > threshold
// - Returns LoadShed::Overloaded error
// - Transitions back to normal when latency recovers

Key insight: LoadShedLayer implements the load shedding pattern as a Tower middleware, providing latency-based backpressure for overloaded services. Unlike rate limiting (which controls request rate) or concurrency limits (which control concurrent count), load shedding responds to actual service latency—when the service becomes slow, it immediately rejects new requests rather than queuing them. This prevents cascade failures by failing fast, protecting both the service (from resource exhaustion) and clients (from timeout cascades). The threshold should be tuned based on your service's latency profile: too low causes unnecessary rejections, too high fails to prevent overload. Combine with other Tower middleware like rate limits, concurrency limits, and timeouts for comprehensive protection. The Overloaded error should be converted to an appropriate response (typically HTTP 503) with retry guidance for clients.