Skip to content

Resilience Frameworks

These frameworks provide the resilience pattern for various service calls.

A resilience pattern is a design pattern used in software systems (especially distributed systems and microservices) to help applications stay stable, responsive, and fault-tolerant when things go wrong — like network failures, timeouts, overload, or downstream service crashes.

Think of it as defensive strategies your application uses to “bend but not break” under stress.

Why?

  • Networks are unreliable (latency, packet loss).
  • Services fail or slow down unexpectedly.
  • Traffic spikes can overwhelm dependencies.
  • Distributed systems make failures inevitable, not exceptional.

Without resilience patterns, one failing service can cause a cascading failure across the whole system (classic "domino effect").

Core Patterns

Here are the core patterns (many implemented in libraries like Resilience4j, Netflix Hystrix, or Spring Cloud):

1. Circuit Breaker

  • Detects failures and “opens the circuit” to stop making calls to a failing service.
  • Protects the system from repeated failed attempts.
  • Example: Like a fuse box at home — when overload happens, it cuts the current.

Parameters (configurable):

  • failureRateThreshold → % of failures to trigger breaker (e.g., 50%).
  • slowCallRateThreshold → % of slow calls considered failures.
  • slowCallDurationThreshold → time limit (e.g., 2s) after which a call is “slow.”
  • slidingWindowType → COUNT-based (# calls) or TIME-based (last N seconds).
  • slidingWindowSize → size of window for failure calculation.
  • minimumNumberOfCalls → minimum calls before breaker can trip.
  • permittedNumberOfCallsInHalfOpenState → trial calls when half-open.
  • waitDurationInOpenState → how long before trying again.
  • automaticTransitionFromOpenToHalfOpenEnabled → true/false.

Metrics: Failure Rate, Slow Call Rate, State, Transition Count.


2. Retry

  • Automatically retries a failed request with a delay/backoff.
  • Useful for temporary glitches (e.g., a momentary network blip).
  • Needs careful tuning (too many retries can worsen the problem).

Parameters:

  • maxAttempts → max retries (including first try).
  • waitDuration → delay between retries.
  • retryExceptions → exception types eligible for retry.
  • ignoreExceptions → exceptions not retried.
  • intervalFunction → constant / exponential backoff / random.
  • enableExponentialBackoff → true/false.
  • enableRandomizedWait → true/false.

Metrics: Retry Attempts, Retry Success/Failure Count, Backoff Duration.


3. Timeout

  • Limits the max time a request can take.
  • Prevents “hanging” calls that block resources.

Parameters:

  • timeoutDuration → max allowed time (e.g., 2s).
  • cancelRunningFuture → whether to cancel running task.

Metrics: Timeout Count, Avg Latency, P95/P99 Latency, Aborted Requests.


4. Bulkhead

  • Isolates resources into pools (like compartments in a ship).
  • If one pool is overloaded, others remain unaffected.
  • Example: Keep API calls to external payment service in their own thread pool.

Parameters:

  • maxConcurrentCalls → max calls allowed at once.
  • maxWaitDuration → how long to wait for a permit (if pool full).
  • maxThreadPoolSize (for thread pool bulkhead).
  • coreThreadPoolSize.
  • queueCapacity → max requests waiting.
  • keepAliveDuration → how long idle threads live.

Metrics: Active Calls, Pool Utilization %, Rejected Calls, Queue Wait Time.


5. Rate Limiter

  • Restricts the number of requests per unit time.
  • Prevents overload of downstream services.
  • Example: Only allow 100 API calls/sec to external service.

Parameters:

  • limitForPeriod → max calls allowed per refresh period.
  • limitRefreshPeriod → duration of one period (e.g., 1s).
  • timeoutDuration → max wait for permit.

Metrics: Allowed Calls, Denied Calls, Wait Time, Utilization %.


6. Fallback

  • Provides an alternative response when a call fails.
  • Example: If product pricing service is down, return “price unavailable” instead of crashing the app.

Parameters:(Fallback is usually code-defined rather than config-based, so parameters = what you define in logic)

  • fallbackMethod → function to invoke when primary fails.
  • fallbackResponse → static or computed response.
  • onExceptionTypes → exception triggers for fallback.

Metrics: Fallback Count, Fallback Success Rate, Degraded Mode Ratio.


7. Cache

  • Store frequently used results locally to reduce repeated calls.
  • Both improves performance and reduces dependency failures.

Parameters:

  • ttl (time-to-live) → how long before entry expires.
  • maxSize → max entries allowed.
  • evictionPolicy → LRU, LFU, FIFO.
  • refreshAhead → whether to refresh before expiry.
  • keyStrategy → how to generate cache keys.

Metrics: Hit Rate, Miss Rate, Evictions, Cache Size, Entry Staleness.


Where Resilience Patterns Are Used

  • Microservices (protect against downstream failures)
  • API gateways (limit or shape traffic)
  • GraphQL resolvers / REST endpoints (wrap external service calls)
  • Database access (timeouts, retries, caching)

Powered by VitePress