Spring Boot Resilience4j Retry & CircuitBreaker Tutorial

Spring Boot Resilience4j Retry & CircuitBreaker Tutorial

In a microservices world, failures are normal: network issues, slow downstream services, temporary outages, rate limits… If your Spring Boot services call other APIs or databases, you must design for failure.

In this in-depth guide, we’ll build production-ready fault tolerance using Resilience4j with Spring Boot 3.x – focusing on Retry and CircuitBreaker.

What you’ll learn:
  • Resilience4j basics & why it replaced Hystrix
  • Spring Boot 3.x setup with Resilience4j (Maven + YAML config)
  • Implementing Retry and CircuitBreaker for remote REST calls
  • Fallback methods & combining patterns (Retry + CircuitBreaker + TimeLimiter)
  • Metrics, monitoring (Actuator), and how to test failures
  • Production best practices & common pitfalls to avoid

1. What is Resilience4j and Why Use It?

Resilience4j is a lightweight, modular fault-tolerance library inspired by Netflix Hystrix, but designed for Java 8+ and functional programming.

FeatureWhat it gives you
RetryAutomatically retry failed operations with backoff
CircuitBreakerStop hitting a failing service and fail-fast
TimeLimiterFail calls that exceed a time limit
RateLimiterThrottle calls to external services
BulkheadIsolate failures and limit concurrent calls
Resilience4j integrates very nicely with Spring Boot 3.x via auto-configuration and simple annotations.

2. Example Scenario – Calling a Remote Pricing Service

Throughout this tutorial, we’ll use a realistic example:

  • inventory-service (your main Spring Boot service)
  • calls a remote pricing-service over HTTP
  • pricing-service sometimes fails or is slow

We’ll protect the remote call with:

  • @Retry – auto-retry a few times if it fails
  • @CircuitBreaker – open the circuit if the failure rate is high
  • Fallback – return cached / default price when downstream is down

3. Project Setup – Spring Boot 3.x + Resilience4j

3.1. Maven Dependencies

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>

    <!-- Resilience4j Spring Boot 3 integration -->
    <dependency>
        <groupId>io.github.resilience4j</groupId>
        <artifactId>resilience4j-spring-boot3</artifactId>
    </dependency>

    <!-- Optional: Actuator for health & metrics -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>

    <!-- Optional: Micrometer Prometheus registry -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-registry-prometheus</artifactId>
    </dependency>
</dependencies>
If you also use Spring Cloud, you can integrate Resilience4j via Spring Cloud CircuitBreaker, but here we use it directly for more control.

4. Basic Resilience4j Configuration (application.yml)

4.1. Enabling Actuator Endpoints (recommended)

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    health:
      show-details: always

4.2. Retry & CircuitBreaker Base Configuration

Create or update application.yml:

resilience4j:
  retry:
    configs:
      default:
        max-attempts: 3             # 1 initial call + 2 retries
        wait-duration: 500ms        # wait between attempts
        enable-exponential-backoff: true
        exponential-backoff-multiplier: 2
        retry-exceptions:
          - java.io.IOException
          - org.springframework.web.client.ResourceAccessException
        ignore-exceptions:
          - com.example.demo.exception.BusinessException
    instances:
      priceService:
        base-config: default

  circuitbreaker:
    configs:
      default:
        sliding-window-type: COUNT_BASED
        sliding-window-size: 20            # number of calls to measure
        minimum-number-of-calls: 10
        failure-rate-threshold: 50         # percentage
        wait-duration-in-open-state: 10s   # how long circuit stays OPEN
        permitted-number-of-calls-in-half-open-state: 3
        automatic-transition-from-open-to-half-open-enabled: true
        record-exceptions:
          - java.io.IOException
          - org.springframework.web.client.HttpServerErrorException
          - org.springframework.web.client.ResourceAccessException
        ignore-exceptions:
          - com.example.demo.exception.BusinessException
    instances:
      priceService:
        base-config: default

4.3. TimeLimiter (Optional but Important for Slow APIs)

resilience4j:
  timelimiter:
    configs:
      default:
        timeout-duration: 2s
        cancel-running-future: true
    instances:
      priceService:
        base-config: default

5. 🔥 Deep Dive: Resilience4j Configuration Properties Explained

Understanding why each configuration exists helps you tune correctness and performance for production workloads. Below is a full breakdown of the most important properties we used.

5.1. 🔁 Retry Configuration — Meaning of Each Property

PropertyWhat It ControlsRecommended Range
max-attempts Total attempts allowed (1 initial call + retries) 3–5
wait-duration Delay between retry attempts 200–800ms
enable-exponential-backoff Whether to increase wait time after each retry true in most cases
exponential-backoff-multiplier How much the delay grows each retry 1.5–2
retry-exceptions Only these exception types trigger a retry Network / I/O errors
ignore-exceptions These are treated as business errors → fail immediately Validation / business exceptions
Rule of thumb: Retries help with transient failures (network glitches, timeouts). Don’t retry validation or permanent business errors — they will never succeed on retry.

5.2. 🛡 CircuitBreaker Configuration — Meaning of Each Property

PropertyWhat It ControlsWhy It Matters
sliding-window-type Whether to measure calls by count or time COUNT_BASED is easier for REST APIs; TIME_BASED for streaming
sliding-window-size Number of recent calls to consider Larger window = more stable decision, slower to react
minimum-number-of-calls Min calls before evaluating failure rate Prevents circuit opening with small sample sizes
failure-rate-threshold % of failed calls required to open the circuit 50% is common starting point
wait-duration-in-open-state How long the circuit stays OPEN before trying again Enough time for dependency to recover (e.g. 10–30s)
permitted-number-of-calls-in-half-open-state Number of trial calls in HALF_OPEN 3–10 is typical; too high can cause spikes
automatic-transition-from-open-to-half-open-enabled Automatically move from OPEN to HALF_OPEN after wait-duration true is usually what you want
record-exceptions Which exceptions count as failures for the failure-rate Real downstream failures (5xx, timeouts, I/O)
ignore-exceptions Exceptions that should not count as circuit failures Business rules, validation errors, etc.
Key Insight:
CircuitBreaker protects your system by stopping calls to a failing dependency. Proper window size + threshold decide when to trip the breaker. Too small → flapping (open/close frequently).

5.3. ⏱ TimeLimiter — Timeout Behavior (Optional but Highly Recommended)

PropertyDescriptionTypical Value
timeout-duration Maximum time allowed for a remote call 1–3 seconds for REST APIs
cancel-running-future Whether to cancel the underlying task when timeout occurs true in most cases
TimeLimiter protects threads from being blocked forever. CircuitBreaker protects the dependency and your system from repeated failures. Use both together for slow or flaky remote services.

5.4. 🎯 Suggested Defaults for Real Projects

  • Retries: 3–4 attempts with exponential backoff
  • Timeout (TimeLimiter): 2 seconds
  • CircuitBreaker sliding window: 20 calls
  • Failure threshold: 50%
  • Half-open trial calls: 3–5

These defaults work well for REST APIs with P99 latency under ~500ms. You should always tune them later using real production metrics from Prometheus / Grafana.

🧠 Performance tip: Too many retries = more load on a broken service → cascading failures. Keep retry counts low and use exponential backoff.

6. Implementing a Resilient Client with Retry & CircuitBreaker

6.1. DTO for Remote Response

public record PriceResponse(
        Long productId,
        BigDecimal price,
        String currency
) {}

6.2. RestTemplate Bean

@Configuration
public class HttpClientConfig {

    @Bean
    public RestTemplate restTemplate(RestTemplateBuilder builder) {
        return builder
            .setConnectTimeout(Duration.ofSeconds(2))
            .setReadTimeout(Duration.ofSeconds(2))
            .build();
    }
}

6.3. PriceClient – Remote Call Wrapped with Resilience4j

@Service
public class PriceClient {

    private final RestTemplate restTemplate;

    @Value("${pricing-service.base-url:http://localhost:8081}")
    private String pricingServiceBaseUrl;

    public PriceClient(RestTemplate restTemplate) {
        this.restTemplate = restTemplate;
    }

    @Retry(name = "priceService", fallbackMethod = "getPriceFallback")
    @CircuitBreaker(name = "priceService", fallbackMethod = "getPriceFallback")
    public PriceResponse getPrice(Long productId) {

        String url = pricingServiceBaseUrl + "/api/prices/" + productId;

        ResponseEntity<PriceResponse> response =
                restTemplate.getForEntity(url, PriceResponse.class);

        return response.getBody();
    }

    /**
     * Fallback method must have same parameters plus Throwable as last parameter.
     */
    public PriceResponse getPriceFallback(Long productId, Throwable throwable) {
        // Return cached / default / last-known price
        BigDecimal defaultPrice = BigDecimal.valueOf(0.00);

        // Optionally log the root cause
        System.err.println("Fallback triggered for product " + productId
                + " due to: " + throwable.getClass().getSimpleName()
                + " - " + throwable.getMessage());

        return new PriceResponse(productId, defaultPrice, "USD");
    }
}
Order of annotations is not important – both will apply: @Retry tries a few times; if failures persist and thresholds are exceeded, @CircuitBreaker will open and fail-fast for subsequent calls.

7. Exposing a REST Endpoint that Uses the Resilient Client

@RestController
@RequestMapping("/api/products")
public class ProductController {

    private final PriceClient priceClient;

    public ProductController(PriceClient priceClient) {
        this.priceClient = priceClient;
    }

    @GetMapping("/{id}/price")
    public PriceResponse getProductPrice(@PathVariable Long id) {
        return priceClient.getPrice(id);
    }
}

Now, every call to /api/products/{id}/price is protected by Retry and CircuitBreaker.


8. How Retry Works in Resilience4j (Under the Hood)

With our earlier YAML:

  • max-attempts: 3 → 1 initial call + 2 retries.
  • wait-duration: 500ms → wait 500ms between attempts.
  • enable-exponential-backoff: true → wait times grow (500ms, 1000ms, 2000ms…).
  • retry-exceptions → retry IOException and ResourceAccessException only.
  • ignore-exceptions → do not retry BusinessException (fail fast).

8.1. Visual Timeline

Initial call  ──X (IOException)
                │
 wait 500ms     ▼
Retry #1    ────X (IOException)
                │
 wait 1000ms    ▼
Retry #2    ────X (IOException)
                │
     All attempts failed → Fallback method invoked

8.2. Custom Retry Instance for a Specific API

You can override configuration per instance:

resilience4j:
  retry:
    instances:
      priceService:
        max-attempts: 5
        wait-duration: 300ms
        enable-exponential-backoff: true
        exponential-backoff-multiplier: 1.5

Once a call fails with one of the configured exceptions, Resilience4j will retry according to this config before giving up and calling the fallback.


9. How CircuitBreaker Works in Resilience4j

9.1. States: CLOSED → OPEN → HALF_OPEN

  • CLOSED: normal operation. All calls are allowed and counted.
  • OPEN: too many failures → short-circuit; calls fail immediately with CallNotPermittedException.
  • HALF_OPEN: some test calls are allowed; if they succeed → go to CLOSED, else back to OPEN.

9.2. Visual State Diagram (Textual)

[CLOSED]
   │  (failure-rate > threshold within sliding window)
   ▼
[OPEN]
   │  (after wait-duration-in-open-state)
   ▼
[HALF_OPEN]
   │  (if trial calls succeed)
   ├──► [CLOSED]
   │
   └──► (if trial calls fail) back to [OPEN]

From our YAML:

  • sliding-window-size: 20 → consider last 20 calls.
  • failure-rate-threshold: 50 → if 10 out of 20 calls fail → circuit goes OPEN.
  • wait-duration-in-open-state: 10s → stay OPEN for 10 seconds before trying HALF_OPEN.
  • permitted-number-of-calls-in-half-open-state: 3 → only 3 test calls allowed.
💡 Benefit: we stop hammering a broken service and failing slowly. Users get a consistent error or fallback immediately.

10. Combining Retry + CircuitBreaker + TimeLimiter

Retry and CircuitBreaker address errors. But what about slow responses? That’s where TimeLimiter helps.

10.1. TimeLimiter Configuration Recap

resilience4j:
  timelimiter:
    instances:
      priceService:
        timeout-duration: 2s
        cancel-running-future: true

10.2. Using TimeLimiter with CompletableFuture

TimeLimiter works with async calls. Example with CompletableFuture:

@Service
public class AsyncPriceClient {

    private final RestTemplate restTemplate;
    private final TimeLimiter timeLimiter;

    public AsyncPriceClient(RestTemplate restTemplate,
                            TimeLimiterRegistry timeLimiterRegistry) {
        this.restTemplate = restTemplate;
        this.timeLimiter = timeLimiterRegistry.timeLimiter("priceService");
    }

    public CompletableFuture<PriceResponse> getPriceAsync(Long productId) {

        Supplier<CompletableFuture<PriceResponse>> supplier =
            () -> CompletableFuture.supplyAsync(() -> {
                String url = "http://localhost:8081/api/prices/" + productId;
                return restTemplate.getForObject(url, PriceResponse.class);
            });

        return timeLimiter.executeCompletionStage(
                () -> supplier.get()
        ).toCompletableFuture();
    }
}
If the remote call doesn’t finish within timeout-duration, a TimeoutException will be thrown. You can wrap this with @Retry and @CircuitBreaker as well for a full protection stack.

11. Observability: Metrics & Monitoring Resilience4j

With spring-boot-starter-actuator and Resilience4j on the classpath, you automatically get metrics:

  • resilience4j_circuitbreaker_calls
  • resilience4j_retry_calls
  • resilience4j_timelimiter_calls

11.1. Check Metrics via Actuator

GET http://localhost:8080/actuator/metrics/resilience4j.circuitbreaker.calls
GET http://localhost:8080/actuator/metrics/resilience4j.retry.calls

11.2. Prometheus + Grafana (Optional)

If you added micrometer-registry-prometheus, expose metrics:

management:
  endpoints:
    web:
      exposure:
        include: health,metrics,prometheus

Then you can scrape /actuator/prometheus from Prometheus and build Grafana dashboards:

  • Circuit state (open / closed)
  • Failure rate over time
  • Retry count, timeouts, fallback usage

12. Testing Failure Scenarios (Very Important)

Don’t just rely on happy-path tests. You should simulate failures:

12.1. Simulate Intermittent Failures in Fake Pricing Service

@RestController
@RequestMapping("/api/prices")
public class FakePricingController {

    private final Random random = new Random();

    @GetMapping("/{id}")
    public PriceResponse getPrice(@PathVariable Long id) {

        int value = random.nextInt(10);

        if (value < 3) {
            // 30% of the time: simulate slow response
            try { Thread.sleep(3000); } catch (InterruptedException ignored) {}
        }

        if (value > 7) {
            // 20% of the time: simulate failure
            throw new RuntimeException("Downstream pricing service failed");
        }

        return new PriceResponse(id, BigDecimal.valueOf(99.99), "USD");
    }
}

Now, hit /api/products/{id}/price multiple times and watch:

  • Retries being applied
  • CircuitBreaker opening after too many failures
  • Fallback being used when circuit is open or all retries fail

13. Best Practices for Resilience4j in Production

13.1. Use Different Instances per Remote Dependency

Don’t reuse a single priceService config for everything. Instead:

resilience4j:
  circuitbreaker:
    instances:
      priceService: { base-config: default }
      stockService: { base-config: default }
      paymentService: { base-config: default }

This way, one noisy dependency doesn’t affect others.

13.2. Choose the Right Retry Count

  • Too many retries → more pressure on an already slow/failing service.
  • Too few retries → transient network glitches may not be recovered.
  • Common choice: 3–5 attempts with exponential backoff.

13.3. Always Provide Meaningful Fallbacks

  • Return cached data or last-known-good values when possible.
  • Return a well-formed error response instead of raw exceptions.
  • Log fallback usage with enough context for debugging.

13.4. Don’t Abuse CircuitBreaker

  • Use it only for remote calls (HTTP, DB, external systems).
  • Don’t wrap CPU-heavy local operations with CircuitBreaker.

13.5. Monitor Metrics and Tune Continually

Start with conservative defaults, then adjust:

  • failure-rate-threshold (e.g., 50%)
  • sliding-window-size
  • timeout-duration (TimeLimiter)

Use real production metrics to tune these values over time.


14. Summary & Next Steps

By integrating Resilience4j with Spring Boot 3, you get a powerful, flexible toolkit for building resilient microservices.

  • @Retry handles transient failures.
  • @CircuitBreaker prevents cascading failures from broken dependencies.
  • TimeLimiter ensures slow calls don’t hog resources.
  • Fallback strategies keep the user experience graceful.
  • Metrics + Actuator provide visibility into real-world behavior.

Combine these patterns with good observability and careful tuning, and your Spring Boot services will stay responsive even when dependencies fail.


15. FAQ: Resilience4j Retry & CircuitBreaker in Spring Boot

Q1. What is Resilience4j used for in Spring Boot?

Resilience4j provides fault-tolerance patterns like Retry, CircuitBreaker, TimeLimiter, RateLimiter and Bulkhead. In Spring Boot it’s commonly used to protect HTTP calls to external APIs, databases and other microservices.

Q2. Can I use Retry and CircuitBreaker together?

Yes. A common pattern is to apply @Retry for transient errors and @CircuitBreaker to stop calls when failure rate is high. You can annotate the same method with both and share the same named configuration instance.

Q3. When should I use TimeLimiter?

Use TimeLimiter when you want to fail calls that are taking too long (e.g. slow downstream API). It’s especially useful for async / non-blocking or CompletableFuture-based calls.

Q4. What is a good starting configuration for CircuitBreaker?

For typical REST APIs, a good starting point is: sliding-window-size = 20, failure-rate-threshold = 50, wait-duration-in-open-state = 10s, permitted-calls-in-half-open = 3, then tune using metrics.

Q5. Should I retry all exceptions?

No. Only retry transient errors like I/O issues or timeouts. Never retry business exceptions (validation failures, domain errors) – they will not succeed on retry.