Spring Boot Resilience4j Retry & CircuitBreaker Tutorial
In a microservices world, failures are normal: network issues, slow downstream services, temporary outages, rate limits… If your Spring Boot services call other APIs or databases, you must design for failure.
In this in-depth guide, we’ll build production-ready fault tolerance using Resilience4j with Spring Boot 3.x – focusing on Retry and CircuitBreaker.
- Resilience4j basics & why it replaced Hystrix
- Spring Boot 3.x setup with Resilience4j (Maven + YAML config)
- Implementing Retry and CircuitBreaker for remote REST calls
- Fallback methods & combining patterns (Retry + CircuitBreaker + TimeLimiter)
- Metrics, monitoring (Actuator), and how to test failures
- Production best practices & common pitfalls to avoid
1. What is Resilience4j and Why Use It?
Resilience4j is a lightweight, modular fault-tolerance library inspired by Netflix Hystrix, but designed for Java 8+ and functional programming.
| Feature | What it gives you |
|---|---|
| Retry | Automatically retry failed operations with backoff |
| CircuitBreaker | Stop hitting a failing service and fail-fast |
| TimeLimiter | Fail calls that exceed a time limit |
| RateLimiter | Throttle calls to external services |
| Bulkhead | Isolate failures and limit concurrent calls |
2. Example Scenario – Calling a Remote Pricing Service
Throughout this tutorial, we’ll use a realistic example:
- inventory-service (your main Spring Boot service)
- calls a remote pricing-service over HTTP
- pricing-service sometimes fails or is slow
We’ll protect the remote call with:
- @Retry – auto-retry a few times if it fails
- @CircuitBreaker – open the circuit if the failure rate is high
- Fallback – return cached / default price when downstream is down
3. Project Setup – Spring Boot 3.x + Resilience4j
3.1. Maven Dependencies
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- Resilience4j Spring Boot 3 integration -->
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
</dependency>
<!-- Optional: Actuator for health & metrics -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- Optional: Micrometer Prometheus registry -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
</dependencies>
4. Basic Resilience4j Configuration (application.yml)
4.1. Enabling Actuator Endpoints (recommended)
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus
endpoint:
health:
show-details: always
4.2. Retry & CircuitBreaker Base Configuration
Create or update application.yml:
resilience4j:
retry:
configs:
default:
max-attempts: 3 # 1 initial call + 2 retries
wait-duration: 500ms # wait between attempts
enable-exponential-backoff: true
exponential-backoff-multiplier: 2
retry-exceptions:
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
ignore-exceptions:
- com.example.demo.exception.BusinessException
instances:
priceService:
base-config: default
circuitbreaker:
configs:
default:
sliding-window-type: COUNT_BASED
sliding-window-size: 20 # number of calls to measure
minimum-number-of-calls: 10
failure-rate-threshold: 50 # percentage
wait-duration-in-open-state: 10s # how long circuit stays OPEN
permitted-number-of-calls-in-half-open-state: 3
automatic-transition-from-open-to-half-open-enabled: true
record-exceptions:
- java.io.IOException
- org.springframework.web.client.HttpServerErrorException
- org.springframework.web.client.ResourceAccessException
ignore-exceptions:
- com.example.demo.exception.BusinessException
instances:
priceService:
base-config: default
4.3. TimeLimiter (Optional but Important for Slow APIs)
resilience4j:
timelimiter:
configs:
default:
timeout-duration: 2s
cancel-running-future: true
instances:
priceService:
base-config: default
5. 🔥 Deep Dive: Resilience4j Configuration Properties Explained
Understanding why each configuration exists helps you tune correctness and performance for production workloads. Below is a full breakdown of the most important properties we used.
5.1. 🔁 Retry Configuration — Meaning of Each Property
| Property | What It Controls | Recommended Range |
|---|---|---|
max-attempts |
Total attempts allowed (1 initial call + retries) | 3–5 |
wait-duration |
Delay between retry attempts | 200–800ms |
enable-exponential-backoff |
Whether to increase wait time after each retry | true in most cases |
exponential-backoff-multiplier |
How much the delay grows each retry | 1.5–2 |
retry-exceptions |
Only these exception types trigger a retry | Network / I/O errors |
ignore-exceptions |
These are treated as business errors → fail immediately | Validation / business exceptions |
5.2. 🛡 CircuitBreaker Configuration — Meaning of Each Property
| Property | What It Controls | Why It Matters |
|---|---|---|
sliding-window-type |
Whether to measure calls by count or time | COUNT_BASED is easier for REST APIs; TIME_BASED for streaming |
sliding-window-size |
Number of recent calls to consider | Larger window = more stable decision, slower to react |
minimum-number-of-calls |
Min calls before evaluating failure rate | Prevents circuit opening with small sample sizes |
failure-rate-threshold |
% of failed calls required to open the circuit | 50% is common starting point |
wait-duration-in-open-state |
How long the circuit stays OPEN before trying again | Enough time for dependency to recover (e.g. 10–30s) |
permitted-number-of-calls-in-half-open-state |
Number of trial calls in HALF_OPEN | 3–10 is typical; too high can cause spikes |
automatic-transition-from-open-to-half-open-enabled |
Automatically move from OPEN to HALF_OPEN after wait-duration | true is usually what you want |
record-exceptions |
Which exceptions count as failures for the failure-rate | Real downstream failures (5xx, timeouts, I/O) |
ignore-exceptions |
Exceptions that should not count as circuit failures | Business rules, validation errors, etc. |
CircuitBreaker protects your system by stopping calls to a failing dependency. Proper window size + threshold decide when to trip the breaker. Too small → flapping (open/close frequently).
5.3. ⏱ TimeLimiter — Timeout Behavior (Optional but Highly Recommended)
| Property | Description | Typical Value |
|---|---|---|
timeout-duration |
Maximum time allowed for a remote call | 1–3 seconds for REST APIs |
cancel-running-future |
Whether to cancel the underlying task when timeout occurs | true in most cases |
5.4. 🎯 Suggested Defaults for Real Projects
- Retries: 3–4 attempts with exponential backoff
- Timeout (TimeLimiter): 2 seconds
- CircuitBreaker sliding window: 20 calls
- Failure threshold: 50%
- Half-open trial calls: 3–5
These defaults work well for REST APIs with P99 latency under ~500ms. You should always tune them later using real production metrics from Prometheus / Grafana.
6. Implementing a Resilient Client with Retry & CircuitBreaker
6.1. DTO for Remote Response
public record PriceResponse(
Long productId,
BigDecimal price,
String currency
) {}
6.2. RestTemplate Bean
@Configuration
public class HttpClientConfig {
@Bean
public RestTemplate restTemplate(RestTemplateBuilder builder) {
return builder
.setConnectTimeout(Duration.ofSeconds(2))
.setReadTimeout(Duration.ofSeconds(2))
.build();
}
}
6.3. PriceClient – Remote Call Wrapped with Resilience4j
@Service
public class PriceClient {
private final RestTemplate restTemplate;
@Value("${pricing-service.base-url:http://localhost:8081}")
private String pricingServiceBaseUrl;
public PriceClient(RestTemplate restTemplate) {
this.restTemplate = restTemplate;
}
@Retry(name = "priceService", fallbackMethod = "getPriceFallback")
@CircuitBreaker(name = "priceService", fallbackMethod = "getPriceFallback")
public PriceResponse getPrice(Long productId) {
String url = pricingServiceBaseUrl + "/api/prices/" + productId;
ResponseEntity<PriceResponse> response =
restTemplate.getForEntity(url, PriceResponse.class);
return response.getBody();
}
/**
* Fallback method must have same parameters plus Throwable as last parameter.
*/
public PriceResponse getPriceFallback(Long productId, Throwable throwable) {
// Return cached / default / last-known price
BigDecimal defaultPrice = BigDecimal.valueOf(0.00);
// Optionally log the root cause
System.err.println("Fallback triggered for product " + productId
+ " due to: " + throwable.getClass().getSimpleName()
+ " - " + throwable.getMessage());
return new PriceResponse(productId, defaultPrice, "USD");
}
}
7. Exposing a REST Endpoint that Uses the Resilient Client
@RestController
@RequestMapping("/api/products")
public class ProductController {
private final PriceClient priceClient;
public ProductController(PriceClient priceClient) {
this.priceClient = priceClient;
}
@GetMapping("/{id}/price")
public PriceResponse getProductPrice(@PathVariable Long id) {
return priceClient.getPrice(id);
}
}
Now, every call to /api/products/{id}/price is protected by Retry and CircuitBreaker.
8. How Retry Works in Resilience4j (Under the Hood)
With our earlier YAML:
- max-attempts: 3 → 1 initial call + 2 retries.
- wait-duration: 500ms → wait 500ms between attempts.
- enable-exponential-backoff: true → wait times grow (500ms, 1000ms, 2000ms…).
- retry-exceptions → retry
IOExceptionandResourceAccessExceptiononly. - ignore-exceptions → do not retry
BusinessException(fail fast).
8.1. Visual Timeline
Initial call ──X (IOException)
│
wait 500ms ▼
Retry #1 ────X (IOException)
│
wait 1000ms ▼
Retry #2 ────X (IOException)
│
All attempts failed → Fallback method invoked
8.2. Custom Retry Instance for a Specific API
You can override configuration per instance:
resilience4j:
retry:
instances:
priceService:
max-attempts: 5
wait-duration: 300ms
enable-exponential-backoff: true
exponential-backoff-multiplier: 1.5
Once a call fails with one of the configured exceptions, Resilience4j will retry according to this config before giving up and calling the fallback.
9. How CircuitBreaker Works in Resilience4j
9.1. States: CLOSED → OPEN → HALF_OPEN
- CLOSED: normal operation. All calls are allowed and counted.
- OPEN: too many failures → short-circuit; calls fail immediately with
CallNotPermittedException. - HALF_OPEN: some test calls are allowed; if they succeed → go to CLOSED, else back to OPEN.
9.2. Visual State Diagram (Textual)
[CLOSED]
│ (failure-rate > threshold within sliding window)
▼
[OPEN]
│ (after wait-duration-in-open-state)
▼
[HALF_OPEN]
│ (if trial calls succeed)
├──► [CLOSED]
│
└──► (if trial calls fail) back to [OPEN]
From our YAML:
- sliding-window-size: 20 → consider last 20 calls.
- failure-rate-threshold: 50 → if 10 out of 20 calls fail → circuit goes OPEN.
- wait-duration-in-open-state: 10s → stay OPEN for 10 seconds before trying HALF_OPEN.
- permitted-number-of-calls-in-half-open-state: 3 → only 3 test calls allowed.
10. Combining Retry + CircuitBreaker + TimeLimiter
Retry and CircuitBreaker address errors. But what about slow responses? That’s where TimeLimiter helps.
10.1. TimeLimiter Configuration Recap
resilience4j:
timelimiter:
instances:
priceService:
timeout-duration: 2s
cancel-running-future: true
10.2. Using TimeLimiter with CompletableFuture
TimeLimiter works with async calls. Example with CompletableFuture:
@Service
public class AsyncPriceClient {
private final RestTemplate restTemplate;
private final TimeLimiter timeLimiter;
public AsyncPriceClient(RestTemplate restTemplate,
TimeLimiterRegistry timeLimiterRegistry) {
this.restTemplate = restTemplate;
this.timeLimiter = timeLimiterRegistry.timeLimiter("priceService");
}
public CompletableFuture<PriceResponse> getPriceAsync(Long productId) {
Supplier<CompletableFuture<PriceResponse>> supplier =
() -> CompletableFuture.supplyAsync(() -> {
String url = "http://localhost:8081/api/prices/" + productId;
return restTemplate.getForObject(url, PriceResponse.class);
});
return timeLimiter.executeCompletionStage(
() -> supplier.get()
).toCompletableFuture();
}
}
timeout-duration, a TimeoutException will be thrown.
You can wrap this with @Retry and @CircuitBreaker as well for a full protection stack.
11. Observability: Metrics & Monitoring Resilience4j
With spring-boot-starter-actuator and Resilience4j on the classpath, you automatically get metrics:
resilience4j_circuitbreaker_callsresilience4j_retry_callsresilience4j_timelimiter_calls
11.1. Check Metrics via Actuator
GET http://localhost:8080/actuator/metrics/resilience4j.circuitbreaker.calls
GET http://localhost:8080/actuator/metrics/resilience4j.retry.calls
11.2. Prometheus + Grafana (Optional)
If you added micrometer-registry-prometheus, expose metrics:
management:
endpoints:
web:
exposure:
include: health,metrics,prometheus
Then you can scrape /actuator/prometheus from Prometheus and build Grafana dashboards:
- Circuit state (open / closed)
- Failure rate over time
- Retry count, timeouts, fallback usage
12. Testing Failure Scenarios (Very Important)
Don’t just rely on happy-path tests. You should simulate failures:
12.1. Simulate Intermittent Failures in Fake Pricing Service
@RestController
@RequestMapping("/api/prices")
public class FakePricingController {
private final Random random = new Random();
@GetMapping("/{id}")
public PriceResponse getPrice(@PathVariable Long id) {
int value = random.nextInt(10);
if (value < 3) {
// 30% of the time: simulate slow response
try { Thread.sleep(3000); } catch (InterruptedException ignored) {}
}
if (value > 7) {
// 20% of the time: simulate failure
throw new RuntimeException("Downstream pricing service failed");
}
return new PriceResponse(id, BigDecimal.valueOf(99.99), "USD");
}
}
Now, hit /api/products/{id}/price multiple times and watch:
- Retries being applied
- CircuitBreaker opening after too many failures
- Fallback being used when circuit is open or all retries fail
13. Best Practices for Resilience4j in Production
13.1. Use Different Instances per Remote Dependency
Don’t reuse a single priceService config for everything. Instead:
resilience4j:
circuitbreaker:
instances:
priceService: { base-config: default }
stockService: { base-config: default }
paymentService: { base-config: default }
This way, one noisy dependency doesn’t affect others.
13.2. Choose the Right Retry Count
- Too many retries → more pressure on an already slow/failing service.
- Too few retries → transient network glitches may not be recovered.
- Common choice: 3–5 attempts with exponential backoff.
13.3. Always Provide Meaningful Fallbacks
- Return cached data or last-known-good values when possible.
- Return a well-formed error response instead of raw exceptions.
- Log fallback usage with enough context for debugging.
13.4. Don’t Abuse CircuitBreaker
- Use it only for remote calls (HTTP, DB, external systems).
- Don’t wrap CPU-heavy local operations with CircuitBreaker.
13.5. Monitor Metrics and Tune Continually
Start with conservative defaults, then adjust:
failure-rate-threshold(e.g., 50%)sliding-window-sizetimeout-duration(TimeLimiter)
Use real production metrics to tune these values over time.
14. Summary & Next Steps
By integrating Resilience4j with Spring Boot 3, you get a powerful, flexible toolkit for building resilient microservices.
- @Retry handles transient failures.
- @CircuitBreaker prevents cascading failures from broken dependencies.
- TimeLimiter ensures slow calls don’t hog resources.
- Fallback strategies keep the user experience graceful.
- Metrics + Actuator provide visibility into real-world behavior.
Combine these patterns with good observability and careful tuning, and your Spring Boot services will stay responsive even when dependencies fail.
15. FAQ: Resilience4j Retry & CircuitBreaker in Spring Boot
Q1. What is Resilience4j used for in Spring Boot?
Resilience4j provides fault-tolerance patterns like Retry, CircuitBreaker, TimeLimiter, RateLimiter and Bulkhead. In Spring Boot it’s commonly used to protect HTTP calls to external APIs, databases and other microservices.
Q2. Can I use Retry and CircuitBreaker together?
Yes. A common pattern is to apply @Retry for transient errors and @CircuitBreaker to stop calls when failure rate is high.
You can annotate the same method with both and share the same named configuration instance.
Q3. When should I use TimeLimiter?
Use TimeLimiter when you want to fail calls that are taking too long (e.g. slow downstream API).
It’s especially useful for async / non-blocking or CompletableFuture-based calls.
Q4. What is a good starting configuration for CircuitBreaker?
For typical REST APIs, a good starting point is:
sliding-window-size = 20, failure-rate-threshold = 50,
wait-duration-in-open-state = 10s, permitted-calls-in-half-open = 3, then tune using metrics.
Q5. Should I retry all exceptions?
No. Only retry transient errors like I/O issues or timeouts. Never retry business exceptions (validation failures, domain errors) – they will not succeed on retry.