Retry Mechanism Implementation
Integrating with the Founda platform requires handling scenarios where requests may not reach their destination due to transient issues like network problems or (downstream) service downtime. This occurs most commonly when we are unable to deliver a request to the downstream data sources due to connectivity issues between Founda and the Provider Organization.
The Founda platform does not automatically retry failed requests for you. We will clearly return the received error code as a response. For this reason, it is strongly advised to implement a retry mechanism. Depending on your requirements, an implementation using exponential backoff might make the most sense.
Retrying failed requests is not always the right choice. This is especially important to consider in (synchronous) patient- or doctor-facing applications. Specific requests that require direct user feedback need to be designed to deal with potential unavailability of downstream systems.
When retrying it is wise to consider the status code. The following guidance applies:
- Implement retries for transient error responses such as:
- 503 Service Unavailable
- 504 Gateway Timeout
- 502 Bad Gateway
- 408 Request Timeout
- Do not retry on errors that are unlikely to resolve themselves or indicate client-side issues, such as:
- 400 Bad Request
- 401 Unauthorized
- 403 Forbidden
- 404 Not Found
- 409 Conflict
The general recommendation is to utilize exponential backoff. This is an approach to manage retries effectively. It involves incrementally increasing the delay between retry attempts to balance network use and increase the likelihood of successful delivery.
For example:
- Initial Attempt: Make your request.
- Failure Detection: On failure, check if the status code suggests a transient error.
- First Retry: If retryable, wait for a moderate initial delay (depending on your application between 5 and 30 seconds).
- Subsequent Retries: Increase the delay for each retry exponentially.
- Retry Limit: Cap the retries at a reasonable number (e.g., 4 attempts).
- Ceiling on Delay: Implement a maximum delay (e.g., 15 minutes) to avoid long waits.
- Jitter: Add random jitter to prevent synchronized retry patterns in large-scale outages.
- Monitor Retry Attempts: Keep track of retries and adjust your strategy based on outcomes.
- Error Handling: Have a plan for handling requests that fail after all retries.