Primary use case is for network operations with intermittent errors.
Things like HTTP timeouts and so on.
We’ve had some processes where these were wrapped in Retry
(pretty common - anything over the network can fail for a myriad of reasons) and from log timings we could see that sometimes it retries, but until we rerolled it into a custom retry logic (which is annoying to do, and error prone) we did not know why. Once the retried errors surfaced, we were able to address the root cause - the other side had a load balancer setup and one of the backup nodes had an issue with a certificate. They quickly addressed that, and retries went down significantly.
The other common issue with HTTP is 429 (too many requests), where the client code did not implement correctly a stepback/throttling strategy. A wait+retry usually works around that, but without knowing that it actually threw back a 429 the root cause remains unknown.
Outside of HTTP’s, when interacting with websites for example, we’ve had situations where certain parts were retried and while we could see in the logs that retries did fire, again we didn’t know why. Usually this ends up in a really weird set up with something like this, for tracing scenarios:
Retry
{
Try
{
something that could fail
}
Catch
{
log error
rethrow
}
}
This is very ugly (especially on the design canvas - mental overhead of these nestings does add up pretty quickly), but is the quickest way to log the error before Retry
swallows it.
In many cases with the websites, once the errors are known, they’re usually pretty easy to fix (90% of the time it’s either a timing issue, or a caching issue), but a retry on a UI interaction can really eat up the execution time.
The last use case is that on occasion, we’ve had things that were wrapped in a Retry
, but after surfacing the errors, we have found that the only errors we were getting were unrecoverable anyway, so the Retry
was just adding execution time without any benefit. Sure, that screams “lazy dev error”, and it probably was (or the intermittent issues went away/were fixed in the meantime, who knows), but to confirm that, we needed to alter the code (potentially adding another error, as with all changes).
So in essence this request is to eliminate the need for these ugly intermittent catches to know what’s going on, and to be able to use Retry
easier without sacrificing error tracing.
Sidenote: an ideal situation would be that aside of logging retried exceptions, we could also pass a predicate for bailouts or a list of non-retryable exception types, but that’s a much more complicated scenario to do right. So just logging for now would help a lot 