Skip to content

Add option for more control on connection recovery #1535

Open
@gavanore

Description

@gavanore

Is your feature request related to a problem? Please describe.

My colleague chatted with you on Discord about this. Our main use case is disparate clusters, where each one is behind a load balancer. We want to use one of those clusters as a primary location, and only swap to the other cluster when the main cluster is completely down/unavailable. We think that this could be easily achievable with a little more interrogation from externally-pluggable logic when setting up the ConnectionFactory.

Describe the solution you'd like

Ultimately something like the RetryListener, but for Connections and not just for Topology. Or, it could be done with lambdas like Predicates, and also on the connection that failed. Maybe also a connection retry count passed in to help make judgments.

We envision setting cluster tags in our servers that inform the client about which cluster they're connected to, and perhaps additionally a cluster tag to indicate that the address used was behind a load balancer. So, we could check to see if the server tags indicate a load balancer address, combined with the reason the connection was shut down.

Maybe an easy way to plug this in currently is to have an interface that returns an AddressResolver, so an easy default implementation is to return the current AddressResolver unconditionally. This would preserve current behavior.

So, maybe, all notional:

public interface ConnectionRetryListener {
    AddressResolver onRetry(Connection failed, Exception cause, int retryCount);
}

/* somewhere in initialization */
if (this.connectionRetryListener == null) {
    this.connectionRetryListener = (conn, cause, count) -> this.addressResolver;
}

Then we could send a non-shuffling list of [secondary, primary] when there's an unexpected issue, or if retry count goes higher than some tolerable level, otherwise ask the system to attempt [primary, secondary] in standard scenarios. Or even skip sending primary/secondary together and let the new implementation determine whether the primary or secondary should be tried by itself. I.e. try primary three times, then try secondary three times, then give up.

I have not yet looked at downstream impacts of wiring this through the existing code. First just want to hash ideas on what you guys like / don't like. We're willing to do the legwork to contribute.

Describe alternatives you've considered

Currently we override AddressResolver to always return a fixed list and skip shuffling, which works mostly well but there are edge cases where a client may cascade to the more distant cluster when their primary is still up.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions