Seeing lot of moved errors when Routebylatency is enabled in ClusterClient

When we benchmarked our elasticache(cluster mode enabled) with Routebylatency option enabled with `goredis v9.5.1`,  we saw increase in average response time in our redis operations(get and pipeline cmds), when we tried to debug this issue and added certain logs in the process, we saw a lot of moved errors that caused retries which in turn increased latency overall.

https://github.com/redis/go-redis/blob/d43a9fa887d9284ba42fcd46d46e97c56b34e132/osscluster.go#L966

In further debugging we observed that `slotClosestNode` func is returning a random node across all the shards in the case when all the nodes are marked as failed. 

https://github.com/redis/go-redis/blob/d43a9fa887d9284ba42fcd46d46e97c56b34e132/osscluster.go#L750

In our case, this situation(where all nodes failing) is happening frequently which is causing frequent moved errors



## Expected Behavior

There shouldn't be increase in response time when `Routebylatency` enabled infact it should decrease if possible and moved errors shouldn't be much once the client's current cluster state is updated.

## Current Behavior

Increase in moved errors, hence increase in throughput of `Get`(with the same traffic), engine cpu utilisation of all the nodes and overall latency.

## Possible Solution



In the case when all the nodes are marked as failed, choosing a random node within the shard associated with the slot(even though they are marked as failed) might work for this problem, this is what is done when `RouteRandomly` is enabled.

## Steps to Reproduce




1. Elasticache cluster (we are using engine 7.1.0) with multiple shards and replicas for each( we used 2 shards with 3 nodes each)
2. Using `go-redis` v9.5.1 with `RoutebyLatency` enabled, throughput around 10-20k rpm with `get` and `pipeline.get`
3. Mulitple ecs tasks(we are using 10) running spread across multiple availability zones

## Context (Environment)


## Detailed Description



## Possible Implementation

We made changes in the `slotClosestNode` func implementing the fix we thought of, actually reduced the moved errors(and hence response time) when we benchmarked again. 

This is the fix we made in our fork.
```
	        for _, n := range nodes {
		     if n.Failing() {
			 continue
		     }
		if node == nil || n.Latency() < node.Latency() {
			node = n
		}
	        }

	       if node != nil {
		    return node, nil
	       }

                // If all nodes are failing - return random node from the nodes corresponding to the slot
	      randomNodes := rand.Perm(len(nodes))
	      return nodes[randomNodes[0]], nil
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeing lot of moved errors when Routebylatency is enabled in ClusterClient #3023

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

Detailed Description

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Seeing lot of moved errors when Routebylatency is enabled in ClusterClient #3023

Description

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

Detailed Description

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions