Description
When we benchmarked our elasticache(cluster mode enabled) with Routebylatency option enabled with goredis v9.5.1
, we saw increase in average response time in our redis operations(get and pipeline cmds), when we tried to debug this issue and added certain logs in the process, we saw a lot of moved errors that caused retries which in turn increased latency overall.
Line 966 in d43a9fa
In further debugging we observed that slotClosestNode
func is returning a random node across all the shards in the case when all the nodes are marked as failed.
Line 750 in d43a9fa
In our case, this situation(where all nodes failing) is happening frequently which is causing frequent moved errors
Expected Behavior
There shouldn't be increase in response time when Routebylatency
enabled infact it should decrease if possible and moved errors shouldn't be much once the client's current cluster state is updated.
Current Behavior
Increase in moved errors, hence increase in throughput of Get
(with the same traffic), engine cpu utilisation of all the nodes and overall latency.
Possible Solution
In the case when all the nodes are marked as failed, choosing a random node within the shard associated with the slot(even though they are marked as failed) might work for this problem, this is what is done when RouteRandomly
is enabled.
Steps to Reproduce
- Elasticache cluster (we are using engine 7.1.0) with multiple shards and replicas for each( we used 2 shards with 3 nodes each)
- Using
go-redis
v9.5.1 withRoutebyLatency
enabled, throughput around 10-20k rpm withget
andpipeline.get
- Mulitple ecs tasks(we are using 10) running spread across multiple availability zones
Context (Environment)
Detailed Description
Possible Implementation
We made changes in the slotClosestNode
func implementing the fix we thought of, actually reduced the moved errors(and hence response time) when we benchmarked again.
This is the fix we made in our fork.
for _, n := range nodes {
if n.Failing() {
continue
}
if node == nil || n.Latency() < node.Latency() {
node = n
}
}
if node != nil {
return node, nil
}
// If all nodes are failing - return random node from the nodes corresponding to the slot
randomNodes := rand.Perm(len(nodes))
return nodes[randomNodes[0]], nil