Skip to content

Seeing lot of moved errors when Routebylatency is enabled in ClusterClient #3023

Closed
@srikar-jilugu

Description

@srikar-jilugu

When we benchmarked our elasticache(cluster mode enabled) with Routebylatency option enabled with goredis v9.5.1, we saw increase in average response time in our redis operations(get and pipeline cmds), when we tried to debug this issue and added certain logs in the process, we saw a lot of moved errors that caused retries which in turn increased latency overall.

moved, ask, addr = isMovedError(lastErr)

In further debugging we observed that slotClosestNode func is returning a random node across all the shards in the case when all the nodes are marked as failed.

return c.nodes.Random()

In our case, this situation(where all nodes failing) is happening frequently which is causing frequent moved errors

Expected Behavior

There shouldn't be increase in response time when Routebylatency enabled infact it should decrease if possible and moved errors shouldn't be much once the client's current cluster state is updated.

Current Behavior

Increase in moved errors, hence increase in throughput of Get(with the same traffic), engine cpu utilisation of all the nodes and overall latency.

Possible Solution

In the case when all the nodes are marked as failed, choosing a random node within the shard associated with the slot(even though they are marked as failed) might work for this problem, this is what is done when RouteRandomly is enabled.

Steps to Reproduce

  1. Elasticache cluster (we are using engine 7.1.0) with multiple shards and replicas for each( we used 2 shards with 3 nodes each)
  2. Using go-redis v9.5.1 with RoutebyLatency enabled, throughput around 10-20k rpm with get and pipeline.get
  3. Mulitple ecs tasks(we are using 10) running spread across multiple availability zones

Context (Environment)

Detailed Description

Possible Implementation

We made changes in the slotClosestNode func implementing the fix we thought of, actually reduced the moved errors(and hence response time) when we benchmarked again.

This is the fix we made in our fork.

	        for _, n := range nodes {
		     if n.Failing() {
			 continue
		     }
		if node == nil || n.Latency() < node.Latency() {
			node = n
		}
	        }

	       if node != nil {
		    return node, nil
	       }

                // If all nodes are failing - return random node from the nodes corresponding to the slot
	      randomNodes := rand.Perm(len(nodes))
	      return nodes[randomNodes[0]], nil

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions