Description
A race condition exists in the redis cluster client involving c.activeAddrs(a slice that stores redis node addresses). Specifically, the cluster client’s GC function writes to c.activeAddrs, while concurrently, callers may invoke the Addrs function to read from it.
Write path
Function GC overwrites the activeAddrs from its beginning.
// GC removes unused nodes.
func (c *clusterNodes) GC(generation uint32) {
//nolint:prealloc
var collected []*clusterNode
c.mu.Lock()
c.activeAddrs = c.activeAddrs[:0] // <-------------
for addr, node := range c.nodes {
if node.Generation() >= generation {
c.activeAddrs = append(c.activeAddrs, addr) // <-------------
if c.opt.RouteByLatency {
go node.updateLatency()
}
continue
}
delete(c.nodes, addr)
collected = append(collected, node)
}
c.mu.Unlock()
for _, node := range collected {
_ = node.Client.Close()
}
}
Read path
The function Addrs() returns a reference to a string slice. The returned addrs references either c.activeAddrs or c.addrs from the original object.
func (c *clusterNodes) Addrs() ([]string, error) {
var addrs []string
c.mu.RLock()
closed := c.closed //nolint:ifshort
if !closed {
if len(c.activeAddrs) > 0 {
addrs = c.activeAddrs // <-------------
} else {
addrs = c.addrs // <-------------
}
}
c.mu.RUnlock()
if closed {
return nil, pool.ErrClosed
}
if len(addrs) == 0 {
return nil, errClusterNoNodes
}
return addrs, nil // <-------------
}
Expected Behavior
There is no race condition between the "addrs" consumers and the GC function.
Current Behavior
WARNING: DATA RACE
Write at 0x00c00068a5d0 by goroutine 248:
github.com/redis/go-redis/v9.(*clusterNodes).GC()
/home/user/work/go-redis-shawnw/osscluster.go:521 +0x284
github.com/redis/go-redis/v9.newClusterState.func1()
/home/user/work/go-redis-shawnw/osscluster.go:690 +0x4c
Previous read at 0x00c00068a5d0 by goroutine 247:
github.com/redis/go-redis/v9.(*ClusterClient).loadState()
/home/user/work/go-redis-shawnw/osscluster.go:1221 +0x2be
github.com/redis/go-redis/v9.(*ClusterClient).loadState-fm()
<autogenerated>:1 +0x47
github.com/redis/go-redis/v9.(*clusterStateHolder).Reload()
/home/user/work/go-redis-shawnw/osscluster.go:848 +0x50
github.com/redis/go-redis/v9.(*clusterStateHolder).LazyReload.func1()
/home/user/work/go-redis-shawnw/osscluster.go:863 +0x8f
Possible Solution
Return a new copy of the addr slice instead of a reference to the original slice.
Steps to Reproduce
- Apply the attached patch: repro.patch
- Run the new test in the patch:
go test -race -v --timeout 1m -run "TestGinkgoSuite"
Context (Environment)
The race condition has been detected in the real Redis cluster and is also reproducible with the above steps.
Possible Implementation
One possible solution is to modify the func (c *clusterNodes) Addrs() ([]string, error) function to return a copied slice instead of a reference.
func (c *clusterNodes) Addrs() ([]string, error) {
...
if !closed {
if len(c.activeAddrs) > 0 {
addrs = c.activeAddrs // make a copy
} else {
addrs = c.addrs // make a copy
}
}
The fix might slightly increase the runtime of the Addrs() function, as copying the original slice generally takes more time than creating a simple reference. However, this impact appears to be negligible. Details are as follows:
- Benchmarking on an "AMD EPYC 7B13" host shows that copying a slice of 500 addresses takes approximately 2.5 microseconds.
- The existing callers of the Addr() function, such as ClusterClient.loadState, clusterNodes.Random, and ClusterClient.cmdsInfo, are not on the perf critical path. Therefore, adding a few microseconds is unlikely to have any significant impact.