Description
Bug Report
LettuceConnectionFactory.SharedConnection#resetConnection
hangs forever and cause deadlock
Current Behavior
I have enabled validateConnection for Lettuce connection factory, and occasionally my service can't serve any incoming request. The thread dump shows that all the http threads are waiting for the connection
Stack trace
// http threads, take one for example
"reactor-http-epoll-6" #126 daemon prio=5 os_prio=0 cpu=16164.68ms elapsed=26788.53s allocated=1510M defined_classes=693 tid=0x0000560e1cfc1000 nid=0x168b waiting for monitor entry [0x00007fdb977c2000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getConnection(LettuceConnectionFactory.java:1295)
- waiting to lock <0x000000070a63d728> (a java.lang.Object)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getSharedReactiveConnection(LettuceConnectionFactory.java:1049)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveClusterConnection(LettuceConnectionFactory.java:481)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:457)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:101)
at org.springframework.data.redis.core.ReactiveRedisTemplate.lambda$doInConnection$0(ReactiveRedisTemplate.java:198)
at org.springframework.data.redis.core.ReactiveRedisTemplate$$Lambda$773/0x00000008007edc40.get(Unknown Source)
at reactor.core.publisher.MonoSupplier.call(MonoSupplier.java:85)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:224)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:203)
And all https threads are waiting for a lock which hold by the thread as below:
"lettuce-epollEventLoop-5-1" #31 daemon prio=5 os_prio=0 cpu=7049.44ms elapsed=26823.40s allocated=1441M defined_classes=171 tid=0x0000560e1dd67000 nid=0x13de waiting on condition [0x00007fdbb8753000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00000007197dec70> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park([email protected]/Unknown Source)
at java.util.concurrent.CompletableFuture$Signaller.block([email protected]/Unknown Source)
at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/Unknown Source)
at java.util.concurrent.CompletableFuture.waitingGet([email protected]/Unknown Source)
at java.util.concurrent.CompletableFuture.join([email protected]/Unknown Source)
at org.springframework.data.redis.connection.lettuce.LettuceFutureUtils.join(LettuceFutureUtils.java:68)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionProvider.release(LettuceConnectionProvider.java:74)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.release(LettuceConnectionFactory.java:1596)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.resetConnection(LettuceConnectionFactory.java:1360)
- locked <0x000000070a63d728> (a java.lang.Object)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.validateConnection(LettuceConnectionFactory.java:1346)
- locked <0x000000070a63d728> (a java.lang.Object)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getConnection(LettuceConnectionFactory.java:1302)
- locked <0x000000070a63d728> (a java.lang.Object)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getSharedReactiveConnection(LettuceConnectionFactory.java:1049)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveClusterConnection(LettuceConnectionFactory.java:481)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:457)
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:101)
at org.springframework.data.redis.core.ReactiveRedisTemplate.lambda$doInConnection$0(ReactiveRedisTemplate.java:198)
at org.springframework.data.redis.core.ReactiveRedisTemplate$$Lambda$773/0x00000008007edc40.get(Unknown Source)
Input Code
Input Code
Our application is using webflux to handle API request's, but I found that lettuce using `synchronized` to protect getConnection:// org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.SharedConnection#getConnection
@Nullable
StatefulConnection<E, E> getConnection() {
synchronized (this.connectionMonitor) {
if (this.connection == null) {
this.connection = getNativeConnection();
}
if (getValidateConnection()) {
validateConnection();
}
return this.connection;
}
}
And inside the validateConnection
the resetConnection
hangs:
// org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.SharedConnection#validateConnection
void validateConnection() {
synchronized (this.connectionMonitor) {
boolean valid = false;
if (connection != null && connection.isOpen()) {
try {
if (connection instanceof StatefulRedisConnection) {
((StatefulRedisConnection) connection).sync().ping();
}
if (connection instanceof StatefulRedisClusterConnection) {
((StatefulRedisClusterConnection) connection).sync().ping();
}
valid = true;
} catch (Exception e) {
log.debug("Validation failed", e);
}
}
if (!valid) {
log.info("Validation of shared connection failed. Creating a new connection.");
// the line below hangs
resetConnection();
this.connection = getNativeConnection();
}
}
}
Expected behavior/code
reset connection could be over in time and no deadlock.
Environment
- Lettuce version(s): 6.1.2.RELEASE
- Redis version: 5.0.9
- SpringDataRedis: 2.5.1
redis relevant configuration:
spring.redis.cluster.nodes={{spring_redis_cluster_nodes}} // we have 6 nodes
spring.redis.password={{spring_redis_password}}
spring.redis.cluster.max-redirects=5
spring.redis.cluster.topology-refresh-interval=10
spring.redis.lettuce.pool.min-idle=500
spring.redis.lettuce.pool.max-active=5000
spring.redis.lettuce.pool.max-wait=-1
spring.redis.lettuce.pool.max-idle=1000
spring.redis.timeout=10000
spring.redis.database=0
Possible Solution
In org.springframework.data.redis.connection.lettuce.LettuceConnectionProvider#release
, seems that it will wait for future forever, maybe a timeout could partially avoid this situation? Still don't know why release hangs.
default void release(StatefulConnection<?, ?> connection) {
LettuceFutureUtils.join(releaseAsync(connection));
}
Additional context
Reference
The original issue was posted on redis/lettuce#1861