Skip to content

Reactive redis hangs forever and cause deadlock  #2179

Open
@coney

Description

@coney

Bug Report

LettuceConnectionFactory.SharedConnection#resetConnection hangs forever and cause deadlock

Current Behavior

I have enabled validateConnection for Lettuce connection factory, and occasionally my service can't serve any incoming request. The thread dump shows that all the http threads are waiting for the connection

Stack trace
// http threads, take one for example
"reactor-http-epoll-6" #126 daemon prio=5 os_prio=0 cpu=16164.68ms elapsed=26788.53s allocated=1510M defined_classes=693 tid=0x0000560e1cfc1000 nid=0x168b waiting for monitor entry  [0x00007fdb977c2000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getConnection(LettuceConnectionFactory.java:1295)
	- waiting to lock <0x000000070a63d728> (a java.lang.Object)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getSharedReactiveConnection(LettuceConnectionFactory.java:1049)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveClusterConnection(LettuceConnectionFactory.java:481)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:457)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:101)
	at org.springframework.data.redis.core.ReactiveRedisTemplate.lambda$doInConnection$0(ReactiveRedisTemplate.java:198)
	at org.springframework.data.redis.core.ReactiveRedisTemplate$$Lambda$773/0x00000008007edc40.get(Unknown Source)
	at reactor.core.publisher.MonoSupplier.call(MonoSupplier.java:85)
	at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:224)
	at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:203)

And all https threads are waiting for a lock which hold by the thread as below:

"lettuce-epollEventLoop-5-1" #31 daemon prio=5 os_prio=0 cpu=7049.44ms elapsed=26823.40s allocated=1441M defined_classes=171 tid=0x0000560e1dd67000 nid=0x13de waiting on condition  [0x00007fdbb8753000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000007197dec70> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park([email protected]/Unknown Source)
	at java.util.concurrent.CompletableFuture$Signaller.block([email protected]/Unknown Source)
	at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/Unknown Source)
	at java.util.concurrent.CompletableFuture.waitingGet([email protected]/Unknown Source)
	at java.util.concurrent.CompletableFuture.join([email protected]/Unknown Source)
	at org.springframework.data.redis.connection.lettuce.LettuceFutureUtils.join(LettuceFutureUtils.java:68)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionProvider.release(LettuceConnectionProvider.java:74)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.release(LettuceConnectionFactory.java:1596)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.resetConnection(LettuceConnectionFactory.java:1360)
	- locked <0x000000070a63d728> (a java.lang.Object)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.validateConnection(LettuceConnectionFactory.java:1346)
	- locked <0x000000070a63d728> (a java.lang.Object)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getConnection(LettuceConnectionFactory.java:1302)
	- locked <0x000000070a63d728> (a java.lang.Object)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getSharedReactiveConnection(LettuceConnectionFactory.java:1049)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveClusterConnection(LettuceConnectionFactory.java:481)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:457)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:101)
	at org.springframework.data.redis.core.ReactiveRedisTemplate.lambda$doInConnection$0(ReactiveRedisTemplate.java:198)
	at org.springframework.data.redis.core.ReactiveRedisTemplate$$Lambda$773/0x00000008007edc40.get(Unknown Source)

Input Code

Input Code Our application is using webflux to handle API request's, but I found that lettuce using `synchronized` to protect getConnection:
// org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.SharedConnection#getConnection
@Nullable
StatefulConnection<E, E> getConnection() {

	synchronized (this.connectionMonitor) {

		if (this.connection == null) {
			this.connection = getNativeConnection();
		}

		if (getValidateConnection()) {
			validateConnection();
		}

		return this.connection;
	}
}

And inside the validateConnection the resetConnection hangs:

// org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.SharedConnection#validateConnection
		void validateConnection() {

			synchronized (this.connectionMonitor) {

				boolean valid = false;

				if (connection != null && connection.isOpen()) {
					try {

						if (connection instanceof StatefulRedisConnection) {
							((StatefulRedisConnection) connection).sync().ping();
						}

						if (connection instanceof StatefulRedisClusterConnection) {
							((StatefulRedisClusterConnection) connection).sync().ping();
						}
						valid = true;
					} catch (Exception e) {
						log.debug("Validation failed", e);
					}
				}

				if (!valid) {

					log.info("Validation of shared connection failed. Creating a new connection.");
                                       // the line below hangs
					resetConnection();
					this.connection = getNativeConnection();
				}
			}
		}

Expected behavior/code

reset connection could be over in time and no deadlock.

Environment

  • Lettuce version(s): 6.1.2.RELEASE
  • Redis version: 5.0.9
  • SpringDataRedis: 2.5.1

redis relevant configuration:

spring.redis.cluster.nodes={{spring_redis_cluster_nodes}} // we have 6 nodes
spring.redis.password={{spring_redis_password}}
spring.redis.cluster.max-redirects=5
spring.redis.cluster.topology-refresh-interval=10
spring.redis.lettuce.pool.min-idle=500
spring.redis.lettuce.pool.max-active=5000
spring.redis.lettuce.pool.max-wait=-1
spring.redis.lettuce.pool.max-idle=1000
spring.redis.timeout=10000
spring.redis.database=0 

Possible Solution

In org.springframework.data.redis.connection.lettuce.LettuceConnectionProvider#release, seems that it will wait for future forever, maybe a timeout could partially avoid this situation? Still don't know why release hangs.

	default void release(StatefulConnection<?, ?> connection) {
		LettuceFutureUtils.join(releaseAsync(connection));
	}

Additional context

stacktrace.zip

Reference

The original issue was posted on redis/lettuce#1861

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions