connect only to primary host for load balancer scenario #999 #1001

ehennum · 2018-09-17T16:23:52Z

To create a DataMovementManager for use with a load balancer, the factory call should be:

newDataMovementManager(DatabaseClient.ConnectionPolicy.PRIMARY_HOST)

No FilteredForestConfiguration can be used when connecting only to the primary host. Thus, no whitelist should be possible or needed. The builtin HostAvailabilityListener and NoResponseListener should work in failover situations.

Please consider whether the design changes are usable and test in an ALB environment.

If the changes are good, please merge into 4.1.1-develop.

...gic-client-api/src/main/java/com/marklogic/client/datamovement/HostAvailabilityListener.java

srinathgit · 2018-09-17T18:28:58Z

@ehennum ,

So, in case where a MarkLogic host is down, the user will still have to remove 'HostAvailabilityListener' and 'NoResponseListener' and add their custom listener based on what exception they get with their LoadBalancer ?

ehennum · 2018-09-18T02:03:56Z

@srinathgit, no, it should no longer be necessary to remove HostAvailabilityListener or NoResponseListener.

At least, that's the intent of the change to the getHostUnavailableExceptions() method:

https://github.com/marklogic/java-client-api/pull/1001/files#diff-e40b825725298441567c39d40f59a6e1L157

The opposite: to detect a failure of the load balancer, adopters will need to add a custom listener for the specific errors generated by their preferred load balancer.

Possibly, as a failsafe in the primary host case, the batcher should keep a count of the number of retry efforts on the job and fail if it exceeds a limit.

srinathgit · 2018-09-18T03:31:11Z

@ehennum , In a 3 node AWS cluster with an ALB, I ran a write batcher job , during the job if MarkLogic instance is shutdown (to force a forest failover ) and a request is sent to that host, the "FailedRequestException" is thrown which is not part of the "hostUnavailableExceptions" list. There is a method "withHostUnavailableExceptions" where we can specify the list of "host unavailable exceptions" however "FailedRequestException" can't be added to the list as it is too generic and could be thrown for other valid reasons. So, in this scenario neither the job gets shut down nor the failed batches retried.

20:37:02.848 [pool-2-thread-17] WARN  c.m.c.d.impl.WriteBatcherImpl - Error writing batch: com.marklogic.client.FailedRequestException: Local message: failed to apply resource at documents: Bad Gateway. Server Message: Server (not a REST instance?) did not respond with an expected REST Error message.
20:37:02.848 [pool-2-thread-17] DEBUG c.m.client.impl.OkHttpServices - Posting documents
20:37:02.848 [pool-2-thread-17] DEBUG c.m.client.impl.OkHttpServices - Sending multipart for /v1/documents
com.marklogic.client.FailedRequestException: Local message: failed to apply resource at documents: Bad Gateway. Server Message: Server (not a REST instance?) did not respond with an expected REST Error message.
	at com.marklogic.client.impl.OkHttpServices.checkStatus(OkHttpServices.java:4327)
	at com.marklogic.client.impl.OkHttpServices.postResource(OkHttpServices.java:3400)
	at com.marklogic.client.impl.OkHttpServices.postBulkDocuments(OkHttpServices.java:3480)
	at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:614)
	at com.marklogic.client.impl.GenericDocumentImpl.write(GenericDocumentImpl.java:1)
	at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:606)
	at com.marklogic.client.impl.GenericDocumentImpl.write(GenericDocumentImpl.java:1)
	at com.marklogic.client.datamovement.impl.WriteBatcherImpl$BatchWriter.run(WriteBatcherImpl.java:1060)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

In an on-prem 3 node MarkLogic cluster, during the same job, if a MarkLogic instance is shutdown , java.net.SocketException is thrown which is part of the "hostUnavailableExceptions" list.

20:20:34.074 [pool-2-thread-8] ERROR c.m.c.d.HostAvailabilityListener - ERROR: host unavailable "rh7v-intel64-90-test-18.marklogic.com", black-listing it for PT10M
com.marklogic.client.MarkLogicIOException: java.net.SocketException: Connection reset
	at com.marklogic.client.impl.OkHttpServices.sendRequestOnce(OkHttpServices.java:708)
	at com.marklogic.client.impl.OkHttpServices.sendRequestOnce(OkHttpServices.java:700)
	at com.marklogic.client.impl.OkHttpServices.doPost(OkHttpServices.java:4071)
	at com.marklogic.client.impl.OkHttpServices.postResource(OkHttpServices.java:3372)
	at com.marklogic.client.impl.OkHttpServices.postBulkDocuments(OkHttpServices.java:3480)
	at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:614)
	at com.marklogic.client.impl.GenericDocumentImpl.write(GenericDocumentImpl.java:1)
	at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:606)
	at com.marklogic.client.impl.GenericDocumentImpl.write(GenericDocumentImpl.java:1)
	at com.marklogic.client.datamovement.impl.WriteBatcherImpl$BatchWriter.run(WriteBatcherImpl.java:1060)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at okio.Okio$2.read(Okio.java:140)
	at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
	at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355)
	at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227)
	at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215)
	at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
	at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at com.burgstaller.okhttp.AuthenticationCacheInterceptor.intercept(AuthenticationCacheInterceptor.java:45)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200)
	at okhttp3.RealCall.execute(RealCall.java:77)
	at com.marklogic.client.impl.OkHttpServices.sendRequestOnce(OkHttpServices.java:706)
	... 14 common frames omitted

So, if "FailedRequestException" is going to be thrown, shouldn't a CustomHostAvailabilityListener be used that checks for both "FailedRequestException" as well as the exception message so that legitimate failed requests are not retried ?

ehennum · 2018-09-18T16:09:19Z

@srinathgit , thanks for doing the comparison. It sounds like the Java Client API receives a different HTTP response in the ALB scenario. If so, can you report the difference in the HTTP response that causes the same server event to throw a FailedRequestException instead of a SocketException?

The only change in the HostAvailabilityListener behavior in the primary host case is that it skips the host refresh that culminates in the invalidation of retry on this line:

java-client-api/marklogic-client-api/src/main/java/com/marklogic/client/datamovement/HostAvailabilityListener.java

Line 299 in 431c1a2

shouldWeRetry = false;

srinathgit · 2018-09-18T20:02:00Z

@ehennum , I printed the HTTP response for the ALB scenario and after the instance is stopped, the following is the response I get :

Response{protocol=http/1.1, code=502, message=Bad Gateway, url=http://srinath-q-elasticl-a1rzued8pwa4-882617851.us-east-2.elb.amazonaws.com:8008/v1/documents}

Response body:

<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
</body>
</html>

srinathgit · 2018-09-18T21:51:18Z

@ehennum ,

I spoke with @mattsunsjf ,

The ELB can send only "HTTP code=502, message=Bad Gateway" response for requests to the MarkLogic instance that was shutdown. It cannot be pass through and send the response it got to the client
Regarding requests getting sent to the downed instance, Matt will open a task in JIRA and will look further to see what the right behavior is. In my case, I had downed the instance that had the "Security" db so I was getting "Bad Gateway" for every request. Nevertheless, even if the instance that doesn't have "Security" db is taken down, there were cases where requests that were sent to it.

ehennum · 2018-09-18T22:25:12Z

@srinathgit , thanks for digging deeper. It's useful to know that no pass through configuration is possible but that redirection from a downed instance might be possible.

Does the load balancer still send 502 errors only when the Security db is unavailable or in all cases? That is, if all forests have failed over correctly, does the load balancer send different errors that would reliably indicate a retry condition for the client?

connect only to primary host for load balancer scenario #999

538b01e

ehennum added this to the java-client-api-4.1.1 milestone Sep 17, 2018

ehennum requested a review from srinathgit September 17, 2018 16:23

srinathgit reviewed Sep 17, 2018

View reviewed changes

...gic-client-api/src/main/java/com/marklogic/client/datamovement/HostAvailabilityListener.java Show resolved Hide resolved

ehennum requested a review from anu3990 September 17, 2018 18:01

anu3990 approved these changes Sep 17, 2018

View reviewed changes

minimum hosts must be 1 for primary host connection #999

431c1a2

ehennum added 8 commits September 19, 2018 18:16

distribute-timestamps of cluster and move property to common #994

c833115

specify connection type as direct or gateway on client #999

689db52

always retry on 502 or 504 so ALB / ELB configuration is optional #999

f2632ab

get forest configuration even in load balancer case #999

7d1821b

outer retry on changed forests in load balancer case #999

a83a762

correction to test for forest difference #999

0adb450

revert to inner retry #999

6d418c2

reword log messages to identify hosts without implying connection #999

8168ea9

ehennum merged commit 1fa4390 into 4.1.1-develop Sep 25, 2018

ehennum deleted the LoadBalancer branch September 25, 2018 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

connect only to primary host for load balancer scenario #999 #1001

connect only to primary host for load balancer scenario #999 #1001

Uh oh!

ehennum commented Sep 17, 2018

Uh oh!

Uh oh!

srinathgit commented Sep 17, 2018

Uh oh!

ehennum commented Sep 18, 2018 •

edited

Loading

Uh oh!

srinathgit commented Sep 18, 2018 •

edited

Loading

Uh oh!

ehennum commented Sep 18, 2018

Uh oh!

srinathgit commented Sep 18, 2018 •

edited

Loading

Uh oh!

srinathgit commented Sep 18, 2018

Uh oh!

ehennum commented Sep 18, 2018

Uh oh!

Uh oh!

connect only to primary host for load balancer scenario #999 #1001

connect only to primary host for load balancer scenario #999 #1001

Uh oh!

Conversation

ehennum commented Sep 17, 2018

Uh oh!

Uh oh!

srinathgit commented Sep 17, 2018

Uh oh!

ehennum commented Sep 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srinathgit commented Sep 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehennum commented Sep 18, 2018

Uh oh!

srinathgit commented Sep 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srinathgit commented Sep 18, 2018

Uh oh!

ehennum commented Sep 18, 2018

Uh oh!

Uh oh!

ehennum commented Sep 18, 2018 •

edited

Loading

srinathgit commented Sep 18, 2018 •

edited

Loading

srinathgit commented Sep 18, 2018 •

edited

Loading