Skip to content

connect only to primary host for load balancer scenario #999 #1001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 25, 2018

Conversation

ehennum
Copy link
Contributor

@ehennum ehennum commented Sep 17, 2018

To create a DataMovementManager for use with a load balancer, the factory call should be:

newDataMovementManager(DatabaseClient.ConnectionPolicy.PRIMARY_HOST)

No FilteredForestConfiguration can be used when connecting only to the primary host. Thus, no whitelist should be possible or needed. The builtin HostAvailabilityListener and NoResponseListener should work in failover situations.

Please consider whether the design changes are usable and test in an ALB environment.

If the changes are good, please merge into 4.1.1-develop.

@ehennum ehennum added this to the java-client-api-4.1.1 milestone Sep 17, 2018
@ehennum ehennum requested a review from srinathgit September 17, 2018 16:23
@ehennum ehennum requested a review from anu3990 September 17, 2018 18:01
@srinathgit
Copy link

@ehennum ,

So, in case where a MarkLogic host is down, the user will still have to remove 'HostAvailabilityListener' and 'NoResponseListener' and add their custom listener based on what exception they get with their LoadBalancer ?

@ehennum
Copy link
Contributor Author

ehennum commented Sep 18, 2018

@srinathgit, no, it should no longer be necessary to remove HostAvailabilityListener or NoResponseListener.

At least, that's the intent of the change to the getHostUnavailableExceptions() method:

https://github.com/marklogic/java-client-api/pull/1001/files#diff-e40b825725298441567c39d40f59a6e1L157

The opposite: to detect a failure of the load balancer, adopters will need to add a custom listener for the specific errors generated by their preferred load balancer.

Possibly, as a failsafe in the primary host case, the batcher should keep a count of the number of retry efforts on the job and fail if it exceeds a limit.

@srinathgit
Copy link

srinathgit commented Sep 18, 2018

@ehennum , In a 3 node AWS cluster with an ALB, I ran a write batcher job , during the job if MarkLogic instance is shutdown (to force a forest failover ) and a request is sent to that host, the "FailedRequestException" is thrown which is not part of the "hostUnavailableExceptions" list. There is a method "withHostUnavailableExceptions" where we can specify the list of "host unavailable exceptions" however "FailedRequestException" can't be added to the list as it is too generic and could be thrown for other valid reasons. So, in this scenario neither the job gets shut down nor the failed batches retried.

20:37:02.848 [pool-2-thread-17] WARN  c.m.c.d.impl.WriteBatcherImpl - Error writing batch: com.marklogic.client.FailedRequestException: Local message: failed to apply resource at documents: Bad Gateway. Server Message: Server (not a REST instance?) did not respond with an expected REST Error message.
20:37:02.848 [pool-2-thread-17] DEBUG c.m.client.impl.OkHttpServices - Posting documents
20:37:02.848 [pool-2-thread-17] DEBUG c.m.client.impl.OkHttpServices - Sending multipart for /v1/documents
com.marklogic.client.FailedRequestException: Local message: failed to apply resource at documents: Bad Gateway. Server Message: Server (not a REST instance?) did not respond with an expected REST Error message.
	at com.marklogic.client.impl.OkHttpServices.checkStatus(OkHttpServices.java:4327)
	at com.marklogic.client.impl.OkHttpServices.postResource(OkHttpServices.java:3400)
	at com.marklogic.client.impl.OkHttpServices.postBulkDocuments(OkHttpServices.java:3480)
	at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:614)
	at com.marklogic.client.impl.GenericDocumentImpl.write(GenericDocumentImpl.java:1)
	at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:606)
	at com.marklogic.client.impl.GenericDocumentImpl.write(GenericDocumentImpl.java:1)
	at com.marklogic.client.datamovement.impl.WriteBatcherImpl$BatchWriter.run(WriteBatcherImpl.java:1060)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

In an on-prem 3 node MarkLogic cluster, during the same job, if a MarkLogic instance is shutdown , java.net.SocketException is thrown which is part of the "hostUnavailableExceptions" list.

20:20:34.074 [pool-2-thread-8] ERROR c.m.c.d.HostAvailabilityListener - ERROR: host unavailable "rh7v-intel64-90-test-18.marklogic.com", black-listing it for PT10M
com.marklogic.client.MarkLogicIOException: java.net.SocketException: Connection reset
	at com.marklogic.client.impl.OkHttpServices.sendRequestOnce(OkHttpServices.java:708)
	at com.marklogic.client.impl.OkHttpServices.sendRequestOnce(OkHttpServices.java:700)
	at com.marklogic.client.impl.OkHttpServices.doPost(OkHttpServices.java:4071)
	at com.marklogic.client.impl.OkHttpServices.postResource(OkHttpServices.java:3372)
	at com.marklogic.client.impl.OkHttpServices.postBulkDocuments(OkHttpServices.java:3480)
	at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:614)
	at com.marklogic.client.impl.GenericDocumentImpl.write(GenericDocumentImpl.java:1)
	at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:606)
	at com.marklogic.client.impl.GenericDocumentImpl.write(GenericDocumentImpl.java:1)
	at com.marklogic.client.datamovement.impl.WriteBatcherImpl$BatchWriter.run(WriteBatcherImpl.java:1060)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at okio.Okio$2.read(Okio.java:140)
	at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
	at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355)
	at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227)
	at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215)
	at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
	at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at com.burgstaller.okhttp.AuthenticationCacheInterceptor.intercept(AuthenticationCacheInterceptor.java:45)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200)
	at okhttp3.RealCall.execute(RealCall.java:77)
	at com.marklogic.client.impl.OkHttpServices.sendRequestOnce(OkHttpServices.java:706)
	... 14 common frames omitted

So, if "FailedRequestException" is going to be thrown, shouldn't a CustomHostAvailabilityListener be used that checks for both "FailedRequestException" as well as the exception message so that legitimate failed requests are not retried ?

@ehennum
Copy link
Contributor Author

ehennum commented Sep 18, 2018

@srinathgit , thanks for doing the comparison. It sounds like the Java Client API receives a different HTTP response in the ALB scenario. If so, can you report the difference in the HTTP response that causes the same server event to throw a FailedRequestException instead of a SocketException?

The only change in the HostAvailabilityListener behavior in the primary host case is that it skips the host refresh that culminates in the invalidation of retry on this line:

@srinathgit
Copy link

srinathgit commented Sep 18, 2018

@ehennum , I printed the HTTP response for the ALB scenario and after the instance is stopped, the following is the response I get :

Response{protocol=http/1.1, code=502, message=Bad Gateway, url=http://srinath-q-elasticl-a1rzued8pwa4-882617851.us-east-2.elb.amazonaws.com:8008/v1/documents}

Response body:

<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
</body>
</html>

@srinathgit
Copy link

@ehennum ,

I spoke with @mattsunsjf ,

  1. The ELB can send only "HTTP code=502, message=Bad Gateway" response for requests to the MarkLogic instance that was shutdown. It cannot be pass through and send the response it got to the client
  2. Regarding requests getting sent to the downed instance, Matt will open a task in JIRA and will look further to see what the right behavior is. In my case, I had downed the instance that had the "Security" db so I was getting "Bad Gateway" for every request. Nevertheless, even if the instance that doesn't have "Security" db is taken down, there were cases where requests that were sent to it.

@ehennum
Copy link
Contributor Author

ehennum commented Sep 18, 2018

@srinathgit , thanks for digging deeper. It's useful to know that no pass through configuration is possible but that redirection from a downed instance might be possible.

Does the load balancer still send 502 errors only when the Security db is unavailable or in all cases? That is, if all forests have failed over correctly, does the load balancer send different errors that would reliably indicate a retry condition for the client?

@ehennum ehennum merged commit 1fa4390 into 4.1.1-develop Sep 25, 2018
@ehennum ehennum deleted the LoadBalancer branch September 25, 2018 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants