Skip to content

Possible Race Condition during black-listing of hosts  #526

Closed
@srinathgit

Description

@srinathgit

A. The following test was performed on a 3 node cluster (rh7v-intel64-90-test-4/5/6.marklogic.com), with a db associated with forests on each of the nodes.
B. The dmsdk job is started from a client machine and while loading is going on, client machine is disconnected from the servers (vpn is disconnected) and after a while vpn is reconnected
C. Chronologically , the events are as follows:
At 20:44:18.082 , [pool-1-thread-1] blacklists "rh7v-intel64-90-test-5.marklogic.com"
At 20:44:18.097 , [main] blacklists "rh7v-intel64-90-test-6.marklogic.com"

Any forestConfig obtained after this time shouldn't contain either "rh7v-intel64-90-test-5.marklogic.com" or "rh7v-intel64-90-test-6.marklogic.com".But

At 20:44:18.113 ,[pool-1-thread-1] uses hosts [rh7v-intel64-90-test-6.marklogic.com, rh7v-intel64-90-test-4.marklogic.com] with forests for "WriteHostBatcher"
and
At 20:44:18.150, [main] uses hosts [rh7v-intel64-90-test-5.marklogic.com, rh7v-intel64-90-test-4.marklogic.com] with forests for "WriteHostBatcher"

D.20:44:39.245 [pool-1-thread-1] INFO c.m.c.d.impl.WriteBatcherImpl - (withForestConfig) Using [rh7v-intel64-90-test-5.marklogic.com] hosts with forests for "WriteHostBatcher"

At 20:44:39.245 when [pool-1-thread-1] blacklists "rh7v-intel64-90-test-4", the job should have stopped as other hosts were already blacklisted and the number of available hosts at this point is 0. But [pool-1-thread-1] seems to filter out host "rh7v-intel64-90-test-4" from the latest forest config obtained by [main] thread [rh7v-intel64-90-test-5.marklogic.com, rh7v-intel64-90-test-4.marklogic.com] at 20:44:18.150 and the job continues.
Snippet log is below and complete log is available at exception.txt

20:44:06.444 [main] INFO  c.m.c.d.impl.WriteBatcherImpl - Adding DatabaseClient on port 8000 for host "rh7v-intel64-90-test-5.marklogic.com" to the rotation
20:44:06.819 [main] INFO  c.m.c.d.impl.WriteBatcherImpl - Adding DatabaseClient on port 8000 for host "rh7v-intel64-90-test-6.marklogic.com" to the rotation
20:44:06.819 [main] INFO  c.m.c.d.impl.WriteBatcherImpl - Adding DatabaseClient on port 8000 for host "rh7v-intel64-90-test-4.marklogic.com" to the rotation
20:44:06.179 [main] INFO  c.m.c.d.impl.WriteBatcherImpl - (withForestConfig) Using [rh7v-intel64-90-test-5.marklogic.com, rh7v-intel64-90-test-6.marklogic.com, rh7v-intel64-90-test-4.marklogic.com] hosts with forests for "WriteHostBatcher"
20:44:18.082 [pool-1-thread-1] ERROR c.m.c.d.HostAvailabilityListener - ERROR: host unavailable "rh7v-intel64-90-test-5.marklogic.com", black-listing it for PT15S
20:44:18.097 [main] ERROR c.m.c.d.HostAvailabilityListener - ERROR: host unavailable "rh7v-intel64-90-test-6.marklogic.com", black-listing it for PT15S
20:44:18.113 [pool-1-thread-1] INFO  c.m.c.d.impl.WriteBatcherImpl - (withForestConfig) Using [rh7v-intel64-90-test-6.marklogic.com, rh7v-intel64-90-test-4.marklogic.com] hosts with forests for "WriteHostBatcher"
20:44:18.150 [main] INFO  c.m.c.d.impl.WriteBatcherImpl - Adding DatabaseClient on port 8000 for host "rh7v-intel64-90-test-5.marklogic.com" to the rotation
20:44:18.150 [main] INFO  c.m.c.d.impl.WriteBatcherImpl - (withForestConfig) Using [rh7v-intel64-90-test-5.marklogic.com, rh7v-intel64-90-test-4.marklogic.com] hosts with forests for "WriteHostBatcher"
20:44:39.245 [pool-1-thread-1] ERROR c.m.c.d.HostAvailabilityListener - ERROR: host unavailable "rh7v-intel64-90-test-4", black-listing it for PT15S
20:44:39.245 [pool-1-thread-1] INFO  c.m.c.d.impl.WriteBatcherImpl - (withForestConfig) Using [rh7v-intel64-90-test-5.marklogic.com] hosts with forests for "WriteHostBatcher"
20:44:54.422 [pool-4-thread-1] INFO  c.m.c.d.impl.WriteBatcherImpl - (withForestConfig) Using [rh7v-intel64-90-test-5.marklogic.com, rh7v-intel64-90-test-6.marklogic.com, rh7v-intel64-90-test-4.marklogic.com] hosts with forests for "WriteHostBatcher"
20:44:54.422 [pool-4-thread-1] INFO  c.m.c.d.impl.WriteBatcherImpl - Adding DatabaseClient on port 8000 for host "rh7v-intel64-90-test-6.marklogic.com" to the rotation
20:44:54.422 [pool-4-thread-1] INFO  c.m.c.d.impl.WriteBatcherImpl - Adding DatabaseClient on port 8000 for host "rh7v-intel64-90-test-4.marklogic.com" to the rotation

Test:

@Test
		public void testFailOver() throws Exception{
			try{
				final String query1 = "fn:count(fn:doc())";
			 	
		       	final AtomicInteger successCount = new AtomicInteger(0);
		       	
		       	final MutableBoolean failState = new MutableBoolean(false);
		       	final AtomicInteger failCount = new AtomicInteger(0);
		    			        
				WriteBatcher ihb2 =  dmManager.newWriteBatcher();
				ihb2.withBatchSize(20);
				//ihb2.withThreadCount(120);
				dmManager.startJob(ihb2);
				
				ihb2.setBatchFailureListeners(
						  new HostAvailabilityListener(dmManager)
						    .withSuspendTimeForHostUnavailable(Duration.ofSeconds(15))
						    .withMinHosts(1)
						);	
				ihb2.onBatchSuccess(
				        (client, batch) -> {

				        	successCount.addAndGet(batch.getItems().length);
				        	System.out.println("Success Host: "+ client.getHost());
				        	System.out.println("Success batch number: "+ batch.getJobBatchNumber());
				        	 System.out.println("Success Job writes so far: "+ batch.getJobWritesSoFar());
				          }
				        )
				        .onBatchFailure(
				          (client, batch, throwable) -> {
				        	  System.out.println("Failed batch number: "+ batch.getJobBatchNumber());
				        	  /*try{
				        		  System.out.println("Retrying batch: "+ batch.getJobBatchNumber());
				        		  ihb2.retry(batch);
				        	  }
				        	 catch(Exception e){
				        		 System.out.println("Retry of batch "+ batch.getJobBatchNumber()+ " failed");
				        		 e.printStackTrace();
				        	 }*/
				        	 
				        	  throwable.printStackTrace();
				        	  failState.setTrue();
				        	  failCount.addAndGet(batch.getItems().length);
				          });
				
				
				     
		       	
				for (int j =0 ;j < 20000; j++){
					String uri ="/local/ABC-"+ j;
					ihb2.add(uri, stringHandle);
				}
			
				
			    ihb2.flushAndWait();
			   
			    
		    	System.out.println("Fail : "+failCount.intValue());
		    	System.out.println("Success : "+successCount.intValue());
		    	System.out.println("Count : "+ dbClient.newServerEval().xquery(query1).eval().next().getNumber().intValue());
		  
		    	Assert.assertTrue(dbClient.newServerEval().xquery(query1).eval().next().getNumber().intValue()==20000);
		    	
			}
			catch(Exception e){
				e.printStackTrace();
			}
		}

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions