Description
I'm using batch checkpoint mode in a clustered environment. All works fine with the M1 release, but in M2 I am seeing messages occasionally get dropped.
I think the culprit is the change in commit 2b57f34 to return the value of "replace(key, oldVal, newVal)" - whereas it previously always returned true when the existingSequence < sequenceNumber, but I'm not too familiar with the code so I can't be sure this is it. I'm thinking it could be a race condition in ShardCheckpointer where the value stored in DynamoDB changes between the call to getCheckpoint( ) and checkpointStore.replace(...), causing the method to return false.
All I can say for sure if that I can switch between version M1 and M2 and in a load test of ~4500 messages I will usually have a handful of messages dropped (~40). Switching back to the M1 release resolves the issue. Unfortunately this is manual performance testing that I haven't automated, so I don't have a simple test case to share.
In my clustere used case I'm happy for my application to receive some duplicates but obviously not for messages to be dropped. There's a comment in the code about an upcoming "shard leader election implementation", so I guess there are more changes coming to this area?