Checkpointing logic resulting in lost messages in a clustered environment (M2 release)

I'm using batch checkpoint mode in a clustered environment. All works fine with the M1 release, but in M2 I am seeing messages occasionally get dropped.

I think the culprit is the change in commit 2b57f34801df625587438536c66e994524899b4f to return the value of "replace(key, oldVal, newVal)" - whereas it previously always returned true when the existingSequence < sequenceNumber, but I'm not too familiar with the code so I can't be sure this is it. I'm thinking it could be a race condition in ShardCheckpointer where the value stored in DynamoDB changes between the call to getCheckpoint( ) and checkpointStore.replace(...), causing the method to return false.

All I can say for sure if that I can switch between version M1 and M2 and in a load test of ~4500 messages I will usually have a handful of messages dropped (~40). Switching back to the M1 release resolves the issue. Unfortunately this is manual performance testing that I haven't automated, so I don't have a simple test case to share.

In my clustere used case I'm happy for my application to receive some duplicates but obviously not for messages to be dropped. There's a comment in the code about an upcoming "shard leader election implementation", so I guess there are more changes coming to this area?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Checkpointing logic resulting in lost messages in a clustered environment (M2 release) #90

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Checkpointing logic resulting in lost messages in a clustered environment (M2 release) #90

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions