Skip to content

Checkpointing logic resulting in lost messages in a clustered environment (M2 release) #90

Closed
@s-porpoise

Description

@s-porpoise

I'm using batch checkpoint mode in a clustered environment. All works fine with the M1 release, but in M2 I am seeing messages occasionally get dropped.

I think the culprit is the change in commit 2b57f34 to return the value of "replace(key, oldVal, newVal)" - whereas it previously always returned true when the existingSequence < sequenceNumber, but I'm not too familiar with the code so I can't be sure this is it. I'm thinking it could be a race condition in ShardCheckpointer where the value stored in DynamoDB changes between the call to getCheckpoint( ) and checkpointStore.replace(...), causing the method to return false.

All I can say for sure if that I can switch between version M1 and M2 and in a load test of ~4500 messages I will usually have a handful of messages dropped (~40). Switching back to the M1 release resolves the issue. Unfortunately this is manual performance testing that I haven't automated, so I don't have a simple test case to share.

In my clustere used case I'm happy for my application to receive some duplicates but obviously not for messages to be dropped. There's a comment in the code about an upcoming "shard leader election implementation", so I guess there are more changes coming to this area?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions