Skip to content

Worker stuck after "Protocol out of sync" #1744

Closed
@isamaru

Description

@isamaru

After switching to 1.4.5 from an older 1.3 version, we see some workers getting stuck with this pattern in the logs:

  1. "Protocol out of sync" with no previous errors
  2. "Closing connection. KafkaConnectionError: Socket EVENT_READ without in-flight-requests"
  3. "[kafka.client] Node 2 connection failed -- refreshing metadata"
  4. 5 minutes later, worker starts looping out infinite "Duplicate close() with error: [Error 7] RequestTimedOutError: Request timed out after 305000 ms", but with no apparent attempts to reconnect

More detailed (INFO) logs here:
kafka-python-logs.txt

We found a worker which was stuck like this for 2 days, processing no messages but not failing directly or even rebalancing the group, causing lag on its partition. The broker is running and other workers can connect to that broker in that period.
Note that Node 2 is leader for the partition which the worker is assigned to. Group coordinator is Node 1 which is why heartbeat keeps beating.

This seems to be the same thing as #1728 which wasn't completely fixed.

#1733 was about fixing one possible cause for that error (= avoid it). I believe that in our case the error is legitimate (temporary connection problems to broker).
The real issue is that the worker is not able to restore itself after this error happens and becomes stuck instead of either giving up and dying or reconnecting.
I've tried to find code responsible for reconnecting (which doesn't seem to fire) but I don't understand your codebase that well. I will continue investigating, this is important for us.

(We have deployed this on several environments and see this on both 1.0.1 and 2.1 brokers, identified by client as 1.0.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions