Skip to content

High CPU usage in KafkaConsumer.poll() when subscribed to many topics with no new messages (possibly SSL related) #1315

Closed
@rmechler

Description

@rmechler

Experiencing high CPU usage when sitting idle in poll() (i.e., waiting for a timeout when there are no new messages on the broker). Gets worse the more topics I am subscribed to (I have cpu pegged at 100 with 40 topics). Note that I am using 1.3.4 with mostly default configs, and repro'd also in the curret master.

Seems to be a couple things at play here. One is that poll() will do fetch requests in a tight loop. The other, the one that really seems to be killing cpu, is that when a fetch response is received, the low level poll() will get in a relatively tight loop as the payload buffer fills, adding a relatively small number of bytes at a time. This explains the effect of adding more topics: the fetch responses are bigger so it more time in this tight loop. Here's some debug output based on a couple probes I put in the code:

In conn.py: _recv()

        if staged_bytes != self._next_payload_bytes:
                print("staged: {}   payload: {}".format(staged_bytes, self._next_payload_bytes))
                return None

In consumer/group.py: _poll_once()

        print("fetch!")
        # Send any new fetches (won't resend pending fetches)
        self._fetcher.send_fetches()

So, for one topic I get output like this while blocked in poll():

fetch!
staged: 4   payload: 104
fetch!
staged: 12   payload: 104
fetch!
staged: 50   payload: 104
fetch!
staged: 68   payload: 104
fetch!
staged: 86   payload: 104
fetch!
fetch!
staged: 4   payload: 104

For 2 topics:

fetch!
staged: 4   payload: 179
fetch!
staged: 12   payload: 179
fetch!
staged: 51   payload: 179
fetch!
staged: 69   payload: 179
fetch!
staged: 87   payload: 179
fetch!
staged: 105   payload: 179
fetch!
staged: 143   payload: 179
fetch!
staged: 161   payload: 179
fetch!
fetch!
staged: 4   payload: 197
fetch!

For 40 topics:

fetch!
staged: 2867   payload: 3835
fetch!
staged: 2885   payload: 3835
fetch!
staged: 2939   payload: 3835
fetch!
staged: 2957   payload: 3835
fetch!
staged: 2975   payload: 3835
staged: 4   payload: 3799
fetch!
staged: 12   payload: 3799
fetch!
staged: 58   payload: 3799
fetch!
staged: 76   payload: 3799
fetch!
staged: 94   payload: 3799
fetch!
staged: 112   payload: 3799
fetch!
staged: 154   payload: 3799
fetch!
... and many mnay more

so it gets stuck spinning in this, and cpu goes to 100.

I tried mitigating this using consumer fetch config:

    fetch_min_bytes=1000000,
    fetch_max_wait_ms=2000,

but that did nothing.

The only thing that gets the cpu down is to to a non-blocking poll() instead of using a timeout, and then doing a short sleep when there are no result records (my application can tolerate that latency). It looks like poll used to support something like this, i.e., there was a sleep parameter that caused a sleep for the remainder of the timeout period if there were no records on first fetch. Looks like that was removed in 237bd73, not sure why.

So... like I said I can workaround the continuous fetching with my own sleep. Would be good to understand the real problem which is the tight _recv() loop, and whether anything can be done about it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions