Description
Experiencing high CPU usage when sitting idle in poll() (i.e., waiting for a timeout when there are no new messages on the broker). Gets worse the more topics I am subscribed to (I have cpu pegged at 100 with 40 topics). Note that I am using 1.3.4 with mostly default configs, and repro'd also in the curret master.
Seems to be a couple things at play here. One is that poll() will do fetch requests in a tight loop. The other, the one that really seems to be killing cpu, is that when a fetch response is received, the low level poll() will get in a relatively tight loop as the payload buffer fills, adding a relatively small number of bytes at a time. This explains the effect of adding more topics: the fetch responses are bigger so it more time in this tight loop. Here's some debug output based on a couple probes I put in the code:
In conn.py: _recv()
if staged_bytes != self._next_payload_bytes:
print("staged: {} payload: {}".format(staged_bytes, self._next_payload_bytes))
return None
In consumer/group.py: _poll_once()
print("fetch!")
# Send any new fetches (won't resend pending fetches)
self._fetcher.send_fetches()
So, for one topic I get output like this while blocked in poll():
fetch!
staged: 4 payload: 104
fetch!
staged: 12 payload: 104
fetch!
staged: 50 payload: 104
fetch!
staged: 68 payload: 104
fetch!
staged: 86 payload: 104
fetch!
fetch!
staged: 4 payload: 104
For 2 topics:
fetch!
staged: 4 payload: 179
fetch!
staged: 12 payload: 179
fetch!
staged: 51 payload: 179
fetch!
staged: 69 payload: 179
fetch!
staged: 87 payload: 179
fetch!
staged: 105 payload: 179
fetch!
staged: 143 payload: 179
fetch!
staged: 161 payload: 179
fetch!
fetch!
staged: 4 payload: 197
fetch!
For 40 topics:
fetch!
staged: 2867 payload: 3835
fetch!
staged: 2885 payload: 3835
fetch!
staged: 2939 payload: 3835
fetch!
staged: 2957 payload: 3835
fetch!
staged: 2975 payload: 3835
staged: 4 payload: 3799
fetch!
staged: 12 payload: 3799
fetch!
staged: 58 payload: 3799
fetch!
staged: 76 payload: 3799
fetch!
staged: 94 payload: 3799
fetch!
staged: 112 payload: 3799
fetch!
staged: 154 payload: 3799
fetch!
... and many mnay more
so it gets stuck spinning in this, and cpu goes to 100.
I tried mitigating this using consumer fetch config:
fetch_min_bytes=1000000,
fetch_max_wait_ms=2000,
but that did nothing.
The only thing that gets the cpu down is to to a non-blocking poll() instead of using a timeout, and then doing a short sleep when there are no result records (my application can tolerate that latency). It looks like poll used to support something like this, i.e., there was a sleep parameter that caused a sleep for the remainder of the timeout period if there were no records on first fetch. Looks like that was removed in 237bd73, not sure why.
So... like I said I can workaround the continuous fetching with my own sleep. Would be good to understand the real problem which is the tight _recv() loop, and whether anything can be done about it.