Skip to content

Unexpected WebSocketConnectionClosedException when streaming long running commands, missing pings? #1841

Closed as not planned
@hterik

Description

@hterik

What happened (please include outputs or screenshots):
After exactly 300sec of idle, exec commands streamed with stream loose their connection, even though

  • pod is still running,
  • The process that was execd in container is still running (checked with kubectl exec POD_NAME ps aux)
  • Network otherwise appears to be healthy.
  • No warnings logged by kube-apiserver

Following stack trace is given

  File "kubernetes/stream/ws_client.py", line 141, in readline_stderr
    return self.readline_channel(STDERR_CHANNEL, timeout=timeout)
  File "kubernetes/stream/ws_client.py", line 104, in readline_channel
    self.update(timeout=(timeout - time.time() + start))
  File "kubernetes/stream/ws_client.py", line 192, in update
    op_code, frame = self.sock.recv_data_frame(True)
  File "websocket/_core.py", line 401, in recv_data_frame
    frame = self.recv_frame()
  File "websocket/_core.py", line 440, in recv_frame
    return self.frame_buffer.recv_frame()
  File "websocket/_abnf.py", line 338, in recv_frame
    self.recv_header()
  File "websocket/_abnf.py", line 294, in recv_header
    header = self.recv_strict(2)
  File "websocket/_abnf.py", line 373, in recv_strict
    bytes_ = self.recv(min(16384, shortage))
  File "websocket/_core.py", line 524, in _recv
    return recv(self.sock, bufsize)
  File "websocket/_socket.py", line 122, in recv
    raise WebSocketConnectionClosedException(
websocket._exceptions.WebSocketConnectionClosedException: Connection to remote host was lost. 

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
(Mostly inspired by https://github.com/kubernetes-client/python/blob/master/examples/pod_exec.py)

from kubernetes.stream import stream
...
...
s = stream(
    _client.connect_get_namespaced_pod_exec,
    _pod.metadata.name,
    _pod.metadata.namespace,
    command=["sleep", "100000"],           
    async_req=False,
    stderr=True,
    stdin=False,
    stdout=True,
    tty=False,
    _preload_content=False,
)

def tail_logs():
    if (ret_stdout := s.readline_stdout(timeout=1)) is not None:
        print(ret_stdout)

while (retcode := s.returncode) is None:
    tail_logs(s)
tail_logs(s)

Anything else we need to know?:
I've experimented by adding s.sock.ping() in each iteration of the tail_logs function and it appears to improve the situation but some times it still get stuck inside readline_stdout, despite the timeout=1 parameter being set. 🤨 It get stuck so long that ping never has chance to run within the next 300sec timeout.
Does one need to run ping in a separate thread for this to work? That's how it is done inside the WebSocketApp, as used by the long lived connection example from websocket library.
Can one use WebSocketapp together with stream or do i need to implement similar ping-feature myself when using stream?

I'm also curious where 300sec come from, is there a way to find this time through connection handshake or is it a setting in my cluster/azure?

Environment:

  • Kubernetes version: Azure AKS v1.23.5
  • Python version 3.9.7
  • Python client version 23.6.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions