Description
What happened (please include outputs or screenshots):
After exactly 300sec of idle, exec commands streamed with stream
loose their connection, even though
- pod is still running,
- The process that was execd in container is still running (checked with
kubectl exec POD_NAME ps aux
) - Network otherwise appears to be healthy.
- No warnings logged by kube-apiserver
Following stack trace is given
File "kubernetes/stream/ws_client.py", line 141, in readline_stderr
return self.readline_channel(STDERR_CHANNEL, timeout=timeout)
File "kubernetes/stream/ws_client.py", line 104, in readline_channel
self.update(timeout=(timeout - time.time() + start))
File "kubernetes/stream/ws_client.py", line 192, in update
op_code, frame = self.sock.recv_data_frame(True)
File "websocket/_core.py", line 401, in recv_data_frame
frame = self.recv_frame()
File "websocket/_core.py", line 440, in recv_frame
return self.frame_buffer.recv_frame()
File "websocket/_abnf.py", line 338, in recv_frame
self.recv_header()
File "websocket/_abnf.py", line 294, in recv_header
header = self.recv_strict(2)
File "websocket/_abnf.py", line 373, in recv_strict
bytes_ = self.recv(min(16384, shortage))
File "websocket/_core.py", line 524, in _recv
return recv(self.sock, bufsize)
File "websocket/_socket.py", line 122, in recv
raise WebSocketConnectionClosedException(
websocket._exceptions.WebSocketConnectionClosedException: Connection to remote host was lost.
What you expected to happen:
- Connection is not lost
- If it is lost, connection is automatically reconnected (up to some configured timeout limit)
Reading https://github.com/websocket-client/websocket-client#long-lived-connection, there is a description on how to use a built in way from the websocket library to do this. It appears that the stream command is not using this method.
How to reproduce it (as minimally and precisely as possible):
(Mostly inspired by https://github.com/kubernetes-client/python/blob/master/examples/pod_exec.py)
from kubernetes.stream import stream
...
...
s = stream(
_client.connect_get_namespaced_pod_exec,
_pod.metadata.name,
_pod.metadata.namespace,
command=["sleep", "100000"],
async_req=False,
stderr=True,
stdin=False,
stdout=True,
tty=False,
_preload_content=False,
)
def tail_logs():
if (ret_stdout := s.readline_stdout(timeout=1)) is not None:
print(ret_stdout)
while (retcode := s.returncode) is None:
tail_logs(s)
tail_logs(s)
Anything else we need to know?:
I've experimented by adding s.sock.ping()
in each iteration of the tail_logs
function and it appears to improve the situation but some times it still get stuck inside readline_stdout, despite the timeout=1
parameter being set. 🤨 It get stuck so long that ping never has chance to run within the next 300sec timeout.
Does one need to run ping
in a separate thread for this to work? That's how it is done inside the WebSocketApp
, as used by the long lived connection example from websocket library.
Can one use WebSocketapp together with stream or do i need to implement similar ping-feature myself when using stream?
I'm also curious where 300sec come from, is there a way to find this time through connection handshake or is it a setting in my cluster/azure?
Environment:
- Kubernetes version: Azure AKS v1.23.5
- Python version 3.9.7
- Python client version 23.6.0