Skip to content

Git.execute's kill_after_timeout callback assumes procps #1756

Open
@EliahKagan

Description

@EliahKagan

Background

Calling Git.execute—whether directly, or indirectly by calling the dynamic attributes of a Git instance—and passing kill_after_timeout with a non-None value, creates a timer on a separate thread that calls the local kill_process function. This callback function uses os.kill to kill the process. Before killing the process, it enumerates the process's direct children. If sending the first signal succeeds (basically, if the parent process still existed), it also attempts to kill the child processes.

The children are enumerated with ps --ppid:

GitPython/git/cmd.py

Lines 1010 to 1014 in fe082ad

p = Popen(
["ps", "--ppid", str(pid)],
stdout=PIPE,
creationflags=PROC_CREATIONFLAGS,
)

The problem

The --ppid option is not POSIX. Most GNU/Linux systems have procps, whose ps implementation supports --ppid. I am unsure if any other implementations of ps support it. The procps tools generally run only on Linux-based systems, because they use the /proc filesystem (and assume it is laid out as in Linux). Although they can run on any such system, Alpine Linux and some minimal GNU/Linux environments do not ship them, defaulting to ps from busybox instead.

As demonstrated below, macOS ps does not support --ppid. Nor do FreeBSD, NetBSD, OpenBSD, or DragonFly. AIX does not have --ppid. illumos does not have --ppid; nor does Solaris, though -ppid (with one -) can be used in 11.4.27 or higher. Although Cygwin mimics Linux where feasible, its /proc filesystem is different, and its ps does not support --ppid either (nor even some important POSIX options like -o).

The callback parses stdout from that ps command, but does not examine the exit status or stderr. The effect is that an error message on a system without procps (or another ps supporting --ppid, if there is one) is printed, and the parent process is still sent SIGKILL, but its children are never found or sent signals.

As detailed below, although the fetch, pull, and push methods of the Remote class accept a kill_after_timeout argument, they do not use Git.execute, so they are unaffected by this bug.

Steps to reproduce

On macOS 13 (on a GitHub Actions CI runner with tmate), I created this script in a directory in $PATH, named it git-sleep, and marked it executable:

#!/bin/sh
sleep "$@"

Then I called sleep on a Git instance with a kill_after_timeout argument specifying a shorter duration than the sleep:

bash-3.2$ python
Python 3.12.0 (v3.12.0:0fb18b02c8, Oct  2 2023, 09:45:56) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from git import Git
>>> Git().sleep(10, kill_after_timeout=5)
ps: illegal option -- -
usage: ps [-AaCcEefhjlMmrSTvwXx] [-O fmt | -o fmt] [-G gid[,gid...]]
          [-g grp[,grp...]] [-u [uid,uid...]]
          [-p pid[,pid...]] [-t tty[,tty...]] [-U user[,user...]]
       ps [-L]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/runner/work/GitPython/GitPython/git/cmd.py", line 741, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/GitPython/GitPython/git/cmd.py", line 1320, in _call_process
    return self.execute(call, **exec_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/GitPython/GitPython/git/cmd.py", line 1117, in execute
    raise GitCommandError(redacted_command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(-9)
  cmdline: git sleep 10
  stderr: 'Timeout: the command "git sleep 10" did not complete in 5 secs.'
>>>

Impact

1. The other kill_after_timeout is unaffected

There are two callables defined in git/cmd.py that accept an optional kill_after_timeout argument: the "internal" top-level handle_process_output function that is not listed in __all__ but is used throughout GitPython, and the public Git.execute method (also used when dynamic Git methods are called). The meaning of this argument is subtly different, and the associated implementations completely different.

This bug affects only the one in the Git class. Thus it does not affect common uses of timeouts in interacting with remotes: the Remote.fetch, Remote.push, and Remote.pull methods accept kill_after_timeout arguments, but they forward them to handle_process_output.

2. But this one should work on all Unix-like systems

From context, I think it is unintended not to support common Unix-like systems such as macOS. The Git.execute docstring says "This feature is not supported on Windows" and makes no other claims about compatibility, from which I think readers will reasonably infer that other platforms are believed supported. When called on a native Windows system (not Cygwin) with a non-None value for kill_after_timeout, it raises a GitCommandError. Other systems, including Cygwin, raise no exception and register the kill_process callback. kill_after_timeout is thus in effect documented to work on all systems except native Windows.

3. What happens if the child processes aren't sent SIGKILL?

I don't know how much of a problem it is for SIGKILL to be sent only to the parent and not to its direct children. I am not confident I know why that is being done, as opposed to killing only the parent process, or attempting to kill its entire process tree. My guess is that this is because many git commands use a subprocess to do their work. If so, then it may in practice be important—in situations where people pass kill_after_timeout—that the child processes are killed as well.

However, git subprocesses do sometimes use their own subprocesses:

ek@Glub:~$ pstree -a
init(Ubuntu)
  ├─SessionLeader
  │   └─Relay(9)
  │       ├─bash
  │       │   └─git clone https://github.com/huggingface/transformers.git
  │       │       └─git remote-https origin https://github.com/huggingface/transformers.git
  │       │           └─git-remote-http origin https://github.com/huggingface/transformers.git
...

In that example, the git-remote-http process may not receive SIGKILL. I am unsure how much this matters, but if it is a problem, then the more severe it is, the less severe this bug is, because the intended behavior wouldn't help anyway. Likewise, in situations where killing the parent process is sufficient, this bug also does not cause a problem.

That lower descendants are not killed has been reported as #895. That was observed in GitPython 2.0.2, which had the current approach of killing just the direct child processes.

A minor race condition…

One thing I'm a little worried about is a race condition that is currently present, and that I think may not be possible to fix, but that I worry finding child processes in a more portable way may exacerbate. Unless it can be solved or mitigated more deeply, it is a reason, unrelated to performance, to prefer that a portable substitute for the existing use of ps --ppid not be too much slower than the current way. (I likewise worry that if the approach were changed to kill all descendants, then the added time to traverse the whole subtree might exacerbate this race condition.)

Suppose we plan to kill a process P and all its direct child processes including Q, and we find the PID of Q, but before killing Q, all the following happen:

  1. Q dies.
  2. Q is reaped. That is, it is wait(2)ed by its parent--which is either its original parent P or, if P has died, then init--causing its entry in the process table to be removed and its PID to be available for use by a future process.
  3. A new process, R, is created and assigned Q's old PID.

Then when when we try to kill Q, we kill R.

This situation is rare, because in practice the time between when a process is reaped and when a new process is given its PID is only short when the process table is nearly full so the kernel has no less recently relinquished PIDs to give out. But I think it would be best to avoid increasing the risk of it.

There may be other related race conditions, but this is the one that seems it could be worsened by replacing the existing unportable use of ps --ppid with some other technique, if that other technique is markedly slower.

Finding/killing the the subprocesses portably

I am unsure if this should be done, because it is not clear to me that killing the parent process and its direct child processes, as is currently attempted (generally successfully on GNU/Linux and unsuccessfully elsewhere), is necessarily what should happen. Doing anything else might risk incompatibility for some existing use cases on some systems, so I would want to be cautious about doing something altogether different, but I think it should still be considered before proceeding.

However, assuming the current approach of killing the child processes should be preserved, I think there are three cases:

  1. Systems with pgrep/pkill.
  2. Systems whose ps is POSIX-compliant, or at least supports -A and -o.
  3. Cygwin. (And the like, e.g., MSYS2. But sys.platform == "cygwin" still covers that.)

Case 1 could be folded into case 2 if a speed regression is acceptable (but see above on the race condition), or if testing reveals using pgrep or pkill is not significantly faster. Case 3 could be dropped in lieu of modifying the docstring to document that kill_after_timeout is less effective on Cygwin, if reducing complexity is regarded as more important than covering it.

Whether to cover case 3 or not is more a matter of code complexity than time to write and review the code. With or without it, I think most of the time and effort would be on the tests. Currently none cover passing kill_after_timeout to Git.execute or to a dynamic method of a Git object. Only the other kill_after_timeout—of handle_process_output—has test coverage. Because this project has CI on Cygwin, I don't think the tests have to do much to accommodate it—its challenges are ready-made.

(An alternative to dealing with these details is to use psutil, but I'm unsure if the impact of this issue is sufficient to justify adding it as a dependency. It doesn't support all systems, but systems it doesn't support are rare. I think it could be made conditional on the systems it is installable on, and the features that use it be documented as unavailable on other systems. I think this is probably not worth doing just for this, but if it turns out it would help in various other places, and increasing rather than decreasing OS compatibility—as it would here—then it might make sense to consider it. On the other hand, one benefit of GitPython is that it has very few dependencies.)

1. If we have pgrep/pkill

pgrep and pkill are not POSIX, but they are available on many more systems than ps --ppid. Furthermore, it is likely that all systems that support ps --ppid also have pgrep and pkill, because not only are they very common, but procps (which provides the only ps with --ppid I can find, as discussed above) includes an implementation of them. Of course, it's possible (odd, but possible) for a distribution to use procps for ps but not include pgrep and pkill. Whether pkill can be used to consolidate the steps, or pgrep must be used together with something what is already there, is a design decision that should be influenced by a decision about the best order for sending SIGKILL.

If it's acceptable to send SIGKILL to the child processes first, then instead of running ["ps", "--ppid", str(pid)] and most of what comes after it, one option is to run ["pkill", "-P", str(pid)] and then:

  • If it succeeds, immediately call os.kill(pid, signal.SIGKILL) and kill_check.set().
  • If it fails due to pkill not existing—or if it was checked first and found absent—proceed to case 2 (using ps).

If it's not acceptable to send SIGKILL to the child processes first, or if either order is acceptable but it is desirable to share more code with the fallback case 2, then instead of running ["ps", "--ppid", str(pid)], run ["pgrep", "-P", str(pid)], then:

  • If it failed due to pgrep not existing—or if it was checked first and found absent—proceed to case 2.
  • Treat an exit status of 0 or 1 as success; 1 is when there were no children. (This seems to hold across different implementations, but I'll want to look into it further, since these tools are not POSIX or XSI.)
  • Parse the output—each line is just a PID, with no headers, no other columns—into child_pids.
  • Continue with the rest of the kill_process function as it already exists.

2. If ps supports -A and -o

This is almost every Unix-like system used today; POSIX requires these options.

Instead of running ["ps", "--ppid", str(pid)], run ["ps", "-A", "-o", "pid,ppid"], then:

  • Check that the first row is PID and PPID to safeguard against unexpectedly nonstandard ps.
  • Each remaining row should have child and parent process IDs. Filter for the rows where the parent (second column) is what we passed, and populate child_pids with the PIDs from the first column.
  • Continue with the rest of the kill_process function as it already exists.

If using this is as fast a pkill/pgrep, or slower but not by a lot, or code simplicity is considered more important than the small worsening of the rare race condition, then this could be used on all systems except Cygwin. The truth is that it is only out of fear of worsening things in weird situations on GNU/Linux systems with procps that I have even proposed case 1. This is the portable way to do it (except Cygwin).

It may be possible to optimize this with -U to filter the real user ID to os.getuid(), or -u to filter the effective user ID to os.geteuid(), though -u seems to be an XSI extension. I don't know if this would actually make things faster. I don't know if the added complexity, though modest, would be worthwhile even if it does. When doing this, -A would not also be passed.

The reason not to simply omit -A without replacing it, which gets processes that share the caller's EUID, is that it also only shows processes with the same controlling terminal. The reason not to use -a instead is that it doesn't show processes not associated with any terminal. The reason I prefer -A to its synonym -e is that -e seems to be an XSI extension.

3. Cygwin

Running ps on Cygwin gives output that looks like:

      PID    PPID    PGID     WINPID   TTY         UID    STIME COMMAND
     1641       1    1641      27868  ?         197609   Nov 14 /usr/bin/ssh-agent
     2201       1    2201      27112  ?         197609 02:35:26 /usr/bin/mintty
     2202    2201    2202      41336  pty0      197609 02:35:26 /usr/bin/bash
     2304    2202    2304      47276  pty0      197609 14:47:12 /usr/bin/ps

This can be modified by various options, but -o is not supported. (-A is not supported either, but it is not needed.)

Instead of running ["ps", "--ppid", str(pid)], we can run ["ps"], then:

  • Check the first row headers, or at least the leading PID and PPID headers that we are going to use, to safeguard against unexpectedly non-Cygwin ps or future changes to Cygwin ps.
  • Continue as in case 2 after the check, making sure to use only the first two fields.

Perspective

I think the what is more important than the how, because:

Test coverage

Unlike the other kill_after_timeout (in handle_process_output), the code path where Git.execute is passed kill_after_timeout has no test coverage. It would be good to test it even if this bug is not fixed. But at the root of both is figuring out if killing the parent process and its direct child processes are what is wanted.

Maintainability

The kill_process callback is never called on (native) Windows, where calling Git.execute with a non-None value for kill_after_timeout raises GitCommandError. But it contains what seem to be the remains of an attempt to support Windows: it passes PROC_CREATIONFLAGS (which is 0 except on Windows) when running ps, and it falls back to signal.SIGTERM when signal.SIGKILL is absent (which it is on Windows).

I discovered this whole issue because I want to remove that code, which I think could lead to future bugs, and I was looking into whether there is any reason not to. A possible reason not to is if kill_process can be easily modified to support Windows—which it could, if it is acceptable to kill either only the parent process, or the whole process tree, though whether it should is another question. Because figuring out what to do about this issue entails figuring that out too, it would open kill_process up to that improvement—dropping its vestigial Windows code if it is not going to support Windows—and possibly others.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions