Git.execute's kill_after_timeout callback assumes procps

### Background

Calling `Git.execute`—whether directly, or indirectly by calling the dynamic attributes of a `Git` instance—and passing `kill_after_timeout` with a non-`None` value, creates a timer on a separate thread that calls the local [`kill_process`](https://github.com/gitpython-developers/GitPython/blob/fe082ad5e297119a3073dab74fa604eaf547ecc7/git/cmd.py#L1008-L1036) function. This callback function uses `os.kill` to kill the process. Before killing the process, it enumerates the process's *direct children*. If sending the first signal succeeds (basically, if the parent process still existed), it also attempts to kill the child processes.

The children are enumerated with `ps --ppid`:

https://github.com/gitpython-developers/GitPython/blob/fe082ad5e297119a3073dab74fa604eaf547ecc7/git/cmd.py#L1010-L1014

### The problem

The `--ppid` option is [not POSIX](https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/utilities/ps.html). Most GNU/Linux systems have [procps](https://gitlab.com/procps-ng/procps), whose `ps` implementation supports `--ppid`. I am unsure if *any* other implementations of `ps` support it. The procps tools generally run only on Linux-based systems, because they use the `/proc` filesystem (and assume it is laid out as in Linux). Although they can run on any such system, Alpine Linux and some minimal GNU/Linux environments do not ship them, defaulting to `ps` from [busybox](https://busybox.net/) instead.

As demonstrated below, [macOS](https://ss64.com/osx/ps.html) `ps` does not support `--ppid`. Nor do [FreeBSD](https://man.freebsd.org/cgi/man.cgi?query=ps&apropos=0&sektion=0&manpath=FreeBSD+14.0-RELEASE+and+Ports&arch=default&format=html), [NetBSD](https://man.netbsd.org/ps.1), [OpenBSD](https://man.openbsd.org/ps), or [DragonFly](https://leaf.dragonflybsd.org/cgi/web-man?command=ps&section=ANY). [AIX](https://www.ibm.com/docs/en/aix/7.3?topic=p-ps-command) does not have `--ppid`. [illumos](https://illumos.org/man/1/ps) does not have `--ppid`; nor does [Solaris](https://docs.oracle.com/cd/E88353_01/html/E37839/ps-1.html), though `-ppid` (with one `-`) can be used in 11.4.27 or higher. Although Cygwin mimics Linux where feasible, its `/proc` filesystem is different, and its `ps` does not support `--ppid` either (nor even some important POSIX options like `-o`).

The callback parses stdout from that `ps` command, but does not examine the exit status or stderr. The effect is that an error message on a system without procps (or another `ps` supporting `--ppid`, if there is one) is printed, and the parent process is still sent `SIGKILL`, but its children are never found or sent signals.

*As detailed below, although the `fetch`, `pull`, and `push` methods of the `Remote` class accept a `kill_after_timeout` argument, they do not use `Git.execute`, so they are unaffected by this bug.*

### Steps to reproduce

On macOS 13 (on a GitHub Actions CI runner with [tmate](https://github.com/marketplace/actions/debugging-with-tmate)), I created this script in a directory in `$PATH`, named it `git-sleep`, and marked it executable:

```sh
#!/bin/sh
sleep "$@"
```

Then I called `sleep` on a `Git` instance with a `kill_after_timeout` argument specifying a shorter duration than the sleep:

```text
bash-3.2$ python
Python 3.12.0 (v3.12.0:0fb18b02c8, Oct  2 2023, 09:45:56) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from git import Git
>>> Git().sleep(10, kill_after_timeout=5)
ps: illegal option -- -
usage: ps [-AaCcEefhjlMmrSTvwXx] [-O fmt | -o fmt] [-G gid[,gid...]]
          [-g grp[,grp...]] [-u [uid,uid...]]
          [-p pid[,pid...]] [-t tty[,tty...]] [-U user[,user...]]
       ps [-L]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/runner/work/GitPython/GitPython/git/cmd.py", line 741, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/GitPython/GitPython/git/cmd.py", line 1320, in _call_process
    return self.execute(call, **exec_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/GitPython/GitPython/git/cmd.py", line 1117, in execute
    raise GitCommandError(redacted_command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(-9)
  cmdline: git sleep 10
  stderr: 'Timeout: the command "git sleep 10" did not complete in 5 secs.'
>>>
```

### Impact

#### 1. The *other* `kill_after_timeout` is unaffected

There are two callables defined in `git/cmd.py` that accept an optional `kill_after_timeout` argument: the "internal" top-level [`handle_process_output`](https://github.com/gitpython-developers/GitPython/blob/fe082ad5e297119a3073dab74fa604eaf547ecc7/git/cmd.py#L96-L108) function that is not listed in `__all__` but is used throughout GitPython, and the public [`Git.execute`](https://github.com/gitpython-developers/GitPython/blob/fe082ad5e297119a3073dab74fa604eaf547ecc7/git/cmd.py#L820-L837) method (also used when dynamic `Git` methods are called). The meaning of this argument is subtly different, and the associated implementations completely different.

**This bug affects only the one in the `Git` class.** Thus it does not affect common uses of timeouts in interacting with remotes: the [`Remote.fetch`](https://github.com/gitpython-developers/GitPython/blob/fe082ad5e297119a3073dab74fa604eaf547ecc7/git/remote.py#L956-L965), [`Remote.push`](https://github.com/gitpython-developers/GitPython/blob/fe082ad5e297119a3073dab74fa604eaf547ecc7/git/remote.py#L1073-L1081), and [`Remote.pull`](https://github.com/gitpython-developers/GitPython/blob/fe082ad5e297119a3073dab74fa604eaf547ecc7/git/remote.py#L1032-L1040) methods accept `kill_after_timeout` arguments, but they forward them to `handle_process_output`.

#### 2. But this one *should* work on all Unix-like systems

From context, I think it is unintended not to support common Unix-like systems such as macOS. The [`Git.execute` docstring](https://github.com/gitpython-developers/GitPython/blob/fe082ad5e297119a3073dab74fa604eaf547ecc7/git/cmd.py#L878-L886) says "This feature is not supported on Windows" and makes no other claims about compatibility, from which I think readers will reasonably infer that other platforms are believed supported. When called on a native Windows system (not Cygwin) with a non-`None` value for `kill_after_timeout`, it raises a `GitCommandError`. Other systems, including Cygwin, raise no exception and register the `kill_process` callback. `kill_after_timeout` is thus *in effect* documented to work on all systems except native Windows.

#### 3. What happens if the child processes aren't sent `SIGKILL`?

I don't know how much of a problem it is for `SIGKILL` to be sent only to the parent and not to its direct children. I am not confident I know why that is being done, as opposed to killing only the parent process, or attempting to kill its entire process tree. My *guess* is that this is because many `git` commands use a subprocess to do their work. If so, then it may in practice be important—in situations where people pass `kill_after_timeout`—that the child processes are killed as well.

However, `git` subprocesses do sometimes use their own subprocesses:

```text
ek@Glub:~$ pstree -a
init(Ubuntu)
  ├─SessionLeader
  │   └─Relay(9)
  │       ├─bash
  │       │   └─git clone https://github.com/huggingface/transformers.git
  │       │       └─git remote-https origin https://github.com/huggingface/transformers.git
  │       │           └─git-remote-http origin https://github.com/huggingface/transformers.git
...
```

In that example, the `git-remote-http` process may not receive `SIGKILL`. I am unsure how much this matters, but if it is a problem, then the more severe it is, the *less* severe this bug is, because the intended behavior wouldn't help anyway. Likewise, in situations where killing the parent process is sufficient, this bug also does not cause a problem.

That lower descendants are not killed has been reported as #895. That was observed in GitPython 2.0.2, which had the current approach of [killing just the direct child processes](https://github.com/gitpython-developers/GitPython/blob/2.0.2/git/cmd.py#L635-L639).

### A minor race condition…

One thing I'm a little worried about is a race condition that is currently present, and that I think may not be possible to fix, *but that I worry finding child processes in a more portable way may exacerbate*. Unless it can be solved or mitigated more deeply, it is a reason, unrelated to performance, to prefer that a portable substitute for the existing use of `ps --ppid` *not be too much slower* than the current way. (I likewise worry that if the approach were changed to kill all descendants, then the added time to traverse the whole subtree might exacerbate this race condition.)

Suppose we plan to kill a process P and all its direct child processes including Q, and we find the PID of Q, but before killing Q, all the following happen:

1. Q dies.
2. Q is reaped. That is, it is [wait(2)](https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/wait.html)ed by its parent--which is either its original parent P or, if P has died, then `init`--causing its entry in the process table to be removed and its PID to be available for use by a future process.
3. A new process, R, is created and assigned Q's old PID.

Then when when we try to kill Q, we kill R.

**This situation is rare**, because in practice the time between when a process is reaped and when a new process is given its PID is only short [when the process table is nearly full](https://unix.stackexchange.com/a/414974/11938) so the kernel has no less recently relinquished PIDs to give out. But I think it would be best to avoid increasing the risk of it.

There may be other related race conditions, but this is the one that seems it could be worsened by replacing the existing unportable use of `ps --ppid` with some other technique, if that other technique is markedly slower.

### Finding/killing the the subprocesses portably

I am unsure if this should be done, because it is not clear to me that killing the parent process and its direct child processes, as is currently attempted (generally successfully on GNU/Linux and unsuccessfully elsewhere), is necessarily what should happen. Doing anything else might risk incompatibility for some existing use cases on some systems, so I would want to be cautious about doing something altogether different, but I think it should still be considered before proceeding.

However, assuming the current approach of killing the child processes should be preserved, I think there are three cases:

1. Systems with `pgrep`/`pkill`.
2. Systems whose `ps` is POSIX-compliant, or at least supports `-A` and `-o`.
3. Cygwin. (And the like, e.g., MSYS2. But `sys.platform == "cygwin"` still covers that.)

Case 1 could be folded into case 2 if a speed regression is acceptable (but see above on the race condition), or if testing reveals using `pgrep` or `pkill` is not significantly faster. Case 3 could be dropped in lieu of modifying the docstring to document that `kill_after_timeout` is less effective on Cygwin, if reducing complexity is regarded as more important than covering it.

Whether to cover case 3 or not is more a matter of code complexity than time to write and review the code. With or without it, I think most of the time and effort would be on the *tests*. Currently none cover passing `kill_after_timeout` to `Git.execute` or to a dynamic method of a `Git` object. Only the *other* `kill_after_timeout`—of `handle_process_output`—has test coverage. Because this project has CI on Cygwin, I don't think the tests have to do much to accommodate it—its challenges are ready-made.

(An alternative to dealing with these details is to use [psutil](https://pypi.org/project/psutil/), but I'm unsure if the impact of this issue is sufficient to justify adding it as a dependency. It doesn't support all systems, but systems it doesn't support are rare. I think it could be made conditional on the systems it is installable on, and the features that use it be documented as unavailable on other systems. I think this is probably not worth doing just for this, but if it turns out it would help in various other places, and increasing rather than decreasing OS compatibility—as it would here—then it might make sense to consider it. On the other hand, one benefit of GitPython is that it has very few dependencies.)

#### 1. If we have `pgrep`/`pkill`

`pgrep` and `pkill` are not POSIX, but they are available on many more systems than `ps --ppid`. Furthermore, it is likely that all systems that support `ps --ppid` also have `pgrep` and `pkill`, because not only are they very common, but procps (which provides the only `ps` with `--ppid` I can find, as discussed above) [includes an implementation](https://gitlab.com/procps-ng/procps/-/blob/master/src/pgrep.c?ref_type=heads) of them. Of course, it's possible (odd, but possible) for a distribution to use procps for `ps` but not include `pgrep` and `pkill`. Whether `pkill` can be used to consolidate the steps, or `pgrep` must be used together with something what is already there, is a design decision that should be influenced by a decision about the best order for sending `SIGKILL`.

If it's acceptable to send `SIGKILL` to the child processes first, then instead of running `["ps", "--ppid", str(pid)]` *and most of what comes after it*, one option is to run `["pkill", "-P", str(pid)]` and then:

- If it succeeds, immediately call `os.kill(pid, signal.SIGKILL)` and `kill_check.set()`.
- If it fails due to `pkill` not existing—or if it was checked first and found absent—proceed to case 2 (using `ps`).

If it's not acceptable to send `SIGKILL` to the child processes first, or if either order is acceptable but it is desirable to share more code with the fallback case 2, then instead of running `["ps", "--ppid", str(pid)]`, run `["pgrep", "-P", str(pid)]`, then:

- If it failed due to `pgrep` not existing—or if it was checked first and found absent—proceed to case 2.
- Treat an exit status of 0 or 1 as success; 1 is when there were no children. (This seems to hold across different implementations, but I'll want to look into it further, since these tools are not POSIX or [XSI](https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap02.html#tag_02_01_04).)
- Parse the output—each line is just a PID, with no headers, no other columns—into `child_pids`.
- Continue with the rest of the `kill_process` function as it already exists.

#### 2. If `ps` supports `-A` and `-o`

This is almost every Unix-like system used today; [POSIX requires these options](https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/utilities/ps.html).

Instead of running `["ps", "--ppid", str(pid)]`, run `["ps", "-A", "-o", "pid,ppid"]`, then:

- Check that the first row is `PID` and `PPID` to safeguard against unexpectedly nonstandard `ps`.
- Each remaining row should have child and parent process IDs. Filter for the rows where the parent (second column) is what we passed, and populate `child_pids` with the PIDs from the first column.
- Continue with the rest of the `kill_process` function as it already exists.

If using this is as fast a `pkill`/`pgrep`, or slower but not by a lot, or code simplicity is considered more important than the small worsening of the rare race condition, then this could be used on all systems except Cygwin. The truth is that it is only out of fear of worsening things in weird situations on GNU/Linux systems with procps that I have even proposed case 1. This is the portable way to do it (except Cygwin).

It may be possible to optimize this with `-U` to filter the real user ID to `os.getuid()`, or `-u` to filter the effective user ID to `os.geteuid()`, though `-u` seems to be an XSI extension. I don't know if this would actually make things faster. I don't know if the added complexity, though modest, would be worthwhile even if it does. When doing this, `-A` would not also be passed.

The reason not to simply omit `-A` without replacing it, which gets processes that share the caller's EUID, is that it also only shows processes with the same controlling terminal. The reason not to use `-a` instead is that it doesn't show processes not associated with any terminal. The reason I prefer `-A` to its synonym `-e` is that `-e` seems to be an XSI extension.

#### 3. Cygwin

Running `ps` on Cygwin gives output that looks like:

```text
      PID    PPID    PGID     WINPID   TTY         UID    STIME COMMAND
     1641       1    1641      27868  ?         197609   Nov 14 /usr/bin/ssh-agent
     2201       1    2201      27112  ?         197609 02:35:26 /usr/bin/mintty
     2202    2201    2202      41336  pty0      197609 02:35:26 /usr/bin/bash
     2304    2202    2304      47276  pty0      197609 14:47:12 /usr/bin/ps
```

This can be modified by various options, but `-o` is not supported. (`-A` is not supported either, but it is not needed.)

Instead of running `["ps", "--ppid", str(pid)]`, we can run `["ps"]`, then:

- Check the first row headers, or at least the leading `PID` and `PPID` headers that we are going to use, to safeguard against unexpectedly non-Cygwin `ps` or future changes to Cygwin `ps`.
- Continue as in case 2 after the check, making sure to use only the first two fields.

### Perspective

I think the *what* is more important than the *how*, because:

#### Test coverage

[Unlike](https://github.com/gitpython-developers/GitPython/blob/fe082ad5e297119a3073dab74fa604eaf547ecc7/test/test_remote.py#L1006-L1019) the other `kill_after_timeout` (in `handle_process_output`), the code path where `Git.execute` is passed `kill_after_timeout` has no test coverage. It would be good to test it even if this bug is not fixed. But at the root of both is figuring out if killing the parent process and its direct child processes are what is wanted.

#### Maintainability

The `kill_process` callback is never called on (native) Windows, where calling `Git.execute` with a non-`None` value for `kill_after_timeout` [raises](https://github.com/gitpython-developers/GitPython/blob/fe082ad5e297119a3073dab74fa604eaf547ecc7/git/cmd.py#L960-L964) `GitCommandError`. But it contains what seem to be the remains of an attempt to support Windows: it [passes](https://github.com/gitpython-developers/GitPython/blob/fe082ad5e297119a3073dab74fa604eaf547ecc7/git/cmd.py#L1013) `PROC_CREATIONFLAGS` (which is 0 [except](https://github.com/gitpython-developers/GitPython/blob/fe082ad5e297119a3073dab74fa604eaf547ecc7/git/cmd.py#L231-L236) on Windows) when running `ps`, and it [falls back](https://github.com/gitpython-developers/GitPython/blob/fe082ad5e297119a3073dab74fa604eaf547ecc7/git/cmd.py#L1023-L1024) to `signal.SIGTERM` when `signal.SIGKILL` is absent (which it is on Windows).

I discovered this whole issue because I want to remove that code, which I think could lead to future bugs, and I was looking into whether there is any reason not to. A possible reason not to is if `kill_process` can be easily modified to support Windows—which it could, if it is acceptable to kill either only the parent process, or the whole process tree, though whether it *should* is another question. Because figuring out what to do about this issue entails figuring that out too, it would open `kill_process` up to that improvement—dropping its vestigial Windows code if it is not going to support Windows—and possibly others.

	p = Popen(
	["ps", "--ppid", str(pid)],
	stdout=PIPE,
	creationflags=PROC_CREATIONFLAGS,
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Git.execute's kill_after_timeout callback assumes procps #1756

Background

The problem

Steps to reproduce

Impact

1. The other `kill_after_timeout` is unaffected

2. But this one should work on all Unix-like systems

3. What happens if the child processes aren't sent `SIGKILL`?

A minor race condition…

Finding/killing the the subprocesses portably

1. If we have `pgrep`/`pkill`

2. If `ps` supports `-A` and `-o`

3. Cygwin

Perspective

Test coverage

Maintainability

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Git.execute's kill_after_timeout callback assumes procps #1756

Description

Background

The problem

Steps to reproduce

Impact

1. The other kill_after_timeout is unaffected

2. But this one should work on all Unix-like systems

3. What happens if the child processes aren't sent SIGKILL?

A minor race condition…

Finding/killing the the subprocesses portably

1. If we have pgrep/pkill

2. If ps supports -A and -o

3. Cygwin

Perspective

Test coverage

Maintainability

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. The other `kill_after_timeout` is unaffected

3. What happens if the child processes aren't sent `SIGKILL`?

1. If we have `pgrep`/`pkill`

2. If `ps` supports `-A` and `-o`