Skip to content

Copy-view behaviour and mutating arrays #24

Closed
@rgommers

Description

@rgommers

Context:

That issue and PR were about unrelated topics, so I'll try to summarize the copy-view and mutation topic here and we can continue the discussion.

Note that the two topics are fairly coupled, because copy/view differences only matter (for semantics, not for performance) when mixed with mutation.

Mutating arrays

There's a number of things that may rely on mutation:

  • In-place operators like +=, *=
  • The out= keyword argument
  • Element and slice assignment with __setitem__

Summary of the issue with mutation by @shoyer was: Mutation can be challenging to support in some execution models (at least without another layer of indirection), which is why several projects currently don't support it (TensorFlow and JAX) or only support it half-heartedly (e.g., Dask). The commonality between these libraries is that they build up abstract computations, which is then transformed (e.g., for autodiff) and/or executed in parallel. Even NumPy has "read only" arrays. I'm particularly concerned about new projects that implement this API, which might find the need to support mutation burdensome.

@alextp said: TensorFlow was planning to add mutability and didn't see a real issue with supporting out=.

@shoyer said: It's definitely always possible to support mutation at the Python level via some sort of wrapper layer.

dask.array is perhaps a good example of this. It supports mutating operations and out in some cases, but its support for mutation is still rather limited. For example, it doesn't support assignment like x[:2, :] = some_other_array.

Working around limitations of no support for mutation can usually be done by one of:

  1. Use where for selection, e.g., where(arange(4) == 2, 1, 0)
  2. Calculate the "inverse" of the assignment operator in terms of indexing, e.g., y = array([0, 1]); x = y[[0, 0, 1, 0]] in this case

Some version of (2) always works, though it can be tricky to work out (especially with current APIs). The duality between indexing and assignment is the difference between specifying where elements come from or where they end up.

The JAX syntax for slice assignment is: x.at[idx].set(y) vs x[idx] = y

One advantage of the non-mutating version is that JAX can have reliable assigning arithmetic on array slices with x.at[idx].add(y) (x[idx] += y doesn't work if x[idx] returns a copy).

A disadvantage is that doing this sort thing inside a loop is almost always a bad idea unless you have a JIT compiler, because every indexing assignment operation makes a full copy. So the naive translation of an efficient Python loop that fills out an array row by row would now make a copy in each step. Instead, you'd have to rewrite that loop to use something like concatenate instead (which in my experience is already about as efficient as using indexing assignment).

Copy-view behaviour

Libraries like NumPy and PyTorch return views where possible from function calls. It's sometimes hard to predict when a view will be returned vs. when a copy - it not only depends on the function in question, but also on whether the input array is contiguous, and sometimes even on input dtype.

This is one place where it's hard to avoid implementation choices leaking into the API:

  • Static graph based implementations like TensorFlow and MXNet, or a functional implementation like JAX with immutable arrays, will return a copy for a function like transpose().
  • Implementations which support strides and/or use a dynamic graph are able to, and therefore often will, return a view when they can (which is the case for transpose()).

The above copy vs. view difference starts leaking into the API - i.e., the same code starts giving different results for different implementations - when it is combined with an operation that performs in-place mutation of an array (either the base array or the view on it). In the absence of that combination, views are simply a performance optimization that's invisible to the user.

The question is whether copy-view differences should be allowed, and if so how to deal with the semantics that vary between libraries.

To answer whether is should be allowed, let's first ask how often the combination of views and mutation is used. A few observations:

  1. It is normally considered a bug if a library function (e.g. a SciPy or scikit-learn one) mutates any of its input arguments - unless the function is explicitly documented as doing so, which is rare. So the main concern is use inside functions, with arrays that are either created inside the function or use a copy of the input array.
  2. A search for patterns like *=, += and ] = in SciPy and scikit-learn .py files shows that in-place mutation inside functions is heavily used.
  3. There's a significant difference between mutating a complete array (e.g. with += 1) and mutating part of an array (e.g. with x[:, :2] = y). The former is a lot easier to support for array libraries employing static graphs or a JIT than the latter. See the discussion at Proposal to standardize element-wise elementary mathematical functions #8 (comment) for details.
  4. It's harder to figure out how often the combination of mutating part of an array and that mutation affecting a view occurs. This could be tested though, with a patched NumPy to raise an exception on mutations affecting a view and then running test suites of downstream libraries.

Options for how to standardize

In #8 @shoyer listed the following options for how to deal with mutability:

  1. Require support for in-place operations. Libraries that don't support mutation fully will need to write a wrapper layer, even if it would be inefficient.
  2. Make support for in-place operations optional. Arrays can indicate whether they support mutation via some standard API, e.g., like NumPy's ndarray.flags.writeable. (From later discussion, see Proposal to standardize element-wise elementary mathematical functions #8 (comment) for the implication of that for users of the API).
  3. Don't include support for in-place operations in the spec. This is a conservative choice, one which might have negative performance consequences (but it's a little hard to say without looking carefully). At the very least, it might require a library like SciPy to retain a special path for numpy.ndarray objects.

To that I'd like to add a more granular option:

  1. Require support for in-place operations that are unambiguous, and require raising an exception in case a view is mutated.

    Rationale:

    (a) This would require libraries that don't support mutation to write a wrapper layer, but the behaviour would be unambiguous and in most cases the wrapper would not be inefficient.
    (b) In case inefficient mutation is detected (e.g. mutation a large array row-by-row in a loop), a warning may be emitted.

A variant of this option would be:

  1. Require support for in-place operations that are unambiguous and mutate the whole array at once (i.e. += and out= must be supported, element/slice assignment must raise an exception), and require raising an exception in case a view is mutated.

    Trade-off here is ease of implementation for libraries like Dask and JAX vs. putting a rewrite burden on SciPy et al. and a usability burden on end users (the alternative to element/slice assignment is unintuitive).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions