API / CoW: copy/view behaviour when constructing DataFrame/Series from a numpy array

Context: with the Copy-on-Write implementation (see https://github.com/pandas-dev/pandas/issues/36195 / [detailed proposal doc](https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit), and overview follow up issue https://github.com/pandas-dev/pandas/issues/48998), we generally want to avoid that updating (setting) values in one DataFrame/Series also changes another DataFrame/Series. 
When constructing a DataFrame/Series from existing DataFrame/Series objects this can be handled correctly (the newly constructed object keeps track of the original ones, ensuring to correctly trigger the Copy-on-Write mechanism when needed). However, when constructing a DataFrame/Series from a numpy array, this doesn't work (the CoW mechanism works on the level of BlockManager and blocks). 

Currently the `DataFrame(..)`/`Series(..)` constructors work "zero-copy" for a numpy array input (i.e. viewing the original ndarray). While it would be nice to keep behaviour for performance reasons, this also creates some problems for maintaining the copy/view rules under Copy-on-Write.

There are several aspects to this behaviour (I am showing the examples for Series, but the same is true for constructing DataFrame from a 2D array):

1\) Updating the Series/DataFrame also updates the numpy array:

```python
>>> arr = np.array([1, 2, 3])
>>> s = pd.Series(arr)
>>> s.iloc[0] = 100
>>> arr
array([100,   2,   3])
```

2\) Updating the numpy array also updates the Series/DataFrame:

```python
>>> arr = np.array([1, 2, 3])
>>> s = pd.Series(arr)
>>> arr[0] = 100
>>> s
0    100
1      2
2      3
dtype: int64
```

3\) When creating multiple pandas objects viewing the same numpy array, updating the one will also update the other:

```python
>>> arr = np.array([1, 2, 3])
>>> s1 = pd.Series(arr)
>>> s2 = pd.Series(arr)
>>> s1.iloc[0] = 100
>>> s2
Out[13]: 
0    100
1      2
2      3
dtype: int64
```

The first two cases could maybe be seen as corner cases (and documented as such, assuming changing the numpy array after DataFrame/Series creation is an "advanced" use case), although IMO ideally we should avoid that users can accidentally mutate DataFrame/Series in this way. But the third case clearly violates the principles of the copy/view rules under the CoW proposal. 

I can think of two ways to fix this: 

1) We always copy input arrays in the constructors (i.e. effectively changing the default to `copy=True`).
    * We will still need _some_ way to create the Series/DataFrame without a copy (for example internally, if we know the array was just created and isn't passed by the user), and we could _maybe_ use the `copy=False` option for this? (but we will have to clearly document that you then circumvent CoW and we don't give guarantees about the resulting copy/view behaviour when the array or dataframe/series gets mutated.
    * _Ideally_ we would avoid this for performance reasons, I think, at least for Series (and DataFrame in case the input array has the correct memory layout). But for typical 2D numpy arrays, making a copy by default could actually improve performance: https://github.com/pandas-dev/pandas/issues/50756
2) We keep using a view by default (like we can do for DataFrame/Series input to the constructors), but use some other mechanism to enforce correct CoW behaviour.
    * One idea would be to change the input array to read-only (`arr.flags.writeable = False`). This would avoid the user to accidentally mutate the array and that way also update the pandas object (you can still change the flags, but that should give a hint to the user about what it is doing), and at the same time that would trigger pandas to make a copy of the array when the user tries to mutate the DataFrame/Series (triggering CoW, but using a different mechanism).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API / CoW: copy/view behaviour when constructing DataFrame/Series from a numpy array #50776

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API / CoW: copy/view behaviour when constructing DataFrame/Series from a numpy array #50776

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions