Description
Context: with the Copy-on-Write implementation (see #36195 / detailed proposal doc, and overview follow up issue #48998), we generally want to avoid that updating (setting) values in one DataFrame/Series also changes another DataFrame/Series.
When constructing a DataFrame/Series from existing DataFrame/Series objects this can be handled correctly (the newly constructed object keeps track of the original ones, ensuring to correctly trigger the Copy-on-Write mechanism when needed). However, when constructing a DataFrame/Series from a numpy array, this doesn't work (the CoW mechanism works on the level of BlockManager and blocks).
Currently the DataFrame(..)
/Series(..)
constructors work "zero-copy" for a numpy array input (i.e. viewing the original ndarray). While it would be nice to keep behaviour for performance reasons, this also creates some problems for maintaining the copy/view rules under Copy-on-Write.
There are several aspects to this behaviour (I am showing the examples for Series, but the same is true for constructing DataFrame from a 2D array):
1) Updating the Series/DataFrame also updates the numpy array:
>>> arr = np.array([1, 2, 3])
>>> s = pd.Series(arr)
>>> s.iloc[0] = 100
>>> arr
array([100, 2, 3])
2) Updating the numpy array also updates the Series/DataFrame:
>>> arr = np.array([1, 2, 3])
>>> s = pd.Series(arr)
>>> arr[0] = 100
>>> s
0 100
1 2
2 3
dtype: int64
3) When creating multiple pandas objects viewing the same numpy array, updating the one will also update the other:
>>> arr = np.array([1, 2, 3])
>>> s1 = pd.Series(arr)
>>> s2 = pd.Series(arr)
>>> s1.iloc[0] = 100
>>> s2
Out[13]:
0 100
1 2
2 3
dtype: int64
The first two cases could maybe be seen as corner cases (and documented as such, assuming changing the numpy array after DataFrame/Series creation is an "advanced" use case), although IMO ideally we should avoid that users can accidentally mutate DataFrame/Series in this way. But the third case clearly violates the principles of the copy/view rules under the CoW proposal.
I can think of two ways to fix this:
- We always copy input arrays in the constructors (i.e. effectively changing the default to
copy=True
).- We will still need some way to create the Series/DataFrame without a copy (for example internally, if we know the array was just created and isn't passed by the user), and we could maybe use the
copy=False
option for this? (but we will have to clearly document that you then circumvent CoW and we don't give guarantees about the resulting copy/view behaviour when the array or dataframe/series gets mutated. - Ideally we would avoid this for performance reasons, I think, at least for Series (and DataFrame in case the input array has the correct memory layout). But for typical 2D numpy arrays, making a copy by default could actually improve performance: PERF: Inefficient data representation when building dataframe from 2D NumPy array #50756
- We will still need some way to create the Series/DataFrame without a copy (for example internally, if we know the array was just created and isn't passed by the user), and we could maybe use the
- We keep using a view by default (like we can do for DataFrame/Series input to the constructors), but use some other mechanism to enforce correct CoW behaviour.
- One idea would be to change the input array to read-only (
arr.flags.writeable = False
). This would avoid the user to accidentally mutate the array and that way also update the pandas object (you can still change the flags, but that should give a hint to the user about what it is doing), and at the same time that would trigger pandas to make a copy of the array when the user tries to mutate the DataFrame/Series (triggering CoW, but using a different mechanism).
- One idea would be to change the input array to read-only (