Description
In general almost all DataFrame and Series methods return new data and thus make a copy if needed (if there was no calculation / data didn't change). But some methods allow you to avoid making this copy with an explicit copy
keyword, which defaults to copy=True
, but which you can change to copy=False
manually to avoid the copy.
Example:
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
# by default a method returns a copy
>>> df2 = df.rename(columns=str.upper)
>>> df2.iloc[0, 0] = 100
>>> df
a b
0 1 3
1 2 4
# explicitly ask not to make a copy
>>> df3 = df.rename(columns=str.upper, copy=False)
>>> df3.iloc[0, 0] = 100
>>> df
a b
0 100 3
1 2 4
Now, if Copy-on-Write is enabled, the above behaviour shouldn't happen (because we are updating one dataframe (df
) through changing another dataframe (df3
)).
In this specific case of rename
, it actually already doesn't work anymore like that, and df
is not updated:
>>> pd.options.mode.copy_on_write = True
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df3 = df.rename(columns=str.upper, copy=False)
>>> df3.iloc[0, 0] = 100
>>> df
a b
0 1 3
1 2 4
This is because of how it is implemented under the hood in rename
, using result = self.copy(deep=copy)
, and so this always was already taking a shallow copy of the calling dataframe. With CoW enabled, a "shallow" copy doesn't exist anymore in the original meaning, but now essentially is a "lazy copy with CoW".
But for some other methods, this is actually not yet working
There are several issues/questions here:
- Are we OK with
copy=False
now actually meaning a "lazy" copy for all those methods?- I don't think there is any alternative with the current CoW semantics, but just to make this explicit and track this, because we 1) should document this (it's a breaking change) and potentially add future warnings for this at some point, and 2) ensure this behaviour is correctly happening for all methods that have a
copy
keyword.
- I don't think there is any alternative with the current CoW semantics, but just to make this explicit and track this, because we 1) should document this (it's a breaking change) and potentially add future warnings for this at some point, and 2) ensure this behaviour is correctly happening for all methods that have a
- The case of manually passing
copy=True
should still give an actual hard / "eager" copy?- Probably yes (if we keep the keyword, see 3) below), but we should also ensure to test this when CoW is enabled.
- If (in the future with CoW enabled) the default will now be to not return a copy, is it still worth it to keep the
copy
keyword?- Currently the default is
copy=True
, and so people will typically mostly use it explicitly to setcopy=False
. Butcopy=False
will become the default in the future, and so will not be needed anymore to specify explicitly. - People can still use
copy=True
in the future to ensure they get a "eager" copy (and not delay the copy / trigger a copy later on). But is that use case worth it to keep the keyword around? (they can always do.copy()
instead)
- Currently the default is
DataFrame/Series methods that have a copy
keyword (except for the constructors):
-
align
-
astype
-
infer_objects
-
merge
-
reindex
-
reindex_like
-
rename
-
rename_axis
-
set_axis
(only added in 1.5) -
set_flags
(default False) -
swapaxes
-
swaplevel
-
to_numy
(default False) -
to_timestamp
-
transpose
(default False) -
truncate
-
tz_convert
-
tz_localize
-
pd.concat
xref CoW overview issue: #48998