Skip to content

API: CoW and explicit copy keyword in DataFrame/Series methods #50535

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

In general almost all DataFrame and Series methods return new data and thus make a copy if needed (if there was no calculation / data didn't change). But some methods allow you to avoid making this copy with an explicit copy keyword, which defaults to copy=True, but which you can change to copy=False manually to avoid the copy.

Example:

>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
# by default a method returns a copy
>>> df2 = df.rename(columns=str.upper)
>>> df2.iloc[0, 0] = 100
>>> df
   a  b
0  1  3
1  2  4

# explicitly ask not to make a copy
>>> df3 = df.rename(columns=str.upper, copy=False)
>>> df3.iloc[0, 0] = 100
>>> df
     a  b
0  100  3
1    2  4

Now, if Copy-on-Write is enabled, the above behaviour shouldn't happen (because we are updating one dataframe (df) through changing another dataframe (df3)).

In this specific case of rename, it actually already doesn't work anymore like that, and df is not updated:

>>> pd.options.mode.copy_on_write = True
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df3 = df.rename(columns=str.upper, copy=False)
>>> df3.iloc[0, 0] = 100
>>> df
   a  b
0  1  3
1  2  4

This is because of how it is implemented under the hood in rename, using result = self.copy(deep=copy), and so this always was already taking a shallow copy of the calling dataframe. With CoW enabled, a "shallow" copy doesn't exist anymore in the original meaning, but now essentially is a "lazy copy with CoW".
But for some other methods, this is actually not yet working

There are several issues/questions here:

  1. Are we OK with copy=False now actually meaning a "lazy" copy for all those methods?
    • I don't think there is any alternative with the current CoW semantics, but just to make this explicit and track this, because we 1) should document this (it's a breaking change) and potentially add future warnings for this at some point, and 2) ensure this behaviour is correctly happening for all methods that have a copy keyword.
  2. The case of manually passing copy=True should still give an actual hard / "eager" copy?
    • Probably yes (if we keep the keyword, see 3) below), but we should also ensure to test this when CoW is enabled.
  3. If (in the future with CoW enabled) the default will now be to not return a copy, is it still worth it to keep the copy keyword?
    • Currently the default is copy=True, and so people will typically mostly use it explicitly to set copy=False. But copy=False will become the default in the future, and so will not be needed anymore to specify explicitly.
    • People can still use copy=True in the future to ensure they get a "eager" copy (and not delay the copy / trigger a copy later on). But is that use case worth it to keep the keyword around? (they can always do .copy() instead)

DataFrame/Series methods that have a copy keyword (except for the constructors):

  • align
  • astype
  • infer_objects
  • merge
  • reindex
  • reindex_like
  • rename
  • rename_axis
  • set_axis (only added in 1.5)
  • set_flags (default False)
  • swapaxes
  • swaplevel
  • to_numy (default False)
  • to_timestamp
  • transpose (default False)
  • truncate
  • tz_convert
  • tz_localize
  • pd.concat

xref CoW overview issue: #48998

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions