Description
With the Copy-on-Write implementation (see #36195 / proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit, and overview follow up issue #48998), we can avoid doing an actual copy of the data in DataFrame and Series methods that typically return a copy / new object.
A typical example is the following:
df2 = df.rename(columns=str.lower)
By default, the rename()
method returns a new object (DataFrame) with a copy of the data of the original DataFrame (and thus, mutating values in df2
never mutates df
). With CoW enabled (pd.options.mode.copy_on_write = True
), we can still return a new object, but now pointing to the same data under the hood (avoiding an initial copy), while preserving the observed behaviour of df2
being a copy / not mutating df
when df2
is mutated (though the CoW mechanism, only copying the data in df2
when actually needed upon mutation, i.e. a delayed or lazy copy).
The way this is done in practice for a method like rename()
or reset_index()
is by using the fact that copy(deep=None)
will mean a true deep copy (current default behaviour) if CoW is not enabled, and this "lazy" copy when CoW is enabled. For example:
Lines 6246 to 6249 in 7bf8d6b
The initial CoW implementation in #46958 only added this logic to a few methods (to ensure this mechanism was working): rename
, reset_index
, reindex
(when reindexing the columns), select_dtypes
, to_frame
and copy
itself.
But there are more methods that can make use of this mechanism, and this issue is meant to as the overview issue to summarize and keep track of the progress on this front.
There is a class of methods that perform an actual operation on the data and return newly calculated data (eg typically reductions or the methods wrapping binary operators) that don't have to be considered here. It's only methods that can (potentially, in certain cases) return the original data that could make use of this optimization.
Series / DataFrame methods to update (I added a ?
for the ones I wasn't directly sure about, have to look into what those exactly do to be sure, but left them here to keep track of those, can remove from the list once we know more):
-
add_prefix
/add_suffix
-> TST/CoW: copy-on-write tests for add_prefix and add_suffix #49991 -
align
-> ENH: Add lazy copy to align #50432- Needs a follow-up, see comment -> ENH: Make shallow copy for align nocopy with CoW #50917
-
asfreq
-> ENH: Add test for asfreq CoW when doing noop #50916 -
assign
-> ENH/TST: expand copy-on-write to assign() method #50010 -
astype
-> ENH: Add lazy copy to astype #50802 -
between_time
-> ENH: Add lazy copy for take and between_time #50476 -
bfill
/backfill
-> ENH: Add CoW optimization to interpolate #51249 -
clip
-> TST: Add tests for clip with CoW #51492 -
convert_dtypes
-> ENH: Implement CoW for convert_dtypes #51265 -
copy
(tackled in initial implemention in #46958) -
drop
-> ENH: Add copy-on-write toDataFrame.drop
#49689 -
drop_duplicates
(in case no duplicates are dropped) -> ENH: Add lazy copy for drop duplicates #50431 -
droplevel
-> ENH: test CoW for drop_level #50552 -
dropna
-> ENH: Use lazy copy for dropna #50429 -
eval
-> ENH / CoW: Add lazy copy to eval #53746 -
ffill
/pad
-> ENH: Add CoW optimization to interpolate #51249 -
fillna
-> ENH: Add CoW optimization for fillna #51279 -
filter
-> TST: Copy on Write for filter #50589 -
get
-> TST: add CoW tests for xs() and get() #51292 -
head
-> TST/CoW: copy-on-write tests for df.head and df.tail #49963 -
infer_objects
-> ENH: Use lazy copy in infer objects #50428 -
insert
? -
interpolate
-> ENH: Add CoW optimization to interpolate #51249 -
isetitem
-> TST: CoW with df.isetitem() #50692 -
items
-> TST: Test CoW with DataFrame.items() #50595 -
iterrows
? -> CoW: Ensure that iterrows does not allow mutating parent #51271 -
join
/merge
-> ENH: enable lazy copy in merge() for CoW #51297 -
mask
-> ENH: Add lazy copy to where #51336- this is covered by
where
, but could use an independent test -> TST / CoW: Add test for mask #53745
- this is covered by
-
pipe
- > ENH: Add lazy copy to pipe #50567 -
pop
-> TST: Add test for CoW in pop #50569 -
reindex
- Already handled for reindexing the columns in the initial implemention (#46958), but we can still optimize row selection as well? (in case no actual reindexing takes place) -> TST: add test for reindexing rows with matching index uses shallow copy with CoW #53723
-
reindex_like
-> ENH: Use cow for reindex_like #50426 -
rename
(tackled in initial implementation in #46958) -
rename_axis
-> ENH: add lazy copy (CoW) mechanism to rename_axis #50415 -
reorder_levels
-> ENH: add copy on write for df reorder_levels GH49473 #50016 -
replace
-> ENH: Add lazy copy to replace #50746- ENH: Optimize replace to avoid copying when not necessary #50918
- TODO: Optimize when column not explicitly provided in to_replace?
- TODO: Optimize list-like
- TODO: Add note in docs that this is not fully optimized for 2.0 (not necessary if everything is finished by then)
-
reset_index
(tackled in initial implemention in #46958) -
round
(for columns that are not rounded) -> ENH: Add lazy copy to concat and round #50501 -
select_dtypes
(tackled in initial implemention in #46958) -
set_axis
-> ENH/CoW: use lazy copy in set_axis method #49600 -
set_flags
-> TST: Test cow for set_flags #50489 -
set_index
-> ENH/CoW: use lazy copy in set_index method #49557- TODO: check what happens if parent is mutated -> shouldn't mutate the index! (is the data copied when creating the index?)
-
shift
-> ENH: Add lazy copy to shift #50753 -
sort_index
/sort_values
(optimization if nothing needs to be sorted)-
sort_index
-> ENH: Add lazy copy for sort_index #50491 -
sort_values
-> ENH: Add lazy copy for sort_values #50643
-
-
squeeze
-> TST: Test squeeze with CoW #50590 -
style
. (phofl: I don't think there is anything to do here) -
swapaxes
-> ENH: Add lazy copy for swapaxes no op #50573 -
swaplevel
-> ENH: Add lazy copy to swaplevel #50478 -
T
/transpose
-> BUG: transpose not respecting CoW #51430 -
tail
-> TST/CoW: copy-on-write tests for df.head and df.tail #49963 -
take
(optimization if everything is taken?) -> ENH: Add lazy copy for take and between_time #50476 -
to_timestamp
/to_period
-> ENH: Add lazy copy to to_timestamp and to_period #50575 -
transform
-> BUG / CoW: Series.transform not respecting CoW #53747 -
truncate
-> ENH: Add lazy copy for truncate #50477 -
tz_convert
/tz_localize
-> ENH: Add lazy copy for tz_convert and tz_localize #50490 -
unstack
(in optimized case where each column is a slice?) -
update
-> TST: add CoW test for update() #51426 -
where
-> ENH: Add lazy copy to where #51336 -
xs
-> TST: add CoW tests for xs() and get() #51292 -
Series.to_frame()
(tackled in initial implemention in #46958)
Top-level functions:
-
pd.concat
-> ENH: Add lazy copy to concat and round #50501 -
pd.merge
et al? -> ENH: enable lazy copy in merge() for CoW #51297, ENH: Avoid copy when possible in merge #51327- add tests for
join
- add tests for
Want to contribute to this issue?
Pull requests tackling one of the bullet points above are certainly welcome!
- Pick one of the methods above (best to stick to one method per PR)
- Update the method to make use of a lazy copy (in many cases this might mean using
copy(deep=None)
somewhere, but for some methods it will be more involved) - Add a test for it in
/pandas/tests/copy_view/test_methods.py
(you can mimick on of the existing ones, egtest_select_dtypes
)- You can run the test with
PANDAS_COPY_ON_WRITE=1 pytest pandas/tests/copy_view/test_methods.py
to test it with CoW enabled (pandas will check that environment variable). The test needs to pass with both CoW disabled and enabled. - The tests make use of a
using_copy_on_write
fixture that can be used within the test function to test different expected results depending on whether CoW is enabled or not.
- You can run the test with