Skip to content

CoW: Add warning for replace with inplace #56060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 20, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@
SettingWithCopyError,
SettingWithCopyWarning,
_chained_assignment_method_msg,
_chained_assignment_warning_method_msg,
)
from pandas.util._decorators import (
deprecate_nonkeyword_arguments,
Expand Down Expand Up @@ -7773,6 +7774,17 @@ def replace(
ChainedAssignmentError,
stacklevel=2,
)
elif not PYPY and not using_copy_on_write():
ctr = sys.getrefcount(self)
ref_count = REF_COUNT
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"):
# in non-CoW mode, chained Series access will populate the `_item_cache` which results in an increased ref count not below the threshold, while we still need to warn. We detect this case of a Series derived from a DataFrame through the presence of `_cacher`
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"):

(maybe a bit long to put in every place that has this check (after the other PRs), but a comment like this would have explained it to me)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with adding this here. Maybe adding a link to this comment for the other prs?

ref_count += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem might be that in the warning mode, this is not the case? (so that might need to add a not warn_copy_on_write() to this if block updating the ref_count

Although the tests are not failing ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your comment, why would this not be the case?

I double checked and it seems to work in warning mode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #55838 (comment), I thought I had disabled the item cache for the warning mode, which I would think to affect the count.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked this locally, the cache is still populated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will take a closer look

Copy link
Member

@jorisvandenbossche jorisvandenbossche Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no, the _cacher points to the DataFrame (in a case like s = df["col"]), so it doesn't increase the ref count for s (which is what is relevant for chained setitem detection). And also it uses a weakref, so shouldn't actually increase the ref count?

The hasattr(self, "_cacher") is kind of a check whether the Series is derived from a DataFrame?
So I think that the reason we need to increase the ref count is still because of _item_cache, not actually _cacher. The presence of _cacher just turns out to be an equivalent check for checking that we are a Series in non-CoW mode (only in that mode _item_cache would be populated). So maybe a more "correct" check (although they will give the same result) would be:

Suggested change
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"):
ref_count += 1
if isinstance(self, ABCSeries) and not (using_copy_on_write() or warn_copy_on_write()):
ref_count += 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that suggestion gives a bunch of failures (false positives) in pandas/tests/series/methods/test_replace.py

Copy link
Member

@jorisvandenbossche jorisvandenbossche Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is logical, because that's the whole point with _item_cache: you don't know when it is populated or when not in the non-CoW case. So you can't use a single fixed REF_COUNT value to check, this actually depends on the circumstances.

And so what you have (checking _cacher) is a way to check if the Series object is derived from a DataFrame with simple indexing (and has populated _item_cache). So that's probably fine? Or are there ways to get a Series that does not populate the cacher?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are there ways to get a Series that does not populate the cacher?

Getting a single column with ["col"], .loc[:, "col"], .iloc[:, 0], .get("col"), .xs("col", axis=1) all populate the cacher.

In theory you can get a Series as a result of a calculation that doesn't do this (df.mean().replace(..), but of course that never could have done something useful so we don't need to care about that. We only need to care about getting a Series that is view.

So to summarize: your fix is probably fully correct, I only didn't understand why ;) You might want to add a comment about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes your conclusion is correct as far as I can tell. That was the reason why I added the hasattr check. Sorry for omitting this information.

Added a comment

if ctr <= ref_count:
warnings.warn(
_chained_assignment_warning_method_msg,
FutureWarning,
stacklevel=2,
)

if not is_bool(regex) and to_replace is not None:
raise ValueError("'to_replace' must be 'None' if 'regex' is not a bool")
Expand Down
13 changes: 13 additions & 0 deletions pandas/errors/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -503,6 +503,19 @@ class ChainedAssignmentError(Warning):
)


_chained_assignment_warning_method_msg = (
"A value is trying to be set on a copy of a DataFrame or Series "
"through chained assignment using an inplace method.\n"
"The behavior will change in pandas 3.0. This inplace method will "
"never work because the intermediate object on which we are setting "
"values always behaves as a copy.\n\n"
"For example, when doing 'df[col].method(value, inplace=True)', try "
"using 'df.method({col: value}, inplace=True)' or "
"df[col] = df[col].method(value) instead, to perform "
"the operation inplace on the original object.\n\n"
)


class NumExprClobberingError(NameError):
"""
Exception raised when trying to use a built-in numexpr name as a variable name.
Expand Down
12 changes: 12 additions & 0 deletions pandas/tests/copy_view/test_replace.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from pandas import (
Categorical,
DataFrame,
option_context,
)
import pandas._testing as tm
from pandas.tests.copy_view.util import get_array
Expand Down Expand Up @@ -395,6 +396,17 @@ def test_replace_chained_assignment(using_copy_on_write):
with tm.raises_chained_assignment_error():
df[["a"]].replace(1, 100, inplace=True)
tm.assert_frame_equal(df, df_orig)
else:
with tm.assert_produces_warning(FutureWarning, match="inplace method"):
with option_context("mode.chained_assignment", None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it otherwise also raise a SettingWithCopyWarning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, [["a"]] currently copies

df[["a"]].replace(1, 100, inplace=True)

with tm.assert_produces_warning(FutureWarning, match="inplace method"):
with option_context("mode.chained_assignment", None):
df[df.a > 5].replace(1, 100, inplace=True)

with tm.assert_produces_warning(FutureWarning, match="inplace method"):
df["a"].replace(1, 100, inplace=True)


def test_replace_listlike(using_copy_on_write):
Expand Down
1 change: 1 addition & 0 deletions scripts/validate_unwanted_patterns.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@
"_global_config",
"_chained_assignment_msg",
"_chained_assignment_method_msg",
"_chained_assignment_warning_method_msg",
"_version_meson",
# The numba extensions need this to mock the iloc object
"_iLocIndexer",
Expand Down