Description
#49467 added the feature to detect "chained assignment" and raise an error for this when CoW is enabled (since it will then never work):
>>> pd.options.mode.copy_on_write = True
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df["a"][0] = 10
...
ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment.
When using the Copy-on-Write mode, such chained assignment never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy.
...
This is done based on the refcount of the calling DataFrame of __setitem__
: if this is only a temporary variable in a chain, it has a lower refcount than when it is assigned to a variable (using sys.getrefcount
).
It seemed that we could robustly check for this with the above approach, but we discovered that this fails if the setitem call is done from a function compiled with cython. We actually encounter this in our own test suite, in the read_spss tests using pyreadstats
(those are skipped for CoW now). pyreadstats
has some functionality written in cython, and for example this function does a valid (non-chained) setitem call to update a column of a dataframe. However, this incorrectly triggers the ChainedAssignmentError.
This is because the cython generated code results in a lower refcount when getting to DataFrame/Series.__setitem__
(a refcount of 2 instead of 3 for a chained case, but so for a valid non-chained case it can give a refcount of 3, which triggers the error).
My current understanding is that both when calling a pure python function that gets interpreted, or calling a compiled cython function, it ends up calling PyObject_SetItem
(which will then eventually call the __setitem__
). However, when calling this through the interpreter, there is an extra refcount increment from having the object on the stack.
It's an annoying issue, and I am not directly sure of a good solution. Some options that I can currently think of:
- Somehow "fix" it and find a way to not raise the error incorrectly:
- Can we detect that we are being called from cython? Currently I didn't find anything.
- Analyse the stack frames. The cython frames are missing, but so we could analyse the code of the frame above to see if this actually contains a chained setitem operation (but this doesn't sound very robust / fun to write ..)
- Make it a warning instead of an error (which makes it less serious if raised as a false positive), and then we can document that libraries should best catch this warning to silence it if calling setitem from cython. Or have some option to enable/disable the warning